2005-04-21 01:32:14

by Jeff Dike

[permalink] [raw]
Subject: [RFC] Resource management through virtualization - the scheduler

I have long believed that general-purpose resource management is best
done by virtualizing the subsystem responsible for the resource in
question.

Here, I present a virtualization of the scheduler which allows the
construction of CPU compartments and confinement of processes within
them. The scheduler is virtualized in the sense that it is possible
to have more than one scheduler running on the system, and each new
guest scheduler runs as a process on the host scheduler. Thus, it
competes for CPU time as a single process, and the processes confined
to the guest compete against each other for the CPU time given to the
guest scheduler.

For example, given three CPU hogs, one of which is running directly on
the host scheduler, and two of which are confined to a guest
scheduler, the CPU hog on the host will compete with the guest
scheduler for CPU time, and each will receive half. The two hogs
inside the guest scheduler will then compete for the half of the CPU
given to the guest scheduler, each receiving 1/4 of the CPU.

This has uses aside from the resource control which motivated it:
The guest scheduler doesn't need to be the standard Linux
scheduler - if you feel the current scheduler doesn't do justice to
your workload, you can write your own, load it in as a guest, and put
your workload into it.
The guest can be a bug-fixed version of the host - move your
workload into it and see if your bug is fixed. If not, move your
workload back out, unload the scheduler, and try again. If so, you
can leave everything as is, and you don't have to reboot.

Each guest scheduler creates a "sched group" (I would have used "sched
domain", but the NUMA people beat me to it). In addition, there is a
sched group for the host scheduler. These are represented in /proc as
/proc/sched-groups/<n>, where <n> is the pid of the process creating
the sched group. The host sched group is /proc/sched-groups/0, and is
created at boot time.

The /proc/sched-groups/<n> directory contains the current /proc/<pid>
directories for all processes inside the sched group. sched group 0
initially contains all of the processes on the system. As the
/proc/<pid> directories have moved, symlinks are left behind for
compatibility.

A new sched group is created by a process opening
/proc/schedulers/guest_o1_scheduler, as with
% /proc/schedulers/guest_o1_scheduler
This interface is somewhat hokey, and better suggestions would be
appreciated. However, this does have the property that the container
initially has the scheduling properties of this process, such as nice
level, priority, scheduling policy, processor affinity, etc. These
can be manipulated after the fact just as any other process.

Processes are confined to a compartment by moving them from one
/proc/sched-group/<n> directory to another. New processes also
inherit the compartment of their parent.

To make this concrete, here is a session producing the 50-25-25 CPU
split described above:

# This cat will become the guest scheduler by opening guest_o1_scheduler
usermode:~# cat /proc/schedulers/guest_o1_scheduler &
Created sched_group 290 ('guest_o1_scheduler')
# This is now /proc/sched-groups/290/
[1] 290
# Create three CPU hogs
usermode:~# bash -c 'while true; do true; done' &
[2] 292
usermode:~# bash -c 'while true; do true; done' &
[3] 293
usermode:~# bash -c 'while true; do true; done' &
[4] 294
# Move 293 and 294 into the compartment, leaving 292 on the host
# scheduler
usermode:~# mv /proc/sched-groups/0/293 /proc/sched-groups/290/
usermode:~# mv /proc/sched-groups/0/294 /proc/sched-groups/290/
# ...wait a bit...
usermode:~# ps uax
...
root 292 49.1 0.7 2324 996 tty0 R 21:51 14:40 bash -c
root 293 24.7 0.7 2324 996 tty0 R 21:51 7:23 bash -c
root 294 24.7 0.7 2324 996 tty0 R 21:51 7:23 bash -c
...

More arbitrary divisions of CPU can be made by having the guest
schedulers be SMP with more than one virtual processor. This could
also be adjusted on the fly by adding CPU hotplug support.

Compartments can be nested, and don't require root privileges to
create. What does require root is loading the guests into the kernel
in the first place. In the absence of a reentrant scheduler, new
instances are made available by modprobing more instances of the
scheduler module into the kernel. However, given the presence of
available guests, activating one is a non-privileged operation.

The design of this is that there is a scheduler struct pointer in each
task struct. This contains the entry points for the scheduler that is
controlling it. Every scheduler operation now goes through
p->scheduler.task_ops.<function>. Moving a process from one sched group
to another causes its scheduler to be changed.

When a new sched group is created, it creates another thread, which
will be the idle thread of the guest scheduler. The original process
remains with its scheduler. When it gets CPU time from the host, it
switches to the idle thread of the guest scheduler. When the guest
idle thread runs again, instead of sleeping, it switches back to the
container process, which sleeps in the host scheduler.

Clock ticks are handled such that the host scheduler gets to see them
first, getting a chance to schedule away from the compartment. When
the compartment next runs, it will handle the tick on its own, and
have a chance to make its own scheduling decisions.

The implementation is as follows:
The O(1) scheduler algorithm was split out from sched.c and moved
to o1_sched.c. This is the code that we want duplicated, and the code
that remains in sched.c is stuff like system call entry points and
primitives which are independent of the actual algorithm, which
shouldn't be duplicated.

The scheduling primitives were made static and used to fill a
structure of function pointers in the scheduler struct. Their names
were also changed so as not to conflict with the new macros in sched.h
which call through the scheduler struct. For example, schedule() in
sched.h is now
#define schedule() current->scheduler->task_ops.schedule()
and the scheduler is now
static asmlinkage void __sched o1_schedule(void)

o1_sched.c is now linked into two intermediate object files, a host
and a guest, with either host_sched.c and guest_sched.c. These two
customize the scheduler struct as either a host or a guest scheduler.
The output of this is either guest_o1_sched.o (when CONFIG_GUEST_SCHED
== y) or guest_o1_scheduler.ko (when CONFIG_GUEST_SCHED == m). When
built into the kernel, you get one guest scheduler, which is
registered at boot time. Thus, it is recommended to make it
modular, and modprobe it whenever you need a new compartment.

modprobe is somewhat reluctant to load the same module twice, so using
the -o switch to modprobe is needed to give each instance a different
name:
% modprobe -o guest1 modprobe guest_o1_scheduler
at which point, you'd open /proc/schedulers/guest1 to activate it.

I've broken this into three patches -
guest-sched-prep - adds the procfs changes and makes the single
host scheduler use it, adds the sched.h changes, the sched.c
redeclarations, and the scheduler struct, but doesn't change the
system's behavior.
guest-sched-movement - creates o1_sched.c by splitting the
relevant code from sched.c
guest-sched-guest - adds guest support by adding the necessary
build changes (along with the kbuild nastiness that seems to be a
hallmark of my work, sigh) and the code in guest_sched.c which
actually virtualizes the scheduler.

Patches 1 and 3 are attached; patch 2 is code movement with no
functional changes and is way too large to post to LKML. It can be
found at http://www.user-mode-linux.org/~jdike/guest-sched-movement

Here are diffstats:

guest-sched-prep:

arch/um/kernel/skas/process_kern.c | 2
arch/um/kernel/tt/process_kern.c | 2
fs/proc/base.c | 201 +++++++++++++++++++++++++++++++++++--
include/linux/init_task.h | 1
include/linux/kernel_stat.h | 11 +-
include/linux/sched.h | 122 +++++++++++++++++++---
kernel/Makefile | 7 -
kernel/host_sched.c | 10 +
kernel/sched.c | 142 ++++++++++++++++++--------
kernel/sched_group.c | 138 +++++++++++++++++++++++++
10 files changed, 560 insertions(+), 76 deletions(-)

guest-sched-movement:

Makefile | 2
o1_sched.c | 3506 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
sched.c | 3510 -------------------------------------------------------------
3 files changed, 3512 insertions(+), 3506 deletions(-)

guest-sched-guest:

arch/um/defconfig | 1
arch/um/kernel/ksyms.c | 4 +
arch/um/kernel/process_kern.c | 2
arch/um/kernel/signal_kern.c | 2
arch/um/kernel/time_kern.c | 2
include/linux/rcupdate.h | 2
include/linux/sched.h | 2
init/Kconfig | 4 +
kernel/Makefile | 32 +++++++++
kernel/acct.c | 3
kernel/fork.c | 4 +
kernel/guest_sched.c | 143 ++++++++++++++++++++++++++++++++++++++++++
kernel/o1_sched.c | 26 +++++++
kernel/profile.c | 2
kernel/rcupdate.c | 7 ++
kernel/sched_group.c | 52 +++++++++++++++
mm/memory.c | 2
17 files changed, 288 insertions(+), 2 deletions(-)

The lots of little changes in -guest are a bunch of EXPORT_SYMBOLS of
random crap that were needed to get the scheduler loading as a module.

I've build and tested this with UML (2.6.11-bk8, but it should be fine
with anything near that). I think that there one or two minor
unportabilities that will prevent this from working as-is on other
arches. However, there's nothing intrinsically arch-dependent here.

Jeff


Attachments:
(No filename) (9.72 kB)
guest-sched-prep (32.67 kB)
guest-sched-guest (17.24 kB)
Download all attachments