2002-12-30 21:37:42

by Ed Tomlinson

[permalink] [raw]
Subject: [PATCH,RFC] fix o(1) handling of threads

The o(1) scheduler is an interesting beast. It handles most workloads
well. One, not uncommon, corner case is that of a heavily threaded
application. A common source of these is java. In my case I want to
run a freenet node. This normally runs with 80-100 threads. With o(1)
I get two choices. Run the application nice 19, and wait when using it
(who wants to run a _server_ at nice 19?). Alternately I can run it at
nice 10 or so and my box slows to a crawl when the node gets busy.
When this happens the loadaverage can shoot up into the high teens or
more - which causes the crawl... (reported first Feb 2002).

Its a bit of a catch 22. If the application kept a 'stable' number of
threads active nice could be used to control it. In reality from 1 to
20 or more threads are active (in a runqueue).

What this patches does is recognise threads as process that clone both
mm and files. For these 'threads' it tracks how many are active in a
given group. When many are active it reduces their timeslices as below

weight = (threads active in group) * THREAD_PENALTY / 100
timeslice = BASE_TIMESLICE(p) / weight

Testing for a valid weight and respecting MIN_TIMESLICE.

The effect of this change is limit the amount of cpu used by threads
when many in a given thread group are active. The amount they are
limited by is controlled by the THREAD_PENALTY tunable.

This seems quite effective here allowing freenet to run at nice 0 without
the system slowing down unreasonably when it gets active.

Patch is against 2.5.53-bk from Saturday. A couple of comments.

- I have tested UP with preempt enabled. I am not sure of the locking with
smp - there may be races possible when a ptg_struct is destroyed. Think
creation is probably ok.

- Should I respect the minimum timeslice? Maybe it should be set lower, which
would help with very large active lists. As its set now we need more than 18
active nice 0 threads in one group before this is an issue.

- I suspect that most systems will have very few ptgs allocated - here two long
lived structures are normal.

Comments?
Ed Tomlinson


diffstat ptg_A1
include/linux/sched.h | 6 ++++++
kernel/fork.c | 21 +++++++++++++++++++++
kernel/sched.c | 12 ++++++++++++
3 files changed, 39 insertions(+)


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.922 -> 1.924
# include/linux/sched.h 1.116 -> 1.117
# kernel/fork.c 1.93 -> 1.95
# kernel/sched.c 1.145 -> 1.147
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/30 [email protected] 1.923
# Allow heavily threaded task to run with normal priorities without
# destroying responsiveness. This is done by throttling the threads
# in a group when many are active (in a runqueue) at the same time.
# --------------------------------------------
# 02/12/30 [email protected] 1.924
# remove printks, adjust THREAD_PENALTY and respect MIN_TIMESLICE
# --------------------------------------------
#
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Mon Dec 30 10:04:31 2002
+++ b/include/linux/sched.h Mon Dec 30 10:04:31 2002
@@ -172,6 +172,11 @@

#include <linux/aio.h>

+struct ptg_struct { /* pseudo thread groups */
+ atomic_t active; /* number of tasks in run queues */
+ atomic_t count; /* number of refs */
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -301,6 +306,7 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ struct ptg_struct * ptgroup; /* pseudo thread group for this task */

/* task state */
struct linux_binfmt *binfmt;
diff -Nru a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c Mon Dec 30 10:04:31 2002
+++ b/kernel/fork.c Mon Dec 30 10:04:31 2002
@@ -59,6 +59,11 @@

void __put_task_struct(struct task_struct *tsk)
{
+ if (tsk->ptgroup && atomic_sub_and_test(1,&tsk->ptgroup->count)) {
+ kfree(tsk->ptgroup);
+ tsk->ptgroup = NULL;
+ }
+
if (tsk != current) {
free_thread_info(tsk->thread_info);
kmem_cache_free(task_struct_cachep,tsk);
@@ -432,6 +437,7 @@

tsk->mm = NULL;
tsk->active_mm = NULL;
+ tsk->ptgroup = NULL;

/*
* Are we cloning a kernel thread?
@@ -819,6 +825,21 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
+
+ /* detect a 'thread' and link to the ptg block for group */
+ if ( ((clone_flags & CLONE_VM) && (clone_flags & CLONE_FILES)) ||
+ (clone_flags & CLONE_THREAD)) {
+ if (current->ptgroup)
+ atomic_inc(&current->ptgroup->count);
+ else {
+ current->ptgroup = kmalloc(sizeof(struct ptg_struct), GFP_ATOMIC);
+ if likely(current->ptgroup) {
+ atomic_set(&current->ptgroup->count,2);
+ atomic_set(&current->ptgroup->active,1);
+ }
+ }
+ p->ptgroup = current->ptgroup;
+ }

if (clone_flags & CLONE_CHILD_SETTID)
p->set_child_tid = child_tidptr;
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Mon Dec 30 10:04:31 2002
+++ b/kernel/sched.c Mon Dec 30 10:04:31 2002
@@ -62,6 +62,7 @@
#define MAX_TIMESLICE (300 * HZ / 1000)
#define CHILD_PENALTY 95
#define PARENT_PENALTY 100
+#define THREAD_PENALTY 80
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
@@ -122,6 +123,13 @@

static inline unsigned int task_timeslice(task_t *p)
{
+ if (p->ptgroup) {
+ int weight = atomic_read(&p->ptgroup->active) * THREAD_PENALTY / 100;
+ if (weight > 1) {
+ int slice = BASE_TIMESLICE(p) / weight;
+ return slice > MIN_TIMESLICE ? slice : MIN_TIMESLICE;
+ }
+ }
return BASE_TIMESLICE(p);
}

@@ -295,6 +303,8 @@
}
enqueue_task(p, array);
rq->nr_running++;
+ if (p->ptgroup)
+ atomic_inc(&p->ptgroup->active);
}

/*
@@ -302,6 +312,8 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
+ if (p->ptgroup)
+ atomic_dec(&p->ptgroup->active);
rq->nr_running--;
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;


------------


2002-12-30 22:00:05

by Alan

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

Very interesting, but I'll note there are actually two groupings to
solve - per user and per threadgroup. Also for small numbers of threads
you don't want to punish a task and ruin its balancing across CPUs

Have you looked at the per user fair share stuff too ?

2002-12-30 22:05:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On Mon, Dec 30, 2002 at 04:45:50PM -0500, Ed Tomlinson wrote:
> The o(1) scheduler is an interesting beast. It handles most workloads

O(1) is very, very different from o(1). Don't skip that shift key!

Also see:

http://mathworld.wolfram.com/AsymptoticNotation.html


Bill

2002-12-30 22:24:40

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On December 30, 2002 05:50 pm, Alan Cox wrote:
> Very interesting, but I'll note there are actually two groupings to
> solve - per user and per threadgroup. Also for small numbers of threads
> you don't want to punish a task and ruin its balancing across CPUs

This easily tuneable. As its set now 2 in queue threads from the
same group are not punished, 3 and they have their timeslices halfed.
Setting THREAD_PENALTY to 65 means no adjustments till 4 in queue threads
exist.

> Have you looked at the per user fair share stuff too ?

No but a varient of the same code could be cooked up - interested?. As
I am the only real user here is not much of an issue. Anyone have
boxes that can be used to test per user throttles?

Ed Tomlinson

2002-12-30 22:52:10

by David Schwartz

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads


On Mon, 30 Dec 2002 16:45:50 -0500, Ed Tomlinson wrote:

>What this patches does is recognise threads as process that clone both
>mm and files. For these 'threads' it tracks how many are active in a
>given group. When many are active it reduces their timeslices as below

In general, changes that cause the system to become less efficient as load
increases are not such a good idea. By reducing timeslices, you increase
context-switching overhead. So the busier you are, the less efficient you
get. I think it would be wiser to keep the timeslice the same but assign
fewer timeslices.

DS


2002-12-31 03:55:25

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On December 30, 2002 06:00 pm, David Schwartz wrote:
> On Mon, 30 Dec 2002 16:45:50 -0500, Ed Tomlinson wrote:
> >What this patches does is recognise threads as process that clone both
> >mm and files. For these 'threads' it tracks how many are active in a
> >given group. When many are active it reduces their timeslices as below
>
> In general, changes that cause the system to become less efficient as load
> increases are not such a good idea. By reducing timeslices, you increase
> context-switching overhead. So the busier you are, the less efficient you
> get. I think it would be wiser to keep the timeslice the same but assign
> fewer timeslices.

That would be better - I cannot see a way to do it using O(1). What might
be possible (not sure how) is only to decrease the timeslices IF there are
other tasks being slowed down by the thread group...

Ed

2002-12-31 04:37:09

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On Mon, 30 Dec 2002, Ed Tomlinson wrote:
> On December 30, 2002 06:00 pm, David Schwartz wrote:
> >
> > In general, changes that cause the system to become less efficient as load
> > increases are not such a good idea. By reducing timeslices, you increase
> > context-switching overhead. So the busier you are, the less efficient you
> > get. I think it would be wiser to keep the timeslice the same but assign
> > fewer timeslices.
>
> That would be better - I cannot see a way to do it using O(1).

I've been thinking about this problem for a while, but haven't
found a good solution yet. I've got a long way to go before I
can port the per-user fair scheduling stuff to the O(1) base.

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-12-31 14:26:05

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On December 30, 2002 05:50 pm, Alan Cox wrote:
> Very interesting, but I'll note there are actually two groupings to
> solve - per user and per threadgroup. Also for small numbers of threads
> you don't want to punish a task and ruin its balancing across CPUs
>
> Have you looked at the per user fair share stuff too ?

Two changes here. First I have modified the timeslice compression
calculation to make it more understandable. Now 100/penalty gives
the number of timeslices to distribute equally among the processes
in the tread group or processes of a user. For thread groups its
now set so the time for 2 timeslices is distributed equally to the
members of the group.

The second changes add a user throttle. It would be better to get the
USER_PENALTY from a per user source - suggestions? Since limiting
users is not a problem here, I have set the limit so a normal user
can have 10 active process in runqueues before the timeslice compression
starts. Root is excluded from this logic.

In effect of this patch is to lower the priority of a user's processes
or threads in a group. It uses the same technique that O(1) uses
for priorities to do this.

Comments
Ed Tomlinson

PS. Its trivial to factor user vs thread group limits if required

diffstat ptg_B0
include/linux/sched.h | 7 +++++++
kernel/fork.c | 22 ++++++++++++++++++++++
kernel/sched.c | 22 +++++++++++++++++++++-
kernel/user.c | 2 ++
4 files changed, 52 insertions(+), 1 deletion(-)

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.922 -> 1.925
# include/linux/sched.h 1.116 -> 1.118
# kernel/fork.c 1.93 -> 1.96
# kernel/user.c 1.5 -> 1.6
# kernel/sched.c 1.145 -> 1.148
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/30 [email protected] 1.923
# Allow heavily threaded task to run with normal priorities without
# destroying responsiviness. This is done by throttling the threads
# in a group when many are active (in a runqueue) at the same time.
# --------------------------------------------
# 02/12/30 [email protected] 1.924
# remove printks, adjust THREAD_PENALTY and respect MIN_TIMESLICE
# --------------------------------------------
# 02/12/31 [email protected] 1.925
# Add user throttling
# Improve the timeslice caculations for throttling
# --------------------------------------------
#
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Dec 31 08:53:27 2002
+++ b/include/linux/sched.h Tue Dec 31 08:53:27 2002
@@ -172,6 +172,11 @@

#include <linux/aio.h>

+struct ptg_struct { /* pseudo thread groups */
+ atomic_t active; /* number of tasks in run queues */
+ atomic_t count; /* number of refs */
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -256,6 +261,7 @@
struct user_struct {
atomic_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
+ atomic_t active; /* How many active processes does this user have? */
atomic_t files; /* How many open files does this user have? */

/* Hash table maintenance information */
@@ -301,6 +307,7 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ struct ptg_struct * ptgroup; /* pseudo thread group for this task */

/* task state */
struct linux_binfmt *binfmt;
diff -Nru a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c Tue Dec 31 08:53:27 2002
+++ b/kernel/fork.c Tue Dec 31 08:53:27 2002
@@ -59,6 +59,11 @@

void __put_task_struct(struct task_struct *tsk)
{
+ if (tsk->ptgroup && atomic_sub_and_test(1,&tsk->ptgroup->count)) {
+ kfree(tsk->ptgroup);
+ tsk->ptgroup = NULL;
+ }
+
if (tsk != current) {
free_thread_info(tsk->thread_info);
kmem_cache_free(task_struct_cachep,tsk);
@@ -432,6 +437,7 @@

tsk->mm = NULL;
tsk->active_mm = NULL;
+ tsk->ptgroup = NULL;

/*
* Are we cloning a kernel thread?
@@ -819,6 +825,22 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
+
+ /* detect a 'thread' and link to the ptg block for group */
+ if ( ((clone_flags & CLONE_VM) && (clone_flags & CLONE_FILES)) ||
+ (clone_flags & CLONE_THREAD)) {
+ if (current->ptgroup)
+ atomic_inc(&current->ptgroup->count);
+ else {
+ current->ptgroup = kmalloc(sizeof(struct ptg_struct), GFP_ATOMIC);
+ if likely(current->ptgroup) {
+ atomic_set(&current->ptgroup->count,2);
+ atomic_set(&current->ptgroup->active,1);
+ /* printk(KERN_INFO "ptgroup - pid %u\n",current->pid); */
+ }
+ }
+ p->ptgroup = current->ptgroup;
+ }

if (clone_flags & CLONE_CHILD_SETTID)
p->set_child_tid = child_tidptr;
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Dec 31 08:53:27 2002
+++ b/kernel/sched.c Tue Dec 31 08:53:27 2002
@@ -62,6 +62,8 @@
#define MAX_TIMESLICE (300 * HZ / 1000)
#define CHILD_PENALTY 95
#define PARENT_PENALTY 100
+#define THREAD_PENALTY 50 /* allow threads groups 2 full timeslices */
+#define USER_PENALTY 10 /* allow user 10 full timeslices */
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
@@ -122,7 +124,19 @@

static inline unsigned int task_timeslice(task_t *p)
{
- return BASE_TIMESLICE(p);
+ int work , slice, weight = 100;
+ if (p->ptgroup) {
+ work = atomic_read(&p->ptgroup->active) * THREAD_PENALTY;
+ if (work > weight)
+ weight = work;
+ }
+ if (p->user->uid) {
+ work = atomic_read(&p->user->active) * USER_PENALTY;
+ if (work > weight)
+ weight = work;
+ }
+ slice = 100 * BASE_TIMESLICE(p) / weight;
+ return slice > MIN_TIMESLICE ? slice : MIN_TIMESLICE;
}

/*
@@ -295,6 +309,9 @@
}
enqueue_task(p, array);
rq->nr_running++;
+ if (p->ptgroup)
+ atomic_inc(&p->ptgroup->active);
+ atomic_inc(&p->user->active);
}

/*
@@ -302,6 +319,9 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
+ atomic_dec(&p->user->active);
+ if (p->ptgroup)
+ atomic_dec(&p->ptgroup->active);
rq->nr_running--;
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
diff -Nru a/kernel/user.c b/kernel/user.c
--- a/kernel/user.c Tue Dec 31 08:53:27 2002
+++ b/kernel/user.c Tue Dec 31 08:53:27 2002
@@ -30,6 +30,7 @@
struct user_struct root_user = {
.__count = ATOMIC_INIT(1),
.processes = ATOMIC_INIT(1),
+ .active = ATOMIC_INIT(1),
.files = ATOMIC_INIT(0)
};

@@ -96,6 +97,7 @@
new->uid = uid;
atomic_set(&new->__count, 1);
atomic_set(&new->processes, 0);
+ atomic_set(&new->active, 0);
atomic_set(&new->files, 0);

/*
--------------


2003-01-01 12:58:37

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

Here is the scheduler-tunables patch updated to include USER_PENALTY
and THREAD_PENANTY. This on top of ptg_B0.

Ed Tomlinson

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.975 -> 1.977
# kernel/sysctl.c 1.37 -> 1.39
# Documentation/filesystems/proc.txt 1.10 -> 1.12
# include/linux/sysctl.h 1.38 -> 1.40
# kernel/sched.c 1.146 -> 1.148
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/31 [email protected] 1.976
# scheduler-tunables.patch
# --------------------------------------------
# 02/12/31 [email protected] 1.977
# update schedule tunables for thead and user penalties
# --------------------------------------------
#
diff -Nru a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt Tue Dec 31 13:48:26 2002
+++ b/Documentation/filesystems/proc.txt Tue Dec 31 13:48:26 2002
@@ -37,6 +37,7 @@
2.8 /proc/sys/net/ipv4 - IPV4 settings
2.9 Appletalk
2.10 IPX
+ 2.11 /proc/sys/sched - scheduler tunables

------------------------------------------------------------------------------
Preface
@@ -1658,6 +1659,111 @@
The /proc/net/ipx_route table holds a list of IPX routes. For each route it
gives the destination network, the router node (or Directly) and the network
address of the router (or Connected) for internal networks.
+
+2.11 /proc/sys/sched - scheduler tunables
+-----------------------------------------
+
+Useful knobs for tuning the scheduler live in /proc/sys/sched.
+
+child_penalty
+-------------
+
+Percentage of the parent's sleep_avg that children inherit. sleep_avg is
+a running average of the time a process spends sleeping. Tasks with high
+sleep_avg values are considered interactive and given a higher dynamic
+priority and a larger timeslice. You typically want this some value just
+under 100.
+
+exit_weight
+-----------
+
+When a CPU hog task exits, its parent's sleep_avg is reduced by a factor of
+exit_weight against the exiting task's sleep_avg.
+
+interactive_delta
+-----------------
+
+If a task is "interactive" it is reinserted into the active array after it
+has expired its timeslice, instead of being inserted into the expired array.
+How "interactive" a task must be in order to be deemed interactive is a
+function of its nice value. This interactive limit is scaled linearly by nice
+value and is offset by the interactive_delta.
+
+max_sleep_avg
+-------------
+
+max_sleep_avg is the largest value (in ms) stored for a task's running sleep
+average. The larger this value, the longer a task needs to sleep to be
+considered interactive (maximum interactive bonus is a function of
+max_sleep_avg).
+
+max_timeslice
+-------------
+
+Maximum timeslice, in milliseconds. This is the value given to tasks of the
+highest dynamic priority.
+
+min_timeslice
+-------------
+
+Minimum timeslice, in milliseconds. This is the value given to tasks of the
+lowest dynamic priority. Every task gets at least this slice of the processor
+per array switch.
+
+parent_penalty
+--------------
+
+Percentage of the parent's sleep_avg that it retains across a fork().
+sleep_avg is a running average of the time a process spends sleeping. Tasks
+with high sleep_avg values are considered interactive and given a higher
+dynamic priority and a larger timeslice. Normally, this value is 100 and thus
+task's retain their sleep_avg on fork. If you want to punish interactive
+tasks for forking, set this below 100.
+
+prio_bonus_ratio
+----------------
+
+Middle percentage of the priority range that tasks can receive as a dynamic
+priority. The default value of 25% ensures that nice values at the
+extremes are still enforced. For example, nice +19 interactive tasks will
+never be able to preempt a nice 0 CPU hog. Setting this higher will increase
+the size of the priority range the tasks can receive as a bonus. Setting
+this lower will decrease this range, making the interactivity bonus less
+apparent and user nice values more applicable.
+
+starvation_limit
+----------------
+
+Sufficiently interactive tasks are reinserted into the active array when they
+run out of timeslice. Normally, tasks are inserted into the expired array.
+Reinserting interactive tasks into the active array allows them to remain
+runnable, which is important to interactive performance. This could starve
+expired tasks, however, since the interactive task could prevent the array
+switch. To prevent starving the tasks on the expired array for too long. the
+starvation_limit is the longest (in ms) we will let the expired array starve
+at the expense of reinserting interactive tasks back into active. Higher
+values here give more preferance to running interactive tasks, at the expense
+of expired tasks. Lower values provide more fair scheduling behavior, at the
+expense of interactivity. The units are in milliseconds.
+
+thead_penalty
+-------------
+
+Limit sum of timeslices used by a threadgroup to 100/n timeslices. This
+is used to prevent heavily thread applications from slowing down the system
+when many threads are active. For this item theads are defined as processes
+sharing their mm and files. This implies that if this is set to 33 and six
+processes from a given threadgroup are in runqueues each process will have its
+timeslice reduced by 50%. Set to zero to disable.
+
+user_penalty
+------------
+
+Limit the sum of timeslices used by a user to 100/n timeslices. This prevents
+one user from stealing the cpu by creating many active threads. For example,
+if this is set to 25 and six processes are in runqueues the timeslice of each
+process will be reduced by 33%. Set to zero to disable - root is always
+excluded from this logic.

------------------------------------------------------------------------------
Summary
diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h Tue Dec 31 13:48:26 2002
+++ b/include/linux/sysctl.h Tue Dec 31 13:48:26 2002
@@ -66,7 +66,8 @@
CTL_DEV=7, /* Devices */
CTL_BUS=8, /* Busses */
CTL_ABI=9, /* Binary emulation */
- CTL_CPU=10 /* CPU stuff (speed scaling, etc) */
+ CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
+ CTL_SCHED=11, /* scheduler tunables */
};

/* CTL_BUS names: */
@@ -157,6 +158,20 @@
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
};

+/* Tunable scheduler parameters in /proc/sys/sched/ */
+enum {
+ SCHED_MIN_TIMESLICE=1, /* minimum process timeslice */
+ SCHED_MAX_TIMESLICE=2, /* maximum process timeslice */
+ SCHED_CHILD_PENALTY=3, /* penalty on fork to child */
+ SCHED_PARENT_PENALTY=4, /* penalty on fork to parent */
+ SCHED_EXIT_WEIGHT=5, /* penalty to parent of CPU hog child */
+ SCHED_PRIO_BONUS_RATIO=6, /* percent of max prio given as bonus */
+ SCHED_INTERACTIVE_DELTA=7, /* delta used to scale interactivity */
+ SCHED_MAX_SLEEP_AVG=8, /* maximum sleep avg attainable */
+ SCHED_STARVATION_LIMIT=9, /* no re-active if expired is starved */
+ SCHED_THREAD_PENALTY=10, /* thread group throttle */
+ SCHED_USER_PENALTY=11, /* user process throttle */
+};

/* CTL_NET names: */
enum
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Dec 31 13:48:26 2002
+++ b/kernel/sched.c Tue Dec 31 13:48:26 2002
@@ -57,18 +57,33 @@
* Minimum timeslice is 10 msecs, default timeslice is 150 msecs,
* maximum timeslice is 300 msecs. Timeslices get refilled after
* they expire.
+ *
+ * They are configurable via /proc/sys/sched
*/
-#define MIN_TIMESLICE ( 10 * HZ / 1000)
-#define MAX_TIMESLICE (300 * HZ / 1000)
-#define CHILD_PENALTY 95
-#define PARENT_PENALTY 100
-#define THREAD_PENALTY 50 /* allow threads groups 2 full timeslices */
-#define USER_PENALTY 10 /* allow user 10 full timeslices */
-#define EXIT_WEIGHT 3
-#define PRIO_BONUS_RATIO 25
-#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (2*HZ)
-#define STARVATION_LIMIT (2*HZ)
+
+int min_timeslice = (10 * HZ) / 1000;
+int max_timeslice = (300 * HZ) / 1000;
+int child_penalty = 95;
+int parent_penalty = 100;
+int thread_penalty = 50;
+int user_penalty = 10;
+int exit_weight = 3;
+int prio_bonus_ratio = 25;
+int interactive_delta = 2;
+int max_sleep_avg = 2 * HZ;
+int starvation_limit = 2 * HZ;
+
+#define MIN_TIMESLICE (min_timeslice)
+#define MAX_TIMESLICE (max_timeslice)
+#define CHILD_PENALTY (child_penalty)
+#define PARENT_PENALTY (parent_penalty)
+#define THREAD_PENALTY (thread_penalty)
+#define USER_PENALTY (user_penalty)
+#define EXIT_WEIGHT (exit_weight)
+#define PRIO_BONUS_RATIO (prio_bonus_ratio)
+#define INTERACTIVE_DELTA (interactive_delta)
+#define MAX_SLEEP_AVG (max_sleep_avg)
+#define STARVATION_LIMIT (starvation_limit)

/*
* If a task is 'interactive' then we reinsert it in the active
diff -Nru a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c Tue Dec 31 13:48:26 2002
+++ b/kernel/sysctl.c Tue Dec 31 13:48:26 2002
@@ -55,6 +55,17 @@
extern int cad_pid;
extern int pid_max;
extern int sysctl_lower_zone_protection;
+extern int min_timeslice;
+extern int max_timeslice;
+extern int child_penalty;
+extern int parent_penalty;
+extern int exit_weight;
+extern int prio_bonus_ratio;
+extern int interactive_delta;
+extern int max_sleep_avg;
+extern int starvation_limit;
+extern int thread_penalty;
+extern int user_penalty;

/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -112,6 +123,7 @@

static ctl_table kern_table[];
static ctl_table vm_table[];
+static ctl_table sched_table[];
#ifdef CONFIG_NET
extern ctl_table net_table[];
#endif
@@ -156,6 +168,7 @@
{CTL_FS, "fs", NULL, 0, 0555, fs_table},
{CTL_DEBUG, "debug", NULL, 0, 0555, debug_table},
{CTL_DEV, "dev", NULL, 0, 0555, dev_table},
+ {CTL_SCHED, "sched", NULL, 0, 0555, sched_table},
{0}
};

@@ -358,7 +371,33 @@

static ctl_table dev_table[] = {
{0}
-};
+};
+
+static ctl_table sched_table[] = {
+ {SCHED_MAX_TIMESLICE, "max_timeslice",
+ &max_timeslice, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_MIN_TIMESLICE, "min_timeslice",
+ &min_timeslice, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_CHILD_PENALTY, "child_penalty",
+ &child_penalty, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_PARENT_PENALTY, "parent_penalty",
+ &parent_penalty, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_EXIT_WEIGHT, "exit_weight",
+ &exit_weight, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_PRIO_BONUS_RATIO, "prio_bonus_ratio",
+ &prio_bonus_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_INTERACTIVE_DELTA, "interactive_delta",
+ &interactive_delta, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_MAX_SLEEP_AVG, "max_sleep_avg",
+ &max_sleep_avg, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_STARVATION_LIMIT, "starvation_limit",
+ &starvation_limit, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_THREAD_PENALTY, "thread_penalty",
+ &thread_penalty, sizeof(int), 0644, NULL, &proc_dointvec},
+ {SCHED_USER_PENALTY, "user_penalty",
+ &user_penalty, sizeof(int), 0644, NULL, &proc_dointvec},
+ {0}
+};

extern void init_irq_proc (void);

-----------------

2003-01-03 00:09:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads


On Wed, 1 Jan 2003, Ed Tomlinson wrote:

> Here is the scheduler-tunables patch updated to include USER_PENALTY and
> THREAD_PENANTY. This on top of ptg_B0.

there's no way we'll make the scheduler internal constants tunable in such
a wide range. Such a patch has been submitted a couple of months ago
already. I do use something like that to test tunings, but it's definitely
not something we want to make tunable directly in the stock kernel.

Ingo

2003-01-03 12:42:04

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH,RFC] fix o(1) handling of threads

On January 2, 2003 07:22 pm, Ingo Molnar wrote:
> On Wed, 1 Jan 2003, Ed Tomlinson wrote:
> > Here is the scheduler-tunables patch updated to include USER_PENALTY and
> > THREAD_PENANTY. This on top of ptg_B0.
>
> there's no way we'll make the scheduler internal constants tunable in such
> a wide range. Such a patch has been submitted a couple of months ago
> already. I do use something like that to test tunings, but it's definitely
> not something we want to make tunable directly in the stock kernel.

Nor would I advocate doing so. I added two 'constants' I wanted to
be able to test them so I updated Robert's patch... Two questions
for you.

1. Do you have any comments/suggestsion on the ptg_B0 patch?

2. I have been playing with using user and thread penalties together.
- they often interact badly. Using just one works very well. This
can be fixed - but gets messy. Alternately, I am thinking about
implementing per user policies. ie.

a. govern thread groups
b. govern all threads, ignoring groups, for a user
c. govern processes for a user

This can be done cleanly. Would something along the lines of sys_nice
be the way to implement the kernel side of the user interface to this?

TIA,
Ed Tomlinson