2002-02-02 15:50:36

by Ed Tomlinson

[permalink] [raw]
Subject: [PATCH] improving O(1)-J9 in heavily threaded situations

The following patch improves the performance of O(1) when under load
and threaded programs are running. For this patch, threaded is taken
to mean processes sharing their vm (p->mm). While this type of
programming is usually not optimal it is very common (ie. java).

Under load, with threaded tasks running niced, and a cpu bound task at
normal priority we quickly notice the load average shooting up. What
is happening is all the threads are trying to run. Due to the nature
of O(1) all the process in the current array must run to get on to the
inactive array. This quickly leads to the higher priority cpu bound
task starving. Running the threaded application at nice +19 can help
but does not solve the base issue. On UP, I have observed loads of
between 20 and 40 when this is happening,

I tried various approaches to correcting this. Neither reducing the
timeslices or playing the the effective prio of the threaded tasks
helped much. What does seems quite effective is to monitor the total
time _all_ the tasks sharing a vm use. Once a threshold is exceeded
we move any tasks in the same group directly to the inactive array
temporarily increasing their effective prio.

With this change the cpu bound tasks can get over 80% of the cpu.

A couple of comments. The patch applies its logic to all processes.
This limits the number of times interactive tasks can requeue. With
this in place we can probably drop the expired_timeslice logic.

I see no reason that some of the tunables should not become part of
the standard scheduler provided they are not adding significant
overhead. The THREAD_PENALTY tunable is only used in the tick
processing logic (and not in the common path) - keeping it tunable
should not introduce measureable overhead.

Patch against 2.4.17pre7

Comments and feedback welcome,
Ed Tomlinson

--- linux/include/linux/sched.h.orig Fri Feb 1 23:19:44 2002
+++ linux/include/linux/sched.h Fri Feb 1 23:16:34 2002
@@ -211,6 +211,8 @@
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ unsigned long mm_hogstamp[NR_CPUS]; /* timestamp to reset the hogslice */
+ unsigned long mm_hogslice[NR_CPUS]; /* when this reaches 0, task using this mm are hogs */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects task page tables and mm->rss */
@@ -486,6 +488,7 @@
#define INTERACTIVE_DELTA 3
#define MAX_SLEEP_AVG (2*HZ)
#define STARVATION_LIMIT (2*HZ)
+#define THREAD_PENALTY 8

#define USER_PRIO(p) ((p)-MAX_RT_PRIO)
#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
--- linux/kernel/sched.c.orig Fri Feb 1 23:23:50 2002
+++ linux/kernel/sched.c Fri Feb 1 23:12:49 2002
@@ -41,7 +41,7 @@
*/
struct runqueue {
spinlock_t lock;
- unsigned long nr_running, nr_switches, expired_timestamp;
+ unsigned long nr_running, nr_switches, expired_timestamp, switch_timestamp;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
@@ -614,6 +614,28 @@
} else
enqueue_task(p, rq->active);
}
+
+ /*
+ * Here we stop a single interactive task or a group of tasks sharing
+ * a mm_struct (like java threads) from monopolizing the cpu.
+ */
+ if (p->mm) {
+ if (p->mm->mm_hogstamp[smp_processor_id()] < rq->switch_timestamp) {
+ p->mm->mm_hogstamp[smp_processor_id()] = rq->switch_timestamp;
+ p->mm->mm_hogslice[smp_processor_id()] =
+ NICE_TO_TIMESLICE(p->__nice) * THREAD_PENALTY;
+ }
+ if (!p->mm->mm_hogslice[smp_processor_id()]) {
+ dequeue_task(p, rq->active);
+ p->need_resched = 1;
+ p->prio--;
+ if (p->prio < MAX_RT_PRIO)
+ p->prio = MAX_RT_PRIO;
+ enqueue_task(p, rq->expired);
+ } else
+ p->mm->mm_hogslice[smp_processor_id()]--;
+ }
+
out:
#if CONFIG_SMP
if (!(now % BUSY_REBALANCE_TICK))
@@ -676,6 +698,7 @@
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
+ rq->switch_timestamp = jiffies;
}

idx = sched_find_first_bit(array->bitmap);



2002-02-03 06:00:20

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Found a bug in the patch:

if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
enqueue_task(p, rq->expired);
>>> goto out;
} else
enqueue_task(p, rq->active);
}

/*
* Here we stop a single interactive task or a group of tasks sharing
* a mm_struct (like java threads) from monopolizing the cpu.
*/
if (p->mm) {
>>> if (p->mm->mm_hogstamp[smp_processor_id()] != rq->switch_timestamp) {

I added a 'goto out;' after the enqueue_task(p, rq->expired); Suspect bad
things can happen if I dequeue from the wrong array... Also changed the test
in 'p->mm->mm_hogstamp[smp_processor_id()] < rq->switch_timestamp)' from <
to != as it will be safer when jiffies rolls over.

Ed Tomlinson

On February 2, 2002 10:50 am, Ed Tomlinson wrote:
> The following patch improves the performance of O(1) when under load
> and threaded programs are running. For this patch, threaded is taken
> to mean processes sharing their vm (p->mm). While this type of
> programming is usually not optimal it is very common (ie. java).
>
> Under load, with threaded tasks running niced, and a cpu bound task at
> normal priority we quickly notice the load average shooting up. What
> is happening is all the threads are trying to run. Due to the nature
> of O(1) all the process in the current array must run to get on to the
> inactive array. This quickly leads to the higher priority cpu bound
> task starving. Running the threaded application at nice +19 can help
> but does not solve the base issue. On UP, I have observed loads of
> between 20 and 40 when this is happening,
>
> I tried various approaches to correcting this. Neither reducing the
> timeslices or playing the the effective prio of the threaded tasks
> helped much. What does seems quite effective is to monitor the total
> time _all_ the tasks sharing a vm use. Once a threshold is exceeded
> we move any tasks in the same group directly to the inactive array
> temporarily increasing their effective prio.
>
> With this change the cpu bound tasks can get over 80% of the cpu.
>
> A couple of comments. The patch applies its logic to all processes.
> This limits the number of times interactive tasks can requeue. With
> this in place we can probably drop the expired_timeslice logic.
>
> I see no reason that some of the tunables should not become part of
> the standard scheduler provided they are not adding significant
> overhead. The THREAD_PENALTY tunable is only used in the tick
> processing logic (and not in the common path) - keeping it tunable
> should not introduce measureable overhead.
>
> Patch against 2.4.17pre7
>
> Comments and feedback welcome,
> Ed Tomlinson
>
> --- linux/include/linux/sched.h.orig Fri Feb 1 23:19:44 2002
> +++ linux/include/linux/sched.h Fri Feb 1 23:16:34 2002
> @@ -211,6 +211,8 @@
> pgd_t * pgd;
> atomic_t mm_users; /* How many users with user space? */
> atomic_t mm_count; /* How many references to "struct mm_struct" (users
> count as 1) */ + unsigned long mm_hogstamp[NR_CPUS]; /* timestamp to reset
> the hogslice */ + unsigned long mm_hogslice[NR_CPUS]; /* when this reaches
> 0, task using this mm are hogs */ int map_count; /* number of VMAs */
> struct rw_semaphore mmap_sem;
> spinlock_t page_table_lock; /* Protects task page tables and mm->rss */
> @@ -486,6 +488,7 @@
> #define INTERACTIVE_DELTA 3
> #define MAX_SLEEP_AVG (2*HZ)
> #define STARVATION_LIMIT (2*HZ)
> +#define THREAD_PENALTY 8
>
> #define USER_PRIO(p) ((p)-MAX_RT_PRIO)
> #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
> --- linux/kernel/sched.c.orig Fri Feb 1 23:23:50 2002
> +++ linux/kernel/sched.c Fri Feb 1 23:12:49 2002
> @@ -41,7 +41,7 @@
> */
> struct runqueue {
> spinlock_t lock;
> - unsigned long nr_running, nr_switches, expired_timestamp;
> + unsigned long nr_running, nr_switches, expired_timestamp,
> switch_timestamp; task_t *curr, *idle;
> prio_array_t *active, *expired, arrays[2];
> int prev_nr_running[NR_CPUS];
> @@ -614,6 +614,28 @@
> } else
> enqueue_task(p, rq->active);
> }
> +
> + /*
> + * Here we stop a single interactive task or a group of tasks sharing
> + * a mm_struct (like java threads) from monopolizing the cpu.
> + */
> + if (p->mm) {
> + if (p->mm->mm_hogstamp[smp_processor_id()] < rq->switch_timestamp) {
> + p->mm->mm_hogstamp[smp_processor_id()] = rq->switch_timestamp;
> + p->mm->mm_hogslice[smp_processor_id()] =
> + NICE_TO_TIMESLICE(p->__nice) * THREAD_PENALTY;
> + }
> + if (!p->mm->mm_hogslice[smp_processor_id()]) {
> + dequeue_task(p, rq->active);
> + p->need_resched = 1;
> + p->prio--;
> + if (p->prio < MAX_RT_PRIO)
> + p->prio = MAX_RT_PRIO;
> + enqueue_task(p, rq->expired);
> + } else
> + p->mm->mm_hogslice[smp_processor_id()]--;
> + }
> +
> out:
> #if CONFIG_SMP
> if (!(now % BUSY_REBALANCE_TICK))
> @@ -676,6 +698,7 @@
> rq->expired = array;
> array = rq->active;
> rq->expired_timestamp = 0;
> + rq->switch_timestamp = jiffies;
> }
>
> idx = sched_find_first_bit(array->bitmap);

2002-02-03 09:35:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Sat, 2 Feb 2002, Ed Tomlinson wrote:

> The following patch improves the performance of O(1) when under load
> and threaded programs are running. For this patch, threaded is taken
> to mean processes sharing their vm (p->mm). While this type of
> programming is usually not optimal it is very common (ie. java).

generally we do not make any distinction between Linux threads that share
or do not share their VM. We certainly do not want to handle them
differently wrt. timeslices distributed.

> Under load, with threaded tasks running niced, and a cpu bound task at
> normal priority we quickly notice the load average shooting up. What
> is happening is all the threads are trying to run. Due to the nature
> of O(1) all the process in the current array must run to get on to the
> inactive array. This quickly leads to the higher priority cpu bound
> task starving. Running the threaded application at nice +19 can help
> but does not solve the base issue. On UP, I have observed loads of
> between 20 and 40 when this is happening,

the problem of lower priority tasks taking away CPU time from a higher
priority task is not new, it's present in the old scheduler as well. In no
way is this symptom related to threads that share their VM. The same
workload with forked processes will show the same symptoms.

i'd suggest to do the following tunings on stock -J9:

- increase HZ to 1024, in include/asm/param.h
- decrease MIN_TIMESLICE to 1 msec in sched.h.
- decrease PRIO_BONUS_RATIO from 70 to 40.

these changes do two things: they decrease the timeslice of nice +19 tasks
(pretty dramatically, relative to current kernels), and they make sure
that heavily reniced tasks cannot reach interactive status easily.

do you still see higher priority CPU-bound task starving?

> I tried various approaches to correcting this. Neither reducing the
> timeslices or playing the the effective prio of the threaded tasks
> helped much. What does seems quite effective is to monitor the total
> time _all_ the tasks sharing a vm use. Once a threshold is exceeded we
> move any tasks in the same group directly to the inactive array
> temporarily increasing their effective prio.

this in essence punishes threads that share their VM - including system
threads such as eg. kswapd or bdflush as well. This concept is broken
because the problem is not that the threads share their VM, that is just a
property of your particular workload. The problem is that while you have
lowered the priority of your java threads, they still show interactive
behavior and get a higher proportion of the CPU share they got previously.

i think your workload shows a weakness in the current handling of reniced
workloads, which can be fixed without adding any new mechanizm.

> With this change the cpu bound tasks can get over 80% of the cpu.

please also try the tunings i suggested, and compare the two kernels.

Ingo

2002-02-03 15:46:18

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

On February 3, 2002 06:32 am, Ingo Molnar wrote:
> On Sat, 2 Feb 2002, Ed Tomlinson wrote:
> > The following patch improves the performance of O(1) when under load
> > and threaded programs are running. For this patch, threaded is taken
> > to mean processes sharing their vm (p->mm). While this type of
> > programming is usually not optimal it is very common (ie. java).
>
> generally we do not make any distinction between Linux threads that share
> or do not share their VM. We certainly do not want to handle them
> differently wrt. timeslices distributed.

My changes only defers the execution of 'threaded' tasks when they are placing
a heavy load on a cpu - its does not change the timeslices at all. Also, if
the tuneable patch is installed, the behaviour is completely controlable -
up THREAD_PENALTY to a high number, and it reverts to the J9 behaviour.

> > Under load, with threaded tasks running niced, and a cpu bound task at
> > normal priority we quickly notice the load average shooting up. What
> > is happening is all the threads are trying to run. Due to the nature
> > of O(1) all the process in the current array must run to get on to the
> > inactive array. This quickly leads to the higher priority cpu bound
> > task starving. Running the threaded application at nice +19 can help
> > but does not solve the base issue. On UP, I have observed loads of
> > between 20 and 40 when this is happening,
>
> the problem of lower priority tasks taking away CPU time from a higher
> priority task is not new, it's present in the old scheduler as well. In no
> way is this symptom related to threads that share their VM. The same
> workload with forked processes will show the same symptoms.

Yes, and O(1) is already doing a better job. The reason this sort of
workload worries me is mainly java. Creating hundreds of threads is
not uncommon with java. The 1 to 1 relation ship between kernel and
java threads makes this an issue. I do not think that there are that
many applications forking hundreds of active tasks (excepting fork bombs).

> i'd suggest to do the following tunings on stock -J9:
>
> - increase HZ to 1024, in include/asm/param.h
> - decrease MIN_TIMESLICE to 1 msec in sched.h.
> - decrease PRIO_BONUS_RATIO from 70 to 40.

I am sure this will help.

> these changes do two things: they decrease the timeslice of nice +19 tasks
> (pretty dramatically, relative to current kernels), and they make sure
> that heavily reniced tasks cannot reach interactive status easily.
>
> do you still see higher priority CPU-bound task starving?

The other half of this is does the java application remain responsive?
Remember it is interactive it that parts of it feed a local browser.

> > I tried various approaches to correcting this. Neither reducing the
> > timeslices or playing the the effective prio of the threaded tasks
> > helped much. What does seems quite effective is to monitor the total
> > time _all_ the tasks sharing a vm use. Once a threshold is exceeded we
> > move any tasks in the same group directly to the inactive array
> > temporarily increasing their effective prio.
>
> this in essence punishes threads that share their VM - including system
> threads such as eg. kswapd or bdflush as well. This concept is broken
> because the problem is not that the threads share their VM, that is just a
> property of your particular workload. The problem is that while you have
> lowered the priority of your java threads, they still show interactive
> behavior and get a higher proportion of the CPU share they got previously.

If system tasks are a problem its easy to exclude them. I did not do
this since monitoring who was triggering this code did not show system
tasks.

What happens when the java threads really _are_ interactive? In my case
the test application is a freenet node. Part of it is acting as a http
proxy. Starving this results in an unresponsive system. Why should I
have to renice at all?

There is another why this could be coded. We could divide the normal
prorites into N buckets. Say -20, -12, -4, +4, +12, +19. Then we
limit the total execution time spent by tasks in each using the same
sort of logic I am use except its by bucket not mm. Time for each
bucket would be <tuneable> * NICE_TO_TIMESLICE(-20), <tuneable> *
NICE_TO_TIMESLICE(-12) etc.

> i think your workload shows a weakness in the current handling of reniced
> workloads, which can be fixed without adding any new mechanizm.

Are you sure we really want renice to be needed get good response for
common workloads? I understand the effect of priorities, will Aunt
Martha? I did not code using the bucket approach since it implies
more use of nice than the by mm mechanism.

> > With this change the cpu bound tasks can get over 80% of the cpu.
>
> please also try the tunings i suggested, and compare the two kernels.

Will do.

Ed Tomlinson

2002-02-03 22:49:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Sun, 3 Feb 2002, Ed Tomlinson wrote:

> > these changes do two things: they decrease the timeslice of nice +19 tasks
> > (pretty dramatically, relative to current kernels), and they make sure
> > that heavily reniced tasks cannot reach interactive status easily.
> >
> > do you still see higher priority CPU-bound task starving?
>
> The other half of this is does the java application remain responsive?
> Remember it is interactive it that parts of it feed a local browser.

yes. Priority boost/penalty works for reniced tasks just as well.

with the -K2 scheduler (i will release the patch soon) it will be
progressively harder for reniced tasks to gain 'heavily interactive'
status (ie. to be reinserted into the active array). For nice +19 tasks it
will be impossible to get this status. (for nice +15 tasks it's still
possible.)

> If system tasks are a problem its easy to exclude them. I did not do
> this since monitoring who was triggering this code did not show system
> tasks.

the fact that we might need to 'exclude' certain tasks from a mechanism
shows that the mechanism is not robust.

> What happens when the java threads really _are_ interactive? In my
> case the test application is a freenet node. Part of it is acting as
> a http proxy. Starving this results in an unresponsive system. Why
> should I have to renice at all?

you have to renice if you want to give non-java tasks a higher share of
the CPU time. Java threads will still be interactive relative to each
other.

> > i think your workload shows a weakness in the current handling of reniced
> > workloads, which can be fixed without adding any new mechanism.
>
> Are you sure we really want renice to be needed get good response for
> common workloads? [...]

'response' in terms of interactive latencies should be good, yes.

'response' in terms of relative CPU time given to CPU hogs and interactive
tasks wont be as 'good' as with the old scheduler. (ie. CPU hogs *will* be
punished harder - this is what is needed for good interactivity after
all.) So if you see that some of your interactive tasks are not as
important as you'd like them to be, then renicing them somewhat will give
more CPU time even to CPU hogs. The kernel wont be able to figure out what
is important to you though - the default right now is that interactive
tasks are more important. If the opposite is desired then the kernel needs
external help - ie. nice values.

Ingo

2002-02-04 04:41:20

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

On February 4, 2002 12:27 am, Ingo Molnar wrote:
> Ed,
>
> i've just uploaded -K2. It should handle your java load better, eg.
> renicing to +15 or +19 should give nice 0 CPU-hogs a clear advantage.
> Please let me know what you find,

Will see what K2 does tomorrow (probably evening).

>> If system tasks are a problem its easy to exclude them. I did not do
>> this since monitoring who was triggering this code did not show system
>> tasks.

>the fact that we might need to 'exclude' certain tasks from a mechanism
>shows that the mechanism is not robust.

Testing showed no reason to remove the system tasks. They were not being
deferred... We can always add exceptions - here I did not need too. I
do agree that if we have to start adding exceptions we probably have not
found the best mechanism.

One point that seems to get missed is that a group of java threads,
posix threads or sometimes forked processes combine to make an application.
Linux, at the scheduler level at least, does not have the ability to
determine that all the tasks are really one application. Under light
loads this makes no difference. When the load gets heavy having
this ability helps here.

Maybe we can control this with nice. Is the the best or only
way to do it? I am not at all sure it is. After all nice is just
another knob. The fewer knobs we have to tweak the easier linux
is to use.

I really think giving the linux schedule more information (not necessarily
using a shared mm) about which groups of tasks comprise an application would
help things.

What I coded was an attempt to give the scheduler a way to cope under load.
If it knows groups of processes belong together then it can control them
when required. With my current code it place running my freenet node at
nice +10 still leaves me with a very responsive system.

I am very interested to see what K2 does here - if no new code is needed
great. On the other hand if we can figure a way to add a simple and
understandable knob that let it perform better under load do not think
its a bad thing either.

Ed




2002-02-04 10:34:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Sun, 3 Feb 2002, Ed Tomlinson wrote:

> Maybe we can control this with nice. Is the the best or only way to
> do it? I am not at all sure it is. After all nice is just another
> knob. The fewer knobs we have to tweak the easier linux is to use.

> [...] On the other hand if we can figure a way to add a simple and
> understandable knob that let it perform better under load do not think
> its a bad thing either.

comparing these two paragraphs you'll see what kind of paradox we face.
I'm trying to keep the number of external knobs down, and i'm trying to
revive nice levels as a thing that makes some real difference, while still
handling all the important cases automatically, wherever we can.

Ingo

2002-02-04 10:40:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Sun, 3 Feb 2002, Ed Tomlinson wrote:

> One point that seems to get missed is that a group of java threads,
> posix threads or sometimes forked processes combine to make an
> application. [...]

yes - but what makes them an application is not really the fact that they
share the VM (i can very much imagine thread-based login servers where
different users use different threads - a single application as well?),
but the intention of the application designer, which is hard to guess.

if this becomes inevitable then perhaps a better line we can guess along
is the child-parent relationship. Looking at 'pstree' output shows some
clear application boundaries. I'd say an application are 'all children of
a parent'. Ie. if two threads (shared VM or not shared VM, does not
matter) have the same parent (which is not init) then they form an
'application'. This will cover FreeNet java threads just as well as
hundreds of Apache processes.

but this method is guesswork as well, so it could mishandle certain cases.
Eg. i'm quite certain that most people would notice the interactive
effects if we handled all processes forked by kdeinit as a single
application. So lets do it only if everything else fails to fix your
workload.

Ingo

2002-02-04 10:48:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> On Sun, 3 Feb 2002, Ed Tomlinson wrote:
>
> > One point that seems to get missed is that a group of java threads,
> > posix threads or sometimes forked processes combine to make an
> > application. [...]
>
> yes - but what makes them an application is not really the fact that they
> share the VM (i can very much imagine thread-based login servers where
> different users use different threads - a single application as well?),
> but the intention of the application designer, which is hard to guess.

sharing the same Thread Group ID would be a very obvious quantity to
check,
and would very much show the indication of the application author.

2002-02-04 12:12:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Mon, 4 Feb 2002, Arjan van de Ven wrote:

> > yes - but what makes them an application is not really the fact that they
> > share the VM (i can very much imagine thread-based login servers where
> > different users use different threads - a single application as well?),
> > but the intention of the application designer, which is hard to guess.
>
> sharing the same Thread Group ID would be a very obvious quantity to
> check, and would very much show the indication of the application
> author.

agreed.

Ingo

2002-02-04 12:18:35

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

On February 4, 2002 07:30 am, Ingo Molnar wrote:
> On Sun, 3 Feb 2002, Ed Tomlinson wrote:
> > Maybe we can control this with nice. Is the the best or only way to
> > do it? I am not at all sure it is. After all nice is just another
> > knob. The fewer knobs we have to tweak the easier linux is to use.
> >
> > [...] On the other hand if we can figure a way to add a simple and
> > understandable knob that let it perform better under load do not think
> > its a bad thing either.
>
> comparing these two paragraphs you'll see what kind of paradox we face.
> I'm trying to keep the number of external knobs down, and i'm trying to
> revive nice levels as a thing that makes some real difference, while still
> handling all the important cases automatically, wherever we can

I am very much aware of this and agree that the fewer knobs one _has_
to play with the better.

> > One point that seems to get missed is that a group of java threads,
> > posix threads or sometimes forked processes combine to make an
> > application. [...]
>
> yes - but what makes them an application is not really the fact that they
> share the VM (i can very much imagine thread-based login servers where
> different users use different threads - a single application as well?),
> but the intention of the application designer, which is hard to guess.
>
> if this becomes inevitable then perhaps a better line we can guess along
> is the child-parent relationship. Looking at 'pstree' output shows some
> clear application boundaries. I'd say an application are 'all children of
> a parent'. Ie. if two threads (shared VM or not shared VM, does not
> matter) have the same parent (which is not init) then they form an
> 'application'. This will cover FreeNet java threads just as well as
> hundreds of Apache processes.

I had though of this too. This could work ok if we only go back one level.
For my load the ->mm method was more specific though, thinking of Apache,
its probably not general enought.

> but this method is guesswork as well, so it could mishandle certain cases.
> Eg. i'm quite certain that most people would notice the interactive
> effects if we handled all processes forked by kdeinit as a single
> application. So lets do it only if everything else fails to fix your
> workload.

K2 today, will take a day or two to get a real feel of K2 - this week is
going to be a busy one (its time to put our serviced SAP system into
production).

I wonder just how much they will notice though. If we use a scheme like
I implemented for the shared mm, it only triggers under load and then still
runs the interactive tasks first.

BTW, there is another buglet in my code. If we defer a task we need to
check the expired time and set it if its zero. Probably it will not be
but checking it would be a good idea - that is if the code ends up being
required all all.

Ed

2002-02-04 12:33:06

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Arjan van de Ven wrote:

> Ingo Molnar wrote:
>>
>> On Sun, 3 Feb 2002, Ed Tomlinson wrote:
>>
>> > One point that seems to get missed is that a group of java threads,
>> > posix threads or sometimes forked processes combine to make an
>> > application. [...]
>>
>> yes - but what makes them an application is not really the fact that they
>> share the VM (i can very much imagine thread-based login servers where
>> different users use different threads - a single application as well?),
>> but the intention of the application designer, which is hard to guess.
>
> sharing the same Thread Group ID would be a very obvious quantity to
> check,
> and would very much show the indication of the application author.

I Tried this. Looks like not all (many?) apps actually use this.

Ed

2002-02-04 12:34:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

On Mon, Feb 04, 2002 at 07:32:46AM -0500, Ed Tomlinson wrote:
> > sharing the same Thread Group ID would be a very obvious quantity to
> > check,
> > and would very much show the indication of the application author.
>
> I Tried this. Looks like not all (many?) apps actually use this.

would be a case of fixing those apps imo

2002-02-04 19:55:52

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> 'response' in terms of interactive latencies should be good, yes.
>
> 'response' in terms of relative CPU time given to CPU hogs and
> interactive tasks wont be as 'good' as with the old scheduler. (ie. CPU
> hogs *will* be punished harder - this is what is needed for good
> interactivity after all.) So if you see that some of your interactive

How about priority inheritance? When interactive task relies heavily on
output from CPU hog?

My application uses three tier architecture where is low HAL layer reading
audio from soundcard which is compressed and sent (TCP) to distributor
process which decompresses the audio and distributes (UNIX/LOCAL) it to
clients. Distributor's clients are the CPU hogs doing various processing
tasks to the signal and then sending (TCP) the results to the very thin user
interface.

Now the user interface can take some CPU time and is considered to be
interactive. But if that leads to starvation of CPU hog processing tasks it
leads to unusable output on user interface because starvation leads to
losing blocks of data in distributor process.

HAL and distributor are running as SCHED_FIFO, but CPU hog processing tasks
are dynamically fork()/exec()'d and run on default priority (not as root).
So I should nice user interfaces to 15+?

App can be found from: http://hasas.sf.net


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-04 20:04:52

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ed Tomlinson wrote:
>
> One point that seems to get missed is that a group of java threads,
> posix threads or sometimes forked processes combine to make an
> application. Linux, at the scheduler level at least, does not have the
> ability to determine that all the tasks are really one application.
> Under light loads this makes no difference. When the load gets heavy
> having this ability helps here.

My application is very good example of this kind of application. I'm very
worried about the way new scheduler is beginning to behave. It's combination
of single processes with many threads and many processes with single
threads. And because it's kind of realtime application, _all_ processes and
threads are "interactive". CPU load is typically very high.


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-04 20:08:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Mon, 4 Feb 2002, Jussi Laako wrote:

> My application is very good example of this kind of application. I'm
> very worried about the way new scheduler is beginning to behave. It's
> combination of single processes with many threads and many processes
> with single threads. [...]

please give it a test then (i'd suggest using the latest, -K2 patch) and
let me know about what you find - this way we can fix any possible real
problems instead of talking in hypotheticals.

Ingo

2002-02-04 22:25:25

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

On Sun, 3 Feb 2002, Ed Tomlinson wrote:

> I really think giving the linux schedule more information (not necessarily
> using a shared mm) about which groups of tasks comprise an application would
> help things.
>
> What I coded was an attempt to give the scheduler a way to cope under load.
> If it knows groups of processes belong together then it can control them
> when required. With my current code it place running my freenet node at
> nice +10 still leaves me with a very responsive system.

I don't much like the vm as a way to determine association, but between
process groups and thread groups I would think that if there isn't useful
information perhaps something isn't working as desired.

I firmly believe that nice should be used to get uncommon behaviour, and
that the scheduler should do a decent job with most common loads, which
includes threads.

I hope to try K2 this weekend, I want to get get O1, rmap, and low
latency. If the latest O1 doesn't have child run first I want to add that
as well, barring someone providing a reason not to. I suspect I'll go
head down for some hours doing that, which is why I only evaluate multiple
features every few weeks :-(

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-04 22:58:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Tue, 5 Feb 2002, Jussi Laako wrote:

> Ok, now I made comparison test with exactly same kernel except other
> one with -K2 patch. O1-K2 behaves significantly worse than old
> scheduler. I think this behaviour was introduced somewhere around
> beginning of -J series. I can't make kernel with old scheduler loose
> datablocks, but with O1 it looses large percentage of the blocks.

what does 'loose datablocks' mean? What application loses datablocks?

Ingo

2002-02-04 22:59:21

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> please give it a test then (i'd suggest using the latest, -K2 patch) and
> let me know about what you find - this way we can fix any possible real
> problems instead of talking in hypotheticals.

I've been testing all your scheduler versions since first one.

Ok, now I made comparison test with exactly same kernel except other one
with -K2 patch. O1-K2 behaves significantly worse than old scheduler. I
think this behaviour was introduced somewhere around beginning of -J series.
I can't make kernel with old scheduler loose datablocks, but with O1 it
looses large percentage of the blocks.


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-04 23:01:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Mon, 4 Feb 2002, Jussi Laako wrote:

> My application uses three tier architecture where is low HAL layer
> reading audio from soundcard which is compressed and sent (TCP) to
> distributor process which decompresses the audio and distributes
> (UNIX/LOCAL) it to clients. Distributor's clients are the CPU hogs
> doing various processing tasks to the signal and then sending (TCP)
> the results to the very thin user interface.

Please renice your CPU hog soundcard processes to -11, does that make any
difference? (under -K2)

> HAL and distributor are running as SCHED_FIFO, but CPU hog processing
> tasks are dynamically fork()/exec()'d and run on default priority (not
> as root). So I should nice user interfaces to 15+?

is it more important to run these CPU hogs than to run interactive tasks?
If yes then renice them to -11.

Ingo

2002-02-04 23:33:16

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> what does 'loose datablocks' mean? What application loses datablocks?

My app http://hasas.sf.net

It's because either distributor's send thread isn't woken up and at
pthread_cond_broadcast() or it's not at pthread_cond_wait() because
receiving process is not receiving it's data because of CPU time starvation.
Thus receiving "CPU hog" process is losing blocks of data.

Data path basically is:

1) read() data from soundcard
2) pthread_cond_broadcast() it
3) pthread_cond_wait() returns | there are N of these
4) write() data to tcp socket |

5) read() data from tcp socket
6) pthread_cond_broadcast() it
7) pthread_cond_wait() returns | there are N of these
8) write() data to unix socket | (here the loss probably happens)

9) read() data from unix socket | there are N of these
10) do some CPU hog stuff | (this process doesn't get all
11) write() results to tcp socket | of the data)

12) read() results from tcp socket | ...and N of these
13) scale data for drawing to screen |
14) do a bit system-cpu-time hog drawing to screen |


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-04 23:37:16

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> Please renice your CPU hog soundcard processes to -11, does that make any
> difference? (under -K2)

I can renice this only for testing purposes. Normally these are not run as
root so I can't do negative renice.

> is it more important to run these CPU hogs than to run interactive tasks?
> If yes then renice them to -11.

Yes and no... :) Interactive tasks get their work from CPU hogs so those are
strongly related. If interactive task puts CPU hog to wait it will also lose
it's data.


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-04 23:37:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Tue, 5 Feb 2002, Jussi Laako wrote:

> Thus receiving "CPU hog" process is losing blocks of data.

i understand, and we want to handle such cases perfectly as well.

How much of a CPU hog is your task? What does 'top' show while you are
using your app? (pasting top output here will show the situation.)

Ingo

2002-02-04 23:40:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Tue, 5 Feb 2002, Jussi Laako wrote:

> > Please renice your CPU hog soundcard processes to -11, does that make any
> > difference? (under -K2)
>
> I can renice this only for testing purposes. Normally these are not
> run as root so I can't do negative renice.

but you can run the audio tasks as SCHED_FIFO?

(you do not have to run the tasks as root, you only have to do the renice
as root.)

> > is it more important to run these CPU hogs than to run interactive tasks?
> > If yes then renice them to -11.
>
> Yes and no... :) Interactive tasks get their work from CPU hogs so
> those are strongly related. If interactive task puts CPU hog to wait
> it will also lose it's data.

you can tune the actual weight of importance by decrasing the priority of
the CPU-hog (or increasing the priority of the interactive task) a bit.
But ideally you'd want to decrease the priority of the CPU hog, because
that way you protect it not only from your 'own' interactive tasks, but
from other activities on the system as well.

Ingo

2002-02-04 23:43:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Tue, 5 Feb 2002, Jussi Laako wrote:

> I can renice this only for testing purposes. Normally these are not
> run as root so I can't do negative renice.

by the way, i'm thinking about allowing unpriviledged tasks to go back to
nice -5. But i fear it will be abused and people will just put a renice
into their .bashrc which defeats the purpose of this.

Ingo

2002-02-06 00:44:22

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> > I can renice this only for testing purposes. Normally these are not
> > run as root so I can't do negative renice.
>
> but you can run the audio tasks as SCHED_FIFO?

Yes, two server processes (input and distributor) are running all the time
transferring data, the actual CPU time taking calculation processes (which
connect to the distributor) are spawned dynamically as requested by ui
clients from inetd-kind of server running at normal user privileges.

> (you do not have to run the tasks as root, you only have to do the renice
> as root.)
> > > is it more important to run these CPU hogs than to run interactive
> > > tasks? If yes then renice them to -11.

I tried to renice many of the processes with different combinations and
didn't make any difference.

> How much of a CPU hog is your task? What does 'top' show while you are
> using your app? (pasting top output here will show the situation.)

This looks like some kind of resonance, as it works better with some
specific loads and worse with others. Is it possible this has something to
do with longer timeslices?

Here's top screens of two badly behaving load combinations:

Data path is

1) soundsrv2
2) streamdist
3) spectrum
4) guispect

--- 8< ---

5:01am up 6 min, 2 users, load average: 0.44, 0.47, 0.22
75 processes: 72 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 14.3% user, 11.9% system, 0.0% nice, 73.6% idle
Mem: 257040K av, 155588K used, 101452K free, 0K shrd, 39632K
buff
Swap: 257032K av, 0K used, 257032K free 46024K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
809 root 15 0 58312 14M 3744 S 11.8 5.7 1:12 X
946 visitor 15 0 9572 9572 3584 S 8.0 3.7 0:26 guispect
999 visitor 15 0 1616 1616 1324 S 2.2 0.6 0:01 spectrum
944 visitor 15 0 1184 1184 984 S 1.5 0.4 0:03 soundsrv2
939 visitor 15 0 1184 1184 984 R 1.4 0.4 0:06 soundsrv2
1000 visitor 15 0 1236 1236 1024 S 0.7 0.4 0:00 streamdist
943 visitor 15 0 1236 1236 1024 S 0.5 0.4 0:02 streamdist
912 visitor 15 0 4564 4564 3544 R 0.2 1.7 0:00 gnome-terminal
1006 visitor 15 0 1076 1076 856 R 0.1 0.4 0:00 top
1 root 15 0 532 532 464 S 0.0 0.2 0:04 init
2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 15 0 0 0 0 SW 0.0 0.0 0:00 kapm-idled
4 root 15 0 0 0 0 SW 0.0 0.0 0:00 kswapd
5 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated
7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kjournald
105 root 15 0 0 0 0 SW 0.0 0.0 0:00 kjournald

--- 8< ---

5:01am up 6 min, 2 users, load average: 0.35, 0.44, 0.22
75 processes: 71 sleeping, 4 running, 0 zombie, 0 stopped
CPU states: 19.4% user, 27.7% system, 0.0% nice, 52.7% idle
Mem: 257040K av, 155648K used, 101392K free, 0K shrd, 39664K
buff
Swap: 257032K av, 0K used, 257032K free 46032K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
809 root 15 0 58328 14M 3744 S 27.9 5.7 1:19 X
946 visitor 15 0 9572 9572 3584 R 11.3 3.7 0:29 guispect
939 visitor 15 0 1184 1184 984 R 2.4 0.4 0:07 soundsrv2
1014 visitor 15 0 1612 1612 1324 S 2.4 0.6 0:00 spectrum
944 visitor 15 0 1184 1184 984 S 1.0 0.4 0:03 soundsrv2
1015 visitor 15 0 1236 1236 1024 S 0.5 0.4 0:00 streamdist
943 visitor 15 0 1236 1236 1024 S 0.4 0.4 0:02 streamdist
862 visitor 15 0 5416 5416 4012 S 0.1 2.1 0:00 panel
891 visitor 15 0 4348 4348 3552 S 0.1 1.6 0:00
tasklist_applet
1 root 15 0 532 532 464 S 0.0 0.2 0:04 init
2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 15 0 0 0 0 SW 0.0 0.0 0:00 kapm-idled
4 root 15 0 0 0 0 SW 0.0 0.0 0:00 kswapd
5 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated
7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kjournald
105 root 15 0 0 0 0 SW 0.0 0.0 0:00 kjournald

--- 8< ---


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-06 10:40:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


On Wed, 6 Feb 2002, Jussi Laako wrote:

> > > > tasks? If yes then renice them to -11.
>
> I tried to renice many of the processes with different combinations
> and didn't make any difference.

> Here's top screens of two badly behaving load combinations:

> CPU states: 14.3% user, 11.9% system, 0.0% nice, 73.6% idle

73% of CPU time is idle! So priorities will make no (or little)
difference.

> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
> 809 root 15 0 58312 14M 3744 S 11.8 5.7 1:12 X
> 946 visitor 15 0 9572 9572 3584 S 8.0 3.7 0:26 guispect
> 999 visitor 15 0 1616 1616 1324 S 2.2 0.6 0:01 spectrum
> 944 visitor 15 0 1184 1184 984 S 1.5 0.4 0:03 soundsrv2
> 939 visitor 15 0 1184 1184 984 R 1.4 0.4 0:06 soundsrv2
> 1000 visitor 15 0 1236 1236 1024 S 0.7 0.4 0:00 streamdist
> 943 visitor 15 0 1236 1236 1024 S 0.5 0.4 0:02 streamdist
> 912 visitor 15 0 4564 4564 3544 R 0.2 1.7 0:00 gnome-terminal

since there is lots of idle CPU time left, the only thing that could make
a difference is timeslice length.

to 'fix' this (temporarily), renice all the non-SCHED_FIFO processes to
nice +19, that will give them the minimum timeslice length.

still dropped sound? If this fixes it then we can think about ways to fix
it for real.

again, what does it need to cause the 'bad' situation - too much latency
in the sound delivery path causing the audio buffers to starve? (or audio
buffers to be dropped?)

> CPU states: 19.4% user, 27.7% system, 0.0% nice, 52.7% idle

idle time in this snapshot too.

Ingo

2002-02-07 01:11:11

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> since there is lots of idle CPU time left, the only thing that could make
> a difference is timeslice length.
> to 'fix' this (temporarily), renice all the non-SCHED_FIFO processes to
> nice +19, that will give them the minimum timeslice length.

That caused a "choke" effect of about 10 seconds and then returned back to
normal without visible change.

> again, what does it need to cause the 'bad' situation - too much latency
> in the sound delivery path causing the audio buffers to starve? (or audio
> buffers to be dropped?)

If processes doing read()/write() on socket are not scheduled fast enough it
will cause sender thread in distributor process (SCHED_FIFO) to loose
pthread_cond_*() event for new block of data. This is non-waiting datablock
posting mechanism is made to prevent one hung or otherwise nonresponding
process from affecting other processes dataflow, esp. in SMP systems. So all
the output threads must be back from write() on pthread_cond_wait() before
input thread finishes it's read() and issues pthread_cond_broadcast() for
new datablock. If one of those output threads is not on pthread_cond_wait()
it will of course loose it's datablock.


I made even more testing and found situation where the old scheduler also
does same thing. It seems to be related to processes frequency(/size) of
issuing read()/write() calls on socket. Below _and_ above this frequency
everything works fine. This frequency is a bit lower on old scheduler. So
there is some kind of grey frequency region.


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-02-07 20:30:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations


Jussi,

there is one more thing in the -K2 patch that could cause your problems.
In kernel/softirq.c, you'll find this line:

//__initcall(spawn_ksoftirqd);

please uncomment it - this was just a debugging thing that was left in the
patch accidentally. I've made a -K3 patch that has this fixed. Do you
still see the audio problems?

Ingo

2002-02-10 15:02:03

by Jussi Laako

[permalink] [raw]
Subject: Re: [PATCH] improving O(1)-J9 in heavily threaded situations

Ingo Molnar wrote:
>
> there is one more thing in the -K2 patch that could cause your problems.
> In kernel/softirq.c, you'll find this line:
> //__initcall(spawn_ksoftirqd);
> please uncomment it - this was just a debugging thing that was left in
> the patch accidentally. I've made a -K3 patch that has this fixed. Do you
> still see the audio problems?

I did this and also tried -K3. It didn't fix the problem.

I addded lost block count printing to the SCHED_FIFO server processes. Most
of the loss (about 75%) happens at lowest level soundcard server and rest in
distributor process. Usually it looses 1 block at time but occasionally
there is peak of about 18 lost blocks.

If I make the client process read larger blocks (> 4kB) from the distributor
process number of lost blocks at soundcard server raises significantly. I
can make it a bit smaller without increased loss, but of course it means
larger overhead and eventually more lost blocks if made something like 512
bytes. 4 kB is optimal block size as it's also internal block size used by
soundcard and distributor servers.

4 kB block size means 2.9 ms in time in this case.


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers