2002-06-04 15:53:56

by Robert Love

[permalink] [raw]
Subject: [PATCH] scheduler hints

So I went ahead and implemented scheduler hints on top of the O(1)
scheduler.

I tried to find a decent paper on the web covering scheduler hints
(sometimes referred to as hint-based scheduling) but could not find
anything worthwhile. Solaris, for example, implements scheduler hints
so perhaps the "Solaris Internals" book has some information.

Basically, scheduler hints are a way for a program to give a "hint" to
the scheduler about its present behavior in the hopes of the scheduler
subsequently making better scheduling decisions. After all, who knows
better than the application what it is about to do?

For example, consider a group of SCHED_RR threads that share a
semaphore. Before one of the threads were to acquire the semaphore, it
could give a "hint" to the scheduler to increase its remaining timeslice
in order to ensure it could complete its work and drop the semaphore
before being preempted. Since, if it were preempted, it would just end
up being rescheduled as the other real-time threads would eventually
block on the held semaphore.

Other hints could be "I am interactive" or "I am a batch (i.e. cpu hog)
task" or "I am cache hot: try to keep me on this CPU".

The scheduler tries hard to figure out the three qualities and it is
usually right, although perhaps it can react quicker to these hints than
it can figure things out on its own. If nothing else, this serves as a
useful tool for determining just how accurate our O(1) scheduler is.

I am not necessarily suggesting this for inclusion; it is more of a
"just for fun" thing that turned into something with what I am actually
seeing improvements, so I post it here for others to see.

You use scheduler hints in a program by doing something like,

sched_hint(hint)

where `hint' is currently one or more of:

HINT_TIME - task needs some more quanta, boost
remaining timeslice

HINT_INTERACTIVE - task is interactive, give it a
small priority bonus to help.

HINT_BATCH - task is a batch-processed task, give
it a small priority penalty to be fair.

Right now the code makes no attempt to be fair - a program giving
HINT_TIME now will not receive some sort of penalty later. Thus
sched_hint requires CAP_SYS_NICE. This really is not what we want; we
need any arbitrary task to be able to give these hints. Since the
current solution is not fair, however, we cannot sanely do that.

A patch for 2.5.20 is attached. A patch for 2.4 + O(1) scheduler as
well as a lame example program can be had at:

ftp://ftp.kernel.org/pub/linux/kernel/people/rml/sched/scheduler-hints

Any comments or suggestions are welcome. Thanks,

Robert Love

diff -urN linux-2.5.20/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.20/arch/i386/kernel/entry.S Sun Jun 2 18:44:44 2002
+++ linux/arch/i386/kernel/entry.S Mon Jun 3 13:48:43 2002
@@ -785,6 +785,7 @@
.long sys_futex /* 240 */
.long sys_sched_setaffinity
.long sys_sched_getaffinity
+ .long sys_sched_hint

.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff -urN linux-2.5.20/include/asm-i386/unistd.h linux/include/asm-i386/unistd.h
--- linux-2.5.20/include/asm-i386/unistd.h Sun Jun 2 18:44:51 2002
+++ linux/include/asm-i386/unistd.h Mon Jun 3 13:48:59 2002
@@ -247,6 +247,7 @@
#define __NR_futex 240
#define __NR_sched_setaffinity 241
#define __NR_sched_getaffinity 242
+#define __NR_sched_hint 243

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -urN linux-2.5.20/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.20/include/linux/sched.h Sun Jun 2 18:44:41 2002
+++ linux/include/linux/sched.h Mon Jun 3 17:10:10 2002
@@ -116,6 +116,13 @@
#endif

/*
+ * Scheduling Hints
+ */
+#define HINT_TIME 1 /* increase remaining timeslice */
+#define HINT_INTERACTIVE 2 /* interactive task: prio bonus */
+#define HINT_BATCH 4 /* batch task: prio penalty */
+
+/*
* Scheduling policies
*/
#define SCHED_OTHER 0
diff -urN linux-2.5.20/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.20/kernel/sched.c Sun Jun 2 18:44:44 2002
+++ linux/kernel/sched.c Mon Jun 3 17:09:09 2002
@@ -1143,7 +1143,7 @@
policy != SCHED_OTHER)
goto out_unlock;
}
-
+
/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
* 1..MAX_USER_RT_PRIO, valid priority for SCHED_OTHER is 0.
@@ -1336,6 +1336,64 @@
return real_len;
}

+/*
+ * sys_sched_hint - give the scheduler a hint to (hopefully) provide
+ * better scheduling behavior. For example, if a task is about
+ * to acquire a highly contended resource, it would be wise to
+ * increase its remaining timeslice to ensure it could drop the
+ * resource before being preempted.
+ *
+ * `hint' is the hint to the scheduler, defined in include/linux/sched.h
+ */
+asmlinkage int sys_sched_hint(unsigned long hint)
+{
+ int ret = -EINVAL;
+ unsigned long flags;
+ runqueue_t *rq;
+
+ /*
+ * Requiring CAP_SYS_NICE is an issue: we really want any task
+ * to be able to give the scheduler a `hint' but we have no
+ * way of ensuring fairness. The compromise is to require
+ * some sort of permission... you may want to get rid of this.
+ */
+ if (!capable(CAP_SYS_NICE))
+ return -EPERM;
+
+ rq = task_rq_lock(current, &flags);
+
+ if (hint & HINT_TIME) {
+ current->time_slice = MAX_TIMESLICE;
+ /*
+ * we may have run out of timeslice and have been
+ * put on the expired runqueue: if so, fix that.
+ */
+ if (unlikely(current->array != rq->active)) {
+ dequeue_task(current, current->array);
+ enqueue_task(current, rq->active);
+ }
+ ret = 0;
+ }
+
+ if (hint & HINT_INTERACTIVE) {
+ dequeue_task(current, current->array);
+ current->sleep_avg = MAX_SLEEP_AVG;
+ current->prio = effective_prio(current);
+ enqueue_task(current, rq->active);
+ ret = 0;
+ } else if (hint & HINT_BATCH) {
+ dequeue_task(current, current->array);
+ current->sleep_avg = 0;
+ current->prio = effective_prio(current);
+ enqueue_task(current, rq->active);
+ ret = 0;
+ }
+
+ task_rq_unlock(rq, &flags);
+
+ return ret;
+}
+
asmlinkage long sys_sched_yield(void)
{
runqueue_t *rq;
@@ -1376,6 +1434,7 @@

return 0;
}
+
asmlinkage long sys_sched_get_priority_max(int policy)
{
int ret = -EINVAL;


2002-06-04 17:32:22

by Simon Trimmer

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On 4 Jun 2002, Robert Love wrote:
> I tried to find a decent paper on the web covering scheduler hints
> (sometimes referred to as hint-based scheduling) but could not find
> anything worthwhile. Solaris, for example, implements scheduler hints
> so perhaps the "Solaris Internals" book has some information.

Hi Robert,
This isn't my thing but my flatmate had left a copy of solaris internals on
the table ;)

This is briefly mentioned around about page 384 and appears to be targetted
at userspace processes for exactly the cases you're suggesting (holding
global resources).

A good entry point into the sun online documentation for this stuff is
schedctl_init() -
http://docs.sun.com/db?q=schedctl_init&p=/doc/816-0216/6m6ngupm0&a=view

-Simon

Simon Trimmer <[email protected]> VERITAS R&D Watford, UK



2002-06-04 18:07:20

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Tue, 2002-06-04 at 10:38, Simon Trimmer wrote:

> Hi Robert,
> This isn't my thing but my flatmate had left a copy of solaris internals on
> the table ;)
>
> This is briefly mentioned around about page 384 and appears to be targetted
> at userspace processes for exactly the cases you're suggesting (holding
> global resources).

I knew I read it there ;) My copy of "Solaris Internals" is elsewhere so
I could not confirm.

> A good entry point into the sun online documentation for this stuff is
> schedctl_init() -
> http://docs.sun.com/db?q=schedctl_init&p=/doc/816-0216/6m6ngupm0&a=view

Hm, what they export is a bit different. I wonder what the internal
kernel interface is like (i.e. how close to sched_hint it is)?

Since they have a start_hint and stop_hint, that is where they are able
to enforce their fairness. When you call stop, I suspect they penalize
your timeslice by some amount similar to the duration from start to
stop. If you don't call stop before you reschedule, then you probably
forfeit a large chunk of your timeslice.

This would be doable with our scheduler - and perhaps even with minimal
impact (which is my goal). However, since I wrote this more as an
exercise in fun than something to merge, I do not know if it is worth it
to make a whole infrastructure around this. Those who really see
benefit (scientific computing or real-time or whatever) could just grab
the patch, remove the permission check, and code their applications to
fit -- they trust their application base.

Anyhow, to pique interest, here are some benchmark numbers. I have 5
pthreads contesting over a single semaphore. They loop, doing some busy
looping, down the semaphore, busy loop, and then up the semaphore. Thus
they use a lot of their timeslice and spend the rest of the time
blocking on the semaphore. I let them loop a fixed number of times
before exiting.

(These are average of ~10 runs)

With a call to sched_hint(HINT_TIME) after successfully downing the
semaphore the avg total duration is 7233459 us. Without the sched_hint,
the avg total duration is 7683220 us.

That is an improvement of 6% - with only 5 threads.

A quick glance shows a reduction in context switches, but what really
matters is if we are entering schedule and neither (a) rescheduling the
same task, or (b) running another thread that quickly blocks on the
semaphore.

It is all academic anyhow...

Robert Love

2002-06-05 08:11:11

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

Robert Love wrote:
[...]
> Basically, scheduler hints are a way for a program to give a "hint" to
> the scheduler about its present behavior in the hopes of the scheduler
> subsequently making better scheduling decisions. After all, who knows
> better than the application what it is about to do?
>
> For example, consider a group of SCHED_RR threads that share a
> semaphore. Before one of the threads were to acquire the semaphore, it
> could give a "hint" to the scheduler to increase its remaining timeslice
> in order to ensure it could complete its work and drop the semaphore
> before being preempted. Since, if it were preempted, it would just end
> up being rescheduled as the other real-time threads would eventually
> block on the held semaphore.
>
Seems to me this particular case is covered by increasing
priority when grabbing the semaphore and normalizing
priority when releasing.

Only root can do that - but only root does real-time
anyway. And I guess only rood should be able to increase
its timeslice too...

> Other hints could be "I am interactive"
Already shows up as a thread who always ends its timeslice
blocking for io. Such threads do get an priority
boost for the next timeslice.

> or "I am a batch (i.e. cpu hog)
shows up as a thread who spends its entire timeslice - these
don't get the above mentioned boost as it is assumed it gets
"enough cpu" while the interactive threads blocks.

> task" or "I am cache hot: try to keep me on this CPU".
Perhaps that might be useful.

> The scheduler tries hard to figure out the three qualities and it is
> usually right, although perhaps it can react quicker to these hints than
> it can figure things out on its own. If nothing else, this serves as a
> useful tool for determining just how accurate our O(1) scheduler is.

Well, hog/interactive is determined in one timeslice already...

The problem is that this may be abused. Someone nasty could
write a cpu hog that drops a lot of hints about being
interactive, starving real interactive programs.

Generally, it degenerates into application programmers
using _all_ the hints to get performance, so they
can beat some competitor in benchmarks. And all
other programs just get penalized.

Helge Hafting

2002-06-05 10:23:54

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, Jun 05, 2002 at 10:11:02AM +0200, Helge Hafting wrote:
> The problem is that this may be abused. Someone nasty could
> write a cpu hog that drops a lot of hints about being
> interactive, starving real interactive programs.
>
> Generally, it degenerates into application programmers
> using _all_ the hints to get performance, so they
> can beat some competitor in benchmarks. And all
> other programs just get penalized.

I can see how that would be a problem.
If I didn't have the source to $programs.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-06-05 16:17:30

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, 2002-06-05 at 01:11, Helge Hafting wrote:

> Seems to me this particular case is covered by increasing
> priority when grabbing the semaphore and normalizing
> priority when releasing.
>
> Only root can do that - but only root does real-time
> anyway. And I guess only rood should be able to increase
> its timeslice too...

Increasing its priority has no bearing on whether it runs out of
timeslice, however. The idea here is to help the task complete its
critical section (and thus not block other tasks) before being
preempted. Only way to achieve that is boost its timeslice.

Boosting its priority will assure there is no priority inversion and
that, eventually, the task will run - but it does nothing to avoid the
nasty "grab resource, be preempted, reschedule a bunch, finally find
yourself running again since everyone else blocked" issue.

And I don't think only root should be able to do this. If we later
punish the task (take back the timeslice we gave it) then this is fair.

> > Other hints could be "I am interactive"
>
> Already shows up as a thread who always ends its timeslice
> blocking for io. Such threads do get an priority
> boost for the next timeslice.
>
> > or "I am a batch (i.e. cpu hog)
>
> shows up as a thread who spends its entire timeslice - these
> don't get the above mentioned boost as it is assumed it gets
> "enough cpu" while the interactive threads blocks.
>
> Well, hog/interactive is determined in one timeslice already...

In the O(1) scheduler it is determined based on a running sleep average,
not timeslice used (this is effectively the same thing - although we
turn it into a heuristic so it is more accurate over time).

The problem is it takes time to figure these out. One whole schedule of
the app to determine anything, and then a series of schedules to perfect
it. My idea here was let the app tell the system what it is to give the
system a head start. The scheduler will slowly readjust whatever it is
told, based on the task's behavior, anyhow.

Giving a hint at the start of an interactive task, for example, skips
the second or two of low priority where the task is not receiving its
full boost.

> The problem is that this may be abused. Someone nasty could
> write a cpu hog that drops a lot of hints about being
> interactive, starving real interactive programs.

Agreed. The code does require CAP_SYS_NICE and the comments explain the
issue... One thing worth saying is I don't think this is as useful as
the HINT_TIME hint anyhow.

> Generally, it degenerates into application programmers
> using _all_ the hints to get performance, so they
> can beat some competitor in benchmarks. And all
> other programs just get penalized.

Well they can already nice themselves or make themselves real-time, so
we have to trust them in numerous ways already not to cheat.

Robert Love

2002-06-06 00:46:29

by Rick Bressler

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

> So I went ahead and implemented scheduler hints on top of the O(1)
> scheduler.

> Other hints could be "I am interactive" or "I am a batch (i.e. cpu hog)
> task" or "I am cache hot: try to keep me on this CPU".

Sequent had an interesting hint they cooked up with Oracle. (Or maybe it
was the other way around.) As I recall they called it 'twotask.'
Essentially Oracle clients processes spend a lot of time exchanging
information with its server process. It usually makes sense to bind them
to the same CPU in an SMP (and especially NUMA) machine. (Probably
obvious to most of the folks on the group, but it is generally lots
better to essentially communicate through the cache and local memory
than across the NUMA bus.)

As I recall it made a significant difference in Oracle performance, and
would probably also translate to similar performance in many situations
where you had a client and server process doing lots of interaction in
an SMP environment.

Don't know if there is enough application to warrant it, but you asked.
:-)

--
+--------------------------------------------+ Rick Bressler
|Mushrooms and other fungi have several |
|important roles in nature. They help things|
|grow, they are a source of food, they |
|decompose organic matter and they |
|infect, debilitate and kill organisms. | Linux: Because a PC is a
+--------------------------------------------+ terrible thing to waste.

2002-06-06 00:53:52

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, 2002-06-05 at 17:46, Rick Bressler wrote:

> Sequent had an interesting hint they cooked up with Oracle. (Or maybe it
> was the other way around.) As I recall they called it 'twotask.'
> Essentially Oracle clients processes spend a lot of time exchanging
> information with its server process. It usually makes sense to bind them
> to the same CPU in an SMP (and especially NUMA) machine. (Probably
> obvious to most of the folks on the group, but it is generally lots
> better to essentially communicate through the cache and local memory
> than across the NUMA bus.)

This is similar in theory to why we used to have the sync option on
wake_up for pipes... it does work.

We don't need a scheduler "hint" for this, though. A big loud command
"bind me to this processor!" would do fine, and in 2.5 we have that:

just have one of the tasks do:

sched_setaffinity(0, sizeof(unsigned long), 1);
sched_setaffinity(other_guys_pid, sizeof(unsigned long), 1);

and both will be affined to CPU 1.

Robert Love

2002-06-06 01:06:39

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

In message <[email protected]>, > : Rick Bressle
r writes:
> > So I went ahead and implemented scheduler hints on top of the O(1)
> > scheduler.
>
> > Other hints could be "I am interactive" or "I am a batch (i.e. cpu hog)
> > task" or "I am cache hot: try to keep me on this CPU".
>
> Sequent had an interesting hint they cooked up with Oracle. (Or maybe it
> was the other way around.) As I recall they called it 'twotask.'
> Essentially Oracle clients processes spend a lot of time exchanging
> information with its server process. It usually makes sense to bind them
> to the same CPU in an SMP (and especially NUMA) machine. (Probably
> obvious to most of the folks on the group, but it is generally lots
> better to essentially communicate through the cache and local memory
> than across the NUMA bus.)

Actually, process-to-process affinity, which was later generalized
as a process gang affinity.

> As I recall it made a significant difference in Oracle performance, and
> would probably also translate to similar performance in many situations
> where you had a client and server process doing lots of interaction in
> an SMP environment.

Yep. Must be used with care, but not terribly damaging for general
access. Typically arranged as a many to one linkage by the callers,
which simplified the rebalancing decisions a bit. I think there
was a paper written about it somewhere by Phil Krueger.

gerrit

2002-06-06 01:11:50

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, 2002-06-05 at 18:05, Gerrit Huizenga wrote:

> Actually, process-to-process affinity, which was later generalized
> as a process gang affinity.

Oh OK, gang affinity - a bit different and not what we do now :)

Interesting to look into, although not terribly useful I suspect weighed
against its implementation...

Robert Love

2002-06-06 01:14:49

by Rick Bressler

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

> We don't need a scheduler "hint" for this, though. A big loud command
> "bind me to this processor!" would do fine, and in 2.5 we have that:
>
> just have one of the tasks do:
>
> sched_setaffinity(0, sizeof(unsigned long), 1);
> sched_setaffinity(other_guys_pid, sizeof(unsigned long), 1);
>
> and both will be affined to CPU 1.

I think that in some ways they were trying to simplify the code. It is
a bit more complicated to do well from user space. You're talking
dozens to thousands of process pairs, and maybe dozens of CPU's. I
think the idea was that the scheduler has a better idea of what CPU's
are least busy, where to put the processes and indeed should migrate
tasks as necessary. I just does it in pairs. Keep em together is the
idea, rather than keep them in any one specific place, thus the hint.

I note that Gerrit replied also and as I recall he is one of those ex
Sequent guys who really knows this stuff, so I'll bow out in favor of
the experts. :-)

--
+--------------------------------------------+ Rick Bressler
|Mushrooms and other fungi have several | G-4781 (425)342-1554
|important roles in nature. They help things| Pager 1-800-946-4646
|grow, they are a source of food, they | Pin: 1700898
|decompose organic matter and they | [email protected]
|infect, debilitate and kill organisms. | Linux: Because a PC is a
+--------------------------------------------+ terrible thing to waste.

2002-06-06 01:20:41

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

In message <1023325903.912.390.camel@sinai>, > : Robert Love writes:
> On Wed, 2002-06-05 at 18:05, Gerrit Huizenga wrote:
>
> > Actually, process-to-process affinity, which was later generalized
> > as a process gang affinity.
>
> Oh OK, gang affinity - a bit different and not what we do now :)
>
> Interesting to look into, although not terribly useful I suspect weighed
> against its implementation...
>
> Robert Love

Our scheduler *was* a long set of conditionals. However, from the
stock BSD scheduler through the contorted thing that it became, we
saw something like 30-50% increases in some workloads. I think
the motivator was actually not Oracle originally but something like
SAP. Specific numbers are hard to extract now since we did so many
SMP & NUMA changes over the years, but I think I remember a slide
showing over 30% increase in SAP for this one additional feature.
I don't know that I ever saw specific Oracle or Oracle Apps numbers
for this although it was viewed as a "large" benefit, especially
in NUMA machines, but even on SMP machines.

gerrit

2002-06-07 19:13:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

Hi!

> > Seems to me this particular case is covered by increasing
> > priority when grabbing the semaphore and normalizing
> > priority when releasing.
> >
> > Only root can do that - but only root does real-time
> > anyway. And I guess only rood should be able to increase
> > its timeslice too...
>
> Increasing its priority has no bearing on whether it runs out of
> timeslice, however. The idea here is to help the task complete its
> critical section (and thus not block other tasks) before being
> preempted. Only way to achieve that is boost its timeslice.
>
> Boosting its priority will assure there is no priority inversion and
> that, eventually, the task will run - but it does nothing to avoid the
> nasty "grab resource, be preempted, reschedule a bunch, finally find
> yourself running again since everyone else blocked" issue.
>
> And I don't think only root should be able to do this. If we later
> punish the task (take back the timeslice we gave it) then this is
> fair.

Another possibility might be to allow it to *steal* time from another
processes... Of course only processes of same UID ;-).
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-06-10 21:10:43

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, 5 Jun 2002, Rick Bressler wrote:

> > So I went ahead and implemented scheduler hints on top of the O(1)
> > scheduler.
>
> > Other hints could be "I am interactive" or "I am a batch (i.e. cpu hog)
> > task" or "I am cache hot: try to keep me on this CPU".
>
> Sequent had an interesting hint they cooked up with Oracle. (Or maybe it
> was the other way around.) As I recall they called it 'twotask.'
> Essentially Oracle clients processes spend a lot of time exchanging
> information with its server process. It usually makes sense to bind them
> to the same CPU in an SMP (and especially NUMA) machine. (Probably
> obvious to most of the folks on the group, but it is generally lots
> better to essentially communicate through the cache and local memory
> than across the NUMA bus.)

Are you really saying that you think serializing all the clients through a
single processor will gain more than than you lose by not using all the
other CPUs for clients?

> As I recall it made a significant difference in Oracle performance, and
> would probably also translate to similar performance in many situations
> where you had a client and server process doing lots of interaction in
> an SMP environment.

I've certainly seen a "significant difference" between uni and SMP, but it
was always in the other direction. Is this particular to some hardware, or
running multiple servers somehow? I'm only fmailiar with Linux, AIX and
Solaris, maybe this is Sequent magic? Or were you talking about having
only one client total on the machine and just making that run fast?

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-06-10 22:30:00

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

In message <[email protected]>, > :
Bill Davidsen writes:
> On Wed, 5 Jun 2002, Rick Bressler wrote:
>
> > > So I went ahead and implemented scheduler hints on top of the O(1)
> > > scheduler.
> >
> > > Other hints could be "I am interactive" or "I am a batch (i.e. cpu hog)
> > > task" or "I am cache hot: try to keep me on this CPU".
> >
> > Sequent had an interesting hint they cooked up with Oracle. (Or maybe it
> > was the other way around.) As I recall they called it 'twotask.'
> > Essentially Oracle clients processes spend a lot of time exchanging
> > information with its server process. It usually makes sense to bind them
> > to the same CPU in an SMP (and especially NUMA) machine. (Probably
> > obvious to most of the folks on the group, but it is generally lots
> > better to essentially communicate through the cache and local memory
> > than across the NUMA bus.)
>
> Are you really saying that you think serializing all the clients through a
> single processor will gain more than than you lose by not using all the
> other CPUs for clients?

When the number of runnable processes exceeds the number of CPUs
in an SMP system, and subsets of the runnable processes share data
(pipes, sockets, shared memory, etc.), minimizing the cache invalidate
effects of the subset by scheduling them on the same CPU (with some
level of cache-affinity, or stickiness) can increase throughput
dramatically. Oracle Apps and BAAN are two application sets that have
this kind of behavior which benefited from having these subsets of
related processes "tied at the wrists" when they were scheduled.

Figure 1000 runnable process, in sets of two or even sets of ten.
The subsets should be scheduled together if possible, but still
smeared across all processors. No one is suggesting that a UP machine
will always outperform an SMP machine. ;-)

As was mentioned in another aspect of this thread, this is different
from explicit user-specified CPU affinity in that there are more processes
than a user wants to allocate explicitly to CPUs. Instead, the scheduler
can do load balancing by moving/migrating sets of processes from an
overloaded CPU to a less loaded CPU. However, the cache effects can
make a difference of something like 20-50% of overall throughput in
a fairly intensive data sharing workload like this.

> > As I recall it made a significant difference in Oracle performance, and
> > would probably also translate to similar performance in many situations
> > where you had a client and server process doing lots of interaction in
> > an SMP environment.
>
> I've certainly seen a "significant difference" between uni and SMP, but it
> was always in the other direction. Is this particular to some hardware, or
> running multiple servers somehow? I'm only fmailiar with Linux, AIX and
> Solaris, maybe this is Sequent magic? Or were you talking about having
> only one client total on the machine and just making that run fast?

This is an SMP thing, which also benefits NUMA pretty dramatically.
And this is about how processes are scheduled, and how hints can
be provided to the scheduler. It also relates to the overhead of
cache invalidation, the size of CPU caches, etc. Sequent's hardware
might have seen a bigger improvement from this type of change than
other types of hardware might. Or vice versa.

gerrit

2002-06-12 19:25:17

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Fri, Jun 07, 2002 at 01:32:31PM +0200, Pavel Machek wrote:
> > Boosting its priority will assure there is no priority inversion and
> > that, eventually, the task will run - but it does nothing to avoid the
> > nasty "grab resource, be preempted, reschedule a bunch, finally find
> > yourself running again since everyone else blocked" issue.
> >
> > And I don't think only root should be able to do this. If we later
> > punish the task (take back the timeslice we gave it) then this is
> > fair.
>
> Another possibility might be to allow it to *steal* time from another
> processes... Of course only processes of same UID ;-).
> Pavel

Good idea!

And I would say SID instead of UID and give up, if no task in the
same SID is runnable.

One could provide different policies here, which the user can
choose/combine.

That way we aren't at least unfair to other users on our remote
machine.

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2002-06-12 19:39:49

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] scheduler hints

On Wed, 2002-06-12 at 11:37, Ingo Oeser wrote:

> Good idea!
>
> And I would say SID instead of UID and give up, if no task in the
> same SID is runnable.
>
> One could provide different policies here, which the user can
> choose/combine.
>
> That way we aren't at least unfair to other users on our remote
> machine.

The solution I am working on now is to just take the timeslice away
later. I.e., now it is more beneficial to me than later, so give me
some of my future timeslice now. This is per-process.

It is enforced right now only by a call to sched_hint() with a hint
saying "I am done". The timeslice used is calculated and your timeslice
is adjusted and if applicable you are removed from the active runqueue.

Next I need to add a check to schedule() to check for processes who are
scheduling off and have not explicitly given the "I am done" hint. I am
weary of what to do here as I do not want to adversely affect the
fastpath.

That is what I am playing with now, anyhow... but I have been a bit busy
of late and not put enough cycles into it.

Robert Love