2002-06-15 22:15:35

by Emmanuel Mogenet

[permalink] [raw]
Subject: Question about sched_yield()


Hello,

I am seeing some strange linux scheduler behaviours,
and I thought this'd be the best place to ask.

I have two processes, one that loops forever and
does nothing but calling sched_yield(), and the other
is basically benchmarking how fast it can compute
some long math calculation.

When I run the math simulator with nothing else running,
I get the following:

lenslark # ./loop
1.86543 MLoops per sec, sum=0.291969
1.84847 MLoops per sec, sum=1.65587
1.84064 MLoops per sec, sum=1.74601

When I run one instance of the math simulator and
one instance of the sched_yielder, I see the behaviour
I was expecting (and want): all CPU is allocated to the
process that does some actual work:

lenslark # ./yield &
[1] 26316
lenslark #
lenslark # ./loop
1.86502 MLoops per sec, sum=0.291969
1.84868 MLoops per sec, sum=1.65587
1.84035 MLoops per sec, sum=1.74601

and top confirms what I'm seeing: loop gets 99% of
the CPU, and yield gets close to none.

However, when I have two instances of the
sched_yielder running, things start to behave strange:

lenslark # ./yield &
[1] 26350
lenslark # ./yield &
[2] 26351
lenslark # ./loop
0.686901 MLoops per sec, sum=0.291969
0.674352 MLoops per sec, sum=1.65587
0.665542 MLoops per sec, sum=1.74601
0.674117 MLoops per sec, sum=0.407357

and sure enough, top is showing that for
some reason, the sched_yielders get 1/3rd
of the CPU:

26364 root 18 0 364 364 304 R 36.0 0.3 0:04 loop
26351 root 12 0 252 252 204 R 32.2 0.2 0:31 yield
26350 root 14 0 252 252 204 R 31.2 0.2 0:33 yield
26365 root 11 0 1024 1024 812 R 0.3 1.0 0:00 top


Does anyone have a clue as to why this happens ?
Is this the expected behaviour given the current
scheduler algorithm (of which I know nothing :)).

For comparison purposes, I ran the same test on
Win2k+MingWin32, and things behave the way I'd expect
there: no matter how many sched_yielders are running,
loop gets almost all of the CPU, and barely slows down.

I am running on an AMD-K6@400Mhz, on the following kernel:

lenslark # cat /proc/version
Linux version 2.4.18 ([email protected]) (gcc version 2.96 20000731 (Mandra
ke Linux 8.2 2.96-0.76mdk)) #1 Sun Apr 28 15:25:19 PDT 2002

This kernel was compiled with SMP turned off.

If anyone can explain me what's going on, or even better
how to replicate the behaviour I'm seeing on Win2k, I'd
be very happy. Also, I'm not a subscriber to linux-kernel,
so I'd appreciate a CC.

I've included the source to the two small programs below:

================================= YIELD.CPP

#ifdef linux
#include <sched.h>
main()
{
while(1)
{
sched_yield();
}
}
#endif


#ifdef WIN32
#include <windows.h>
main()
{
while(1)
{
Sleep(0);
}
}
#endif

================================= LOOP.CPP
#include <math.h>
#include <stdio.h>
#include <sys/time.h>

#ifdef WIN32
#include <windows.h>
#endif

typedef unsigned long long uint64_t;

uint64_t usecs()
{

#ifdef linux
struct timeval t;
gettimeofday(&t,0);

uint64_t result;
result= t.tv_sec;
result*= 1000000;
result+= t.tv_usec;
return result;
#endif

#ifdef WIN32
return 1000*(uint64_t)GetTickCount();
#endif
}

main()
{
uint64_t t= 0;
double sum= 0.0;
uint64_t lastTime= 0;
uint64_t lastLoops= 0;
while(t<100000000)
{
sum+= sin((double)t);
if((t%4000000)==0)
{
uint64_t nowTime= usecs();
uint64_t nowLoops= t;

if(lastTime!=0)
{
double loops= (nowLoops-lastLoops);
double elapsed= (nowTime-lastTime)/1000000.0;
fprintf(
stderr,
"%g MLoops per sec, sum=%g\n",
(loops/elapsed)/1000000.0,
sum
);
}

lastTime= nowTime;
lastLoops= nowLoops;
}
++t;
}
fprintf(stderr,"Sum=%g\n",sum);
}



2002-06-16 14:46:16

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Re: Question about sched_yield()


On Sat, 15 Jun 2002 [email protected] wrote:

> I am seeing some strange linux scheduler behaviours, and I thought
> this'd be the best place to ask.
>
> I have two processes, one that loops forever and does nothing but
> calling sched_yield(), and the other is basically benchmarking how fast
> it can compute some long math calculation.

sched_yield() is indeed misbehaving both in 2.4 and 2.5. (I think you are
using some variant of 2.4.18, does that kernel have the O(1) scheduler
patches applied?) In fact in 2.5 it's behaving slightly worse than in
vanilla 2.4.18.

the O(1)-scheduler's sched_yield() implementation does the following to
'give up' the CPU:

- it decreases its priority by 1 until it reaches the lowest level
- it queues the task to the end of the priority queue

this scheme works fine in most cases, but if sched_yield()-active tasks
are mixed with CPU-using processes then it's quite likely that the
CPU-using process is in the expired array. In that case the yield()-ing
process only requeues itself in the active array - a true context-switch
to the expired process will only occur once the timeslice of the
yield()-ing process has expired: in ~150 msecs. This leads to the
yield()-ing and CPU-using process to use up rougly the same amount of
CPU-time, which is arguably deficient.

i've fixed this problem by extending sched_yield() the following way:

+ * There are three levels of how a yielding task will give up
+ * the current CPU:
+ *
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ *
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
+ */

the attached patch against vanilla 2.5.21 also includes this sched_yield()
improvement, and in my tests it now behaves well. Could you give it a go,
does it now behave for your workload as well?

Ingo

diff -rNu linux-2.5.21-final/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.21-final/arch/i386/kernel/entry.S Sun Jun 9 07:28:19 2002
+++ linux/arch/i386/kernel/entry.S Sun Jun 16 16:35:52 2002
@@ -193,6 +193,7 @@

ENTRY(ret_from_fork)
#if CONFIG_SMP || CONFIG_PREEMPT
+ # NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
#endif
GET_THREAD_INFO(%ebx)
diff -rNu linux-2.5.21-final/fs/pipe.c linux/fs/pipe.c
--- linux-2.5.21-final/fs/pipe.c Sun Jun 9 07:26:29 2002
+++ linux/fs/pipe.c Sun Jun 16 16:35:36 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -rNu linux-2.5.21-final/include/asm-i386/system.h linux/include/asm-i386/system.h
--- linux-2.5.21-final/include/asm-i386/system.h Sun Jun 9 07:26:29 2002
+++ linux/include/asm-i386/system.h Sun Jun 16 16:35:52 2002
@@ -11,9 +11,12 @@
struct task_struct; /* one of the stranger aspects of C forward declarations.. */
extern void FASTCALL(__switch_to(struct task_struct *prev, struct task_struct *next));

-#define prepare_to_switch() do { } while(0)
+#define prepare_arch_schedule(prev) do { } while(0)
+#define finish_arch_schedule(prev) do { } while(0)
+#define prepare_arch_switch(rq) do { } while(0)
+#define finish_arch_switch(rq) spin_unlock_irq(&(rq)->lock)

-#define switch_to(prev,next) do { \
+#define switch_to(prev,next,last) do { \
asm volatile("pushl %%esi\n\t" \
"pushl %%edi\n\t" \
"pushl %%ebp\n\t" \
diff -rNu linux-2.5.21-final/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.21-final/include/linux/sched.h Sun Jun 9 07:27:21 2002
+++ linux/include/linux/sched.h Sun Jun 16 16:35:36 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -rNu linux-2.5.21-final/include/linux/spinlock.h linux/include/linux/spinlock.h
--- linux-2.5.21-final/include/linux/spinlock.h Sun Jun 9 07:30:01 2002
+++ linux/include/linux/spinlock.h Sun Jun 16 16:35:46 2002
@@ -26,6 +26,7 @@
#define write_lock_bh(lock) do { local_bh_disable(); write_lock(lock); } while (0)

#define spin_unlock_irqrestore(lock, flags) do { spin_unlock(lock); local_irq_restore(flags); } while (0)
+#define _raw_spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock); local_irq_restore(flags); } while (0)
#define spin_unlock_irq(lock) do { spin_unlock(lock); local_irq_enable(); } while (0)
#define spin_unlock_bh(lock) do { spin_unlock(lock); local_bh_enable(); } while (0)

@@ -183,6 +184,12 @@
preempt_schedule(); \
} while (0)

+#define preempt_check_resched() \
+do { \
+ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ preempt_schedule(); \
+} while (0)
+
#define spin_lock(lock) \
do { \
preempt_disable(); \
@@ -197,6 +204,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -206,20 +219,22 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
+#define preempt_get_count() (0)
+#define preempt_disable() do { } while (0)
#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
+#define preempt_enable() do { } while (0)
+#define preempt_check_resched() do { } while (0)

-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
-
-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -rNu linux-2.5.21-final/kernel/ksyms.c linux/kernel/ksyms.c
--- linux-2.5.21-final/kernel/ksyms.c Sun Jun 9 07:26:33 2002
+++ linux/kernel/ksyms.c Sun Jun 16 16:35:36 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -rNu linux-2.5.21-final/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21-final/kernel/sched.c Sun Jun 9 07:28:13 2002
+++ linux/kernel/sched.c Sun Jun 16 16:36:00 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,21 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
goto repeat_lock_task;
}
return rq;
@@ -172,7 +175,23 @@
static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -284,9 +303,15 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ if (unlikely(rq->curr == p)) {
cpu_relax();
- barrier();
+ /*
+ * enable/disable preemption just to make this
+ * a preemption point - we are busy-waiting
+ * anyway.
+ */
+ preempt_enable();
+ goto repeat;
}
rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
@@ -322,40 +347,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +406,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -401,19 +435,18 @@
}

#if CONFIG_SMP || CONFIG_PREEMPT
-asmlinkage void schedule_tail(void)
+asmlinkage void schedule_tail(task_t *prev)
{
- spin_unlock_irq(&this_rq()->frozen);
+ finish_arch_switch(this_rq());
+ finish_arch_schedule(prev);
}
#endif

-static inline void context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;

- prepare_to_switch();
-
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
@@ -427,7 +460,9 @@
}

/* Here we just switch the register state and the stack. */
- switch_to(prev, next);
+ switch_to(prev, next, prev);
+
+ return prev;
}

unsigned long nr_running(void)
@@ -773,6 +808,7 @@
rq = this_rq();

release_kernel_lock(prev, smp_processor_id());
+ prepare_arch_schedule(prev);
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);

@@ -828,28 +864,20 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
- context_switch(prev, next);
-
- /*
- * The runqueue pointer might be from another CPU
- * if the new task was last running on a different
- * CPU - thus re-load it.
- */
- mb();
+
+ prepare_arch_switch(rq);
+ prev = context_switch(prev, next);
+ barrier();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
+ finish_arch_switch(rq);
+ } else
spin_unlock_irq(&rq->lock);
- }
+ finish_arch_schedule(prev);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
if (test_thread_flag(TIF_NEED_RESCHED))
goto need_resched;
- return;
}

#ifdef CONFIG_PREEMPT
@@ -880,7 +908,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +919,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +933,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1339,27 +1386,34 @@

asmlinkage long sys_sched_yield(void)
{
- runqueue_t *rq;
- prio_array_t *array;
-
- preempt_disable();
- rq = this_rq();
+ runqueue_t *rq = rq_lock(rq);
+ prio_array_t *array = current->array;

/*
- * Decrease the yielding task's priority by one, to avoid
- * livelocks. This priority loss is temporary, it's recovered
- * once the current timeslice expires.
+ * There are three levels of how a yielding task will give up
+ * the current CPU:
*
- * If priority is already MAX_PRIO-1 then we still
- * roundrobin the task within the runlist.
- */
- spin_lock_irq(&rq->lock);
- array = current->array;
- /*
- * If the task has reached maximum priority (or is a RT task)
- * then just requeue the task to the end of the runqueue:
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
*/
- if (likely(current->prio == MAX_PRIO-1 || rt_task(current))) {
+ if (likely(current->prio == MAX_PRIO-1)) {
+ if (current->time_slice <= 1) {
+ dequeue_task(current, rq->active);
+ enqueue_task(current, rq->expired);
+ } else
+ current->time_slice--;
+ } else if (unlikely(rt_task(current))) {
list_del(&current->run_list);
list_add_tail(&current->run_list, array->queue + current->prio);
} else {
@@ -1370,8 +1424,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1652,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
@@ -1687,7 +1739,15 @@
task_rq_unlock(rq, &flags);
goto out;
}
-
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && (p != rq->curr)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
init_MUTEX_LOCKED(&req.sem);
req.task = p;
list_add(&req.list, &rq->migration_queue);

2002-06-18 00:46:31

by David Schwartz

[permalink] [raw]
Subject: Re: Question about sched_yield()


On Sat, 15 Jun 2002 15:15:32 -0700, [email protected] wrote:
>
>Hello,
>
>I am seeing some strange linux scheduler behaviours,
>and I thought this'd be the best place to ask.
>
>I have two processes, one that loops forever and
>does nothing but calling sched_yield(), and the other
>is basically benchmarking how fast it can compute
>some long math calculation.
[snip]

You seem to have a misconception about what sched_yield is for. It is not a
replacement for blocking or a scheduling priority adjustment. It simply lets
other ready-to-run tasks be scheduled before returning to the current task.

Here's a quote from SuS3:

"The sched_yield() function shall force the running thread to relinquish the
processor until it again becomes the head of its thread list. It takes no
arguments."

This neither says nor implies anything about CPU usage. It simply says that
the current thread will yield and be put at the end of the list.

DS


2002-06-18 00:55:06

by Robert Love

[permalink] [raw]
Subject: Re: Question about sched_yield()

On Mon, 2002-06-17 at 17:46, David Schwartz wrote:

> You seem to have a misconception about what sched_yield is for.
> It is not a replacement for blocking or a scheduling priority
> adjustment. It simply lets other ready-to-run tasks be scheduled
before returning to the current task.
>
> Here's a quote from SuS3:
>
> "The sched_yield() function shall force the running thread to relinquish the
> processor until it again becomes the head of its thread list. It takes no
> arguments."
>
> This neither says nor implies anything about CPU usage. It simply says
> that the current thread will yield and be put at the end of the list.

And you seem to have a misconception about sched_yield, too. If a
machine has n tasks, half of which are doing CPU-intense work and the
other half of which are just yielding... why on Earth would the yielding
tasks get any noticeable amount of CPU use?

Seems to me the behavior of sched_yield is a bit broken. If the tasks
are correctly returned to the end of their runqueue, this should not
happen. Note, for example, you will not see this behavior in 2.5.

Quite frankly, even if the supposed standard says nothing of this... I
do not care: calling sched_yield in a loop should not show up as a CPU
hog.

Robert Love

2002-06-18 01:41:38

by Emmanuel Mogenet

[permalink] [raw]
Subject: RE: Question about sched_yield()



> -----Original Message-----
> From: David Schwartz [mailto:[email protected]]
> Sent: Monday, June 17, 2002 5:46 PM
> To: [email protected]; [email protected]
> Subject: Re: Question about sched_yield()
>
>
>
> On Sat, 15 Jun 2002 15:15:32 -0700, [email protected] wrote:
> >
> >Hello,
> >
> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.
> [snip]
>
> You seem to have a misconception about what sched_yield is for. It is not a
> replacement for blocking or a scheduling priority adjustment. It simply lets
> other ready-to-run tasks be scheduled before returning to the current task.
>
> Here's a quote from SuS3:
>
> "The sched_yield() function shall force the running thread to relinquish the
> processor until it again becomes the head of its thread list. It takes no
> arguments."
>
> This neither says nor implies anything about CPU usage. It simply says that
> the current thread will yield and be put at the end of the list.

If so, please enlighten me as to when, why, and what for you would use sched_yield.

If willingly and knowingly relinquinshing a CPU does not make it possible
for other processes to use what would otherwise have been your very own slice
of processing time then what could it be used for, I really wonder.

Second, I have tried to run my misconception on various other OS'es I have
access to:Win2k, Mac OSX and OpenBSD, and suprinsingly enough, all of them
seem to be sharing my twisted views of How Things Should Be (tm).

- Mgix


2002-06-18 01:51:20

by Emmanuel Mogenet

[permalink] [raw]
Subject: RE: Question about sched_yield()




> Seems to me the behavior of sched_yield is a bit broken. If the tasks
> are correctly returned to the end of their runqueue, this should not
> happen. Note, for example, you will not see this behavior in 2.5.

Actually, it seems to happen in 2.5 too.
However, Ingo sent me a patch for 2.5.21
that fixes the issue.

See this message:

http://marc.theaimsgroup.com/?l=linux-kernel&m=102423901727214&w=2

- Mgix

2002-06-18 03:18:24

by David Schwartz

[permalink] [raw]
Subject: Re: Question about sched_yield()


>> This neither says nor implies anything about CPU usage. It simply says
>>that the current thread will yield and be put at the end of the list.

>And you seem to have a misconception about sched_yield, too. If a
>machine has n tasks, half of which are doing CPU-intense work and the
>other half of which are just yielding... why on Earth would the yielding
>tasks get any noticeable amount of CPU use?

Because they're running infinite loops!

>Quite frankly, even if the supposed standard says nothing of this... I
>do not care: calling sched_yield in a loop should not show up as a CPU
>hog.

Calling any function that does not block in an endless loop *should* show up
as a CPU hog. Yielding is not blocking or even lowering your priority.

DS


2002-06-18 03:21:51

by David Schwartz

[permalink] [raw]
Subject: RE: Question about sched_yield()


>> This neither says nor implies anything about CPU usage. It simply says
>>that
>>the current thread will yield and be put at the end of the list.
>
>If so, please enlighten me as to when, why, and what for you would use
>sched_yield.

Generally because you can't do something until some other thread/process
does something, so you give it a chance to finish immediately before trying
something a more expensive way.

>If willingly and knowingly relinquinshing a CPU does not make it possible
>for other processes to use what would otherwise have been your very own
>slice
>of processing time then what could it be used for, I really wonder.

It does! That's what it's for.

>Second, I have tried to run my misconception on various other OS'es I have
>access to:Win2k, Mac OSX and OpenBSD, and suprinsingly enough, all of them
>seem to be sharing my twisted views of How Things Should Be (tm).

Just because they do what you naively expect doesn't validate your
expectations.

DS


2002-06-18 03:52:13

by Emmanuel Mogenet

[permalink] [raw]
Subject: RE: Question about sched_yield()


> >If willingly and knowingly relinquinshing a CPU does not make it possible
> >for other processes to use what would otherwise have been your very own
> >slice
> >of processing time then what could it be used for, I really wonder.
>
> It does! That's what it's for.

Good, and now that we agree:

Now back to the original point: sched_yield does not
do the above on Linux as of today, which was the point
of my original posting, and which is the reason Ingo
posted a scheduler patch.

- Mgix


2002-06-18 04:59:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: Question about sched_yield()


On Mon, 17 Jun 2002, David Schwartz wrote:

> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.

> [...] It is not a replacement for blocking or a scheduling priority
> adjustment. It simply lets other ready-to-run tasks be scheduled before
> returning to the current task.

and this is what the scheduler didnt do properly, it actually didnt
schedule valid ready-to-run processes for long milliseconds, it switched
between two sched_yield processes, starving the CPU-intensive process. I
posted a patch for 2.5 that fixes this, and the 2.4.19-pre10-ac2 backport
i did includes this fix as well:

http://redhat.com/~mingo/O(1)-scheduler/sched-2.4.19-pre10-ac2-A4

a good sched_yield() implementation should give *all* other tasks a chance
to run, instead of switching between multiple sched_yield()-ing tasks. I
don think this is specified anywhere, but it's a quality of implementation
issue.

Ingo

2002-06-18 09:36:47

by David Schwartz

[permalink] [raw]
Subject: Re: Question about sched_yield()


>And you seem to have a misconception about sched_yield, too. If a
>machine has n tasks, half of which are doing CPU-intense work and the
>other half of which are just yielding... why on Earth would the yielding
>tasks get any noticeable amount of CPU use?

Because they are not blocking. They are in an endless CPU burning loop. They
should get CPU use for the same reason they should get CPU use if they're the
only threads running. They are always ready-to-run.

>Quite frankly, even if the supposed standard says nothing of this... I
>do not care: calling sched_yield in a loop should not show up as a CPU
>hog.

It has to. What if the only task running is:

while(1) sched_yield();

What would you expect? You cannot use sched_yield as a replacement for
blocking or scheduling priority. You cannot use it to be 'nice'. You can only
use it to try to give other threads a chance to run, usually to give them a
chance to release a resource.

Imagine if a spinlock uses sched_yield in its wait loop, what would you
expect on a dual-CPU system with a process on CPU A holding the lock? You
*want* the sched_yield process to burn the CPU fully, so that it notices the
spinlock acquisition as soon as possible.

While linux's sched_yield behavior is definitely sub-optimal, the particular
criticism being leveled (that sched_yield processes get too much CPU) is
wrong-headed.

DS


2002-06-18 18:51:49

by Rusty Russell

[permalink] [raw]
Subject: Re: Question about sched_yield()

On Mon, 17 Jun 2002 17:46:29 -0700
David Schwartz <[email protected]> wrote:
> "The sched_yield() function shall force the running thread to relinquish the
> processor until it again becomes the head of its thread list. It takes no
> arguments."

Notice how incredibly useless this definition is. It's even defined in terms
of UP.

Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-06-18 19:14:25

by David Schwartz

[permalink] [raw]
Subject: Re: Question about sched_yield()


On Wed, 19 Jun 2002 04:56:06 +1000, Rusty Russell wrote:

>On Mon, 17 Jun 2002 17:46:29 -0700
>David Schwartz <[email protected]> wrote:

>>"The sched_yield() function shall force the running thread to relinquish
>>the
>>processor until it again becomes the head of its thread list. It takes no
>>arguments."

>Notice how incredibly useless this definition is. It's even defined in
>terms
>of UP.

Huh?! This definition is beautiful in that it makes no such assumptions. How
would you say this is invalid on an SMP machine? By "the processor", they
mean "the process on which the thread is running" (the only one it could
relinquish, after all).

DS


2002-06-18 20:15:24

by Rusty Russell

[permalink] [raw]
Subject: Re: Question about sched_yield()

In message <[email protected]@whenever> you write:
>
> On Wed, 19 Jun 2002 04:56:06 +1000, Rusty Russell wrote:
>
> >On Mon, 17 Jun 2002 17:46:29 -0700
> >David Schwartz <[email protected]> wrote:
>
> >>"The sched_yield() function shall force the running thread to relinquish
> >>the processor until it again becomes the head of its thread list.
> >> It takes no arguments."
>
> >Notice how incredibly useless this definition is. It's even defined in
> >terms of UP.
>
> =09Huh?! This definition is beautiful in that it makes no such=
> assumptions. How would you say this is invalid on an SMP machine? By
> "the= processor", they mean "the process on which the thread is
> running" (the only one= it could relinquish, after all).

Read again: they use "relinquish ... until", not "relinquish". Subtle
difference.

I have 32 processors and 32 threads. One does a yield(). What
happens? What should happen?

Given that yield is "sleep for some time but I won't tell you what I'm
doing", I have no sympathy for yield users 8)

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-18 20:41:00

by David Schwartz

[permalink] [raw]
Subject: Re: Question about sched_yield()


>>>>"The sched_yield() function shall force the running thread to relinquish
>>>>the processor until it again becomes the head of its thread list.
>>>>It takes no arguments."

>>>Notice how incredibly useless this definition is. It's even defined in
>>>terms of UP.

>>Huh?! This definition is beautiful in that it makes no such=
>>assumptions. How would you say this is invalid on an SMP machine? By
>>"the= processor", they mean "the process on which the thread is
>>running" (the only one= it could relinquish, after all).

>Read again: they use "relinquish ... until", not "relinquish". Subtle
>difference.

So?

>I have 32 processors and 32 threads. One does a yield(). What
>happens? What should happen?

It should relinquish the processor it is running on until it again becomes
the head of its thread list. (IE, for as long as it takes the scheduler to
decide that it's the thread to run.)

>Given that yield is "sleep for some time but I won't tell you what I'm
>doing", I have no sympathy for yield users 8)

I have sympathy for those who use it properly, I have no sympathy for those
who loop on sched_yield burning the CPU and then complaining that it burns
CPU.

DS


2002-06-18 20:42:30

by Emmanuel Mogenet

[permalink] [raw]
Subject: RE: Question about sched_yield()



> Given that yield is "sleep for some time but I won't tell you what I'm
> doing", I have no sympathy for yield users 8)

Here's something interesting at last.

True, when userland calls sched_yield, the kernel is left in the dark.

However, I am not aware of any alternative to communicate what
I really want to the scheduler, and here's why. If anyone has
ideas on how to do this better, please, I'm all ears.

It's basically about spinlocks and the cost of task switching.

I'm trying to implement "smart" spinlocks.

Regular blocking lock based on some kind of kernel object are:

1. Slow because every unsuccessful use of the lock will
trigger a task switch. There is no way to have two threads
exchange messages at a very fast rate with a blocking lock.

2. Expensive in terms of kernel resource consumption: you end
up allocating a kernel object each time you need a lock.
If your app's locking is sufficienly fine-grained, you end
up allocating a huge number of kernel objects.

3. To avoid problem 2., you can try and allocate kernel objects
on the fly, but then performance gets even worse: every lock
acquisition/release requiring the creation/destruction of a
kernel object.

The alternative, userland spinlocks :

1. Cost zero in term of kernel resource consumption.

2. Are a major catastrophe when running on a UP box
(you can spin as long as you want, nothing's going to
change since there's only one CPU and you're hogging it)

3. A CPU hog at best when running on an SMP boxes: the spinning
thread gobbles up a whole 100% of a CPU.

"Smart" spinlocks basically try and do it this way:

int spinLoops= GetNumberOfProcsICanRunOn() > 1 ? someBigNumber : 1;
while(1)
{
int n= spinLoops;
while(n--) tryAndGetTheSpinLock();
if(gotIt) break;
sched_yield();
}

These seem to have all the qualities I want:

1. On a UP box, the thread will poke at the spinlock only once,
and then yield and *hopefully* get scheduled till later, giving
a chance to the lock owner to finish its stuff and release.

Net result: spinning thread gives no CPU consumption, no kernel
object allocated, and proper locking behaviour.

2. On an SMP box, the thread will bang on the spinlock a large
number of times, hoping to get it before it gets taskswitched away.
If it does, great: no time lost.
If it doesn't, we're out of luck, yield the CPU and try again next time.

Net result: tunable balance between CPU consumption and task switching
costs, no kernel object allocated, and proper locking behaviour.


I'm sure this issue has been beaten to death before, but I
just wanted to gove my $.02

- Mgix




2002-06-18 22:03:36

by David Schwartz

[permalink] [raw]
Subject: RE: Question about sched_yield()


> 3. A CPU hog at best when running on an SMP boxes: the spinning
>thread gobbles up a whole 100% of a CPU.

For the few hundred cycles some other thread holds the lock.

>"Smart" spinlocks basically try and do it this way:
>
> int spinLoops= GetNumberOfProcsICanRunOn() > 1 ? someBigNumber : 1;
> while(1)
> {
> int n= spinLoops;
> while(n--) tryAndGetTheSpinLock();
> if(gotIt) break;
> sched_yield();
> }
>
>These seem to have all the qualities I want:

Almost.

>2. On an SMP box, the thread will bang on the spinlock a large
>number of times, hoping to get it before it gets taskswitched away.
>If it does, great: no time lost.
>If it doesn't, we're out of luck, yield the CPU and try again next time.

You should limit how many times you spin in this loop. If it gets to be too
many, you should block.

You can either block by sleeping for a few milliseconds. If you don't like
the idea that one thread will release the lock and the other will waste time
sleeping, then associate a kernel lock with the spinlock when a thread gives
up waiting, have your unlock function check for an associated kernel lock and
if there is one, unlock it.

DS


2002-06-18 22:39:04

by Ingo Molnar

[permalink] [raw]
Subject: RE: Question about sched_yield()


On Tue, 18 Jun 2002 [email protected] wrote:

> However, I am not aware of any alternative to communicate what I really
> want to the scheduler, and here's why. If anyone has ideas on how to do
> this better, please, I'm all ears.
>
> It's basically about spinlocks and the cost of task switching.
>
> I'm trying to implement "smart" spinlocks.

the right solution for you are the new ultra-fast kernel-helped,
user-space semaphores. Those are kernel objects where the semaphore data
structure is accessible to user-space as well. Acquiring the semaphore in
the no-contention case is very cheap, it's basically a single assembly
instruction. In the contention case you'll enter the kernel and schedule
away.

(in the sched_yield() case you schedule away just as much in the
contention case, so there isnt any difference.)

for shortlived but contended locks this mechanism can be improved a bit by
looping 'some more' on the lock before entering the kernel and blocking -
but it could also be a loss if the amount of additional looping is too
long or too short. Since the O(1) scheduler context-switches in under 1
microsecond on a 1 GHz box, and has no SMP interlocking and cache-trashing
problems, you'll still get 1 million context switches per second, per CPU.
And this is the 'slow' case.

Ingo

2002-06-19 02:11:08

by jw schultz

[permalink] [raw]
Subject: Re: Question about sched_yield()

Lets back up a moment please.

On Mon, Jun 17, 2002 at 05:46:29PM -0700, David Schwartz wrote:
>
> On Sat, 15 Jun 2002 15:15:32 -0700, [email protected] wrote:
> >
> >Hello,
> >
> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.
> [snip]
>
> You seem to have a misconception about what sched_yield is for. It is not a
> replacement for blocking or a scheduling priority adjustment. It simply lets
> other ready-to-run tasks be scheduled before returning to the current task.
>
> Here's a quote from SuS3:
>
> "The sched_yield() function shall force the running thread to relinquish the
> processor until it again becomes the head of its thread list. It takes no
> arguments."

.From this description it sounds like sched_yield was
intended to deal with scheduling of threads within a
process not the between processes. Namely that there is a
"thread list" within a single process context.

This would be used for yielding to another thread within the
same list (context) when the current thread needs a
contended-for resource (an inter-thread/intra-context lock)
or the result of another thread's calculation. Or simply to
allow switching, round-robin style, between a set of compute
intensive threads by salting the compute loops with
sched_yield.

Am i right?

Linux, as we know, treats threads and independent processes
as equals. Therefore, the process scheduler now becomes
responsible for dealing with sched_yield and somehow
translating sched_yield's intent to a situation where threads are
contending with other processes instead of just with their
fellows and their fellows can wind up with differing
priorities.

Handled sched_yield wrong could result in a live/deadlock
(i'm fuzzy at the moment on the distinction, sorry) if one
thread is sched_yield/spinning on a resource pending action of a
lower priority thread. It would seem that sched_yield
should make sure that the calling thread has a priority
lower than the lowest priority thread within the thread
group.

"until it again becomes the head of its thread list"
sounds to me like all other threads in the group should have
an opportunity to run before the calling thread does.

I would suggest that in the case where there are no other
threads in the group that are runnable the process should
sleep(0) so another (lower priority) process can run.



--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-06-19 11:29:08

by Bill Davidsen

[permalink] [raw]
Subject: Re: Question about sched_yield()

On Tue, 18 Jun 2002, Ingo Molnar wrote:

>
> On Mon, 17 Jun 2002, David Schwartz wrote:
>
> > >I am seeing some strange linux scheduler behaviours,
> > >and I thought this'd be the best place to ask.
> > >
> > >I have two processes, one that loops forever and
> > >does nothing but calling sched_yield(), and the other
> > >is basically benchmarking how fast it can compute
> > >some long math calculation.
>
> > [...] It is not a replacement for blocking or a scheduling priority
> > adjustment. It simply lets other ready-to-run tasks be scheduled before
> > returning to the current task.
>
> and this is what the scheduler didnt do properly, it actually didnt
> schedule valid ready-to-run processes for long milliseconds, it switched
> between two sched_yield processes, starving the CPU-intensive process. I
> posted a patch for 2.5 that fixes this, and the 2.4.19-pre10-ac2 backport
> i did includes this fix as well:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-2.4.19-pre10-ac2-A4
>
> a good sched_yield() implementation should give *all* other tasks a chance
> to run, instead of switching between multiple sched_yield()-ing tasks. I
> don think this is specified anywhere, but it's a quality of implementation
> issue.

Clearly your fix is subtle enough that a quick reading, at least by me,
can't follow all the nuances, but it looks on first reading as if your
patch fixes this too well, and will now starve the threaded process by
only giving one turn at the CPU per full pass through the queue, rather
than sharing the timeslice between threads of a process.

I posted some thoughts on this to Robert, if you would comment I would
appreciate. My thought is that if you have N processes and one is
threaded, the aggregate of all threads should be 1/N of the CPU(s), not
vastly more or less. I think with your change it will be less if they are
sharing a resource. Feel free to tell me I misread what it does, or my
desired behaviour is not correct.

I do run 15 machines with long running a threaded server application and
periodic CPU hog things like log roll and compress, stats generation,
certain database cleanup, etc, and I have more than intelectual curiousity
on this behaviour.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-06-19 11:33:34

by Bill Davidsen

[permalink] [raw]
Subject: Re: Question about sched_yield()

On Wed, 19 Jun 2002, Rusty Russell wrote:

> On Mon, 17 Jun 2002 17:46:29 -0700
> David Schwartz <[email protected]> wrote:
> > "The sched_yield() function shall force the running thread to relinquish the
> > processor until it again becomes the head of its thread list. It takes no
> > arguments."
>
> Notice how incredibly useless this definition is. It's even defined in terms
> of UP.

I think you parse this differently than I, I see no reference to UP. The
term "the processor" clearly (to me at least) means the processor running
in that thread at the time of the yeild.

The number of processors running in a single thread at any one time is an
integer number in the range zero to one.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-06-19 11:49:33

by Ingo Molnar

[permalink] [raw]
Subject: scheduler timeslice distribution, threads, processes. [was: Re: Question about sched_yield()]


On Wed, 19 Jun 2002, Bill Davidsen wrote:

> Clearly your fix is subtle enough that a quick reading, at least by me,
> can't follow all the nuances, but it looks on first reading as if your
> patch fixes this too well, and will now starve the threaded process by
> only giving one turn at the CPU per full pass through the queue, rather
> than sharing the timeslice between threads of a process.

the important point is that to the Linux scheduler there is only one
context of execution that matters: one Linux thread. The kernel itself is
deeply threaded - in kernel mode all threads access all data structures of
each other. Whether threads share certain resources (the VM or files) at
the userspace level is irrelevant to the scheduler. My strong point is
that this is not an accidental property of Linux, it's very intentional.
CPU time partitioning and VM-sharing are two distinct concepts and they
should not be artificially connected.

> I posted some thoughts on this to Robert, if you would comment I would
> appreciate. My thought is that if you have N processes and one is
> threaded, the aggregate of all threads should be 1/N of the CPU(s), not
> vastly more or less. I think with your change it will be less if they
> are sharing a resource. Feel free to tell me I misread what it does, or
> my desired behaviour is not correct.

'processes' are threads that have an isolated set of resources. The
possible sharing relationships between threads is much more complex than
what can be expressed via a simplicistic 'lightweight process' vs.
'heavyweight process' picture. [or the 'container process/shell' and
'member threads' concept used in a particular OS, designed in the early
90s ;-) ] It's a fundamental property of the scheduler that it shares
timeslices between threads of execution - regardless of the
resource-sharing done by those threads.

[in fact it's almost impossible to get an accurate picture of resource
sharing that can be used by the scheduler, the resource sharing
capabilities of Linux are very finegrained. It's possible to share just a
single file descriptor between two threads, and it's possible to share a
given segment of virtual memory as well.]

so in your example, if you have 10 isolated threads ('normal processes'),
and a 'process' that has 5 shared-all threads, the scheduler only sees
that there are 15 threads in the system - and each thread will get 1/15th
of the CPU resources [if all threads have the same priority].

but i'm not fundamentally against enabling users to partition their CPU
resource usage better - ie. to be able to tell which particular set of
threads should get a given percentage of CPU time. What i'm against is to
tie, hardcode this to the 'all VM sharing' property of threads - like
other OSs do. Why shouldnt it be possible to give the 10 isolated threads
30% of all CPU time, and the remaining 5 shared-all threads 70% CPU time?
By decoupling the ability to partition CPU time in a finegrained way we
not only can do what you describe, but we also win lots of flexibility,
and the whole model becomes much cleaner.

> I do run 15 machines with long running a threaded server application and
> periodic CPU hog things like log roll and compress, stats generation,
> certain database cleanup, etc, and I have more than intelectual
> curiousity on this behaviour.

right now you cannot set specific hard percentages for given jobs, but you
can give them a relative priority via nice(). While nice() isnt exactly a
partitioning tool (eg. the CPU time used up increases with the number of
threads), it's pretty close. But being able to partition timeslices more
accurately is something that will happen eventually - and the scheduler is
ready for it.

Ingo

2002-06-19 15:31:04

by Rusty Russell

[permalink] [raw]
Subject: Re: Question about sched_yield()

In message <[email protected]> you wr
ite:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> > On Mon, 17 Jun 2002 17:46:29 -0700
> > David Schwartz <[email protected]> wrote:
> > > "The sched_yield() function shall force the running thread to relinquish
the
> > > processor until it again becomes the head of its thread list. It takes no

> > > arguments."
> >
> > Notice how incredibly useless this definition is. It's even defined in ter
ms
> > of UP.
>
> I think you parse this differently than I, I see no reference to UP. The
> term "the processor" clearly (to me at least) means the processor running
> in that thread at the time of the yeild.
>
> The number of processors running in a single thread at any one time is an
> integer number in the range zero to one.

It's the word "until": "relinquish the processor until".

It's pretty clearly implied that it's going to "unrelinquish" *the
processor* at the end of this process.

So, by your definition, it can be scheduled on another CPU before it
becomes head of the thread list?

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-19 22:29:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: Question about sched_yield()

On Thu, 20 Jun 2002, Rusty Russell wrote:

> In message <[email protected]> you wr
> ite:
> > On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > > On Mon, 17 Jun 2002 17:46:29 -0700
> > > David Schwartz <[email protected]> wrote:
> > > > "The sched_yield() function shall force the running thread to relinquish
> the
> > > > processor until it again becomes the head of its thread list. It takes no
>
> > > > arguments."
> > >
> > > Notice how incredibly useless this definition is. It's even defined in ter
> ms
> > > of UP.
> >
> > I think you parse this differently than I, I see no reference to UP. The
> > term "the processor" clearly (to me at least) means the processor running
> > in that thread at the time of the yeild.
> >
> > The number of processors running in a single thread at any one time is an
> > integer number in the range zero to one.
>
> It's the word "until": "relinquish the processor until".
>
> It's pretty clearly implied that it's going to "unrelinquish" *the
> processor* at the end of this process.
>
> So, by your definition, it can be scheduled on another CPU before it
> becomes head of the thread list?

I have to read "head of the thread list" broadly, and assume it means the
thread will be run when it is the most appropriate thread to be run. I
don't read the wording as requiring or forbidding SMP, uni, or a strict
round-robin scheduler. The term "head of the thread list" doesn't state
that all other processes must get a time slice.

I believe I clarified my concerns in another post, I don't want to repeat
them. I don't want a process with threads contending for a resource to get
all or none of the CPU, each of which is possible with various schedulers
and patches recently available. I'd like to see threads of a single
process be able to get, use, and share a timeslice before some cpu hog
comes in and get his timeslice.

I don't read the text you quoted as requiring much more than 'run someone
else," because it's comfortably vague. To me anyway. I'm not knocking
anyone who reads it more strictly, I just don't see what you did.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-06-19 22:39:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Question about sched_yield()


On Wed, 19 Jun 2002, Bill Davidsen wrote:

> [...] I'd like to see threads of a single process be able to get, use,
> and share a timeslice before some cpu hog comes in and get his
> timeslice.

there is no such concept as 'threads of a single process' in Linux, and
this is not just a naming difference. In Linux threads are threads, and
whether they share the same set of pagetables or not is secondary to the
kernel. (there are lots of other resources they might or might not share
between each other.)

the OS where processes 'own' threads, where the process is a container,
where this concept is pretty much the only meaningful multiprogramming
concept, and where the kernel API is separated into per-thread and
per-process parts is not called Linux.

Ingo