The semantics of sched_yield() changed in the 2.5 kernel.
In the 2.4 series it meant "sleep a little".
The new 2.5 semantics are correct (move to the end of the
run queue) but can mean "sleep a lot" under load.
This already bit ext3 transaction batching, c.f. Andrew Morton's
>[PATCH] remove the sched_yield from the ext3 fsync path
>
>The changed sched_yield() semantics have made ext3's transaction
>batching terribly slow.
>
>Apparently a schedule() fixes that, although it probably breaks
>transaction batching.
>
>This patch largely fixes my complaints about the new scheduler being
>extremely sluggish to interactive applications. Evidently those
>applications were calling fsync() and were spending extremely long
>periods in sched_yield().
Maybe it is worth auditing the kernel source files using yield()?
[There are only 33 of them, so not too bad - see below].
A number of them have comments like /* sleep a little */, so the
authors presumably weren't expecting to get "sleep a lot"...
Here is the list of files using yield(), excluding non-i386 arch specific files:
net/ipv4/tcp_output.c
net/sched/sch_generic.c
net/sunrpc/sched.c
net/unix/af_unix.c
net/socket.c
mm/oom_kill.c
mm/page_alloc.c
kernel/sched.c (in migration_call)
kernel/softirq.c
kernel/suspend.c
init/do_mounts.c
fs/jbd/journal.c
fs/jbd/revoke.c
fs/nfs/pagelist.c
fs/reiserfs/journal.c
fs/ufs/truncate.c
fs/buffer.c
fs/exec.c
fs/locks.c
fs/super.c
drivers/cdrom/cdu31a.c
drivers/cdrom/sonycd535.c
drivers/ide/ide-disk.c
drivers/message/i2o/i2o_core.c
drivers/net/e100/e100_eeprom.c
drivers/net/e100/e100_main.c
drivers/net/e100/e100_phy.c
drivers/net/e100/e100_test.c
drivers/net/depca.c
drivers/net/sb1000.c
drivers/net/sis900.c
drivers/net/slip.c
arch/i386/mm/fault.c
Thoughts?
Duncan.
On Tue, 15 Oct 2002, Duncan Sands wrote:
> Maybe it is worth auditing the kernel source files using yield()?
most definitely.
> Here is the list of files using yield(), excluding non-i386 arch
> specific files:
>
[...]
> mm/oom_kill.c
this one i think is OK.
> kernel/sched.c (in migration_call)
this is okay as well.
> kernel/softirq.c
these are okay too - both are nonperformance bits.
> arch/i386/mm/fault.c
okay as well, it's a last-ditch effort to not kill init, so yielding is
the right thing to do here.
the others i think should be fixed. (but there might be exceptions.)
Ingo
Hi Duncan,
> The semantics of sched_yield() changed in the 2.5 kernel.
> In the 2.4 series it meant "sleep a little".
> The new 2.5 semantics are correct (move to the end of the
> run queue) but can mean "sleep a lot" under load.
>
> This already bit ext3 transaction batching, c.f. Andrew Morton's
>
>> [PATCH] remove the sched_yield from the ext3 fsync path
where did you read this ^^? :)
ciao, Marc
> > This already bit ext3 transaction batching, c.f. Andrew Morton's
> >
> >> [PATCH] remove the sched_yield from the ext3 fsync path
>
> where did you read this ^^? :)
>
> ciao, Marc
http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html|ChangeSet@-7d
Duncan.
On Tue, Oct 15, 2002 at 03:36:38PM +0200, Duncan Sands wrote:
> fs/locks.c
http://marc.theaimsgroup.com/?l=linux-kernel&m=103352923323135&w=2
I don't think the maintainer has submitted a patch for it yet.
regards
john
--
"CUT IT OUT FACEHEAD"
- jeffk
On Tuesday 15 October 2002 19:12, John Levon wrote:
> On Tue, Oct 15, 2002 at 03:36:38PM +0200, Duncan Sands wrote:
> > fs/locks.c
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103352923323135&w=2
>
> I don't think the maintainer has submitted a patch for it yet.
A fix has been applied in Linus's tree. See
http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html|ChangeSet@-3d
Duncan.
Hi!
> Here is the list of files using yield(), excluding non-i386 arch specific files:
>
...
> kernel/suspend.c
This is okay.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]
> > Here is the list of files using yield(), excluding non-i386 arch specific
> > files:
>
> ...
>
> > kernel/suspend.c
>
> This is okay.
Hi Pavel, I agree. I have some questions about the code though:
when you come across a thread with (p->flags & PF_FROZEN), why
break out of the loop? Why not just skip this thread and go on to the
next one? Also, does it matter if the code doing the freezing is itself
frozen?
Ciao, Duncan.
PS: Here is the code, for reference:
do {
todo = 0;
read_lock(&tasklist_lock);
do_each_thread(g, p) {
unsigned long flags;
INTERESTING(p);
if (p->flags & PF_FROZEN)
continue;
/* FIXME: smp problem here: we may not access other proc
ess' flags
without locking */
p->flags |= PF_FREEZE;
spin_lock_irqsave(&p->sig->siglock, flags);
signal_wake_up(p);
spin_unlock_irqrestore(&p->sig->siglock, flags);
todo++;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
yield();
if (time_after(jiffies, start_time + TIMEOUT)) {
printk( "\n" );
return todo;
}
} while(todo);
Hi!
> > > Here is the list of files using yield(), excluding non-i386 arch specific
> > > files:
> >
> > ...
> >
> > > kernel/suspend.c
> >
> > This is okay.
>
> Hi Pavel, I agree. I have some questions about the code though:
> when you come across a thread with (p->flags & PF_FROZEN), why
> break out of the loop? Why not just skip this thread and go on to
> the
There's "continue;" in there, and it should "just skip this thread".
Pavel
> next one? Also, does it matter if the code doing the freezing is itself
> frozen?
>
> Ciao, Duncan.
>
> PS: Here is the code, for reference:
>
> do {
> todo = 0;
> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> unsigned long flags;
> INTERESTING(p);
> if (p->flags & PF_FROZEN)
> continue;
>
> /* FIXME: smp problem here: we may not access other proc
> ess' flags
> without locking */
> p->flags |= PF_FREEZE;
> spin_lock_irqsave(&p->sig->siglock, flags);
> signal_wake_up(p);
> spin_unlock_irqrestore(&p->sig->siglock, flags);
> todo++;
> } while_each_thread(g, p);
> read_unlock(&tasklist_lock);
> yield();
> if (time_after(jiffies, start_time + TIMEOUT)) {
> printk( "\n" );
> return todo;
> }
> } while(todo);
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
> > Hi Pavel, I agree. I have some questions about the code though:
> > when you come across a thread with (p->flags & PF_FROZEN), why
> > break out of the loop? Why not just skip this thread and go on to
> > the
>
> There's "continue;" in there, and it should "just skip this thread".
> Pavel
I meant, why not just do as in the following code (I've changed
INTERESTING also, for same reason, as explained below):
do {
todo = 0;
read_lock(&tasklist_lock);
do_each_thread(g, p) {
unsigned long flags;
if (
!(p->flags & PF_IOTHREAD) &&
(p != current) &&
(p->state != TASK_ZOMBIE) &&
!(p->flags & PF_FROZEN)
) {
/* FIXME: smp problem here: we may not access other process' flags
without locking */
p->flags |= PF_FREEZE;
spin_lock_irqsave(&p->sig->siglock, flags);
signal_wake_up(p);
spin_unlock_irqrestore(&p->sig->siglock, flags);
todo++;
}
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
yield();
if (time_after(jiffies, start_time + TIMEOUT)) {
printk( "\n" );
printk(KERN_ERR " stopping tasks failed (%d tasks remaining)\n", todo );
return todo;
}
} while(todo);
The reason is that yield(), which sends the current task to the expired list,
can take a long time before it runs again. With the current code, every time
you meet, for example, a kernel thread you break out of the loop, perform
a yield (= wait a long time), before going on to the next thread. This could
take forever. With code like that above, you mark as many tasks frozen as
possible, with as few yields as possible. Isn't that better?
Ciao,
Duncan.
Well goodness me! That just goes to show that spending
all your time programming in other languages can be harmful
to your C (or: to your brain)! Let me just crawl away and find
a hole to die in before someone pours salt on me...
Ciao,
Duncan.
Would it be sensible to add a "yield_short()" function to the kernel?
mark
On Sun, Oct 20, 2002 at 11:10:33AM +0200, Duncan Sands wrote:
> > > Hi Pavel, I agree. I have some questions about the code though:
> > > when you come across a thread with (p->flags & PF_FROZEN), why
> > > break out of the loop? Why not just skip this thread and go on to
> > > the
> >
> > There's "continue;" in there, and it should "just skip this thread".
> > Pavel
>
> I meant, why not just do as in the following code (I've changed
> INTERESTING also, for same reason, as explained below):
>
> do {
> todo = 0;
> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> unsigned long flags;
>
> if (
> !(p->flags & PF_IOTHREAD) &&
> (p != current) &&
> (p->state != TASK_ZOMBIE) &&
> !(p->flags & PF_FROZEN)
> ) {
>
> /* FIXME: smp problem here: we may not access other process' flags
> without locking */
> p->flags |= PF_FREEZE;
> spin_lock_irqsave(&p->sig->siglock, flags);
> signal_wake_up(p);
> spin_unlock_irqrestore(&p->sig->siglock, flags);
> todo++;
> }
> } while_each_thread(g, p);
> read_unlock(&tasklist_lock);
> yield();
> if (time_after(jiffies, start_time + TIMEOUT)) {
> printk( "\n" );
> printk(KERN_ERR " stopping tasks failed (%d tasks remaining)\n", todo );
> return todo;
> }
> } while(todo);
>
> The reason is that yield(), which sends the current task to the expired list,
> can take a long time before it runs again. With the current code, every time
> you meet, for example, a kernel thread you break out of the loop, perform
> a yield (= wait a long time), before going on to the next thread. This could
> take forever. With code like that above, you mark as many tasks frozen as
> possible, with as few yields as possible. Isn't that better?
>
> Ciao,
>
> Duncan.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
On Tuesday 22 October 2002 19:24, Mark Mielke wrote:
> Would it be sensible to add a "yield_short()" function to the kernel?
You can do:
set_current_state(TASK_RUNNING);
schedule();
I'm not clear on what you are guaranteed to get - for example,
it looks as if it can sometimes return without sleeping at all.
There is also cond_resched(), which is
static inline void cond_resched(void)
{
if (need_resched())
__cond_resched();
}
void __cond_resched(void)
{
set_current_state(TASK_RUNNING);
schedule();
}
I don't know when need_resched evaluates to true, I haven't
had time to look into this yet.
Ciao, Duncan.
On Tuesday 15 October 2002 18:27, Duncan Sands wrote:
> > > This already bit ext3 transaction batching, c.f. Andrew Morton's
> > >
> > >> [PATCH] remove the sched_yield from the ext3 fsync path
> >
> > where did you read this ^^? :)
> >
> > ciao, Marc
>
> http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html|Change
>Set@-7d
Actually its
http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html|ChangeSet@-3w
Sorry about that.
Duncan.
This is my current understanding of some different ways to schedule:
(1) Yield to a higher priority task (if there is one):
cond_resched();
(2) Yield to the next task (if there is another one):
set_current_state(TASK_RUNNING);
schedule();
(3) Yield to all tasks:
yield();
Notes:
(1) Also yields if the current task's time quantum has expired.
(3) Only guaranteed to yield to all tasks on the active list.
Corrections?
Duncan.