2002-01-10 22:41:50

by Ingo Molnar

[permalink] [raw]
Subject: [patch] O(1) scheduler, -H5


the -H5 patch adds a debugging check:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre11-H5.patch

it adds code to catch places that call schedule() from global-cli()
sections. Right now release_kernel_lock() doesnt automatically release the
IRQ lock if there is no kernel lock held. A fair amount of code does this
still, and i think we should fix them in 2.5.

(Such code, while of questionable quality, is safe if it also holds the
big kernel lock, but it's definitely SMP-unsafe it doesnt hold the bkl -
the BUG() assert only catches the later case.)

(Andi Kleen noticed this on the first day the patch was released, and
Andrew Morton reminded me today that i forgot to fix it ... :-| )

my systems do not trigger the BUG(), so there cannot be all that much
broken code left.

Ingo


2002-01-11 11:31:59

by Russell King

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

On Fri, Jan 11, 2002 at 01:38:51AM +0100, Ingo Molnar wrote:
> it adds code to catch places that call schedule() from global-cli()
> sections. Right now release_kernel_lock() doesnt automatically release the
> IRQ lock if there is no kernel lock held. A fair amount of code does this
> still, and i think we should fix them in 2.5.

The serial driver (old or new) open/close functions are one of the worst
offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
I'm currently working on fixing this in the new serial driver.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-11 11:42:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

Russell King <[email protected]> writes:
>
> The serial driver (old or new) open/close functions are one of the worst
> offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
> I'm currently working on fixing this in the new serial driver.

When they hold the kernel lock in addition to the global cli() before
schedule() it should be ok. Only the behaviour of code not holding
kernel lock but global cli and calling schedule() has changed.

-Andi

2002-01-11 11:57:50

by Russell King

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

On Fri, Jan 11, 2002 at 12:42:14PM +0100, Andi Kleen wrote:
> When they hold the kernel lock in addition to the global cli() before
> schedule() it should be ok. Only the behaviour of code not holding
> kernel lock but global cli and calling schedule() has changed.

Agreed, however, there is one thing that has bugged me for a long time
(and which I believe is causing someone a problem at the moment) - when
we shut down a port, we're holding the BKL, and have global IRQs disabled.
We unhook the port from the serial drivers chain, and maybe free and
reclaim the IRQ with a different handler, and then disable the IRQ from
the port in question.

If we happen to schedule within request_irq, it doesn't take too much
imagination to see that Bad Things can happen.

(There is a report of complete lockup, and re-ordering stuff around here
fixes the problem, but the example patch changed a number of things, and
I'm trying to work towards a proper solution).

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-11 12:01:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

Andi Kleen wrote:
>
> Russell King <[email protected]> writes:
> >
> > The serial driver (old or new) open/close functions are one of the worst
> > offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
> > I'm currently working on fixing this in the new serial driver.
>
> When they hold the kernel lock in addition to the global cli() before
> schedule() it should be ok. Only the behaviour of code not holding
> kernel lock but global cli and calling schedule() has changed.

well the biggest serial.c offender is block_til_ready of course...
oh and there's quite some dusty old code that does

save_flags();
cli();

while (some_condition)
sleep_on(&queue);

eg not re-disabling interrupts after the sleep_on().....
to the point where just about every use of
sleep_on/interruptible_sleep_on is buggy
except in serial.c ;(

2002-01-11 12:07:15

by David Miller

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

From: Russell King <[email protected]>
Date: Fri, 11 Jan 2002 11:31:31 +0000

The serial driver (old or new) open/close functions are one of the worst
offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
I'm currently working on fixing this in the new serial driver.

The tty layer is really the only layer that hasn't had any
"SMP love and care" given to it.

Last time I looked at trying to do something it appeared you could fix
the whole thing up with a semaphore for the user bits and a spinlock
with which to interlock with the drivers.

2002-01-11 12:57:56

by Alan

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

> The serial driver (old or new) open/close functions are one of the worst
> offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
> I'm currently working on fixing this in the new serial driver.

Someone fixed serial.c to use spinlocks a long while ago. Its just not
merged

2002-01-11 14:58:42

by Russell King

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

On Fri, Jan 11, 2002 at 01:09:03PM +0000, Alan Cox wrote:
> > The serial driver (old or new) open/close functions are one of the worst
> > offenders of the global-cli-and-hold-kernel-lock-and-schedule problem.
> > I'm currently working on fixing this in the new serial driver.
>
> Someone fixed serial.c to use spinlocks a long while ago. Its just not
> merged

Unfortunately it wasn't a simple "replace global irq with spinlocks" - some
code also got moved around so its not clear that the problem was fixed by
the spinlocks or the code reordering. I'd rather know which it was.

In addition, holding a spinlock while calling request_irq is asking for
trouble.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-11 15:11:40

by Alan

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

> Unfortunately it wasn't a simple "replace global irq with spinlocks" - some
> code also got moved around so its not clear that the problem was fixed by
> the spinlocks or the code reordering. I'd rather know which it was.

The code re-ordering fixes the bug. The spinlocks are an unrelated change
that belong in a seperate diff.

2002-01-11 15:24:06

by Russell King

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5

On Fri, Jan 11, 2002 at 03:22:48PM +0000, Alan Cox wrote:
> > Unfortunately it wasn't a simple "replace global irq with spinlocks" - some
> > code also got moved around so its not clear that the problem was fixed by
> > the spinlocks or the code reordering. I'd rather know which it was.
>
> The code re-ordering fixes the bug. The spinlocks are an unrelated change
> that belong in a seperate diff.

This is at odds with the evidence at present:

| I'll give it a try, but from what I experienced in those days was that
| adding the _spinlock protection_ finally solved all.

He tried 2.4.17 without patch which locked up, and then he tried 2.4.17
with the patch, which didn't. Unfortunately it contains both the reordering
and the spinlocking.

I'm trying to work with him at the moment to find out which change fixed
the problem.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-01-13 15:21:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) scheduler, -H5


On Fri, 11 Jan 2002, Mike Kravetz wrote:

> reacquire_kernel_lock(current);
> if (unlikely(current->need_resched))
> goto need_resched_back;
> return;
>
> The question is why do we reacquire the kernel lock before checking
> for need_resched?. If it is not needed, we can save a lock/unlock
> cycle in the case of need_resched.

that code is something i discovered when trying to reduce scheduling
latencies for the very first lowlatency patchset i released. Back then,
1-2-3 years ago, kernel_lock usage was still common in the kernel. So it
often happened that tasks were spending more than 1 msec spinning for the
kernel lock, and often they did that in the reacquire code. So to reduce
latencies, i've added ->need_resched checking to kernel_lock() and
reacquire_kernel_lock() as well.

these days kernel_lock contention doesnt happen all that often, so i think
we should remove it from the 2.5 tree. I consider the preemptible kernel
patch to be the most advanced method to get low scheduling-latencies
anyway.

> This code isn't new with the O(1) scheduler, so I'm guessing there is
> a need to hold the kernel_lock when checking need_resched. I just
> don't know what it is.

there is no need for it to be under the kernel_lock - i simply found it to
be an easy and common preemption point.

Ingo