2007-10-15 16:56:17

by Ingo Molnar

[permalink] [raw]
Subject: [patch] scsi: fix crash in gdth_timeout()


the patch below fixes a bootup-crash bug merged via today's SCSI git
merge:

commit df3d80f5a5c74168be42788364d13cf6c83c7b9c
Merge: 3d06f7a... c8e91b0...
Author: Linus Torvalds <[email protected]>
Date: Mon Oct 15 08:19:33 2007 -0700

Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6

--------------------->
Subject: scsi: fix crash in gdth_timeout()
From: Ingo Molnar <[email protected]>

fix the following bootup crash in gdth_timeout():

[ 40.739625] BUG: spinlock bad magic on CPU#1, swapper/1
[ 40.744822] lock: 788a6e40, .magic: 784d8cec, .owner: <none>/-1, .owner_cpu: 2018346022
[ 40.752881] [<78104f2c>] show_trace_log_lvl+0x1a/0x2f
[ 40.757994] [<781058d1>] show_trace+0x12/0x14
[ 40.762414] [<781058e9>] dump_stack+0x16/0x18
[ 40.766833] [<782f1683>] spin_bug+0xa1/0xa9
[ 40.771079] [<782f17b0>] _raw_spin_lock+0x1e/0xe4
[ 40.775846] [<78643f9f>] _spin_lock_irqsave+0x53/0x64
[ 40.781461] [<784b224a>] gdth_timeout+0x1c/0xa6
[ 40.786054] [<7812632f>] run_timer_softirq+0x142/0x1a4
[ 40.791254] [<78123fb8>] __do_softirq+0x7b/0xf1
[ 40.795847] [<78105fb8>] do_softirq+0x61/0xc7
[ 40.800267] =======================

the bug is that list_first_entry() assumes a non-empty list.

A further problem is probably that the GDTH timer is not stopped by a
failed GDTH probe?

Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/scsi/gdth.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux/drivers/scsi/gdth.c
===================================================================
--- linux.orig/drivers/scsi/gdth.c
+++ linux/drivers/scsi/gdth.c
@@ -3793,6 +3793,9 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;

+ if (list_empty(&gdth_instances))
+ return;
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);


2007-10-15 17:12:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()



On Mon, 15 Oct 2007, Ingo Molnar wrote:
>
> A further problem is probably that the GDTH timer is not stopped by a
> failed GDTH probe?

Indeed. Maybe this is a better fix?

That driver is pretty messy, and this should have been found ealier.
James? Boaz?

Linus

---
drivers/scsi/gdth.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index e8010a7..3ac080e 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -5213,6 +5213,10 @@ static int __init gdth_init(void)
#endif /* CONFIG_PCI */

TRACE2(("gdth_detect() %d controller detected\n", gdth_ctr_count));
+
+ if (list_empty(&gdth_instances))
+ return -ENODEV;
+
#ifdef GDTH_STATISTICS
TRACE2(("gdth_detect(): Initializing timer !\n"));
init_timer(&gdth_timer);

2007-10-15 17:57:31

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()

Linus Torvalds wrote:
>
> On Mon, 15 Oct 2007, Ingo Molnar wrote:
>> A further problem is probably that the GDTH timer is not stopped by a
>> failed GDTH probe?
>
> Indeed. Maybe this is a better fix?
>
> That driver is pretty messy, and this should have been found ealier.
> James? Boaz?

FWIW, the gdth driver was "super-messy". With this latest SCSI pull,
that severity has been successfully downgraded to "messy" :)

IMO some easy-to-fix breakage was inevitable with such a large volume of
fundamental changes.

Honestly, the driver is probably rarely run by people that lack the
hardware, I bet...

Jeff



2007-10-15 18:48:56

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()

On Mon, Oct 15 2007 at 19:57 +0200, Jeff Garzik <[email protected]> wrote:
> Linus Torvalds wrote:
>> On Mon, 15 Oct 2007, Ingo Molnar wrote:
>>> A further problem is probably that the GDTH timer is not stopped by a
>>> failed GDTH probe?
>> Indeed. Maybe this is a better fix?
>>
>> That driver is pretty messy, and this should have been found ealier.
>> James? Boaz?
>
> FWIW, the gdth driver was "super-messy". With this latest SCSI pull,
> that severity has been successfully downgraded to "messy" :)
>
> IMO some easy-to-fix breakage was inevitable with such a large volume of
> fundamental changes.
>
> Honestly, the driver is probably rarely run by people that lack the
> hardware, I bet...
>
> Jeff

It was all "flight by instruments only". I called for HW testers and none
came forward. All these changes, apart from "successful downgrade to messy"
where also needed in order to push important changes to scsi.

But a little bird said that QEMU might simulate this HW. SO I guess it is
QEMU time for me.

Boaz

2007-10-15 19:25:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()


* Linus Torvalds <[email protected]> wrote:

> On Mon, 15 Oct 2007, Ingo Molnar wrote:
> >
> > A further problem is probably that the GDTH timer is not stopped by a
> > failed GDTH probe?
>
> Indeed. Maybe this is a better fix?
>
> That driver is pretty messy, and this should have been found ealier.
> James? Boaz?

> @@ -5213,6 +5213,10 @@ static int __init gdth_init(void)
> #endif /* CONFIG_PCI */
>
> TRACE2(("gdth_detect() %d controller detected\n", gdth_ctr_count));
> +
> + if (list_empty(&gdth_instances))
> + return -ENODEV;

indeed - with this patch instead of the one i did the CONFIG_SCSI_GDTH=y
bzImage boots up fine too, and no crash:

Calling initcall 0x80a417e0: gdth_init+0x0/0xb20()
GDT-HA: Storage RAID Controller Driver. Version: 3.05
GDT-HA: Found 0 PCI Storage RAID Controllers
initcall 0x80a417e0: gdth_init+0x0/0xb20() returned -19.
initcall 0x80a417e0 ran for 5 msecs: gdth_init+0x0/0xb20()

Tested-by: Ingo Molnar <[email protected]>

Ingo

2007-10-15 19:27:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()


* Boaz Harrosh <[email protected]> wrote:

> > Honestly, the driver is probably rarely run by people that lack the
> > hardware, I bet...
>
> It was all "flight by instruments only". I called for HW testers and
> none came forward. All these changes, apart from "successful downgrade
> to messy" where also needed in order to push important changes to
> scsi.
>
> But a little bird said that QEMU might simulate this HW. SO I guess it
> is QEMU time for me.

heh. Incidentally i was thinking about using KVM for automated testing.
Important pieces of hardware should get an in-KVM simulator/emulator,
that way developers who do not own that hardware can do functionality
testing too. So basically the highest-quality drivers would have an
"inverse driver" in KVM, which simulates the hardware. (that model is
evidently useful to the hardware maker even for new hardware: it can
then also be used to test the Linux compatibility and Linux performance
of future planned releases of the hardware, etc.)

Ingo

2007-10-15 19:39:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()



On Mon, 15 Oct 2007, Ingo Molnar wrote:
>
> heh. Incidentally i was thinking about using KVM for automated testing.

Using emulators to test device drivers is almost certain to be pointless.

The problem with device drivers tends to be timing issues, odd hardware
interactions, and lots of strange (and sometimes undocumented) behaviour
and dependencies (eg things like "you have to wait 50us after setting the
reset bit until the hardware has actually reset").

These are all things that you'd generally not catch in emulation - because
the emulation by necessity is only going to be a very weak picture of the
real thing.

So I suspect you can find the easy stuff, but only by writing insanely
complex device model descriptions in the emulator environment itself, and
only for those device models that have actually been written.

Can it be donein theory? Sure. Practically? Not so sure. Would it help? I
suspect the effort to do the device model would be many times bigger than
*any* conceivable effort to just fix the driver bugs as they get found
through other means.

So we could perhaps have *really* good emulated hardware for a few models
of hw out there, but likely they'd be fewer and less varied platforms than
most kernel developers end up having hidden under their desk anyway..

Linus

2007-10-15 20:07:29

by Alan

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()

On Mon, 15 Oct 2007 12:38:06 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 15 Oct 2007, Ingo Molnar wrote:
> >
> > heh. Incidentally i was thinking about using KVM for automated testing.
>
> Using emulators to test device drivers is almost certain to be pointless.

For some things. I do it a bit because you can use it to fake
failures that are tricky to do in the real world. It won't tell you the
driver works but its suprisingly good for testing for races (forcing IRQ
delivery at specific points), buggy hardware you don't posess, and things
like media failures and timeouts your real hardware refuses to do.

Alan

2007-10-15 20:18:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()



On Mon, 15 Oct 2007, Alan Cox wrote:
>
> For some things. I do it a bit because you can use it to fake
> failures that are tricky to do in the real world. It won't tell you the
> driver works but its suprisingly good for testing for races (forcing IRQ
> delivery at specific points), buggy hardware you don't posess, and things
> like media failures and timeouts your real hardware refuses to do.

Heh. I do agree that you likely find bugs, even if quite often it's
exactly because the behaviour is something that will never happen on real
hardware.

But failure testing is very useful - I forget who it was who debugged some
driver by taking a CD and just scrathing it mercilessly to induce read
errors ;)

Having a really *bad* HW emulator can certainly work that way too, even if
it also would probably end up hitting just a few of the potential error
paths..

Linus

2007-10-15 21:56:00

by James Bottomley

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()

On Mon, 2007-10-15 at 10:08 -0700, Linus Torvalds wrote:
>
> On Mon, 15 Oct 2007, Ingo Molnar wrote:
> >
> > A further problem is probably that the GDTH timer is not stopped by a
> > failed GDTH probe?
>
> Indeed. Maybe this is a better fix?
>
> That driver is pretty messy, and this should have been found ealier.
> James? Boaz?

It was:

http://marc.info/?t=119238793200002

and that patch was my best guess for fixing this as well ... do we have
someone with actual hardware to confirm it yet? (apparently QEMU uses
gdth as some type of emulated controller).

James


2007-10-15 22:23:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scsi: fix crash in gdth_timeout()


* Linus Torvalds <[email protected]> wrote:

> On Mon, 15 Oct 2007, Ingo Molnar wrote:
> >
> > heh. Incidentally i was thinking about using KVM for automated
> > testing.
>
> Using emulators to test device drivers is almost certain to be
> pointless.

something like that wont enable 100% coverage (or even reasonable
coverage for most hardware), so it's no replacement for actual hard
testing, but it could push out the domain of minimally tested code quite
a bit and increase the quality of the kernel. Races are always tough and
so are bugs on the side of the hardware, but it's the silly boot-time
crash showstoppers and "device does not work anymore" mistakes that
causes us to lose most of the testers and early adopters.

I'm not really worried about the 1% of bugs that are tough to find and
fix (because they are actually fun to find and fix), i'm worried about
the 99% easy and boring bugs - because they annoy users just as much as
the exciting bugs do. If we fix them faster then there's more time (and
more tester stamina) left for the harder to find bugs.

so i think adding redundancy in form of a simplified hw emulator can
certainly not hurt and fundamentally increases robustness - and it will
definitely reduce the chance for a whole host of stupid bugs that are
not in the hardware but are in the ~4 million lines of Linux driver
codebase. Limits and scalability would also be testable: "if i put 32 of
these networking cards into a Linux box, will the Linux driver blow up".

not that i think this is realistic for any significant portion of the
hardware currently - unless some hw maker starts doing it. But KVM will
have a good portion of the core PC platform emulated (APIC, etc.) - and
that's a nice step forward already.

Ingo