2004-10-28 13:28:39

by Jaroslav Kysela

[permalink] [raw]
Subject: Re: [Alsa-devel] Oops in 2.6.10-rc1

On Thu, 28 Oct 2004, Christian wrote:

> [<c01fc7b8>] pci_enable_device_bars+0x28/0x40
> [<c01fc7ef>] pci_enable_device+0x1f/0x40
> [<e082729d>] snd_ensoniq_create+0x1d/0x480 [snd_ens1371]
> [<e08469cf>] snd_card_new+0x1cf/0x2c0 [snd]

It's a bit dead-lock, because we cannot help you. It seems that
the pci structure passed to our code is broken. The driver has had
no changes in initialization for a long time.

Jaroslav

-----
Jaroslav Kysela <[email protected]>
Linux Kernel Sound Maintainer
ALSA Project, SUSE Labs


2004-10-28 14:10:00

by Christian Kujau

[permalink] [raw]
Subject: Re: [Alsa-devel] Oops in 2.6.10-rc1

Jaroslav Kysela wrote:
> On Thu, 28 Oct 2004, Christian wrote:
>
>
>> [<c01fc7b8>] pci_enable_device_bars+0x28/0x40
>> [<c01fc7ef>] pci_enable_device+0x1f/0x40
>> [<e082729d>] snd_ensoniq_create+0x1d/0x480 [snd_ens1371]
>> [<e08469cf>] snd_card_new+0x1cf/0x2c0 [snd]
>
>
> It's a bit dead-lock, because we cannot help you. It seems that
> the pci structure passed to our code is broken. The driver has had
> no changes in initialization for a long time.

so, it's a kernel problem again, not related to the alsa framework?

i see in

http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.10-rc1

[...]
<[email protected]>
[PATCH] i386/io_apic init section fixups

<[email protected]>
[PATCH] vm: convert users of remap_page_range() under sound/ to
use remap_pfn_range()
[...]

so i'll revert the patches and see what it gives.

thank you,
Christian
--
BOFH excuse #131:

telnet: Unable to connect to remote host: Connection refused

2004-11-04 15:19:24

by Christian Kujau

[permalink] [raw]
Subject: Re: [Alsa-devel] Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hm,

still no sound with snd_ens1371 but now i spend some time to find out how
to revert a patch with bk. while compiling is still ongoing, let me tell
you how i tried to revert the patch with bk, because i am not entirely
sure if i do the right thing here:

bk changes > ../changes-04-11-2004.txt

as written before, i suspect (!) two changes here:

> [...]
> <[email protected]>
> [PATCH] i386/io_apic init section fixups
>
> <[email protected]>
> [PATCH] vm: convert users of remap_page_range() under sound/ to
> use remap_pfn_range()
> [...]
>
> so i'll revert the patches and see what it gives.

in ../changes-04-11-2004.txt i found out the ChnageSet numbers:
1.1988.72.76 + 1.2000.5.77. then i did

bk undo -a1.1988.72.76

only to find out that i misread the manual and 1.1988.72.76 is still in
place. i did

bk changes > ../changes-1.1988.72.76.txt

and the very patch has a different ChangeSet now: 1.2202. so i did

bk undo -a1.2201

is this the right way to revert patches when subsequent patches might not
allow to simply "bk undo -r<vers>" (because subsequent patches rely on
this single ChangeSet).

thank you for your assistance,
Christian
- --
BOFH excuse #182:

endothermal recalibration
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBike6+A7rjkF8z0wRAl/DAKDAMP31cXrzjBnnl+713F1zJ5ShQQCdFYRr
TpRkMTwdhZq9SvoZEPR2Plw=
=sm2q
-----END PGP SIGNATURE-----

2004-11-05 02:36:18

by Christian Kujau

[permalink] [raw]
Subject: Re: [Alsa-devel] Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hi again,

i *think* i found the ChangeSet leading to the bug i tried to report in
http://marc.theaimsgroup.com/?l=linux-kernel&m=109888178603516&w=2

the error is sill present here (and only here? strange...), the latest -BK
does not fix it. i had some difficulties in telling BK to do the right
thing. to summarise the error:

- - upon loading of snd_ens1371 the Oops occurs. system is still stable
then, but no sound available.
- - this occured somewhere between 2.6.9 (released 15-Oct-2004) and 2.6.9-10
(released 22-Oct-2004)

one interesting changeset was:

[email protected], 2004-10-20 20:33:06+02:00, [email protected]
Merge suse.cz:/home/perex/bk/linux-sound/linux-2.5
into suse.cz:/home/perex/bk/linux-sound/linux-sound

i tried to back it out:

$ bk clone -r1.2000.7.1 linux-2.6-BK linux-2.6-BK-test

but the said ChangeSet was still there (of course). i tried to back it out
(now for sure):

$ bk undo -a1.2010
(hm: the changesets get renumbered everytime i "do" something with the
tree) this one reverted quite a few ChangeSets but i let it happen.

compiling & booting this thing goes fine and i am now running 2,6,9-BK(?)
with working snd_ens1371.

if someone could give me a hint here what to do next or perhaps tell me
that the whole things was totally pointless - please say so.
i am somehow lost as to which is the right person to bug here.

thank you for your time,
Christian.
- --
BOFH excuse #328:

Fiber optics caused gas main leak
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBiucN+A7rjkF8z0wRAkpKAJ0bbevHqmpU/Ut3r5TbWgfu42cGBACgsrhm
X8euqIjgc8KNCWl50oys/Yw=
=8VM9
-----END PGP SIGNATURE-----

2004-11-07 01:24:52

by Christian Kujau

[permalink] [raw]
Subject: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hi again,

i *think* i found the ChangeSet leading to the bug i tried to report in
http://marc.theaimsgroup.com/?l=linux-kernel&m=109888178603516&w=2

the error is sill present here (and only here? strange...), the latest -BK
does not fix it. i had some difficulties in telling BK to do the right
thing. to summarise the error:

- - upon loading of snd_ens1371 the Oops occurs. system is still stable
then, but no sound available.
- - this occured somewhere between 2.6.9 (released 15-Oct-2004) and 2.6.9-10
(released 22-Oct-2004)

one interesting changeset was:

[email protected], 2004-10-20 20:33:06+02:00, [email protected]
Merge suse.cz:/home/perex/bk/linux-sound/linux-2.5
into suse.cz:/home/perex/bk/linux-sound/linux-sound

i tried to back it out:

$ bk clone -r1.2000.7.1 linux-2.6-BK linux-2.6-BK-test

but the said ChangeSet was still there (of course). i tried to back it out
(now for sure):

$ bk undo -a1.2010
(hm: the changesets get renumbered everytime i "do" something with the
tree) this one reverted quite a few ChangeSets but i let it happen.

compiling & booting this thing goes fine and i am now running 2,6,9-BK(?)
with working snd_ens1371.

if someone could give me a hint here what to do next or perhaps tell me
that the whole things was totally pointless - please say so.
i am somehow lost as to which is the right person to bug here.

thank you for your time,
Christian.
- --
BOFH excuse #328:

Fiber optics caused gas main leak
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBjXlZ+A7rjkF8z0wRAqaVAJ9ljiIpxi01SblgEg/ce/Vd/uYksQCfeuJ9
hRGA0/17ttZ83xRQDb8jfhs=
=DQYp
-----END PGP SIGNATURE-----

2004-11-07 07:02:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Sun, 7 Nov 2004, Christian Kujau wrote:
>
> if someone could give me a hint here what to do next or perhaps tell me
> that the whole things was totally pointless - please say so.
> i am somehow lost as to which is the right person to bug here.

Since you seem to be a BK user, try doing a

bk revtool sound/pci/ens1370.c

and see if you can find the change that caused your problem. Of course,
the real change might be somewhere else in the sound driver initialization
path, so it's not like just that one file might be the cause. Regard?ess,
the more you can pinpoint when the problem started, the better.

Also, if you enable frame pointers (under kernel debugging), the traceback
will look a bit better. As it is, your oops looks looks like something has
jumped off into la-la-land by jumping through a bad pointer (the value is
still in %ecx), but it's definitely not clear _where_ that happened.
Your trace points to pci_enable_device_bars(), but that may well be just
stale stack contents.

Linus

2004-11-07 13:05:14

by Pekka Enberg

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

Hi Christian,

On Sun, 07 Nov 2004 02:24:41 +0100, Christian Kujau <[email protected]> wrote:
> if someone could give me a hint here what to do next or perhaps tell me
> that the whole things was totally pointless - please say so.
> i am somehow lost as to which is the right person to bug here.

I am running 2.6.10-rc1-bk14 with ens-1371 working ok. Could you
please post your .config so I can try to reproduce your oops?

Pekka

2004-11-07 13:10:50

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

On Sat, 6 Nov 2004 23:02:28 -0800 (PST), Linus Torvalds wrote
>
> Since you seem to be a BK user, try doing a

s/BK user/BK beginner/

>
> bk revtool sound/pci/ens1370.c
>
> and see if you can find the change that caused your problem.

hm, i already found the ChangeSet ([email protected]), but it seems
the ChangeSets get renumbered when linux makes progress. the issuer of
this changeset did not comment yet.

> Of course, the real change might be somewhere else in the
> sound driver initialization path, so it's not like just that
> one file might be the cause. Regardöess, the more you can
> pinpoint when the problem started, the better.

yes.

>
> Also, if you enable frame pointers (under kernel debugging),
> the traceback will look a bit better. As it is, your oops

ah, ok, will do.

thank you for your time,
Christian.
--
BOFH excuse #206:

Police are examining all internet packets in the search for a
narco-net-trafficker

2004-11-07 13:44:25

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pekka Enberg schrieb:
>
> I am running 2.6.10-rc1-bk14 with ens-1371 working ok. Could you
> please post your .config so I can try to reproduce your oops?

i put it on
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config

thank you,
Christian.
- --
BOFH excuse #361:

Communist revolutionaries taking over the server room and demanding all
the computers in the building or they shoot the sysadmin. Poor misguided
fools.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBjiae+A7rjkF8z0wRAqo9AJ0e0iHAXi2Q6oI/UKl1vBw/dPvODQCfSjfh
ucfAhJkoCMS5gGxt/HtSKrw=
=pqTN
-----END PGP SIGNATURE-----

2004-11-07 16:02:29

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>> bk revtool sound/pci/ens1370.c
>>
>>and see if you can find the change that caused your problem.

since i got this oops between 2.6.9 and 2.6.10-rc1 i am still assuming
that the change was made somewere between 15-Oct-2004 (2.6.9) and
22-Oct-2004 (2.6.10-rc1). so the only Changeset matching this timespan is:

- -------------------------
[email protected], 2004-10-20 08:10:43-07:00, [email protected]
[PATCH] module_param_array() should take a pointer

module_param_array() takes a variable to put the number of elements in.
Looking through the uses, many people don't care, so they declare a dummy
or share one variable between several parameters. The latter is
problematic because sysfs uses that number to decide how many to display.

The solution is to change the variable arg to a pointer, and if the
pointer is NULL, use the "max" value. This change is fairly small, but
fixing up the callers is a lot of (trivial) churn.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
- -------------------------

>>Also, if you enable frame pointers (under kernel debugging),
>>the traceback will look a bit better. As it is, your oops

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config

the new config has this enabled:

CONFIG_DEBUG_DRIVER=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_KOBJECT=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_FRAME_POINTER=y
CONFIG_KPROBES=y

shows the output of dmesg after doing "modprobe snd-ens1371". after this,
snd-ens1371 seems to be loaded:

Module Size Used by
snd_ens1371 29928 1
snd_rawmidi 25952 1 snd_ens1371
snd_ac97_codec 77856 1 snd_ens1371
snd_pcm 101768 2 snd_ens1371,snd_ac97_codec
snd_timer 31940 1 snd_pcm
snd 51620 5
snd_ens1371,snd_rawmidi,snd_ac97_codec,snd_pcm,snd_timer
soundcore 9440 1 snd
snd_page_alloc 7620 1 snd_pcm
ipv6 260480 8
psmouse 20424 0
rtc 20188 0

but is not working and cannot be unloaded:

prinz:~$ rmmod snd_ens1371
ERROR: Module snd_ens1371 is in use

there was an answer from the alsa-devel folks here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109897024116288&w=2

"It's a bit dead-lock, because we cannot help you. It seems that
the pci structure passed to our code is broken. The driver has had
no changes in initialization for a long time."

i hope these information will help a bit.
thank you for your assistance, i really appreciate it
Christian

(still wondering why nobody else has this bug, 1370 is not *that* weird, i
thought)


PS: if someone could explain me, why the ChangeSet numbers are always
different: i've used "bk revtool sound/pci/ens1370.c" to find out the
changes for this file and the suspicious patch reads

sound/pci/[email protected], 2004-10-20....

in "bk revtool". the changelog however reads:

[email protected], 2004-10-20 08:10:43-07:00, [email protected]

- --
BOFH excuse #62:

need to wrap system in aluminum foil to fix problem
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBjkcE+A7rjkF8z0wRAkR/AJ98DKSv5dZfOSJdKGWdz1LWPlItgQCgvS1A
iS1wUtTgHzsx4JFpqsQGt68=
=Hv9R
-----END PGP SIGNATURE-----

2004-11-07 16:58:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Sun, 7 Nov 2004, Christian Kujau wrote:
>
> since i got this oops between 2.6.9 and 2.6.10-rc1 i am still assuming
> that the change was made somewere between 15-Oct-2004 (2.6.9) and
> 22-Oct-2004 (2.6.10-rc1).

Not necessarily. The ALSA merge is the most likely reason for the oops,
and since ALSA development does not merge with the kernel very often, it
may be some much older change in the ALSA tree.

You can check the ALSA tree _before_ the merge, by doing (in the current
tree):

bk undo -a1.2000.7.2

which should give you a tree without any of "my" stuff, ie it was what
Jaroslav was working on before he merged it into the standard tree.

(BK revision numbers change on merges, so the above number is not
necessarily the right one unless you have the current -bk tree. It should
have a changeset something like:

[email protected], 2004-10-20 20:51:33+02:00, [email protected]
Merge suse.cz:/home/perex/bk/linux-sound/linux-sound
into suse.cz:/home/perex/bk/linux-sound/work

so that you can double-check).

> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops.txt

Yup, it's a call through a bad pointer again, and again the EIP value
can be found in %ecx. But the source of the bug is not clear. The stack
trace implies "show_stack()", but that function doesn't do any indirect
calls, so I suspect the frame pointer didn't help in this case.

And it's not "pci_enable_device()" either (which was there last time too),
since that one calls "pci_enable_device_bars()" at the point it shows in
the stack trace.

Quite frankly, it looks like something smashed the stack, and the fact
that it happens _around_ when "pci_enable_device()" was called makes me
seriously suspect the IRQ handler for the device. That's when IRQ routing
is enabled, so often the interrupts start at that point. And since
FRAME_POINTER didn't make the stack frame look sane, it's very possible
that the bogus call isn't due to a real "call", but due to a return from a
broken stack.

> there was an answer from the alsa-devel folks here:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109897024116288&w=2
>
> "It's a bit dead-lock, because we cannot help you. It seems that
> the pci structure passed to our code is broken. The driver has had
> no changes in initialization for a long time."

I seriously doubt that it's the PCI structure being broken. It's the ALSA
merge, almost certainly - it's just that the stack is so confused that
it's hard to tell where the bug has happened.

And I'll double-check the "regparm" changes, just in case. They change
some irq calling conventions, although none of the involved stuff seems to
be implied here.

A quick suggestion: make sure that there is not some stale object file
lying around confusing things about memory layout, and do a "make clean"
and make sure that all old modules are clean too and re-installed. The
kernel dependencies should be correct, but even then there can be problems
with clocks that are off a bit etc.

> (still wondering why nobody else has this bug, 1370 is not *that* weird, i
> thought)

Yes, that makes me suspicious, and is one reason why I wonder if it's just
your tree not being built right.

> PS: if someone could explain me, why the ChangeSet numbers are always
> different: i've used "bk revtool sound/pci/ens1370.c" to find out the
> changes for this file and the suspicious patch reads
>
> sound/pci/[email protected], 2004-10-20....
>
> in "bk revtool". the changelog however reads:
>
> [email protected], 2004-10-20 08:10:43-07:00, [email protected]

There are different revision numbers: there's the revision number for the
_file_, and there is the revision number for the _change_.

Also, both (or one) of them can change when a merge occurs, since other
people may have had different merge histories, and in a distributed
environment the revision numbers are a lot more fluid than in CVS.

Linus

2004-11-07 18:31:14

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

On Sun, 7 Nov 2004 08:57:40 -0800 (PST), Linus Torvalds wrote
>
> You can check the ALSA tree _before_ the merge, by doing (in
> the current tree):
>
> bk undo -a1.2000.7.2
>
> which should give you a tree without any of "my" stuff, ie it
> was what Jaroslav was working on before he merged it into the
> standard tree.

yes, i already did so, i think:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109979092216919&w=2

but i did it this way:
bk clone -r1.2000.7.1 linux-2.6-BK linux-2.6-BK-test
bk undo -a1.2010

(probably wrong, so i'll repeat it as you suggeseted)

> (BK revision numbers change on merges, so the above number is
> not necessarily the right one unless you have the current -bk

aha!

> A quick suggestion: make sure that there is not some stale
> object file lying around confusing things about memory layout,
> and do a "make clean" and make sure that all old modules are
> clean too and re-installed.

really: i always do "make clean", even "make mrproper" sometimes, just
to be sure. and i am quite certain, that i did not forget to install the
modules. but i'll keep my eyes open, yes.

> The kernel dependencies should be correct, but even then there can be
> problems with clocks that are off a bit etc.

i'm updating via "ntpdate" on every boot. i am even using a (faster) 2nd
machine for my build and the bk things right now: building a current -bk
on boths hosts gives me this error.

> Yes, that makes me suspicious, and is one reason why I wonder
> if it's just your tree not being built right.

i'll build a -bk snapshot from a tar.bz2 later on and see what it gives.

> There are different revision numbers: there's the revision
> number for the _file_, and there is the revision number for
> the _change_.

aha. it was kinda confusing...now i got it, i think ;)

again: thank you for your time on this rainy weekend,
Christian.
--
BOFH excuse #8:

static buildup

2004-11-07 18:44:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Sun, 7 Nov 2004, Christian Kujau wrote:

> On Sun, 7 Nov 2004 08:57:40 -0800 (PST), Linus Torvalds wrote
> >
> > You can check the ALSA tree _before_ the merge, by doing (in
> > the current tree):
> >
> > bk undo -a1.2000.7.2
> >
> > which should give you a tree without any of "my" stuff, ie it
> > was what Jaroslav was working on before he merged it into the
> > standard tree.
>
> yes, i already did so, i think:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109979092216919&w=2
>
> but i did it this way:
> bk clone -r1.2000.7.1 linux-2.6-BK linux-2.6-BK-test
> bk undo -a1.2010

Hmm.. That may well have worked fine, but it sounds in that post like you
tried to undo the ALSA stuff, and what I suggested was really to do the
reverse: take _only_ the ALSA changes, and then if it still fails, at
least you have now pinpointed it a bit more (admittedly to the _likely_
source, but that's as it should be: you narrow down the "known bad" source
base until you've narrowed it down to the smallest change you can find
that causes the problem).

> > Yes, that makes me suspicious, and is one reason why I wonder
> > if it's just your tree not being built right.
>
> i'll build a -bk snapshot from a tar.bz2 later on and see what it gives.

Sounds like you're doing everything right, but hey, it can't hurt to
double-check.

Linus

2004-11-07 23:46:11

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christian Kujau schrieb:
> On Sun, 7 Nov 2004 08:57:40 -0800 (PST), Linus Torvalds wrote
>
>> bk undo -a1.2000.7.2
>>
>>which should give you a tree without any of "my" stuff, ie it
>>was what Jaroslav was working on before he merged it into the
>>standard tree.

i did so from a current tree (bk pull, undo, -r get) and it's working
fine (url wraps):

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops-2.6.9_a1.2000.7.2.txt

so i can see with "bk changes" that the ChangeSet is still there. this is
what i expected, because -a says:

- -a<rev> Remove all changesets which occurred after <rev>.

what i did not expect is that this ChangeSet is now *not* the culprit,
because there is no oops. am i right? [1]

>>Yes, that makes me suspicious, and is one reason why I wonder
>>if it's just your tree not being built right.
>
> i'll build a -bk snapshot from a tar.bz2 later on and see what it gives.

i've build from linux-2.6.10-rc1.tar.bz2 with patch-2.6.10-rc1-bk17.bz2
from kernel.org with the same .config and "modprobe snd-ens1371" oopses as
expected :(

> Hmm.. That may well have worked fine, but it sounds in that post like
> you tried to undo the ALSA stuff, and what I suggested was really to
> do the reverse: take _only_ the ALSA changes, and then if it still

yes, i wanted to undo the alsa changes because i suspected the alsa
framework (sorry guys) and wanted to see if it still oopses when the
latest alsa patch was not appied.

i did another thing: i enabled the (deprecated) OSS driver (es1371.ko)
tried to load this thing:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-OSS.txt

it oopses.
- - you said it's not a b0rken pci thingy
- - i have to assume now that it's not an ALSA issue (since oss oopses too)
- - it is OSS? the driver? i've CC'ed linux-sound...


> fails, at least you have now pinpointed it a bit more (admittedly to
> the _likely_ source, but that's as it should be: you narrow down the
> "known bad" source base until you've narrowed it down to the smallest
> change you can find that causes the problem).

yes, like Documentation/BUG-HUNTING says. but i seem to have difficulties
in using my tools (bk). sorry for that.

> Sounds like you're doing everything right, but hey, it can't hurt to
> double-check.

yes, i really hope that it's not just a user error (on my side). building
kernels since 2.0...but you never know...


thanks again for help,
Christian
(whose only wish these days is to get over this strange thing and not
wasting peoples precious time with a "sound driver". hey, at least the
box is booting...)

- --
BOFH excuse #224:

Jan 9 16:41:27 huber su: 'su root' succeeded for .... on /dev/pts/1
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBjrOp+A7rjkF8z0wRAl59AKCEbRRzsGujcOlLUA74taFZJb8H0ACfUUxQ
nVQHjBXRBBn9BgSs7cLhTlY=
=wb90
-----END PGP SIGNATURE-----

2004-11-08 01:17:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Mon, 8 Nov 2004, Christian Kujau wrote:
>
> what i did not expect is that this ChangeSet is now *not* the culprit,
> because there is no oops. am i right? [1]

Yes.

So now I'd like to know _where_ the culprit is, since it turned out to be
not the ALSA code.

> i did another thing: i enabled the (deprecated) OSS driver (es1371.ko)
> tried to load this thing:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-OSS.txt
>
> it oopses.
> - you said it's not a b0rken pci thingy
> - i have to assume now that it's not an ALSA issue (since oss oopses too)
> - it is OSS? the driver? i've CC'ed linux-sound...

Sounds like something else changed, and likely the ALSA _and_ the OSS
driver both broke. Which is not all that unlikely, since I suspect they
share a lot of history.

> yes, like Documentation/BUG-HUNTING says. but i seem to have difficulties
> in using my tools (bk). sorry for that.

Not your fault. Think of this as a learning experience ;)

Anyway, now that the _other_ driver also oopses, and with a very similar
oops too, so it looks like they both depended on some undocumented (or
changed) detail in the PCI layer. Next step would be to see if the thing
that breaks is this merge:

[email protected], 2004-11-04 17:07:16-08:00, [email protected]
Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux

which merges Greg's PCI/driver model changes.

It's all the same steps you took with the ALSA merge, you're a
professional by now ;)

Greg, have you followed this thread?

> (whose only wish these days is to get over this strange thing and not
> wasting peoples precious time with a "sound driver". hey, at least the
> box is booting...)

Hey, sound is important. And especially if you somehow found something
non-sound that just broke sound by mistake, all the more important to fix
it.

Linus

2004-11-08 13:02:03

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> Not your fault. Think of this as a learning experience ;)

it definitely is, yes.

> Anyway, now that the _other_ driver also oopses, and with a very similar
> oops too, so it looks like they both depended on some undocumented (or
> changed) detail in the PCI layer. Next step would be to see if the thing
> that breaks is this merge:

may i ask how you come to this conclusion? by technical knowledge or could
this be deduced by some bk magic too?

>
> [email protected], 2004-11-04 17:07:16-08:00, [email protected]
> Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
> into ppc970.osdl.org:/home/torvalds/v2.6/linux
>
> which merges Greg's PCI/driver model changes.
>
> It's all the same steps you took with the ALSA merge, you're a
> professional by now ;)

i did "bk undo -a1.2463" from a current -BK tree and it oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-a1.2463.txt

(i've booted with different boot options this time, because i noticed that
i always booted with "acpi=force". changing this did not help either.)

next i wanted to do "bk undo -r1.2463" now to see if it does *not* break
without this ChangeSet (because i already know it *breaks* with this
ChangeSet) but that would leave some parentless child deltas. i read in
the BK docs that "bk cset -x<version>" would help here. but "bk cset
- -x1.2463" aborts:

- ---------------------
evil@atlant:~/kernel/linux-2.6-BK$ bk changes | head -n3
[email protected], 2004-11-04 17:07:16-08:00, [email protected]
Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux

evil@atlant:~/kernel/linux-2.6-BK$ bk cset -x1.2463
cset: Merge cset found in revision list: (1.2463). Aborting. (cset1)
- ---------------------

i've put everthing on http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/
the .configs, the oopses are there. i've double checked a kernel built
from "bk -a a1.2000.7.2" yesterday but the result was the same (no oops)

thank you,
Christian.
- --
BOFH excuse #121:

halon system went off and killed the operators.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBj24z+A7rjkF8z0wRAu0tAJ9g7mfG0iz/LvSAafD7LWKNu9qvLQCg3fjW
1oMRRK8oSqH5oZsudyIQVtw=
=f8CQ
-----END PGP SIGNATURE-----

2004-11-08 18:19:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Mon, 8 Nov 2004, Christian Kujau wrote:
>
> > Anyway, now that the _other_ driver also oopses, and with a very similar
> > oops too, so it looks like they both depended on some undocumented (or
> > changed) detail in the PCI layer. Next step would be to see if the thing
> > that breaks is this merge:
>
> may i ask how you come to this conclusion? by technical knowledge or could
> this be deduced by some bk magic too?

No, just gut feel. If the pre-merge ALSA works, and the post-merge one
doesn't, and the oops in both cases happen somewhere close to where it
does "pci_enable_device()", there's not a lot left. There are interrupts,
and there is the PCI layer...

> > [email protected], 2004-11-04 17:07:16-08:00, [email protected]
> > Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
> > into ppc970.osdl.org:/home/torvalds/v2.6/linux
> >
> > which merges Greg's PCI/driver model changes.
> >
> > It's all the same steps you took with the ALSA merge, you're a
> > professional by now ;)
>
> i did "bk undo -a1.2463" from a current -BK tree and it oopses:

Note that "bk undo -axxx" will _leave_ xxx in place, and undo everything
after.

So what you did still has the merge in the tree, and that it still oopses
is thus to be expected. BUT, we're getting closer.

> next i wanted to do "bk undo -r1.2463" now to see if it does *not* break
> without this ChangeSet (because i already know it *breaks* with this
> ChangeSet) but that would leave some parentless child deltas. i read in
> the BK docs that "bk cset -x<version>" would help here. but "bk cset
> - -x1.2463" aborts:

"cset -x" only works on patches, not on complex operations. You still want
"bk undo", but you want to use "bk revtool" to see what the merge point
was, and tell _which_ of the merged top-of-trees you want to get to.

In other words, you can't just undo a merge, you need to tell which _way_
to undo it. See? It does actually make sense, and "bk revtool" will show
you the relationships of merges (at least if the time range is big enough
to show enough info).

Anyway, if you have the top-of-tree-is-1.2463, then go to "bk revtool",
and select that node in the graph by clicking on it. Notice how those
edges turned white, and you can now easily see which children were
pre-merge.

In this case, the top-of-tree tree _without_ the PCI merge is 1.2642:

[email protected], 2004-11-04 17:06:13-08:00, [email protected]
Merge bk://kernel.bkbits.net/gregkh/linux/usb-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux

(you won't see it in "bk changes", since it's a trivial merge: use "bk
changes -a" to see it). So just before I merged Greg's PCI changes, I
merged his USB changes.

Now, that's fine - the USB merge is likely to be ok, so try doing

bk undo -a1.2462

and you will now have a tree that is exactly the same as before, except it
does _not_ have the PCI merge from Greg.

And if this one does not oops, you can now officially blame Greg.

Now, if you want to get _really_ fancy, you can now look at each changeset
that differed, with something like

bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' -

which is black magic that does a set operation and shows all the changes
in between the sets of "bk at 1.2462" and "bk at 1.2463".

(This is _not_ the same as "bk changes -r1.2462..1.2463", because that one
just shows the single merge change that is on the direct _path_ from one
changeset to another. The black magic thing shows the set difference of
changesets that comes from the full graph at two points).

Then you can look at each change individually and see if they matter.

And once you can do the set operations, you're officially a BK poweruser.
Me, I just have a script, I'm a BK dabbler.

Looking at the list (appended), I don't see anything obvious, but hey, if
it was obvious it wouldn't have been merged in the first place.

Thanks for your willingness to pursue this thing,

Linus

-----
<[email protected]>
[PATCH] sysfs: fix sysfs backing store error path confusion

o sysfs_new_dirent to retrun 0 if kmalloc fails. Thanks to Milton Miller
for spotting this.

Signed-off-by: Maneesh Soni <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] small sysfs cleanups

The patch below does the following cleanups for the sysfs code:
- remove the unused global function sysfs_mknod
- make some structs and functions static

Please check whether this patch is correct, or whether some of the
things I made static should be used globally in the forseeable future.


Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] add the physical device and the bus to the hotplug environment

Add the sysfs path of the physical device to the hotplug event of class
and block devices. This should solve the userspace issue not to know if
the device is a virtual one and the "device" symlink will never be created,
but we sit there and wait for it to show up not knowing when we should
give up.

Also the bus name is added to the hotplug event, so we don't need to
reverse lookup in the /sys/bus/* directory which bus our physical
device belongs to. This is e.g. the value matched against the BUS= key,
that may be used in an udev rule.

This is a PCI network card:
ACTION=add
SUBSYSTEM=net
DEVPATH=/class/net/eth0
PHYSDEVPATH=/devices/pci0000:00/0000:00:1e.0/0000:02:01.0
PHYSDEVBUS=pci
INTERFACE=eth0
SEQNUM=827
PATH=/sbin:/bin:/usr/sbin:/usr/bin
HOME=/

This is a IDE CDROM:
ACTION=add
SUBSYSTEM=block
DEVPATH=/block/hdc
PHYSDEVPATH=/devices/pci0000:00/0000:00:1f.1/ide1/1.0
PHYSDEVBUS=ide
SEQNUM=1017
PATH=/sbin:/bin:/usr/sbin:/usr/bin
HOME=/

This is an USB-stick partition:
ACTION=add
SUBSYSTEM=block
DEVPATH=/block/sda/sda1
PHYSDEVPATH=/devices/pci0000:00/0000:00:1d.1/usb3/3-1/3-1:1.0/host1/target1:0:0/1:0:0:0
PHYSDEVBUS=scsi
SEQNUM=1032
PATH=/sbin:/bin:/usr/sbin:/usr/bin
HOME=/


Signed-off-by: Kay Sievers <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] driver-model: comment fix in bus.c

df_01_driver_attach_comment_fix.patch

bus_match() was renamed to driver_probe_device() but the comment for
device_attach() wasn't updated. This patch updates it.


Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] driver-model: bus_recan_devices() locking fix

df_02_bus_rescan_devcies_fix.patch

bus_rescan_devices() eventually calls device_attach() and thus
requires write locking the corresponding bus. The original code just
called bus_for_each_dev() which only read locks the bus. This patch
separates __bus_for_each_dev() and __bus_for_each_drv(), which don't
do locking themselves, out from the original functions and call them
with read lock in the original functions and with write lock in
bus_rescan_devices().


Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] driver-model: sysfs_release() dangling pointer reference fix

df_03_sysfs_release_fix.patch

Some attributes are allocated dynamically (e.g. module and device
parameters) and are usually deallocated when the assoicated kobject is
released. So, it's not safe to access attr after putting the kobject.


Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] driver-model: kobject_add() error path reference counting fix

df_04_kobject_add_ref_fix.patch

In kobject_add(), @kobj wasn't put'd properly on error path. This
patch fixes it.


Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] driver-model: device_add() error path reference counting fix

df_05_device_add_ref_fix.patch

In device_add(), @dev wan't put'd properly when it has zero length
bus_id (error path). Fixed.


Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
kevent: fix build error if CONFIG_KOBJECT_UEVENT is not selected.

Thanks to Serge Hallyn <[email protected]> for pointing this out.

Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] kobject_uevent: fix init ordering

Looks like kobject_uevent_init is executed before netlink_proto_init and
consequently always fails. Not cool.

Attached patch switches the initialization over from core_initcall (init
level 1) to postcore_initcall (init level 2). Netlink's initialization
is done in core_initcall, so this should fix the problem. We should be
fine waiting until postcore_initcall.

Also a couple white space changes mixed in, because I am anal.

Signed-Off-By: Robert Love <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
[PATCH] kobject_uevent: add MAINTAINER entry

Attached patch adds a MAINTAINER entry for the kernel event layer.


Signed-Off-By: Robert Love <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
Merge kroah.com:/home/greg/linux/BK/bleed-2.6
into kroah.com:/home/greg/linux/BK/driver-2.6

<[email protected]>
[PATCH] fix kernel BUG at fs/sysfs/dir.c:20!

On Thu, Nov 04, 2004 at 12:52:38PM -0800, Greg KH wrote:
> Hi,
>
> I get the following BUG in the sysfs code when I do:
> - plug in a usb-serial device.
> - open the port with 'cat /dev/ttyUSB0'
> - unplug the device.
> - stop the 'cat' process with control-C
>
> This used to work just fine before your big sysfs changes.

There is a similar problem reported by s390 people where we see parent
kobject (directory) going away before child kobject (sub-directory). It
seems kobject code is able to handle this, but not the sysfs. What could
be happening that in sysfs_remove_dir() of parent directory, we try to
remove its contents. It works well with the regular files as it is the
final removal for sysfs_dirent corresponding to the files. But in case
of sub-directory we are doing an extra sysfs_put(). Once while removing
parent and the other one being the one from when sysfs_remove_dir() is
called for the child.

The following patch worked for the s390 people, I hope same will work in
this case also.


o Do not remove sysfs_dirents corresponding to the sub-directory in
sysfs_remove_dir(). They will be removed in the sysfs_remove_dir() call
for the specific sub-directory.

Signed-off-by: Maneesh Soni <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

<[email protected]>
Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux

2004-11-08 18:51:34

by Pekka Enberg

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

Hi Christian,

On Mon, 08 Nov 2004 14:01:39 +0100, Christian Kujau <[email protected]> wrote:
> i've put everthing on http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/
> the .configs, the oopses are there. i've double checked a kernel built
> from "bk -a a1.2000.7.2" yesterday but the result was the same (no oops)

Just to update, I cannot reproduce the oops with your config (nor
mine) on my machine running 2.6.10-rc1-bk14.

Pekka

0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365
[KT133/KM133] (rev 03)
Subsystem: ASUSTeK Computer Inc. A7V133/A7V133-C Mainboard
Flags: bus master, medium devsel, latency 8
Memory at e7000000 (32-bit, prefetchable)
Capabilities: [a0] AGP version 2.0
Capabilities: [c0] Power Management version 2

0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365
[KT133/KM133 AGP] (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, medium devsel, latency 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000d000-0000dfff
Memory behind bridge: d7000000-d7efffff
Prefetchable memory behind bridge: d7f00000-e6ffffff
Expansion ROM at 0000d000 [disabled] [size=4K]
Capabilities: [80] Power Management version 2

0000:00:04.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super
South] (rev 40)
Subsystem: ASUSTeK Computer Inc. A7V133/A7V133-C Mainboard
Flags: bus master, stepping, medium devsel, latency 0
Capabilities: [c0] Power Management version 2

0000:00:04.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
(prog-if
8a [Master SecP PriP])
Flags: bus master, medium devsel, latency 32
I/O ports at b800 [size=16]
Capabilities: [c0] Power Management version 2

0000:00:04.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB
1.1 Controller (rev 16) (prog-if 00 [UHCI])
Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
Flags: bus master, medium devsel, latency 32, IRQ 10
I/O ports at b400 [size=32]
Capabilities: [80] Power Management version 2

0000:00:04.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB
1.1 Controller (rev 16) (prog-if 00 [UHCI])
Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
Flags: bus master, medium devsel, latency 32, IRQ 10
I/O ports at b000 [size=32]
Capabilities: [80] Power Management version 2

0000:00:04.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super
ACPI] (rev 40)
Subsystem: ASUSTeK Computer Inc. A7V133/A7V133-C Mainboard
Flags: medium devsel, IRQ 9
Capabilities: [68] Power Management version 2

0000:00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
Subsystem: Realtek Semiconductor Co., Ltd. RT8139
Flags: bus master, medium devsel, latency 32, IRQ 10
I/O ports at 9400
Memory at d6800000 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2

0000:00:0a.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 04)
Subsystem: Ensoniq Sound Blaster 16PCI 4.1ch
Flags: bus master, slow devsel, latency 32, IRQ 11
I/O ports at 9000
Capabilities: [dc] Power Management version 2

0000:00:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
Subsystem: Realtek Semiconductor Co., Ltd. RT8139
Flags: bus master, medium devsel, latency 32, IRQ 10
I/O ports at 8800
Memory at d6000000 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2

0000:01:00.0 VGA compatible controller: ATI Technologies Inc Radeon
RV100 QY [Radeon 7000/VE] (prog-if 00 [VGA])
Subsystem: Hightech Information System Ltd.: Unknown device 0f02
Flags: bus master, stepping, 66Mhz, medium devsel, latency 64
Memory at d8000000 (32-bit, prefetchable) [size=d7fe0000]
I/O ports at d800 [size=256]
Memory at d7000000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at 00020000 [disabled]
Capabilities: [58] AGP version 2.0
Capabilities: [50] Power Management version 2



Linux version 2.6.10-rc1-bk14 (root@cherry) (gcc version 3.4.2 (Gentoo
Linux 3.4.2-r2, ssp-3.4.1-1, pie-8.7.6.5)) #8 Mon Nov 8 20:18:45 EET
2004
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003ffec000 (usable)
BIOS-e820: 000000003ffec000 - 000000003ffef000 (ACPI data)
BIOS-e820: 000000003ffef000 - 000000003ffff000 (reserved)
BIOS-e820: 000000003ffff000 - 0000000040000000 (ACPI NVS)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
127MB HIGHMEM available.
896MB LOWMEM available.
On node 0 totalpages: 262124
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 225280 pages, LIFO batch:16
HighMem zone: 32748 pages, LIFO batch:7
DMI 2.3 present.
ACPI: RSDP (v000 ASUS ) @ 0x000f6a80
ACPI: RSDT (v001 ASUS A7V133-C 0x30303031 MSFT 0x31313031) @ 0x3ffec000
ACPI: FADT (v001 ASUS A7V133-C 0x30303031 MSFT 0x31313031) @ 0x3ffec080
ACPI: BOOT (v001 ASUS A7V133-C 0x30303031 MSFT 0x31313031) @ 0x3ffec040
ACPI: DSDT (v001 ASUS A7V133-C 0x00001000 MSFT 0x0100000b) @ 0x00000000
ACPI: PM-Timer IO Port: 0xe408
Built 1 zonelists
Kernel command line: root=/dev/ram0 init=/linuxrc real_root=/dev/hda3 acpi=force
No local APIC present or hardware disabled
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 1009.328 MHz processor.
Using pmtmr for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 1034128k/1048496k available (2582k kernel code, 13664k
reserved, 770k data, 148k init, 130992k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay loop... 1998.84 BogoMIPS (lpj=999424)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: After generic identify, caps: 0383f9ff c1c7f9ff 00000000 00000000
CPU: After vendor identify, caps: 0383f9ff c1c7f9ff 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 64K (64 bytes/line)
CPU: After all inits, caps: 0383f9ff c1c7f9ff 00000000 00000020
CPU: AMD Duron(tm) Processor stepping 00
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
ACPI: IRQ9 SCI: Edge set to Level Trigger.
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
Freeing initrd memory: 885k freed
kobject_uevent: unable to create netlink socket!
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xf1180, last bus=1
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20040816
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
PCI: Using ACPI for IRQ routing
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
ACPI: PCI interrupt 0000:00:04.2[D] -> GSI 10 (level, low) -> IRQ 10
ACPI: PCI interrupt 0000:00:04.3[D] -> GSI 10 (level, low) -> IRQ 10
ACPI: PCI interrupt 0000:00:09.0[A] -> GSI 10 (level, low) -> IRQ 10
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 11 (level, low) -> IRQ 11
ACPI: PCI interrupt 0000:00:0d.0[A] -> GSI 10 (level, low) -> IRQ 10
Simple Boot Flag at 0x3a set to 0x1
highmem bounce pool size: 64 pages
devfs: 2004-01-31 Richard Gooch ([email protected])
devfs: boot_options: 0x0
SGI XFS with ACLs, realtime, no debug enabled
SGI XFS Quota Management subsystem
Applying VIA southbridge workaround.
PCI: Disabling Via external APIC routing
Real Time Clock Driver v1.12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
Equalizer2002: Simon Janes ([email protected]) and David S. Miller
([email protected])
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 0000:00:04.1
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci0000:00:04.1
ide0: BM-DMA at 0xb800-0xb807, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xb808-0xb80f, BIOS settings: hdc:DMA, hdd:pio
Probing IDE interface ide0...
hda: Maxtor 4D060H3, ATA DISK drive
elevator: using anticipatory as default io scheduler
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hdc: Hewlett-Packard CD-Writer Plus 8200a, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
Probing IDE interface ide2...
ide2: Wait for ready failed before probe !
Probing IDE interface ide3...
ide3: Wait for ready failed before probe !
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
Probing IDE interface ide5...
ide5: Wait for ready failed before probe !
hda: max request size: 128KiB
hda: 120069936 sectors (61475 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
hda: cache flushes not supported
/dev/ide/host0/bus0/target0/lun0: p1 p2 p3
hdc: ATAPI 32X CD-ROM CD-R/RW drive, 4096kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
ide-floppy driver 0.99.newide
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard on isa0060/serio0
input: ImPS/2 Logitech Wheel Mouse on isa0060/serio1
NET: Registered protocol family 2
IP: routing cache hash table of 8192 buckets, 64Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
NET: Registered protocol family 1
NET: Registered protocol family 10
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
ACPI: (supports S0 S1 S4 S5)
ACPI wakeup devices:
PWRB PCI0 UAR1 UAR2 USB0 USB1
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 148k freed
usbcore: registered new driver usbfs
usbcore: registered new driver hub
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
SCSI subsystem initialized
Initializing USB Mass Storage driver...
usbcore: registered new driver usb-storage
USB Mass Storage support registered.
ohci_hcd: 2004 Feb 02 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ReiserFS: hda3: warning: sh-2021: reiserfs_fill_super: can not find
reiserfs on hda3
kjournald starting. Commit interval 5 seconds
EXT3 FS on hda3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 2040244k swap on /dev/hda2. Priority:-1 extents:1
EXT3 FS on hda3, internal journal
8139too Fast Ethernet driver 0.9.27
PCI: Enabling device 0000:00:09.0 (0004 -> 0007)
ACPI: PCI interrupt 0000:00:09.0[A] -> GSI 10 (level, low) -> IRQ 10
eth0: RealTek RTL8139 at 0xf8814000, 00:06:4f:01:66:57, IRQ 10
eth0: Identified 8139 chip type 'RTL-8139C'
PCI: Enabling device 0000:00:0d.0 (0004 -> 0007)
ACPI: PCI interrupt 0000:00:0d.0[A] -> GSI 10 (level, low) -> IRQ 10
eth1: RealTek RTL8139 at 0xf8816000, 00:06:4f:01:66:58, IRQ 10
eth1: Identified 8139 chip type 'RTL-8139C'
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
[drm] Initialized radeon 1.11.0 20020828 on minor 0: ATI Technologies
Inc Radeon RV100 QY [Radeon 7000/VE]
[drm:radeon_cp_init] *ERROR* radeon_cp_init called without lock held
[drm:radeon_unlock] *ERROR* Process 6283 using kernel context 0
inserting floppy driver for 2.6.10-rc1-bk14
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
PCI: Enabling device 0000:00:0a.0 (0004 -> 0005)
ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 11 (level, low) -> IRQ 11

2004-11-08 19:03:48

by Greg KH

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

On Mon, Nov 08, 2004 at 08:44:37PM +0200, Pekka Enberg wrote:
> Hi Christian,
>
> On Mon, 08 Nov 2004 14:01:39 +0100, Christian Kujau <[email protected]> wrote:
> > i've put everthing on http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/
> > the .configs, the oopses are there. i've double checked a kernel built
> > from "bk -a a1.2000.7.2" yesterday but the result was the same (no oops)
>
> Just to update, I cannot reproduce the oops with your config (nor
> mine) on my machine running 2.6.10-rc1-bk14.

But 2.6.10-rc1-bk15 does have the problem?

Trying to figure out where the issue is...

greg k-h

2004-11-08 19:20:48

by Pekka Enberg

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

Hi,

On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <[email protected]> wrote:
> But 2.6.10-rc1-bk15 does have the problem?
>
> Trying to figure out where the issue is...

No, -bk14 is just the kernel I am running right now (I haven't tried
-bk15) and I haven't had the problem. I cannot reproduce the oops _at
all_ which is why I suspect it's his hardware. I included my lspci and
dmesg output because we have similar (but not exactly the same)
setups.

FWIW, I've asked Christian for an obdump of the kernel to see if I can
track down where it oopses at because I cannot find anything in the
code. I suspected pcibios_enable_irq (which is a function pointer)
might be wrong but looking at his logs, I don't think we get that far.

Pekka

2004-11-08 19:34:28

by Pekka Enberg

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <[email protected]> wrote:
> > But 2.6.10-rc1-bk15 does have the problem?
> >
> > Trying to figure out where the issue is...

On Mon, 8 Nov 2004 21:18:09 +0200, Pekka Enberg <[email protected]> wrote:
> No, -bk14 is just the kernel I am running right now (I haven't tried
> -bk15) and I haven't had the problem.

Sorry for not being clear, any kernel after 2.6.10-rc1 oopses
according to Christian which is why I haven't bothered to test
anything else except -bk14.

Pekka

2004-11-08 20:32:49

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pekka Enberg schrieb:
> Hi,
>
> On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <[email protected]> wrote:
>
>>But 2.6.10-rc1-bk15 does have the problem?
>>
>>Trying to figure out where the issue is...

i could use the -bk snapshots too, but since i am using bk myself (i try),
i think we can narrow it down a bit more.

>
> No, -bk14 is just the kernel I am running right now (I haven't tried
> -bk15) and I haven't had the problem. I cannot reproduce the oops _at
> all_ which is why I suspect it's his hardware. I included my lspci and
> dmesg output because we have similar (but not exactly the same)
> setups.

i've put an lspci output here:
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-v.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-vv.txt

i do not suspect hw problems *yet*, because kernel up to 2.6.9 (tracking
bk) do not show this behaviour.

> FWIW, I've asked Christian for an obdump of the kernel to see if I can

will show up in a couple of minutes here:
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/objdump-d_a1.2463.txt.bz2

this is from the vmlinux from a "bk undo -a1.2463" kernel, IOW it still
contains:

[email protected], 2004-11-04 17:07:16-08:00, [email protected]
Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux


thank you for the hints,
Christian.

PS: should we i un'CC linux-sound and alsa-devel, now we are sure it's a
pci thing?
- --
BOFH excuse #228:

That function is not currently supported, but Bill Gates assures us it
will be featured in the next upgrade.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBj9e9+A7rjkF8z0wRAregAJ9TyK5Mt00CFmCcgA1pOKmzvIxv2QCg0OBi
/9eNZ41Kp2GAOg4J5l0QR8E=
=OkFI
-----END PGP SIGNATURE-----

2004-11-08 21:01:26

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> No, just gut feel. If the pre-merge ALSA works, and the post-merge one
> doesn't, and the oops in both cases happen somewhere close to where it
> does "pci_enable_device()", there's not a lot left. There are interrupts,
> and there is the PCI layer...

yes, makes sense.

>>
>>i did "bk undo -a1.2463" from a current -BK tree and it oopses:
>
> Note that "bk undo -axxx" will _leave_ xxx in place, and undo everything
> after.
>
> So what you did still has the merge in the tree, and that it still oopses
> is thus to be expected. BUT, we're getting closer.

yes, i think i understood that. that's why i wanted to revert 1.2463 too.

[...]

>
> Now, that's fine - the USB merge is likely to be ok, so try doing
>
> bk undo -a1.2462

for now i appreciate your work here but i have to postpone the the "bk
revtool" stuff because i have no X _and_ bk here. (but i'm a good student
and will do my homework)

> and you will now have a tree that is exactly the same as before, except it
> does _not_ have the PCI merge from Greg.
>
> And if this one does not oops, you can now officially blame Greg.

i can't wait... ;)

>> Now, if you want to get _really_ fancy, you can now look at each changeset
> that differed, with something like
>
> bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' -
>
> which is black magic that does a set operation and shows all the changes
> in between the sets of "bk at 1.2462" and "bk at 1.2463".
>
> (This is _not_ the same as "bk changes -r1.2462..1.2463", because that one
> just shows the single merge change that is on the direct _path_ from one
> changeset to another. The black magic thing shows the set difference of
> changesets that comes from the full graph at two points).
>
> Then you can look at each change individually and see if they matter.

will do, after the build

>
> And once you can do the set operations, you're officially a BK poweruser.
> Me, I just have a script, I'm a BK dabbler.
>
> Looking at the list (appended), I don't see anything obvious, but hey, if
> it was obvious it wouldn't have been merged in the first place.
>
> Thanks for your willingness to pursue this thing,

hey, thanks to you and to the folks in the Cc: field to chase a bug which
only _i_ encounter until now.

/me is building now....
thanks,
Christian.
- --
BOFH excuse #111:

The salesman drove over the CPU board.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBj94f+A7rjkF8z0wRAm/uAJ0eTBa20JnX+250GpFiSED4b+arQwCggSgo
CO/MQ+1jeOOvb7WaJRKg7uY=
=Qlt1
-----END PGP SIGNATURE-----

2004-11-08 23:49:38

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>Now, that's fine - the USB merge is likely to be ok, so try doing
>>>
>>> bk undo -a1.2462

i did so, 1.2463 went away, building as usual - but the oops resists :(

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-a1.2462.txt

>
> for now i appreciate your work here but i have to postpone the the "bk
> revtool" stuff because i have no X _and_ bk here. (but i'm a good student
> and will do my homework)

...in progress...



>>>
>>> bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' -
>>>
>>>which is black magic that does a set operation and shows all the changes
>>>in between the sets of "bk at 1.2462" and "bk at 1.2463".

hm, i guess this has to wait now.

>>>Looking at the list (appended), I don't see anything obvious, but hey, if
>>>it was obvious it wouldn't have been merged in the first place.

yes, i'll look for changes regarding PCI. i've started to compile the -bk
snapshots too. there i can do less wrong things. when i have the "bad" -bk
snapshot i'll use "bk" itself again to find the detailed change leading to
the oops.

i hope to get another machine with a another es1371 tomorrow and see if
the error is reproduceable.

thanks,
Christian.

PS: i've taken linux-sound and alsa-devel from CC.
- --
BOFH excuse #74:

You're out of memory
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkAXx+A7rjkF8z0wRAttsAJ9sOI7FVw+Lx8rBYHusHILQvIkeJACfZWDX
zMY4MtVYCCxU3y0Tb/muG5Y=
=CBO/
-----END PGP SIGNATURE-----

2004-11-09 01:12:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Tue, 9 Nov 2004, Christian Kujau wrote:
>
> >>>Looking at the list (appended), I don't see anything obvious, but hey, if
> >>>it was obvious it wouldn't have been merged in the first place.
>
> yes, i'll look for changes regarding PCI. i've started to compile the -bk
> snapshots too. there i can do less wrong things. when i have the "bad" -bk
> snapshot i'll use "bk" itself again to find the detailed change leading to
> the oops.

Actually, looking a bit closer, I think the PCI merge we just looked at
was the PCI merge that happened _after_ 2.6.10-rc1. And since 2.6.10-rc1
already oopsed for you, it shouldn't be an issue.

I think the _real_ PCI merge we should have looked at is:

[email protected], 2004-10-19 16:59:19-07:00, [email protected]
Merge PCI updates

and in particular, that merged the PCI changes from

[email protected], 2004-10-19 14:48:04-07:00, [email protected]
PCI: fix up pci_save/restore_state in via-agp due to api change.

Signed-off-by: Greg Kroah-Hartman <[email protected]>

with my pre-PCI-merge tree at:

[email protected], 2004-10-19 15:06:19-07:00, [email protected]
Merge bk://bart.bkbits.net/ide-2.6
into ppc970.osdl.org:/home/torvalds/v2.6/linux

(all of these revision numbers are relative to a pristine 2.6.10-rc1
tree: remember that they change with merges, so they may not be the same
in your tree. "bk changes -a" is your friend).

So what I'd like you to do is to take the pre-PCI-merge tree, and see if
that works for you

# assuming a 2.6.10-rc1 tree
bk undo -a1.2000.1.6

and if that works, then try the post-PCI-merge tree:

# assuming a 2.6.10-rc1 tree
bk undo -a1.2000.1.7

(I just checked: the above numbers are actually valid even in the current
-bk tree, so you don't have to first go to 2.6.10-rc1, you can just start
from a current tree)

Thanks for testing, and sorry for the confusion with the more recent PCI
merge.

Linus

2004-11-09 01:33:25

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

ok, i've done some other things here and built kernels from
2.6.10-rc1-bk13 and all were giving the oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1-bk13
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-2.6.10-rc1-bk13.txt

the config is the same config i am usually using, never gave me a
headache, new options (due to new kernel version) were left to default in
most cases. anyway - i've pulled again a recent tree, did
"bk undo -a1.2463" again but this time i stripped down my .config (via
menuconfig) to the absolute necessary things:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_a1.2463_take2

...and it did *NOT* oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops-2.6.10-rc1_a1.2463.txt

i'll investigate further, building former -bk snapshots, using other
configs before i'll fiddle around with bk again (to get the smaller
changes). but this is a tomorrow thing, real life calls in :(

Thank you all so far,
Christian.
- --
BOFH excuse #92:

Stale file handle (next time use Tupperware(tm)!)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkB3v+A7rjkF8z0wRAjU/AKCGPnfuJiBzamcRwU9hIiH+GXZNSwCgi2YK
kwN9O4z/1MzWEakWX0p6IGo=
=d8GA
-----END PGP SIGNATURE-----

2004-11-09 01:48:16

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> So what I'd like you to do is to take the pre-PCI-merge tree, and see if
> that works for you
>
> # assuming a 2.6.10-rc1 tree
> bk undo -a1.2000.1.6
>
> and if that works, then try the post-PCI-merge tree:
>
> # assuming a 2.6.10-rc1 tree
> bk undo -a1.2000.1.7
>
> (I just checked: the above numbers are actually valid even in the current
> -bk tree, so you don't have to first go to 2.6.10-rc1, you can just start
> from a current tree)

thanks, Linus. i'll do all this tomorrow, see my other mail i just sent.
i'll definitely do all this 'cause i'm really curious about this thing.
(it's not even the need of sound any more. heck, i could just put in
another soundcard but that'd be too easy :)

>
> Thanks for testing, and sorry for the confusion with the more recent PCI
> merge.

doh, you can't image how thankful i am for your (and the other people's!)
help here. but don't waste too many cycles on this weird issue here. if it
does not break for a million users out there now - why bother at all?
perhaps it'll break later on but then we have the lkml-archives and
someone will eventually remember this thing. but no, i don't want to
discourage anyone here ;-)

regards,
Christian.
- --
BOFH excuse #19:

floating point processor overflow
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkCAs+A7rjkF8z0wRAu2pAKDBw1Cj3fFBXbtbkpfagkpgbxiK+ACcC2gn
HXmcjnhFFX8vAjK0IawPQgI=
=T1C6
-----END PGP SIGNATURE-----

2004-11-09 07:40:53

by Pekka Enberg

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

Hi,

On Tue, 09 Nov 2004 02:31:28 +0100, Christian Kujau <[email protected]> wrote:
> the config is the same config i am usually using, never gave me a
> headache, new options (due to new kernel version) were left to default in
> most cases. anyway - i've pulled again a recent tree, did
> "bk undo -a1.2463" again but this time i stripped down my .config (via
> menuconfig) to the absolute necessary things:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_a1.2463_take2
>
> ...and it did *NOT* oops:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops-2.6.10-rc1_a1.2463.txt
>
> i'll investigate further, building former -bk snapshots, using other
> configs before i'll fiddle around with bk again (to get the smaller
> changes). but this is a tomorrow thing, real life calls in :(

CONFIG_PREEMPT is one obvious candidate (you have that enabled in the
original config and disabled in the non-oopsing one).

Pekka

2004-11-09 12:33:33

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this damn thread is far too long already...


Pekka Enberg schrieb:
> CONFIG_PREEMPT is one obvious candidate (you have that enabled in the
> original config and disabled in the non-oopsing one).

i've disabled *only* CONFIG_PREEMPT in another .config but it still oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-2.6.10-rc1_no-preempt.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_no-preempt.txt

2.6.9 with preempt enabled does not oops:
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.9_preempt.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops_2.6.9_preempt.txt

i was a fool to test further -bk snapshots but it was kinda late yesterday
and i was confused:

patch-2.6.9.bz2 -> 19-Oct-2004
patch-2.6.10-rc1.bz2 -> 23-Oct-2004 00:12
patch-2.6.10-rc1-bk1.bz2 -> 23-Oct-2004 13:34

2.6.9 is not oopsing *here*, plain 2.6.10-rc1 is oopsing. so i can *not*
use -bk snapshots any more and i will go on with BK (undo the ChangeSets
Linus told me about) and use different .configs now. sorry for the
confusion and especially sorry to my bk mentor: we seem to be so close to
the right ChangeSet and then i started to use *snapshots* again.

Thanks,
Christian
- --
BOFH excuse #76:

Unoptimized hard drive
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkLkQ+A7rjkF8z0wRAhqLAJ9bZm+B5LKR+sY7V+yi/fSrhJuGrwCfcumS
GwsGsjKson9vwRMCDtT9/Zk=
=ailz
-----END PGP SIGNATURE-----

2004-11-09 17:26:40

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Tue, 09 Nov 2004 13:33:20 +0100, Christian Kujau wrote
> i've disabled *only* CONFIG_PREEMPT in another .config but it
> still oopses:

at least i finally found the "bad" .config option: it's CONFIG_EDD.
when i disable this option (and only this options. i can use the same
.config as usual only disbaling this very option. diff is my witness.)
i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

i'll test with CONFIG_EDD=m later on. here a short summary:

2.6.9 CONFIG_EDD=y - OK
2.6.10-rc1-bk CONFIG_EDD=y - OOPS!
2.6.10-rc1-bk CONFIG_EDD=n - OK
2.6.10-rc1-bk CONFIG_EDD=m - ??

yes, i'll continue to find out the ChangeSet but now i (and perhaps you
too, if you are as curious as me) will know where to look at.
i must admit that i was not entirely sure why i wanted to enable
CONFIG_EDD at all. if i had never enabled it, it'd have saved me a week
of bug chasing, but learning is fun, too.

thanks,
Christian.
--
BOFH excuse #209:

Only people with names beginning with 'A' are getting mail this week (a
la Microsoft)

2004-11-09 18:54:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)



On Tue, 9 Nov 2004, Christian Kujau wrote:
>
> at least i finally found the "bad" .config option: it's CONFIG_EDD.
> when i disable this option (and only this options. i can use the same
> .config as usual only disbaling this very option. diff is my witness.)
> i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

Very strange. There's not a lot of stuff that affects EDD directly that I
can see, but there is:

[email protected], 2004-10-20 08:36:22-07:00, [email protected]
[PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

Some controller BIOSes have problems with the legacy int13 fn02 READ
SECTORS command. int13 fn42 EXTENDED READ is used in preference by most
boot loaders today, so lets use that. If EXTENDED READ fails or isn't
supported, fall back to READ SECTORS.

This hopefully resolves the three reports of BIOSes which would either
long-pause (30+ seconds) or hang completely on the legacy READ SECTORS
command.

This also adds CONFIG_EDD_SKIP_MBR to eliminate reading the MBR on each
BIOS-presented disk, in case there are further problems in this area.

Signed-off-by: Matt Domsch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

which might fit the bill.

However, even that would just change the EDD _data_, it doesn't change the
code that actually runs in the kernel. And I _really_ don't see what EDD
has got to do with anything.

I wonder if the EDD stuff corrupts the sysfs tree or something, and you're
just seeing some strange kobject interference. Greg, you'd likely still be
on the line for that one.

Christian, finding which change triggers this would be very good indeed. I
think the merge with greg is still a good place to start, although even
just doing the snapshot trees (from _before_ -rc1: ie the patches in
/pub/linux/kernel/v2.6/snapshots/old: patch-2.6.9-bk*.gz) is actually also
a good way to narrow things down.

Linus

2004-11-09 19:04:58

by Greg KH

[permalink] [raw]
Subject: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

This fixes a problem introduced in the previous set of driver model
changes that has been seen by a lot of people (most notibly the greater
than 256 pty users, but others might also be hitting this without
realizing it.)

Also add a comment so we don't try to "fix" this again.

Signed-off-by: Greg Kroah-Hartman <[email protected]>

--- a/lib/kobject.c 2004-11-05 10:06:33 -08:00
+++ b/lib/kobject.c 2004-11-08 23:58:02 -08:00
@@ -181,10 +181,10 @@ int kobject_add(struct kobject * kobj)

error = create_dir(kobj);
if (error) {
+ /* unlink does the kobject_put() for us */
unlink(kobj);
if (parent)
kobject_put(parent);
- kobject_put(kobj);
} else {
kobject_hotplug(kobj, KOBJ_ADD);
}

2004-11-09 19:08:48

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

On Tue, Nov 09, 2004 at 11:04:21AM -0800, Greg KH wrote:
> This fixes a problem introduced in the previous set of driver model
> changes that has been seen by a lot of people (most notibly the greater
> than 256 pty users, but others might also be hitting this without
> realizing it.)
>
> Also add a comment so we don't try to "fix" this again.
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Christian, I don't know if this patch explicitly fixes your problem, but
it fixes problems other people have been having with the driver core
lately. I'd appreciate it if you could test it out and let me know if
it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
at all.

thanks,

greg k-h

2004-11-09 19:09:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()



On Tue, 9 Nov 2004, Greg KH wrote:
>
> This fixes a problem introduced in the previous set of driver model
> changes that has been seen by a lot of people (most notibly the greater
> than 256 pty users, but others might also be hitting this without
> realizing it.)

Ahh.. Christian, pls test this one.

Linus

2004-11-09 20:19:40

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

Hi Greg,

On Tue, 9 Nov 2004 11:08:09 -0800, Greg KH <[email protected]> wrote:
> Christian, I don't know if this patch explicitly fixes your problem, but
> it fixes problems other people have been having with the driver core
> lately. I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.

The broken kobject_add fix is not in -rc1 proper which oopses on
Christian's machine. I don't think this patch has anything to do with
his problem.

Pekka

2004-11-09 21:21:46

by Christian Kujau

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg KH schrieb:
>
> Christian, I don't know if this patch explicitly fixes your problem, but
> it fixes problems other people have been having with the driver core
> lately. I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.
>

yes, i'll do so and test the patch. is this in current -BK yet? because
applying your patch [1] to 2.6.10-rc1 gives:

Hunk #1 FAILED at 181.
1 out of 1 hunk FAILED -- saving rejects to file lib/kobject.c.rej

i've done a few other things before, let me just post the results before i
go on with your suggestions:

i've compiled a recent (BK) 2.6.10-rc1 again with CONFIG_EDD=m|y|n

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_edd-modular.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_edd.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_no-edd.txt

the results:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-modular.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_no-edd.txt

the interesting thing (for me) was, that when CONFIG_EDD=m was set, my
sound card was working properly and i could do "modprobe edd" and "rmmod
edd" as i like:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/typescript-2.6.10-rc1_edd-modular.txt

again: i double checked and compiled on 2 different hosts, each having
it's own -BK tree.

thanks,
Christian.

[1] http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/edd-fix.patch
- --
BOFH excuse #22:

monitor resolution too high
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkTTg+A7rjkF8z0wRAvFPAKCCM05vqhg4u2NH2wklRRbxdVSpcwCff9a3
/KodSmgp9J4Nf2LDcTiBOCo=
=B/3X
-----END PGP SIGNATURE-----

2004-11-09 21:31:46

by Christian Kujau

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg KH schrieb:
> lately. I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.

please ignore my first mail (the part about not being able to patch), it's
already in BK i can see now, sorry.

compiling now...

- --
BOFH excuse #22:

monitor resolution too high
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkTc3+A7rjkF8z0wRAl7LAJ9/mXV4/uFet5aqpJB/02+J/654bACbBz/k
Px9muqjJ+e7OiRPDHbmyS1s=
=Q+hA
-----END PGP SIGNATURE-----

2004-11-09 22:07:03

by Christian Kujau

[permalink] [raw]
Subject: Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

i'm sorry to say that it did not help:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd__kobject_put.txt

i'll go on and try to exclude

[email protected], 2004-10-20 08:36:22-07:00, [email protected]
[PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

(or just test /pub/linux/kernel/v2.6/snapshots/old/patch-2.6.9-bk*.gz ...)

thanks,
Christian.
- --
BOFH excuse #200:

The monitor needs another box of pixels.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkT9q+A7rjkF8z0wRArHjAJ4qSyZf+ioC4VkvPxk2fCNWUrl18QCeLK85
8e2EyGuWgBviGETlV25t/XE=
=Qvnz
-----END PGP SIGNATURE-----

2004-11-09 23:34:44

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> Very strange. There's not a lot of stuff that affects EDD directly that I
> can see, but there is:
>
> [email protected], 2004-10-20 08:36:22-07:00, [email protected]
> [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

and i say: good catch! that does it!

i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
an oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt

excluding this single ChangeSet with "bk undo -r1.2118" does work with
CONFIG_EDD=y:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

(the filename here should really read "...r1.2118.txt" because that was
the number of the changeset representing the above [PATCH] *after* i did
"bk undo -a1.2000.5.108". right?)

> However, even that would just change the EDD _data_, it doesn't change the
> code that actually runs in the kernel. And I _really_ don't see what EDD
> has got to do with anything.

understanding a lot less of all this than you guys i also wonder why only
this single driver broke. i've always loaded a couple of drivers here,
maybe i could play around a bit e.g. CONFIG_SND_ENS1371=y instead of =m or
see if other hw drivers break too.

> I wonder if the EDD stuff corrupts the sysfs tree or something, and you're
> just seeing some strange kobject interference.

do userspace tools matter here? there is "sysfsutils-1.1.0-1" and
"libsysfs1-1.1.0-1" (both debian/unstable) installed here, /sys is mounted:

sysfs on /sys type sysfs (rw)

> Christian, finding which change triggers this would be very good indeed. I
> think the merge with greg is still a good place to start, although even

i'll look again over the -bk magic you told me about and see what it gives.

thanks so far to all involved here, i really enjoyed "working" with you.
first class support at no charge...it's just incredible.

you guys rock,
Christian.
- --
BOFH excuse #112:

The monitor is plugged into the serial port
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkVMN+A7rjkF8z0wRAqu4AKCtxZxE2spjZGgSnxTWzTTB0CWCkACgi2f3
RmHQXbnkcI1OEcLORhP1dmA=
=5Dot
-----END PGP SIGNATURE-----

2004-11-09 23:45:03

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Wed, Nov 10, 2004 at 12:30:21AM +0100, Christian Kujau wrote:
> > [email protected], 2004-10-20 08:36:22-07:00, [email protected]
> > [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR
>
> and i say: good catch! that does it!
>
> i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
> an oops:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt
>
> excluding this single ChangeSet with "bk undo -r1.2118" does work with
> CONFIG_EDD=y:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

OK, thanks, that helps. From the diff of those dmesg:

-BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
+BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

So with the latest EDD patch noted above, it's finding more disks than
before. How many disks do you actually have in the system?

I'll review the assembly again to see where I could have miscounted,
and see how that may affect the EDD sysfs exports. Likely no answer
from me before tomorrow though.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-11-10 00:16:47

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> Now, if you want to get _really_ fancy, you can now look at each changeset
> that differed, with something like
>
> bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' -
>
> which is black magic that does a set operation and shows all the changes
> in between the sets of "bk at 1.2462" and "bk at 1.2463".
>
> (This is _not_ the same as "bk changes -r1.2462..1.2463", because that one
> just shows the single merge change that is on the direct _path_ from one
> changeset to another. The black magic thing shows the set difference of
> changesets that comes from the full graph at two points).

hm, i still fail to see the "magic" part here. from a current tree i get:

- ---------------
$ bk set -n -d -r1.2000.5.107 -r1.2000.5.108 | bk -R prs -h \
- -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' - | head -n5
<[email protected]>
[PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

Some controller BIOSes have problems with the legacy int13 fn02 READ
SECTORS command. int13 fn42 EXTENDED READ is used in preference by most
- ---------------

which looks similiar to the next one, but with "bk changes" i get the
ChangeSet number again:

- ---------------
$ bk changes -r1.2000.5.108 | head -n5
[email protected], 2004-10-20 08:36:22-07:00, [email protected]
[PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

Some controller BIOSes have problems with the legacy int13 fn02 READ
SECTORS command. int13 fn42 EXTENDED READ is used in preference by most
- ---------------

...or was i supposed to alter your cmdline? i just copy'n'pasted it...
anyway, i've seen that i have a lot of "bk help" ahead of me, thanks for
the course, though ;)

greetings,
Christian.
- --
BOFH excuse #297:

Too many interrupts
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkVzi+A7rjkF8z0wRAte6AKCO8isFqWGyFK53IpVtEnAImvQq8gCfeePr
rzMnTyR3EPMqpv7+qz9iR6c=
=BB+K
-----END PGP SIGNATURE-----

2004-11-10 00:21:31

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
>
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
>
> So with the latest EDD patch noted above, it's finding more disks than
> before. How many disks do you actually have in the system?

i have one scsi disk (sda) and two atapi cdrom drives:

hda: CRD-8483B, ATAPI CD/DVD-ROM drive
hdb: AOPEN CD-RW CRW3248 1.17 20020620, ATAPI CD/DVD-ROM drive
...
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
SCSI device sda: drive cache: write back

the "scsi0 : sym-2.1.18k" is on a pci card, the atapi devices are
connected onboard. if it helps:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-v.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-vv.txt

> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports. Likely no answer
> from me before tomorrow though.

that's ok, real life kicks in here too...

thanks,
Christian.

PS: do you have *any* idea how this could be related to the snd-es1371
driver (which is producing the oops then)?
- --
BOFH excuse #449:

greenpeace free'd the mallocs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBkV75+A7rjkF8z0wRAl67AJ9P+SF1WfRe7r2zoF9D/b/fyDeD0QCfe6/f
Uxt5DVlb/IzW9VSWuFJqLlI=
=Hpg9
-----END PGP SIGNATURE-----

2004-11-10 00:23:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1



On Wed, 10 Nov 2004, Christian Kujau wrote:
> >
> > bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P:@:HOST:>\n$each(:C:){\t(:C:)\n}\n' -
> >
> > which is black magic that does a set operation and shows all the changes
> > in between the sets of "bk at 1.2462" and "bk at 1.2463".
>
> hm, i still fail to see the "magic" part here. from a current tree i get:

You don't see any magic, unless there are merges involved. And you've
already narrowed the thing down to a single non-merge changeset, at which
point the "magic" way is just a very slow way of doing the same thing.

The magic hits you only when you have non-trivial merges, in which case
the set operation shows you more than the "just walk from one top-of-tree
to the other".

Linus

2004-11-10 01:01:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)



On Wed, 10 Nov 2004, Christian Kujau wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Matt Domsch schrieb:
> >
> > -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> > +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> >
> > So with the latest EDD patch noted above, it's finding more disks than
> > before. How many disks do you actually have in the system?
>
> i have one scsi disk (sda) and two atapi cdrom drives:

Interestingly, "16" is also EDD_MBR_SIG_MAX, so my suspicion is that it
overflowed some EDD data area. edd_num_devices() (which is what reports
the above number) does

min_t(unsigned char,
max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));

where EDDMAXNR is 6, and EDD_MBR_SIG_MAX is the afore-mentioned 16, so we
know that either edd.edd_info_nr or edd.mbr_signature_nr is actually
_bigger_ than 16.

Which is clearly totally bogus. In fact, even your old "6 devices found"
thing looks suspiciously bogus.

> PS: do you have *any* idea how this could be related to the snd-es1371
> driver (which is producing the oops then)?

I bet it's overwriting some array, and just corrupting memory after it.
For example, the edd_info[] array only has 6 entries, and for example, the
EDD_MBR_SIG_BUFFER is quite close to where we save the E820MAP memory map
at bootup, so if something stomps on that, the kernel might be confused
about where PCI memory can be allocated or similar. Or it might have
overwritten some ACPI memory data, who knows.

Linus

2004-11-11 22:48:18

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Tue, Nov 09, 2004 at 05:40:54PM -0600, Matt Domsch wrote:
> OK, thanks, that helps. From the diff of those dmesg:
>
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

As Linus points out, those are the magic numbers in EDD for number of
device entries stored. Your BIOS seems to be reporting that is has
more devices than it does, or the EDD assembly is horked in a way I
have not yet deciphered.

> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports. Likely no answer
> from me before tomorrow though.

I haven't been able to find a solution to your problem yet, and given
some external time constraints I've got, won't be able to look into
this again for another week or more.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-11-11 22:58:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)



On Thu, 11 Nov 2004, Matt Domsch wrote:
>
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

Matt, I'll revert the EXTENDED READ change for now, then. The random
behaviour of the problem it causes makes me really dislike this bug, and
I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
having known random stuff happen really disturbs me.

We can re-do it once it's more obvious why it broke..

Linus

2004-11-12 00:07:11

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Thu, Nov 11, 2004 at 02:53:15PM -0800, Linus Torvalds wrote:
> Matt, I'll revert the EXTENDED READ change for now, then. The random
> behaviour of the problem it causes makes me really dislike this bug, and
> I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
> having known random stuff happen really disturbs me.
>
> We can re-do it once it's more obvious why it broke..

Good plan, thanks.

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-11-12 00:31:42

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
>
> As Linus points out, those are the magic numbers in EDD for number of
> device entries stored. Your BIOS seems to be reporting that is has
> more devices than it does, or the EDD assembly is horked in a way I
> have not yet deciphered.

actually, my BIOS is even to old for e.g. ACPI, with latest firmware
installed. i had no issues so far with the board/bios, but perhaps this is
no longer true. however, it's still strange that this thing is only
triggerd with you change and CONFIG_EDD=y.

>
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

nevermind then. as nobody else seem to be bothered by this i am happy with
the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
get back to it when it's bothering more people (n>1)

thank you for your time,
Christian.
- --
BOFH excuse #396:

Mail server hit by UniSpammer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBlAOE+A7rjkF8z0wRAkyLAJ4uy4LYBHWk8Wxwr/heQRVm7VOXfwCfW30C
Zv1RdMYf1VOBEGkUnkQ+k0Q=
=f2hG
-----END PGP SIGNATURE-----

2004-11-12 00:56:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)



On Fri, 12 Nov 2004, Christian Kujau wrote:
>
> nevermind then. as nobody else seem to be bothered by this i am happy with
> the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
> get back to it when it's bothering more people (n>1)

The problem with that approach is that very few people are willing to
spend the time and effort to really try to figure out where the problem
triggers for them. Thanks again for testing lots of kernels, and different
configurations.

Basically, if it's a problem that only happens for a smallish percentage
of people, and an even smaller percentage of those is willing to dig down
and find it, it's not a problem we can afford to ignore. Ignoring it just
means that there will be "a few" error reports that we will either waste
time on, or (even worse) we'll dismiss as "known problems" and then
possibly miss _another_ bug.

This is why I take random unexplained (but pinpointed) problems so
seriously. If it wasn't as apparently random, we could file it under
"known problem" and decide to try to fix it later. As it is, it's filed
under "known cause", but since we don't know _why_, it might cause totally
different problems on another machine, and that just makes it too painful
for words.

So the changeset is reverted for now in the current -bk tree, and I'll
make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

Linus

2004-11-12 01:27:26

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds schrieb:
>
> This is why I take random unexplained (but pinpointed) problems so
> seriously. If it wasn't as apparently random, we could file it under
> "known problem" and decide to try to fix it later. As it is, it's filed
> under "known cause", but since we don't know _why_, it might cause totally
> different problems on another machine, and that just makes it too painful
> for words.

just after sending my last mail i too (re)thought about this and i'd have
begged Matt to revert the patch if it was not *only* me having this issue.

but i can see your point here and i appreciate your decision.

> So the changeset is reverted for now in the current -bk tree, and I'll
> make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

yay!

thanks,
Christian.
- --
BOFH excuse #96:

Vendor no longer supports the product
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBlBFw+A7rjkF8z0wRAld5AJ40MjbzFbVXepXkJr1tLZCvYy7z2QCeMYCe
QQyekHBs1cjuebPZTEuPZZ0=
=wwF6
-----END PGP SIGNATURE-----