2004-11-13 03:50:38

by Chuck Ebbert

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <[email protected]> wrote:

> > PS: do you have *any* idea how this could be related to the snd-es1371
> > driver (which is producing the oops then)?
>
> I bet it's overwriting some array, and just corrupting memory after it.
> For example, the edd_info[] array only has 6 entries,

That's almost certainly the problem. There can be up to 16 EDD devices
as of the Jun 30 update to the EDD code.

And sound_class is the next item after edd_info[] in my System.map...


--Chuck Ebbert 12-Nov-04 22:21:27


2004-11-13 14:28:50

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Fri, Nov 12, 2004 at 10:45:12PM -0500, Chuck Ebbert wrote:
> On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <[email protected]> wrote:
>
> > > PS: do you have *any* idea how this could be related to the snd-es1371
> > > driver (which is producing the oops then)?
> >
> > I bet it's overwriting some array, and just corrupting memory after it.
> > For example, the edd_info[] array only has 6 entries,
>
> That's almost certainly the problem. There can be up to 16 EDD devices
> as of the Jun 30 update to the EDD code.

Bingo... edd_devices[] was too short. When we keep more
than 6 signatures, it overruns the end. Also, I rewrote
edd_num_devices to be clearer about its goal.

This patch is necessary even after the last edd.S patch was reverted.

It still doesn't explain why Christian's BIOS reports more devices
than he has, that's still UI, so don't re-apply the edd.S patch just reverted.

Signed-off-by: Matt Domsch

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== drivers/firmware/edd.c 1.30 vs edited =====
--- 1.30/drivers/firmware/edd.c 2004-06-29 09:44:48 -05:00
+++ edited/drivers/firmware/edd.c 2004-11-13 07:56:00 -06:00
@@ -70,7 +70,7 @@
static int edd_dev_is_type(struct edd_device *edev, const char *type);
static struct pci_dev *edd_get_pci_dev(struct edd_device *edev);

-static struct edd_device *edd_devices[EDDMAXNR];
+static struct edd_device *edd_devices[EDD_MBR_SIG_MAX];

#define EDD_DEVICE_ATTR(_name,_mode,_show,_test) \
struct edd_attribute edd_attr_##_name = { \
@@ -728,9 +728,9 @@

static inline int edd_num_devices(void)
{
- return min_t(unsigned char,
- max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
- max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));
+ return max_t(unsigned char,
+ min_t(unsigned char, EDD_MBR_SIG_MAX, edd.mbr_signature_nr),
+ min_t(unsigned char, EDDMAXNR, edd.edd_info_nr));
}

/**

2004-11-13 18:56:20

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Sat, Nov 13, 2004 at 08:28:35AM -0600, Matt Domsch wrote:
> On Fri, Nov 12, 2004 at 10:45:12PM -0500, Chuck Ebbert wrote:
> > On Tue, 9 Nov 2004 at 17:01:10 -0800 Linus Torvalds <[email protected]> wrote:
> >
> > > > PS: do you have *any* idea how this could be related to the snd-es1371
> > > > driver (which is producing the oops then)?
> > >
> > > I bet it's overwriting some array, and just corrupting memory after it.
> > > For example, the edd_info[] array only has 6 entries,
> >
> > That's almost certainly the problem. There can be up to 16 EDD devices
> > as of the Jun 30 update to the EDD code.
>
> Bingo... edd_devices[] was too short. When we keep more
> than 6 signatures, it overruns the end.

In particular, depending on your .config, with EDD=y it overwrites 40
bytes past the end of edd_devices (here I've already extended it by
the necessary amount, but the 40 bytes past its end are all subject to
be overwritten):
c043a880 b edd_devices
c043a8c0 b pci_bios_present
c043a8c4 B pci_mmcfg_base_addr
c043a8c8 b mmcfg_last_accessed_device
c043a8cc b called.0
c043a8d0 B pcibios_enable_irq
c043a8d4 b eisa_irq_mask.0
c043a8d8 b broken_hp_bios_irq9
c043a8dc b acer_tm360_irqrouting
c043a8e0 b pirq_table
c043a8e4 b pirq_router

hence the failure Christian saw and attributed to the sound drivers:

EIP is at 0xc15d5820
eax: 00000000 ebx: dff20400 ecx: c15d5820 edx: dff205c4
esi: ffffffed edi: dff20400 ebp: dff20400 esp: c17a3e58
ds: 007b es: 007b ss: 0068
Process modprobe (pid: 178, threadinfo=c17a2000 task=dfcf05a0)
Stack: c01fa5c8 dff20400 000007ff dff20400 c01fa5ff dff20400 000007ff
c15ea400
e082729d dff20400 c15ea400 00000000 e08469df c15ea400 000001f8
000000d0
000000d0 df45ed14 00000000 c018e14e c15ea400 ffffffed dff20400
dff20400
Call Trace:
[<c01fa5c8>] pci_enable_device_bars+0x28/0x40
[<c01fa5ff>] pci_enable_device+0x1f/0x40
[<e082729d>] snd_ensoniq_create+0x1d/0x480 [snd_ens1371]
[<e08469df>] snd_card_new+0x1cf/0x2c0 [snd]
[<c018e14e>] sysfs_new_dirent+0x2e/0x90
[<e0827867>] snd_audiopci_probe+0x87/0x1e0 [snd_ens1371]
[<c01fb012>] pci_device_probe_static+0x52/0x70
[<c01fb05c>] __pci_device_probe+0x2c/0x30
[<c01fb08c>] pci_device_probe+0x2c/0x60
[<c0258f4f>] driver_probe_device+0x2f/0x80
[<c02590b2>] driver_attach+0x52/0xa0
[<c02595f8>] bus_add_driver+0x98/0xe0
[<c0259c5f>] driver_register+0x2f/0x40
[<c01fb340>] pci_register_driver+0x40/0x50
[<e08279cf>] alsa_card_ens137x_init+0xf/0x13 [snd_ens1371]
[<c0134279>] sys_init_module+0x169/0x240
[<c01041eb>] syscall_call+0x7/0xb


With CONFIG_EDD=m, there just wasn't anything interesting in memory
following edd_devices[] (thanks module loader for using whole pages I
believe).

-Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-11-14 02:59:53

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Sat, Nov 13, 2004 at 08:28:35AM -0600, Matt Domsch wrote:
> It still doesn't explain why Christian's BIOS reports more devices
> than he has, that's still UI, so don't re-apply the edd.S patch just reverted.

Alexander van Heukelum noted to me that addw here modifies CF, so I
think something like should fix that. Christian, if you're in a
position to test this, I'd really appreciate it. You've been a
fantastic bug reporter / tester!

Not ready for Linus yet, and you'll need to re-apply the previous
edd.S patch which is now reverted in Linus's tree. As your BIOS
reports via CHECK EXTENSIONS PRESENT that you've got more devices than
you actually have, hopefully the int13 EXTENDED READ won't succeed for
non-existant devices anymore, and then neither will the READ SECTORS
call.

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== arch/i386/boot/edd.S 1.3 vs edited =====
--- 1.3/arch/i386/boot/edd.S 2004-10-20 03:37:11 -05:00
+++ edited/arch/i386/boot/edd.S 2004-11-13 20:31:58 -06:00
@@ -58,8 +58,12 @@
sti # work around buggy BIOSes
popw %dx
popw %si
- addw $EDD_DEV_ADDR_PACKET_LEN, %sp # remove packet from stack
- jnc edd_mbr_store_sig
+ pushfl # save EFLAGS into ebx
+ popl %ebx # because addw modifies CF
+ addw $EDD_DEV_ADDR_PACKET_LEN, %sp # remove packet from stack
+ pushl %ebx # get back right CF
+ popfl
+ jnc edd_mbr_store_sig
# otherwise, fall through to the legacy read function

edd_mbr_read_sectors:

2004-11-14 04:43:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)



On Sat, 13 Nov 2004, Matt Domsch wrote:
>
> Not ready for Linus yet

Indeed. Please don't use pushfl/popfl to save the carry flag. There are
tons of better ways.

For example, use "lea" instead of "add" to not write the flags (and add a
comment). Or save the carry flag in a register with

sbb %bx,%bx

ant test %bx later. Or any of a million other _standard_ ways to handle
this problem.

Linus

2004-11-14 11:45:58

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

Matt Domsch wrote:
>
> Alexander van Heukelum noted to me that addw here modifies CF, so I
> think something like should fix that. Christian, if you're in a
> position to test this, I'd really appreciate it. You've been a

yes, i'll do so. right now i am off (and late) to sth. else, but i'll
test this in the evening.

thank you,
Christian.
--
BOFH excuse #318:

Your EMAIL is now being delivered by the USPS.

2004-11-14 20:05:31

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

sorry, took me a bit longer to get to the testing.


Matt Domsch schrieb:
>
> Not ready for Linus yet, and you'll need to re-apply the previous
> edd.S patch which is now reverted in Linus's tree. As your BIOS

i've applied the patch to a pristine 2.6.10-rc1, so the (currently
reverted) EDD change is still there. tell me, if the patch had to be
applied to sth. else.

but for now i have to say, that it still oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-2.txt

...
BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
...

(oh, i've added an ide-disk yesterday, so hde will show up in dmesg.)

sorry,
Christian.
- --
BOFH excuse #401:

Sales staff sold a product we don't offer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBl7nZ+A7rjkF8z0wRAvuHAKCX8TWiDt5DP25OqBEWKecfM6x3HwCeNRoM
1IzHqKpcbWOABXWJ4vC4d1w=
=FiKX
-----END PGP SIGNATURE-----

2004-11-14 21:55:55

by Matt Domsch

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (almost solved)

On Sun, Nov 14, 2004 at 09:02:33PM +0100, Christian Kujau wrote:
> > Not ready for Linus yet, and you'll need to re-apply the previous
> > edd.S patch which is now reverted in Linus's tree. As your BIOS
>
> i've applied the patch to a pristine 2.6.10-rc1, so the (currently
> reverted) EDD change is still there. tell me, if the patch had to be
> applied to sth. else.
>
> but for now i have to say, that it still oopses:
>
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-2.txt

OK, the patch below (which Linus applied to his tree yesterday) should
fix the oopses.

> BIOS EDD facility v0.16 2004-Jun-25, 16 devices found

but the patch to edd.S doesn't resolve that EDD believes you've got 16
devices (I would expect it to report 2, as you have only 2 disks).

Thanks for the quick testing. Back to the drawing board though for
this second part.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== drivers/firmware/edd.c 1.30 vs edited =====
--- 1.30/drivers/firmware/edd.c 2004-06-29 09:44:48 -05:00
+++ edited/drivers/firmware/edd.c 2004-11-13 07:56:00 -06:00
@@ -70,7 +70,7 @@
static int edd_dev_is_type(struct edd_device *edev, const char *type);
static struct pci_dev *edd_get_pci_dev(struct edd_device *edev);

-static struct edd_device *edd_devices[EDDMAXNR];
+static struct edd_device *edd_devices[EDD_MBR_SIG_MAX];

#define EDD_DEVICE_ATTR(_name,_mode,_show,_test) \
struct edd_attribute edd_attr_##_name = { \
@@ -728,9 +728,9 @@

static inline int edd_num_devices(void)
{
- return min_t(unsigned char,
- max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
- max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));
+ return max_t(unsigned char,
+ min_t(unsigned char, EDD_MBR_SIG_MAX, edd.mbr_signature_nr),
+ min_t(unsigned char, EDDMAXNR, edd.edd_info_nr));
}

/**

2004-11-15 12:41:43

by Christian Kujau

[permalink] [raw]
Subject: Re: Oops in 2.6.10-rc1 (solved)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Domsch schrieb:
>
> OK, the patch below (which Linus applied to his tree yesterday) should
> fix the oopses.
>

so i've compiled a pristine 2.6.10-rc1-bk24 as your patch should be
included there (i've tried to apply your patch with --dry-run -> it did
not succeed, -R *would* have been successful) and finally it works!

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1-bk24.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1-bk24

snd_ens1371 is working fine, no oops, i can load/unload the drivers, no
problems ;-)

>
>>BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
>
> but the patch to edd.S doesn't resolve that EDD believes you've got 16
> devices (I would expect it to report 2, as you have only 2 disks).

but still:

BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

i have 2 disks now (1 ide, 1 scsi), 2 cdrom drives (ide). as you can see
from the dmesg, i have an additional ide-controller onboard:

PDC20265: chipset revision 2
PDC20265: ROM enabled at 0xdffe0000
PDC20265: 100% native mode on irq 10
PDC20265: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
ide2: BM-DMA at 0xb400-0xb407, BIOS settings: hde:DMA, hdf:DMA
ide3: BM-DMA at 0xb408-0xb40f, BIOS settings: hdg:pio, hdh:pio
Probing IDE interface ide2...
hde: ST320413A, ATA DISK drive
ide2 at 0xbc00-0xbc07,0xb802 on irq 10
Probing IDE interface ide3...
Probing IDE interface ide1...
Probing IDE interface ide3...
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
Probing IDE interface ide5...
ide5: Wait for ready failed before probe !

but there are only 4 ide channels on my board (Gigabyte GA7ZXR):
ide0 - with hda+hdb connected (2x cdrom)
ide1 - none
ide2 - with hde connected (ST320413A)
ide3 - none

so it's probing for a non-existent ide4+ide5! but it did that even in -bk4
times, so it's not "new behaviour", i guess.

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9-bk4.txt

anyway, it's working now, the oops is gone, but i can do further testing
regarding this EDD issue of course.

Thanks to all involved,
Christian.
- --
BOFH excuse #195:

We only support a 28000 bps connection.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBmKP4+A7rjkF8z0wRAooxAJ9dD5QEXsEPUJjlBNvtfhtPteGoNwCfdfCA
tsYq86N5Y/bpegSXYWS+nkw=
=kFOh
-----END PGP SIGNATURE-----