LinuxLists.cc - 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

2007-11-11 19:41:33

by Rafael J. Wysocki

[permalink] [raw]

Subject: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

[Note: Due to git.kernel.org not responding I'm unable to check which fixes
have already been merged since the last report.]

This message contains a list of some regressions from 2.6.23 which have been
reported since 2.6.24-rc1 was released and for which there are no fixes in the
mainline that I know of. ?If any of them have been fixed already, please let me
know.

If you know of any other unresolved regressions from 2.6.23, please let me know
either and I'll add them to the list.

Subject : On 2.6.24-rc1-gc9927c2b BUG: unable to handle kernel paging request at virtual address 3d15b925
Submitter : Giacomo Catenazzi <[email protected]>
References : http://lkml.org/lkml/2007/10/24/487
http://bugzilla.kernel.org/show_bug.cgi?id=9246
Handled-By :
Patch :

Subject : Potential regression in -git15: can't resume stopped root shell?
Submitter : Theodore Tso <[email protected]>
References : http://lkml.org/lkml/2007/10/20/114
http://bugzilla.kernel.org/show_bug.cgi?id=9247
Handled-By : Serge Hallyn <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=13361&action=view
http://bugzilla.kernel.org/attachment.cgi?id=13375&action=view

Subject : irq 21: nobody cared 2.6.24-rc1
Submitter : Bongani Hlope <[email protected]>
References : http://lkml.org/lkml/2007/10/25/90
http://bugzilla.kernel.org/show_bug.cgi?id=9249
Handled-By :
Patch :

Subject : [BUG] panic after umount (biscted)
Submitter : Sebastian Siewior <[email protected]>
References : http://marc.info/?l=linux-kernel&m=119338387030335&w=2
http://bugzilla.kernel.org/show_bug.cgi?id=9250
Handled-By : Jens Axboe <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=119348520210349&w=2

Subject : 2.6.24-rc1 sysctl table check failed on PowerMac
Submitter : Mikael Pettersson <[email protected]>
References : http://marc.info/?l=linux-kernel&m=119350802331857&w=2
http://bugzilla.kernel.org/show_bug.cgi?id=9251
Handled-By : Alexey Dobriyan <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=119351015801660&w=2

Subject : 2.6.24-rc1: pata_acpi fails to activate DMA for DVD-ROM on ALi M5229 secondary channel
Submitter : Andrey Borzenkov <[email protected]>
References : http://marc.info/?l=linux-kernel&m=119342005216716&w=2
http://bugzilla.kernel.org/show_bug.cgi?id=9252
Handled-By : Alan Cox <[email protected]>
Patch :
Note : pata_acpi was not present in 2.6.23

Subject : 2.6.24-rc1 freezes on powerbook at first boot stage
Submitter : Elimar Riesebieter <[email protected]>
References : http://lkml.org/lkml/2007/10/24/205
http://bugzilla.kernel.org/show_bug.cgi?id=9254
Handled-By :
Patch :

Subject : build #286 failed for 2.6.24-rc1-gea45d15 in linux/arch/x86/kernel/setup_32.c
Submitter : Toralf F?rster <[email protected]>
References : http://lkml.org/lkml/2007/10/28/110
http://bugzilla.kernel.org/show_bug.cgi?id=9256
Handled-By : "H. Peter Anvin" <[email protected]>
Patch : http://marc.info/[email protected]

Subject : 2.6.24-rc1 kills onboard r8169 (rtl8111b) NIC
Submitter : "Sergey S. Kostyliov" <[email protected]>
References : http://lkml.org/lkml/2007/10/28/144
http://bugzilla.kernel.org/show_bug.cgi?id=9257
Handled-By : Francois Romieu <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=13441&action=view

Subject : Commit "Hibernation: Enter platform hibernation state in a consistent way)" makes my system to resume instantly from S4
Submitter : Maxim Levitsky <[email protected]>
References : http://lkml.org/lkml/2007/10/27/66
http://bugzilla.kernel.org/show_bug.cgi?id=9258
Handled-By : "Rafael J. Wysocki" <[email protected]>
Patch :
Note : $subject commit apparently exposes a problem that existed previously

Subject : leds: ledtrig-timer calls sleeping function from invalid context
Submitter : M?rton N?meth <[email protected]>
References : http://bugzilla.kernel.org/show_bug.cgi?id=9264
Handled-By :
Patch :

Subject : Device mapper regression 2.6.23 vs. v2.6.23-6597-gcfa76f0
Submitter : Thomas Meyer <[email protected]>
References : http://lkml.org/lkml/2007/10/21/153
http://bugzilla.kernel.org/show_bug.cgi?id=9280
Handled-By :
Patch :

Subject : [2.6.24-rc1][BUG] Oops on battery removal
Submitter : Rolf Eike Beer <[email protected]>
References : http://lkml.org/lkml/2007/11/2/23
http://bugzilla.kernel.org/show_bug.cgi?id=9283
Handled-By : Alexey Starikovskiy <[email protected]>
Patch : http://lkml.org/lkml/2007/11/2/71

Subject : [2.6.24-rc1 regression] AC adapter state does not change after resume
Submitter : Andrey Borzenkov <[email protected]>
References : http://lkml.org/lkml/2007/10/30/427
http://bugzilla.kernel.org/show_bug.cgi?id=9284
Handled-By : Alexey Starikovskiy <[email protected]>
Patch : http://lkml.org/lkml/2007/10/31/44

Subject : 2.6.24-rc1 eat my photo SD card :-(
Submitter : Romano Giannetti <[email protected]>
References : http://lkml.org/lkml/2007/11/1/99
http://bugzilla.kernel.org/show_bug.cgi?id=9286
Handled-By : Nick Piggin <[email protected]>
Pierre Ossman <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=13450&action=view

Subject : 100% iowait on one of cpus in current -git
Submitter : Maxim Levitsky <[email protected]>
Thomas Schwarzgruber <[email protected]>
References : http://lkml.org/lkml/2007/10/22/20
http://lkml.org/lkml/2007/10/31/212
http://bugzilla.kernel.org/show_bug.cgi?id=9289
Handled-By : Fengguang Wu <[email protected]>
Patch :

Subject : pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
Submitter : Florin Iucha <[email protected]>
References : http://lkml.org/lkml/2007/10/28/65
http://bugzilla.kernel.org/show_bug.cgi?id=9291
Handled-By : Trond Myklebust <[email protected]>
Fengguang Wu <[email protected]>
Patch :

Subject : [regression] v2.6.24-rc1-497-gb1d08ac: kde battery icon gone
Submitter : Thomas Meyer <[email protected]>
References : http://lkml.org/lkml/2007/11/2/165
http://bugzilla.kernel.org/show_bug.cgi?id=9297
Handled-By : Andrey Borzenkov <[email protected]>
Ingo Molnar <[email protected]>
Patch :
Note : goes away if ACPI_PROCFS is set

Subject : Regression: libata: implement ata_wait_after_reset()
Submitter : Luca Tettamanti <[email protected]>
References : http://lkml.org/lkml/2007/11/3/66
http://bugzilla.kernel.org/show_bug.cgi?id=9298
Handled-By : Tejun Heo <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=13429&action=view

Subject : 2.6.24-rc1-g74521c28: oops during boot [<ffffffff881c03e4>] :power_supply:power_supply_show_property+0x94/0x150
Submitter : Thomas B?chler <[email protected]>
References : http://lkml.org/lkml/2007/11/3/35
http://bugzilla.kernel.org/show_bug.cgi?id=9299
Handled-By :
Patch :

Subject : Audigy 2 ZS Notebook prevents snd_emu10k1 module from loading/working
Submitter : [email protected]
References : http://bugzilla.kernel.org/show_bug.cgi?id=9304
Handled-By : Takashi Iwai <[email protected]>
Patch :

Subject : National characters are not displayed under console.
Submitter : Konrad Rzepecki <[email protected]>
References : http://bugzilla.kernel.org/show_bug.cgi?id=9319
Handled-By :
Patch :

Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
Submitter : Hans de Bruin <[email protected]>
References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
Handled-By : Robert Moore <[email protected]>
Patch :

Subject : net: skge breakage on 2.6.24-rc1
Submitter : Heikki Orsila <[email protected]>
References : http://lkml.org/lkml/2007/11/7/281
http://bugzilla.kernel.org/show_bug.cgi?id=9321
Handled-By :
Patch :

Subject : 2.6.24-rc1: pata_amd fails to detect 80-pin wire
Submitter : "Thomas Lindroth" <[email protected]>
References : http://lkml.org/lkml/2007/11/7/152
http://bugzilla.kernel.org/show_bug.cgi?id=9322
Handled-By :
Patch :

Subject : 2.6.24-rc1 - Regularly getting processes stuck in D state on startup
Submitter : David <[email protected]>
Confirmed-by : Stephen Rothwell <[email protected]>
References : http://lkml.org/lkml/2007/11/5/229
http://bugzilla.kernel.org/show_bug.cgi?id=9323
Handled-By : Peter Zijlstra <[email protected]>
Fengguang Wu <[email protected]>
Patch : http://lkml.org/lkml/2007/11/6/33

Subject : 2.6.24-rc2 breaks nVidia MCP51 High Definition Audio
Submitter : Gerhard Mack <[email protected]>
References : http://lkml.org/lkml/2007/11/7/318
http://bugzilla.kernel.org/show_bug.cgi?id=9324
Handled-By : Andrew Morton <[email protected]>
Patch :

Subject : 2.6.24-rc2 (esthetic?) regression: auto select interrupt bouncing
Submitter : Romano Giannetti <[email protected]>
References : http://bugzilla.kernel.org/show_bug.cgi?id=9327
Handled-By : Alexey Starikovskiy <[email protected]>
Patch :

Subject : libata: cdrw/dvdrom disabed after s2ram (2.6.24-rc2)
Submitter : Roberto Oppedisano <[email protected]>
References : http://lkml.org/lkml/2007/11/8/124
Handled-By : Andrew Morton <[email protected]>
Jeff Garzik <[email protected]>
Matthew Garrett <[email protected]>
Patch : http://lkml.org/lkml/2007/11/8/167

Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
Submitter : Roland Dreier <[email protected]>
References : http://lkml.org/lkml/2007/11/8/255
http://bugzilla.kernel.org/show_bug.cgi?id=9332
Handled-By :
Patch :

Subject : system hangs with blank screen after some time
Submitter : Marcus Better <[email protected]>
References : http://bugzilla.kernel.org/show_bug.cgi?id=9335
Handled-By : Andrew Morton <[email protected]>
Patch :

Subject : iozone write 50% regression in kernel 2.6.24-rc1
Submitter : "Zhang, Yanmin" <[email protected]>
References : http://lkml.org/lkml/2007/11/9/28
http://bugzilla.kernel.org/show_bug.cgi?id=9340
Handled-By : Peter Zijlstra <[email protected]>
Martin Knoblauch <[email protected]>
Patch :

Subject : 2.6.24 regression: ?hibernation hangs on "Suspending console" in low-battery condition
Submitter : Andrey Borzenkov <[email protected]>
References : http://lkml.org/lkml/2007/11/11/28
http://bugzilla.kernel.org/show_bug.cgi?id=9344
Handled-By : "Rafael J. Wysocki" <[email protected]>
Patch :

Subject : 2.6.24-rc2 STD with s2disk fails to activate suspended system after loading
Submitter : Chris Friedhoff <[email protected]>
References : http://lkml.org/lkml/2007/11/10/114
http://bugzilla.kernel.org/show_bug.cgi?id=9345
Handled-By :
Patch :

Subject : cd/dvd inaccessible in 2.6.24-rc2
Submitter : Will Trives <[email protected]>
References : http://lkml.org/lkml/2007/11/9/290
http://bugzilla.kernel.org/show_bug.cgi?id=9346
Handled-By : Alan Cox <[email protected]>
Jeff Garzik <[email protected]>
Patch :

Subject : [PATCH] x86: show cpuinfo only for online CPUs
Submitter : "Andreas Herrmann" <[email protected]>
References : http://lkml.org/lkml/2007/11/1/207
http://bugzilla.kernel.org/show_bug.cgi?id=9348
Handled-By : Glauber de Oliveira Costa <[email protected]>
"H. Peter Anvin" <[email protected]>
Patch : http://lkml.org/lkml/2007/11/1/246

Subject : 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
Submitter : Christoph Lameter <[email protected]>
References : http://lkml.org/lkml/2007/11/9/246
http://bugzilla.kernel.org/show_bug.cgi?id=9350
Handled-By : Nick Piggin <[email protected]>
Patch :

Subject : 2.6.24-rc1 on PPC64: machine check exception
Submitter : Vaidyanathan Srinivasan <[email protected]>
References : http://lkml.org/lkml/2007/11/5/92
Handled-By : Anton Blanchard <[email protected]>
http://bugzilla.kernel.org/show_bug.cgi?id=9351
Patch : http://patchwork.ozlabs.org/linuxppc/patch?id=14475

Subject : 2.6.24-rc1-gb4f5550 oops
Submitter : Grant Wilson <[email protected]>
References : http://lkml.org/lkml/2007/11/5/6
http://bugzilla.kernel.org/show_bug.cgi?id=9352
Handled-By : "Rafael J. Wysocki" <[email protected]>
Patch :

For details, please follow the links given in references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions from 2.6.23,
unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=9243

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Greetings,
Rafael

2007-11-11 20:10:15

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

> Subject : 2.6.24-rc1: pata_acpi fails to activate DMA for DVD-ROM on ALi M5229 secondary channel
> Submitter : Andrey Borzenkov <[email protected]>
> References : http://marc.info/?l=linux-kernel&m=119342005216716&w=2
> http://bugzilla.kernel.org/show_bug.cgi?id=9252
> Handled-By : Alan Cox <[email protected]>
> Patch :
> Note : pata_acpi was not present in 2.6.23

As I said before pata_acpi was not present in 2.6.23 -> Not a regression.
WONTFIX for 2.6.24. Not actually clear it is even a bug, the interactions
between using pata_acpi and simplex controllers are not documented
anywhere 8(

> Subject : 2.6.24-rc1: pata_amd fails to detect 80-pin wire
> Submitter : "Thomas Lindroth" <[email protected]>
> References : http://lkml.org/lkml/2007/11/7/152
> http://bugzilla.kernel.org/show_bug.cgi?id=9322

Tejun is looking into this - its not trivial so may be 2.6.25 material.
Similar reports for some other controllers (notably VIA).

> Subject : cd/dvd inaccessible in 2.6.24-rc2
> Submitter : Will Trives <[email protected]>
> References : http://lkml.org/lkml/2007/11/9/290
> http://bugzilla.kernel.org/show_bug.cgi?id=9346
> Handled-By : Alan Cox <[email protected]>
> Jeff Garzik <[email protected]>

Not sure who is handling this now - seems to be an IRQ routing bug
introduced in -rc2. I've got a pile of similar breakage reports for
random ATA controllers.

Thanks

Alan

2007-11-11 20:17:29

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

On Sunday, 11 of November 2007, Alan Cox wrote:
> > Subject : 2.6.24-rc1: pata_acpi fails to activate DMA for DVD-ROM on ALi M5229 secondary channel
> > Submitter : Andrey Borzenkov <[email protected]>
> > References : http://marc.info/?l=linux-kernel&m=119342005216716&w=2
> > http://bugzilla.kernel.org/show_bug.cgi?id=9252
> > Handled-By : Alan Cox <[email protected]>
> > Patch :
> > Note : pata_acpi was not present in 2.6.23
>
> As I said before pata_acpi was not present in 2.6.23 -> Not a regression.

OK, dropped.

2007-11-11 20:31:15

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

* Rafael J. Wysocki <[email protected]> wrote:

> Subject : [regression] v2.6.24-rc1-497-gb1d08ac: kde battery icon gone
> Submitter : Thomas Meyer <[email protected]>
> References : http://lkml.org/lkml/2007/11/2/165
> http://bugzilla.kernel.org/show_bug.cgi?id=9297
> Handled-By : Andrey Borzenkov <[email protected]>
> Ingo Molnar <[email protected]>
> Patch :
> Note : goes away if ACPI_PROCFS is set

should be fixed by acpi-make-acpi_procfs-default-to-y.patch in -mm. (not
yet upstream i think, but should before v2.6.24)

Ingo

2007-11-11 20:34:18

by Francois Romieu

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

Rafael J. Wysocki <[email protected]> :
> [Note: Due to git.kernel.org not responding I'm unable to check which fixes
> have already been merged since the last report.]
[...]
> Subject : 2.6.24-rc1 kills onboard r8169 (rtl8111b) NIC
> Submitter : "Sergey S. Kostyliov" <[email protected]>
> References : http://lkml.org/lkml/2007/10/28/144
> http://bugzilla.kernel.org/show_bug.cgi?id=9257
> Handled-By : Francois Romieu <[email protected]>
> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13441&action=view

Fixed and merged in Linus's tree as:
- 50d84c2dc00e48ff9ba018ed0dd23276cf79e566
- b9d04e2401bf308df921d3bbbdacab40fadc27bb

--
Ueimor

2007-11-11 22:16:34

by Bartlomiej Zolnierkiewicz

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

On Sunday 11 November 2007, Alan Cox wrote:

> > Subject : 2.6.24-rc1: pata_amd fails to detect 80-pin wire
> > Submitter : "Thomas Lindroth" <[email protected]>
> > References : http://lkml.org/lkml/2007/11/7/152
> > http://bugzilla.kernel.org/show_bug.cgi?id=9322

http://lkml.org/lkml/2007/10/12/537

The regression itself has been foreseen a month ago and it is quite
sad that it is still not fixed...

> Tejun is looking into this - its not trivial so may be 2.6.25 material.
> Similar reports for some other controllers (notably VIA).

We may fix the regression in a bit different way for now and give Tejun
some more time for the complete rework of the cable detection code.

[PATCH] pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA timings

* Don't program UDMA timings when programming PIO or MWDMA modes.

This has also a nice side-effect of fixing regression added by commit
681c80b5d96076f447e8101ac4325c82d8dce508 ("libata: correct handling of
SRST reset sequences") (->set_piomode method for PIO0 is called before
->cable_detect method which checks UDMA timings to get the cable type).

* Bump driver version.

Signed-off-by: Bartlomiej Zolnierkiewicz <[email protected]>
---
Untested, please don't merge until it is confirmed to fix the problem.

drivers/ata/pata_amd.c | 5 +++--
drivers/ata/pata_via.c | 4 ++--
2 files changed, 5 insertions(+), 4 deletions(-)

Index: b/drivers/ata/pata_amd.c
===================================================================
--- a/drivers/ata/pata_amd.c
+++ b/drivers/ata/pata_amd.c
@@ -25,7 +25,7 @@
#include <linux/libata.h>

#define DRV_NAME "pata_amd"
-#define DRV_VERSION "0.3.9"
+#define DRV_VERSION "0.3.10"

/**
* timing_setup - shared timing computation and load
@@ -115,7 +115,8 @@ static void timing_setup(struct ata_port
}

/* UDMA timing */
- pci_write_config_byte(pdev, offset + 0x10 + (3 - dn), t);
+ if (at.udma)
+ pci_write_config_byte(pdev, offset + 0x10 + (3 - dn), t);
}

/**
Index: b/drivers/ata/pata_via.c
===================================================================
--- a/drivers/ata/pata_via.c
+++ b/drivers/ata/pata_via.c
@@ -63,7 +63,7 @@
#include <linux/dmi.h>

#define DRV_NAME "pata_via"
-#define DRV_VERSION "0.3.2"
+#define DRV_VERSION "0.3.3"

/*
* The following comes directly from Vojtech Pavlik's ide/pci/via82cxxx
@@ -296,7 +296,7 @@ static void via_do_set_mode(struct ata_p
}

/* Set UDMA unless device is not UDMA capable */
- if (udma_type) {
+ if (udma_type && t.udma) {
u8 cable80_status;

/* Get 80-wire cable detection bit */

2007-11-11 22:47:38

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

> [PATCH] pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA timings
>
> * Don't program UDMA timings when programming PIO or MWDMA modes.
>
> This has also a nice side-effect of fixing regression added by commit
> 681c80b5d96076f447e8101ac4325c82d8dce508 ("libata: correct handling of
> SRST reset sequences") (->set_piomode method for PIO0 is called before
> ->cable_detect method which checks UDMA timings to get the cable type).

I'm not sure this helps as if the ACPI _GTF method is looking at the
flags and stuff but it has to be worth a try.

Works for me as a 2.6.24 band aid

2007-11-13 01:13:16

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

On Sun, 11 Nov 2007 22:46:43 +0000 Alan Cox <[email protected]> wrote:

> > [PATCH] pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA timings
> >
> > * Don't program UDMA timings when programming PIO or MWDMA modes.
> >
> > This has also a nice side-effect of fixing regression added by commit
> > 681c80b5d96076f447e8101ac4325c82d8dce508 ("libata: correct handling of
> > SRST reset sequences") (->set_piomode method for PIO0 is called before
> > ->cable_detect method which checks UDMA timings to get the cable type).
>
> I'm not sure this helps as if the ACPI _GTF method is looking at the
> flags and stuff but it has to be worth a try.
>
>
> Works for me as a 2.6.24 band aid

I'm looking at that "Untested, please don't merge until it is confirmed to
fix the problem." comment..

Thomas, can you please give it a try, let us know?

Thanks

2007-11-13 14:09:19

by Thomas Lindroth

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

> > > [PATCH] pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA timings
> > >
> > > * Don't program UDMA timings when programming PIO or MWDMA modes.
> > >
> > > This has also a nice side-effect of fixing regression added by commit
> > > 681c80b5d96076f447e8101ac4325c82d8dce508 ("libata: correct handling of
> > > SRST reset sequences") (->set_piomode method for PIO0 is called before
> > > ->cable_detect method which checks UDMA timings to get the cable type).
> >
> > I'm not sure this helps as if the ACPI _GTF method is looking at the
> > flags and stuff but it has to be worth a try.
> >
> >
> > Works for me as a 2.6.24 band aid
>
> I'm looking at that "Untested, please don't merge until it is confirmed to
> fix the problem." comment..
>
> Thomas, can you please give it a try, let us know?
>
> Thanks
>

I can confirm that the patch "pata_amd/pata_via: de-couple programming
of PIO/MWDMA and UDMA timings" does fix my issue "pata_amd fails to
detect 80-pin wire".

2007-11-13 19:58:12

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.24-rc2: Reported regressions from 2.6.23 (updated)

On Tue, 13 Nov 2007 14:34:19 +0100 "Thomas Lindroth" <[email protected]> wrote:

> >
> > On Sun, 11 Nov 2007 22:46:43 +0000 Alan Cox <[email protected]>
> > wrote:
> >
> > > > [PATCH] pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA
> > timings
> > > >
> > > > * Don't program UDMA timings when programming PIO or MWDMA modes.
> > > >
> > > > This has also a nice side-effect of fixing regression added by
> > commit
> > > > 681c80b5d96076f447e8101ac4325c82d8dce508 ("libata: correct handling
> > of
> > > > SRST reset sequences") (->set_piomode method for PIO0 is called
> > before
> > > > ->cable_detect method which checks UDMA timings to get the cable
> > type).
> > >
> > > I'm not sure this helps as if the ACPI _GTF method is looking at the
> > > flags and stuff but it has to be worth a try.
> > >
> > >
> > > Works for me as a 2.6.24 band aid
> >
> > I'm looking at that "Untested, please don't merge until it is confirmed to
> > fix the problem." comment..
> >
> > Thomas, can you please give it a try, let us know?
> >
> > Thanks
> >
>
> I can confirm that the patch "pata_amd/pata_via: de-couple programming of
> PIO/MWDMA and UDMA timings" does fix my issue "pata_amd fails to detect
> 80-pin wire".
>

Great, thanks for testing it.

I moved that patch to the "to send to maintainers as a 2.6.24 fix" queue.
I should get all that material sent out hopefully tomorrow, if I can manage
to get 2.6.24-rc2-mm1 to limp out the door.

2007-11-14 11:20:47

[permalink] [raw]

Subject: [bug] SLOB crash, 2.6.24-rc2

there's a new SLOB regression - the attached config crashes with:

[ 61.245190] rc.sysinit used greatest stack depth: 1680 bytes left
[ 61.386859] list_add corruption. prev->next should be next (407d973c), but was 418cf818. (prev=41877098).
[ 61.396328] ------------[ cut here ]------------
[ 61.400910] kernel BUG at lib/list_debug.c:33!
[ 61.405330] invalid opcode: 0000 [#1] DEBUG_PAGEALLOC

looks like memory corruption of some sort and it's reproducible. Picking
CONFIG_SLUB makes the crash go away. Booting v2.6.23 with the same
.config works fine.

bisection is difficult due to networking releated Kconfig problems: if i
put my .24 .config into .23 then the network drivers get lost. (we
should be more Kconfig-compatible between kernel releases, to ease
bisection efforts)

(full bootlog and config attached)

Ingo

Attachments:

(No filename) (841.00 B)
crash.log (162.23 kB)
crash.log.config (41.03 kB)
Download all attachments

2007-11-14 17:37:29

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 12:20:01PM +0100, Ingo Molnar wrote:
>
> there's a new SLOB regression - the attached config crashes with:
>
> [ 61.245190] rc.sysinit used greatest stack depth: 1680 bytes left
> [ 61.386859] list_add corruption. prev->next should be next (407d973c), but was 418cf818. (prev=41877098).
> [ 61.396328] ------------[ cut here ]------------
> [ 61.400910] kernel BUG at lib/list_debug.c:33!
> [ 61.405330] invalid opcode: 0000 [#1] DEBUG_PAGEALLOC
>
> looks like memory corruption of some sort and it's reproducible. Picking
> CONFIG_SLUB makes the crash go away. Booting v2.6.23 with the same
> .config works fine.

Hmmm, the changes in SLOB since v2.6.23 are all trivial. I'll try to
reproduce it with your config, but it doesn't seem promising.

--
Mathematics is the supreme nostalgia of our time.

2007-11-14 18:41:54

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 11:36:11AM -0600, Matt Mackall wrote:
> On Wed, Nov 14, 2007 at 12:20:01PM +0100, Ingo Molnar wrote:
> >
> > there's a new SLOB regression - the attached config crashes with:
> >
> > [ 61.245190] rc.sysinit used greatest stack depth: 1680 bytes left
> > [ 61.386859] list_add corruption. prev->next should be next (407d973c), but was 418cf818. (prev=41877098).
> > [ 61.396328] ------------[ cut here ]------------
> > [ 61.400910] kernel BUG at lib/list_debug.c:33!
> > [ 61.405330] invalid opcode: 0000 [#1] DEBUG_PAGEALLOC
> >
> > looks like memory corruption of some sort and it's reproducible. Picking
> > CONFIG_SLUB makes the crash go away. Booting v2.6.23 with the same
> > .config works fine.
>
> Hmmm, the changes in SLOB since v2.6.23 are all trivial. I'll try to
> reproduce it with your config, but it doesn't seem promising.

Couldn't reproduce it here, let me know if you get anywhere with your bisect.

--
Mathematics is the supreme nostalgia of our time.

2007-11-14 19:05:46

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

* Matt Mackall <[email protected]> wrote:

> > > [ 61.245190] rc.sysinit used greatest stack depth: 1680 bytes left
> > > [ 61.386859] list_add corruption. prev->next should be next (407d973c), but was 418cf818. (prev=41877098).
> > > [ 61.396328] ------------[ cut here ]------------
> > > [ 61.400910] kernel BUG at lib/list_debug.c:33!
> > > [ 61.405330] invalid opcode: 0000 [#1] DEBUG_PAGEALLOC
> > >
> > > looks like memory corruption of some sort and it's reproducible. Picking
> > > CONFIG_SLUB makes the crash go away. Booting v2.6.23 with the same
> > > .config works fine.
> >
> > Hmmm, the changes in SLOB since v2.6.23 are all trivial. I'll try to
> > reproduce it with your config, but it doesn't seem promising.
>
> Couldn't reproduce it here, let me know if you get anywhere with your
> bisect.

the bug went away - and the only thing i did was a networking config
tweak. So maybe something in networking corrupts memory? I'm not sure i
can restore the old state. (i had lots of problems with net interface
renaming not working in .24)

Ingo

2007-11-14 19:43:50

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 08:05:01PM +0100, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > > > [ 61.245190] rc.sysinit used greatest stack depth: 1680 bytes left
> > > > [ 61.386859] list_add corruption. prev->next should be next (407d973c), but was 418cf818. (prev=41877098).
> > > > [ 61.396328] ------------[ cut here ]------------
> > > > [ 61.400910] kernel BUG at lib/list_debug.c:33!
> > > > [ 61.405330] invalid opcode: 0000 [#1] DEBUG_PAGEALLOC
> > > >
> > > > looks like memory corruption of some sort and it's reproducible. Picking
> > > > CONFIG_SLUB makes the crash go away. Booting v2.6.23 with the same
> > > > .config works fine.
> > >
> > > Hmmm, the changes in SLOB since v2.6.23 are all trivial. I'll try to
> > > reproduce it with your config, but it doesn't seem promising.
> >
> > Couldn't reproduce it here, let me know if you get anywhere with your
> > bisect.
>
> the bug went away - and the only thing i did was a networking config
> tweak. So maybe something in networking corrupts memory? I'm not sure i
> can restore the old state. (i had lots of problems with net interface
> renaming not working in .24)

Quite possible. SLOB is more sensitive to off by one bugs because it
doesn't have the power-of-two buckets that SLAB/SLUB have. IIRC,
SLAB/SLUB's debugging features won't detect when you request 28 bytes,
get 32, then overwrite byte 29. But that will damage other objects or
the free list in SLOB.

But this isn't the per-page SLOB list that's getting clobbered, this
is the list of pages held in struct page.

--
Mathematics is the supreme nostalgia of our time.

2007-11-14 22:39:49

by David Miller

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

From: Ingo Molnar <[email protected]>
Date: Wed, 14 Nov 2007 20:05:01 +0100

> the bug went away - and the only thing i did was a networking config
> tweak. So maybe something in networking corrupts memory?

This wouldn't surprise me at all.

I think we can make some headway on this bug, the next time
you trigger it, if the list debugging was a little less terse.

For example, a backtrace and perhaps even feeding the bad list
pointers in question to the SLAB/SLUB debug helpers that can
identify a kmem cache from a given pointer would help.

Thanks.

2007-11-14 22:55:41

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 02:39:38PM -0800, David Miller wrote:
> From: Ingo Molnar <[email protected]>
> Date: Wed, 14 Nov 2007 20:05:01 +0100
>
> > the bug went away - and the only thing i did was a networking config
> > tweak. So maybe something in networking corrupts memory?
>
> This wouldn't surprise me at all.
>
> I think we can make some headway on this bug, the next time
> you trigger it, if the list debugging was a little less terse.
>
> For example, a backtrace and perhaps even feeding the bad list
> pointers in question to the SLAB/SLUB debug helpers that can
> identify a kmem cache from a given pointer would help.

He hit the bug using SLOB and there are no kmem (or any other) caches
in SLOB.

--
Mathematics is the supreme nostalgia of our time.

2007-11-14 23:10:26

by David Miller

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

From: Matt Mackall <[email protected]>
Date: Wed, 14 Nov 2007 16:53:36 -0600

> He hit the bug using SLOB and there are no kmem (or any other) caches
> in SLOB.

That's unfortunate, is there any user tracking facility at
all?

2007-11-14 23:37:54

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 03:10:13PM -0800, David Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Nov 2007 16:53:36 -0600
>
> > He hit the bug using SLOB and there are no kmem (or any other) caches
> > in SLOB.
>
> That's unfortunate, is there any user tracking facility at
> all?

No, the usual strategy for debugging problems -outside- SLOB is to
switch to another allocator with more extensive debugging facilities.

It is of course possible to add redzoning, last user, etc., but there
aren't many advantages to implementing these in SLOB compared to
switching allocators, unless the bug disappears in those other
allocators. In the case of random pointer fandango, such bugs are
likely to disappear when you turn on debugging anyway.

The most likely thing you'll hit in SLOB vs SLUB/SLAB is that SLOB
doesn't hand back power-of-two allocations for kmalloc. Instead, it
has 2-byte granularity on most machines. So small pointer overruns on
kmalloced objects will be somewhat more visible in SLOB than
SLAB/SLUB. I don't think SLAB/SLUB debugging can detect overruns
inside the not-requested-but-still-allocated region of objects.

I've implemented redzoning and various other debugging checks for
earlier versions of SLOB to find problems -in- the allocator, but
those won't apply to current SLOB (which can be considered v2).

--
Mathematics is the supreme nostalgia of our time.

2007-11-14 23:41:52

by David Miller

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

From: Matt Mackall <[email protected]>
Date: Wed, 14 Nov 2007 17:37:13 -0600

> No, the usual strategy for debugging problems -outside- SLOB is to
> switch to another allocator with more extensive debugging facilities.

Ok, so the thing we still can do is do a dump_stack() at the
list debugging assertion trigger points.

2007-11-15 00:10:48

by Matt Mackall

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Wed, Nov 14, 2007 at 03:41:43PM -0800, David Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Nov 2007 17:37:13 -0600
>
> > No, the usual strategy for debugging problems -outside- SLOB is to
> > switch to another allocator with more extensive debugging facilities.
>
> Ok, so the thing we still can do is do a dump_stack() at the
> list debugging assertion trigger points.

It's also pretty easy to add some debugging code to make SLOB walk all
its lists at alloc/free time.

--
Mathematics is the supreme nostalgia of our time.

2007-11-15 10:44:19

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

* David Miller <[email protected]> wrote:

> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Nov 2007 17:37:13 -0600
>
> > No, the usual strategy for debugging problems -outside- SLOB is to
> > switch to another allocator with more extensive debugging facilities.
>
> Ok, so the thing we still can do is do a dump_stack() at the list
> debugging assertion trigger points.

ok, i'll first try to trigger it again.

it's a bzImage kernel with fixed order of eth0 and eth1 detection. What
i did was to twiddle the /etc/sysconfig/network-scripts/ifcfg-eth*
configs to address a network-does-not-show-up bug that .24 introduced.
The crash logs contain this:

VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 396k freed
Write protecting the kernel read-only data: 2056k
udev: renamed network interface eth1 to eth0
udev: renamed network interface eth0_rename to eth1
eth0: link down
ADDRCONF(NETDEV_UP): eth0: link is not ready
EXT3 FS on sda6, internal journal
kjournald starting. Commit interval 5 seconds

followed by the crash shortly afterwards (but not immediately). With the
non-crashing kernel i dont get those "renamed network interface"
messages.

network interface renaming has been a historic source of pain for me so
i frequently have to 'twiddle' the networking config to make it work
again on new kernels. Perhaps because i'm using bzImage kernels.
User-space is Fedora 8, so fairly recent.

Ingo

2007-11-15 10:52:15

by David Miller

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

From: Ingo Molnar <[email protected]>
Date: Thu, 15 Nov 2007 11:43:32 +0100

> The crash logs contain this:
>
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 396k freed
> Write protecting the kernel read-only data: 2056k
> udev: renamed network interface eth1 to eth0
> udev: renamed network interface eth0_rename to eth1
> eth0: link down
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> EXT3 FS on sda6, internal journal
> kjournald starting. Commit interval 5 seconds
>
> followed by the crash shortly afterwards (but not immediately). With the
> non-crashing kernel i dont get those "renamed network interface"
> messages.
>
> network interface renaming has been a historic source of pain for me so
> i frequently have to 'twiddle' the networking config to make it work
> again on new kernels. Perhaps because i'm using bzImage kernels.
> User-space is Fedora 8, so fairly recent.

Yeah I wish udev would just leave the damn devices alone.

It even does things like try to rename a network device to the same
name it already has, and other strange stuff.

But that log difference is a good clue.

Because udev can try to rename a network device stupidly to a name the
device already has we added a patch to just short circuit this case in
the networking. We did this because otherwise the generic device
layer gives an ugly stack backtrace via dev_rename().

Therefore, you might want to see if reverting that patch (attached
below) has some effect, once you are able to trigger it again.

Thanks Ingo.

commit c8d90dca3211966ba5189e0f3d4bccd558d9ae08
Author: Stephen Hemminger <[email protected]>
Date: Fri Oct 26 03:53:42 2007 -0700

[NET] dev_change_name: ignore changes to same name

Prevent error/backtrace from dev_rename() when changing
name of network device to the same name. This is a common
situation with udev and other scripts that bind addr to device.

Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

diff --git a/net/core/dev.c b/net/core/dev.c
index f1647d7..ddfef3b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -883,6 +883,9 @@ int dev_change_name(struct net_device *dev, char *newname)
if (!dev_valid_name(newname))
return -EINVAL;

+ if (strncmp(newname, dev->name, IFNAMSIZ) == 0)
+ return 0;
+
memcpy(oldname, dev->name, IFNAMSIZ);

if (strchr(newname, '%')) {

2007-11-15 10:58:20

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Thursday 15 November 2007 21:43, Ingo Molnar wrote:
> * David Miller <[email protected]> wrote:
> > From: Matt Mackall <[email protected]>
> > Date: Wed, 14 Nov 2007 17:37:13 -0600
> >
> > > No, the usual strategy for debugging problems -outside- SLOB is to
> > > switch to another allocator with more extensive debugging facilities.
> >
> > Ok, so the thing we still can do is do a dump_stack() at the list
> > debugging assertion trigger points.
>
> ok, i'll first try to trigger it again.

I had implemented SLOB in userspace, so I resynched and think I
found your problem. Sorry for the attachment format -- this mailer
isn't the best. I'm really computer illiterate when it comes to
userspace...

Anyway, I'm really happy to see you're testing and using SLOB
upstream :) Is there any particular reason that you're using it?

Thanks,
Nick

Attachments:

(No filename) (846.00 B)
slob-rotate-fix.patch (869.00 B)
Download all attachments

2007-11-15 11:03:53

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

* David Miller <[email protected]> wrote:

> Yeah I wish udev would just leave the damn devices alone.
>
> It even does things like try to rename a network device to the same
> name it already has, and other strange stuff.
>
> But that log difference is a good clue.
>
> Because udev can try to rename a network device stupidly to a name the
> device already has we added a patch to just short circuit this case in
> the networking. We did this because otherwise the generic device
> layer gives an ugly stack backtrace via dev_rename().
>
> Therefore, you might want to see if reverting that patch (attached
> below) has some effect, once you are able to trigger it again.

just to confuse things, i just got a crash with the twiddled network
setup :-/ I have reverted the same-name optimization and have got a
similar crash again. So this angle is a red herring.

now that it's reproducible again i'll try more direct debugging.
(Networking might not even be the cause of this - that was just a quick
first impression that i had.)

Btw., the .config is the result of automated "make randconfig" x86
bootup testing QA, so there might be weird combinations in the .config.
That's how SLOB got randomly enabled in the first place, i dont normally
use SLOB kernels.

Ingo

2007-11-15 11:05:36

by David Miller

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

From: Ingo Molnar <[email protected]>
Date: Thu, 15 Nov 2007 12:03:25 +0100

> now that it's reproducible again i'll try more direct debugging.
> (Networking might not even be the cause of this - that was just a quick
> first impression that i had.)
>
> Btw., the .config is the result of automated "make randconfig" x86
> bootup testing QA, so there might be weird combinations in the .config.
> That's how SLOB got randomly enabled in the first place, i dont normally
> use SLOB kernels.

Check out Nick Piggin's SLOB bug fix, I think it is a good
lead :-)

2007-11-15 11:28:50

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

* Nick Piggin <[email protected]> wrote:

> On Thursday 15 November 2007 21:43, Ingo Molnar wrote:
> > * David Miller <[email protected]> wrote:
> > > From: Matt Mackall <[email protected]>
> > > Date: Wed, 14 Nov 2007 17:37:13 -0600
> > >
> > > > No, the usual strategy for debugging problems -outside- SLOB is to
> > > > switch to another allocator with more extensive debugging facilities.
> > >
> > > Ok, so the thing we still can do is do a dump_stack() at the list
> > > debugging assertion trigger points.
> >
> > ok, i'll first try to trigger it again.
>
> I had implemented SLOB in userspace, so I resynched and think I found
> your problem. Sorry for the attachment format -- this mailer isn't the
> best. I'm really computer illiterate when it comes to userspace...

thx, i'll try your fix in a minute.

> Anyway, I'm really happy to see you're testing and using SLOB upstream
> :) Is there any particular reason that you're using it?

i sometimes test SLOB for -rt, but this time it's the result of my
"automated random QA" effort, as part of arch/x86 maintainance/QA.

the main trick is to build and booting random "make randconfig"
bzImages. That finds build bugs and a good deal of boot hang and crash
bugs as well. (it also found a compiler bug already) I can build and
boot about 1000 random kernels in 24 hours, and it's all fully
automated. I usually run it overnight - when a kernel does not come up
due to a bootup hang or crash (or the kernel log signals any exception
condition) then the script stops and i can fix it in the morning.

The first step towards this was to get allyesconfig bzImage kernels to
build and boot fine. That effort took months (we had many problems in
this area) - i think you saw bugreports and fixes from me about that on
lkml.

Once that worked reasonably well i made a small Kconfig patch that
forcibly selects a "minimum set" of drivers and kernel subsystems that
are needed to boot up a testsystem. Once a "make allnoconfig" and a
"make allyesconfig" bzImage kernel boots up fine on the testbox all
randconfig configs "inbetween" are supposed to build and boot fine as
well.

I also have a patch that adds all the x86 boot options like nosmp,
maxcpus=1, nohz=off, hpet=disable to be selectable as .config options -
so those boot options are randomized as well.

I also have a small patch that disables half a dozen drivers/features
that are not expected to work out of box in a bzImage kernel. (such as
ISA drivers that assume the presence of hardware, or root filesystem
features such as NFSROOT)

the resulting make randconfig kernel still has 99% of the degrees of
freedom that a stock make randconfig kernel has, so by all practical
purposes it's a fully random kernel - it just happens to boot on my
testsystem all the time.

A successful bootup means the test system is able to boot up into a
stock Fedora 8 userspace and is able to bring up its network interfaces
and ssh out (automatically) to the build box to signal the completion of
a successful test cycle. The logs are also analyzed for lockdep
assertions (if lockdep is enabled - which it is in about 20% of the
randconfig kernels) and other kernel bugs.

(just in case you were wondering about one of the reasons why the
arch/x86 unification merge went so smoothly, with nary a regression ;-)
Thomas is doing other types of automated QA of the x86 queue as well.)

this method found the SG-list corruption bugs the following night after
Linus committed Jen's SG-list changes, so it's pretty good at finding
regressions as early as possible.

Ingo

2007-11-15 11:32:41

[permalink] [raw]

Subject: [patch] slob: fix memory corruption

* Ingo Molnar <[email protected]> wrote:

> > I had implemented SLOB in userspace, so I resynched and think I
> > found your problem. Sorry for the attachment format -- this mailer
> > isn't the best. I'm really computer illiterate when it comes to
> > userspace...
>
> thx, i'll try your fix in a minute.

that did the trick! Nick, find an updated patch below. (reference to the
bugzilla added.)

Ingo

-------------------->
Subject: slob: fix memory corruption
From: Nick Piggin <[email protected]>

Previously, it would be possible for prev->next to point to
&free_slob_pages, and thus we would try to move a list onto itself, and
bad things would happen.

It seems a bit hairy to be doing list operations with the list marker as
an entry, rather than a head, but...

this resolves the following crash:

http://bugzilla.kernel.org/show_bug.cgi?id=9379

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/slob.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/mm/slob.c
===================================================================
--- linux.orig/mm/slob.c
+++ linux/mm/slob.c
@@ -321,7 +321,8 @@ static void *slob_alloc(size_t size, gfp
/* Improve fragment distribution and reduce our average
* search time by starting our next search here. (see
* Knuth vol 1, sec 2.5, pg 449) */
- if (free_slob_pages.next != prev->next)
+ if (prev != free_slob_pages.prev &&
+ free_slob_pages.next != prev->next)
list_move_tail(&free_slob_pages, prev->next);
break;
}

2007-11-15 11:39:31

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

On Thursday 15 November 2007 22:28, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:

> > Anyway, I'm really happy to see you're testing and using SLOB upstream
> >
> > :) Is there any particular reason that you're using it?
>
> i sometimes test SLOB for -rt, but this time it's the result of my
> "automated random QA" effort, as part of arch/x86 maintainance/QA.
>
> the main trick is to build and booting random "make randconfig"
> bzImages. That finds build bugs and a good deal of boot hang and crash
> bugs as well. (it also found a compiler bug already) I can build and
> boot about 1000 random kernels in 24 hours, and it's all fully
> automated. I usually run it overnight - when a kernel does not come up
> due to a bootup hang or crash (or the kernel log signals any exception
> condition) then the script stops and i can fix it in the morning.
>
> The first step towards this was to get allyesconfig bzImage kernels to
> build and boot fine. That effort took months (we had many problems in
> this area) - i think you saw bugreports and fixes from me about that on
> lkml.
>
> Once that worked reasonably well i made a small Kconfig patch that
> forcibly selects a "minimum set" of drivers and kernel subsystems that
> are needed to boot up a testsystem. Once a "make allnoconfig" and a
> "make allyesconfig" bzImage kernel boots up fine on the testbox all
> randconfig configs "inbetween" are supposed to build and boot fine as
> well.
>
> I also have a patch that adds all the x86 boot options like nosmp,
> maxcpus=1, nohz=off, hpet=disable to be selectable as .config options -
> so those boot options are randomized as well.
>
> I also have a small patch that disables half a dozen drivers/features
> that are not expected to work out of box in a bzImage kernel. (such as
> ISA drivers that assume the presence of hardware, or root filesystem
> features such as NFSROOT)
>
> the resulting make randconfig kernel still has 99% of the degrees of
> freedom that a stock make randconfig kernel has, so by all practical
> purposes it's a fully random kernel - it just happens to boot on my
> testsystem all the time.
>
> A successful bootup means the test system is able to boot up into a
> stock Fedora 8 userspace and is able to bring up its network interfaces
> and ssh out (automatically) to the build box to signal the completion of
> a successful test cycle. The logs are also analyzed for lockdep
> assertions (if lockdep is enabled - which it is in about 20% of the
> randconfig kernels) and other kernel bugs.
>
> (just in case you were wondering about one of the reasons why the
> arch/x86 unification merge went so smoothly, with nary a regression ;-)
> Thomas is doing other types of automated QA of the x86 queue as well.)

Well, my hat's off to you. Actually I was more wondering how it is
that you're catching SLOB bugs ;) so it seems your test setup is much
more useful than just to test the x86 arch code...

2007-11-15 12:40:29

by Dave Haywood

[permalink] [raw]

Subject: Re: [bug] SLOB crash, 2.6.24-rc2

Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>> On Thursday 15 November 2007 21:43, Ingo Molnar wrote:
>>
>>> * David Miller <[email protected]> wrote:
>>>
>>>> From: Matt Mackall <[email protected]>
>>>> Date: Wed, 14 Nov 2007 17:37:13 -0600
>>>>
>>>>
>>>>> No, the usual strategy for debugging problems -outside- SLOB is to
>>>>> switch to another allocator with more extensive debugging facilities.
>>>>>
>>>> Ok, so the thing we still can do is do a dump_stack() at the list
>>>> debugging assertion trigger points.
>>>>
>>> ok, i'll first try to trigger it again.
>>>
>> I had implemented SLOB in userspace, so I resynched and think I found
>> your problem. Sorry for the attachment format -- this mailer isn't the
>> best. I'm really computer illiterate when it comes to userspace...
>>
>
> thx, i'll try your fix in a minute.
>
>
>> Anyway, I'm really happy to see you're testing and using SLOB upstream
>> :) Is there any particular reason that you're using it?
>>
>
> i sometimes test SLOB for -rt, but this time it's the result of my
> "automated random QA" effort, as part of arch/x86 maintainance/QA.
>
> the main trick is to build and booting random "make randconfig"
> bzImages. That finds build bugs and a good deal of boot hang and crash
> bugs as well. (it also found a compiler bug already) I can build and
> boot about 1000 random kernels in 24 hours, and it's all fully
> automated. I usually run it overnight - when a kernel does not come up
> due to a bootup hang or crash (or the kernel log signals any exception
> condition) then the script stops and i can fix it in the morning.
>
> The first step towards this was to get allyesconfig bzImage kernels to
> build and boot fine. That effort took months (we had many problems in
> this area) - i think you saw bugreports and fixes from me about that on
> lkml.
>
> Once that worked reasonably well i made a small Kconfig patch that
> forcibly selects a "minimum set" of drivers and kernel subsystems that
> are needed to boot up a testsystem. Once a "make allnoconfig" and a
> "make allyesconfig" bzImage kernel boots up fine on the testbox all
> randconfig configs "inbetween" are supposed to build and boot fine as
> well.
>
> I also have a patch that adds all the x86 boot options like nosmp,
> maxcpus=1, nohz=off, hpet=disable to be selectable as .config options -
> so those boot options are randomized as well.
>
> I also have a small patch that disables half a dozen drivers/features
> that are not expected to work out of box in a bzImage kernel. (such as
> ISA drivers that assume the presence of hardware, or root filesystem
> features such as NFSROOT)
>
> the resulting make randconfig kernel still has 99% of the degrees of
> freedom that a stock make randconfig kernel has, so by all practical
> purposes it's a fully random kernel - it just happens to boot on my
> testsystem all the time.
>
> A successful bootup means the test system is able to boot up into a
> stock Fedora 8 userspace and is able to bring up its network interfaces
> and ssh out (automatically) to the build box to signal the completion of
> a successful test cycle. The logs are also analyzed for lockdep
> assertions (if lockdep is enabled - which it is in about 20% of the
> randconfig kernels) and other kernel bugs.
>
> (just in case you were wondering about one of the reasons why the
> arch/x86 unification merge went so smoothly, with nary a regression ;-)
> Thomas is doing other types of automated QA of the x86 queue as well.)
>
> this method found the SG-list corruption bugs the following night after
> Linus committed Jen's SG-list changes, so it's pretty good at finding
> regressions as early as possible.
>
> Ingo
>

How complete is the QA testing? I was reading this interesting thread
and it occurred to me that this sounds like a useful distributed
computing application. ie a central server with all valid Kconfig
combinations (how many are there?) for a particular release (-rc or
otherwise) across all architectures. These are allocated to clients on
request to be built / booted etc. Any errors are fed back to the
central server. I guess this would be a useful resource for
developers. More importantly (and I don't know if this is the case
already!) a new Linux release (2.6.x) could be "certified" with some
level of testing on known hardware / architectures.

tbh, I feel sorry for Ingo's machine compiling 1000 random kernels in
24h! I'm surprised it hasn't called the Samaritans...

Dave.

2007-11-15 12:49:29

[permalink] [raw]

Subject: Re: [patch] slob: fix memory corruption

> From: Nick Piggin <[email protected]>

> - if (free_slob_pages.next != prev->next)
> + if (prev != free_slob_pages.prev &&
> + free_slob_pages.next != prev->next)
> list_move_tail(&free_slob_pages, prev->next);

btw., exactly how did you find this bug? User-space simulation of SLOB?

Ingo

2007-11-15 16:02:20

by Matt Mackall

[permalink] [raw]

Subject: Re: [patch] slob: fix memory corruption

On Thu, Nov 15, 2007 at 12:32:04PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > I had implemented SLOB in userspace, so I resynched and think I
> > > found your problem. Sorry for the attachment format -- this mailer
> > > isn't the best. I'm really computer illiterate when it comes to
> > > userspace...
> >
> > thx, i'll try your fix in a minute.
>
> that did the trick! Nick, find an updated patch below. (reference to the
> bugzilla added.)

Yes, good catch, Nick!

> Ingo
>
> -------------------->
> Subject: slob: fix memory corruption
> From: Nick Piggin <[email protected]>
>
> Previously, it would be possible for prev->next to point to
> &free_slob_pages, and thus we would try to move a list onto itself, and
> bad things would happen.
>
> It seems a bit hairy to be doing list operations with the list marker as
> an entry, rather than a head, but...
>
> this resolves the following crash:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=9379
>
> Signed-off-by: Nick Piggin <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Matt Mackall <[email protected]>

Andrew, please cue this for 2.6.24 and -stable.

> ---
> mm/slob.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux/mm/slob.c
> ===================================================================
> --- linux.orig/mm/slob.c
> +++ linux/mm/slob.c
> @@ -321,7 +321,8 @@ static void *slob_alloc(size_t size, gfp
> /* Improve fragment distribution and reduce our average
> * search time by starting our next search here. (see
> * Knuth vol 1, sec 2.5, pg 449) */
> - if (free_slob_pages.next != prev->next)
> + if (prev != free_slob_pages.prev &&
> + free_slob_pages.next != prev->next)
> list_move_tail(&free_slob_pages, prev->next);
> break;
> }

--
Mathematics is the supreme nostalgia of our time.

2007-11-15 21:03:29

[permalink] [raw]

Subject: Re: [patch] slob: fix memory corruption

On Thursday 15 November 2007 23:48, Ingo Molnar wrote:
> > From: Nick Piggin <[email protected]>
> >
> > - if (free_slob_pages.next != prev->next)
> > + if (prev != free_slob_pages.prev &&
> > + free_slob_pages.next != prev->next)
> > list_move_tail(&free_slob_pages, prev->next);
>
> btw., exactly how did you find this bug? User-space simulation of SLOB?

Yes. It was very useful in developing the improvements to the freelist
handling. The only reason why I don't release/run the code more often
is that my test harness work is pretty ugly (ie. it isn't just a simple
cp mm/slob.c ../blah/).

After that, just a loop of N iterations, within each iteration, there is
a chance of a single allocation of a random size, a single free of a
random outstanding allocation, a run of allocating MAX allocations, or
a run of freeing all previously allocated memory. It's a bit crude, but
it showed up your list head corruption in a second or two.