LinuxLists.cc - 2.6.24-rc4-git5: Reported regressions from 2.6.23

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Fabio Comolli <[email protected]> wrote:

> <snip>
>
> > Subject : Battery shows up twice in kpowersave
> > Submitter : Rolf Eike Beer <[email protected]>
> > References : http://bugzilla.kernel.org/show_bug.cgi?id=9494
> > Handled-By : Alexey Starikovskiy <[email protected]>
> > Patch :
> >
>
> I don't think that this is a regression: I reported on RedHat bugzilla
> when I switched from F7 to F8 and I was using 2.6.23.8 at that time.
> It looks to me an HAL regression, but of course I may be wrong :-) as
> the reported bisected to a bad commit.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=373041
>
> By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5
> custom kernel and Gnome desktop and the problem is still present, even
> with gnome-power-manager.

to me this looks like an ABI regression - utilities should work without
change. Something changed in /sys output that caused HAL to think that
there are two batteries:

| The output of lshal shows that there are two UDI's with
| info.capabilities = { 'battery' }:
|
| udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
| udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'

whether it's a HAL bug or a kernel bug, the original state should be
restored and it should be worked out without breaking users of older HAL
versions.

grumble: way too many times do various system utilities break when i
upgrade the kernel on my laptop. Maybe a new debug mechanism: we should
start fingerprinting the exact /sys and /proc output and enforce that
it's immutable across kernel releases as long as the hardware is
unmodified?

Ingo

2007-12-08 09:24:58

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 09:28:15 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Fabio Comolli <[email protected]> wrote:
>
> > <snip>
> >
> > > Subject : Battery shows up twice in kpowersave
> > > Submitter : Rolf Eike Beer <[email protected]>
> > > References : http://bugzilla.kernel.org/show_bug.cgi?id=9494
> > > Handled-By : Alexey Starikovskiy <[email protected]>
> > > Patch :
> > >
> >
> > I don't think that this is a regression: I reported on RedHat bugzilla
> > when I switched from F7 to F8 and I was using 2.6.23.8 at that time.
> > It looks to me an HAL regression, but of course I may be wrong :-) as
> > the reported bisected to a bad commit.
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=373041
> >
> > By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5
> > custom kernel and Gnome desktop and the problem is still present, even
> > with gnome-power-manager.
>
> to me this looks like an ABI regression - utilities should work without
> change. Something changed in /sys output that caused HAL to think that
> there are two batteries:

Yep. Although HAL is of course a most special case of "userspace".

> | The output of lshal shows that there are two UDI's with
> | info.capabilities = { 'battery' }:
> |
> | udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
> | udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'
>
> whether it's a HAL bug or a kernel bug, the original state should be
> restored and it should be worked out without breaking users of older HAL
> versions.

"breaking users of older HAL versions" == "breaking machines".

The patch should be reverted. Do we know which one it was?

> grumble: way too many times do various system utilities break when i
> upgrade the kernel on my laptop. Maybe a new debug mechanism: we should
> start fingerprinting the exact /sys and /proc output and enforce that
> it's immutable across kernel releases as long as the hardware is
> unmodified?

That would be neat. It would need to be executed on a lot of different
machines.

I wonder if there's something sneaky we can do here. Install the script in
/lib/modules/$(uname -r) and then run it from the kernel when the fork
count reaches 1000 ;)

(hey, I've seen worse: /proc files which start with #!/bin/sh)

2007-12-08 09:29:56

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:

> This message contains a list of some regressions from 2.6.23 which have been
> reported since 2.6.24-rc1 was released and for which there are no fixes in the
> mainline that I know of. ?If any of them have been fixed already, please let me
> know.
>
> If you know of any other unresolved regressions from 2.6.23, please let me know
> either and I'll add them to the list.

Twenty nine, huh?

It would be useful if these records were sorted in date-of-reportage order
and had a date stamp so we could see how long they've been hanging about.
Something to think about for the post-2.6.24 regression if you'll be handling
those?

> Subject : leds: ledtrig-timer calls sleeping function from invalid context
> Submitter : M?rton N?meth <[email protected]>
> References : http://bugzilla.kernel.org/show_bug.cgi?id=9264
> Handled-By : Richard Purdie <[email protected]>
> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view

That patch has been merged (dc47206e552c0850ad11f7e9a1fca0a3c92f5d65) and
assuming M?rton has tested the latest git snapshot
(ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots) successfully we can
cross it off?

2007-12-08 09:31:24

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Rafael J. Wysocki <[email protected]> wrote:

> Subject : tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> Submitter : Ingo Molnar <[email protected]>
> References : http://lkml.org/lkml/2007/11/29/157
> http://bugzilla.kernel.org/show_bug.cgi?id=9497
> Handled-By : Matt Mackall <[email protected]>
> Patch : http://lkml.org/lkml/2007/11/29/387

Matt, the above bug is still occuring en masse during random bootups:

[ 67.828312] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[ 41.270905] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[ 82.378714] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[ 35.308126] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[ 35.484155] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()

and the patch does not seem to be upstream yet.

Ingo

2007-12-08 09:36:54

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:

> This message contains a list of some regressions from 2.6.23 which have been
> reported since 2.6.24-rc1 was released and for which there are no fixes in the
> mainline that I know of. ?If any of them have been fixed already, please let me
> know.
>
> If you know of any other unresolved regressions from 2.6.23, please let me know
> either and I'll add them to the list.
>
>
> Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
> Submitter : Hans de Bruin <[email protected]>
> References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
> Handled-By : Robert Moore <[email protected]>
> Tejun Heo <[email protected]>
> Fu Michael <[email protected]>
> Patch :
>

A number of other people are seeing the same thing and Tejun is putting in
a blacklist of machines which cannot use libata+acpi. That patch is not
yet in any git tree which I pull.

AFACIT the machines kepe working OK - there's just some nasty dmesg spew.

If any machines _are_ breaking then this could cause real problems and I'd
prefer that we either go for a whitelist or arrange to detect the condition
and fall back to non-acpi ata.

2007-12-08 09:44:09

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:

> This message contains a list of some regressions from 2.6.23 which have been
> reported since 2.6.24-rc1 was released and for which there are no fixes in the
> mainline that I know of. ?If any of them have been fixed already, please let me
> know.
>
> If you know of any other unresolved regressions from 2.6.23, please let me know
> either and I'll add them to the list.
>
> ...
>
> Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
> Submitter : Roland Dreier <[email protected]>
> References : http://lkml.org/lkml/2007/11/8/255
> http://bugzilla.kernel.org/show_bug.cgi?id=9332
> Handled-By :
> Patch :

Takashi had a patch and that has been merged. AFAIK this regression
has been fixed and we're left with a new but harmless warning.

However Roland reported other problems and it appears that the trail went
cold (http://lkml.org/lkml/2007/11/14/251)

Ted was hitting some of the same problems but that trail appears to also
have gone cold (http://lkml.org/lkml/2007/11/23/17).

Guys, can we have a status update on all of this please?

2007-12-08 09:47:04

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:

> This message contains a list of some regressions from 2.6.23 which have been
> reported since 2.6.24-rc1 was released and for which there are no fixes in the
> mainline that I know of. ?If any of them have been fixed already, please let me
> know.
>
> If you know of any other unresolved regressions from 2.6.23, please let me know
> either and I'll add them to the list.
>
>
> ..
>
> Subject : system hangs after a few minutes
> Submitter : Marcus Better <[email protected]>
> References : http://bugzilla.kernel.org/show_bug.cgi?id=9335
> Handled-By : Andrew Morton <[email protected]>
> Alan Stern <[email protected]>
> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13871&action=view
>

This one we have a confirmed fix from Alan but it doesn't appear to be in
anyone's tree.

There is a second bug in here, applicable to core x86: Marcus's machine
won't boot with nmi_watchdog=1.

2007-12-08 09:52:51

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:

> This message contains a list of some regressions from 2.6.23 which have been
> reported since 2.6.24-rc1 was released and for which there are no fixes in the
> mainline that I know of. ?If any of them have been fixed already, please let me
> know.
>
> If you know of any other unresolved regressions from 2.6.23, please let me know
> either and I'll add them to the list.
>
> ...
>
> Subject : cd/dvd inaccessible in 2.6.24-rc2
> Submitter : Will Trives <[email protected]>
> References : http://lkml.org/lkml/2007/11/9/290
> http://bugzilla.kernel.org/show_bug.cgi?id=9346
> Handled-By : Len Brown <[email protected]>
> Tejun Heo <[email protected]>
> Patch :
>

Nasty one. Tejun and several diligent reporters are doing sterling work
there and things have improved. I don't know whether any of Tejun's
patches have been merged yet, but we'll probably be OK on this one.

What is unclear (to me) is what actually caused those people's machines to
break?

2007-12-08 10:12:23

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007 10:30:39 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > Subject : tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> > Submitter : Ingo Molnar <[email protected]>
> > References : http://lkml.org/lkml/2007/11/29/157
> > http://bugzilla.kernel.org/show_bug.cgi?id=9497
> > Handled-By : Matt Mackall <[email protected]>
> > Patch : http://lkml.org/lkml/2007/11/29/387
>
> Matt, the above bug is still occuring en masse during random bootups:
>
> [ 67.828312] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> [ 41.270905] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> [ 82.378714] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> [ 35.308126] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> [ 35.484155] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
>
> and the patch does not seem to be upstream yet.
>

err, umm, OK, I'll go hunt it down and take a closer look at it.

2007-12-08 10:13:18

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hi,

On Sat, Dec 08, 2007 at 01:36:31AM -0800, Andrew Morton wrote:
> > Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
> > Submitter : Hans de Bruin <[email protected]>
> > References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
> > Handled-By : Robert Moore <[email protected]>
> > Tejun Heo <[email protected]>
> > Fu Michael <[email protected]>
> > Patch :
> >
>
> A number of other people are seeing the same thing and Tejun is putting in
> a blacklist of machines which cannot use libata+acpi. That patch is not
> yet in any git tree which I pull.
>
> AFACIT the machines kepe working OK - there's just some nasty dmesg spew.
>
> If any machines _are_ breaking then this could cause real problems and I'd
> prefer that we either go for a whitelist or arrange to detect the condition
> and fall back to non-acpi ata.

Does this report now win me the lucky draw, pretty please? ;)

STD regression rc1 -> rc234, suspend fails completely, recovering is
pretty much useless since HDD is DEAD from this point on anyway.
Managed to capture -rc2 suspend logging via still-alive ssh session.

2.6.24-rc1 suspend/resume log, successful (well, a couple seconds delay, most likely due to
well-recovered AML failure):

swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created
Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Shrinking memory... done (0 pages freed)
Freed 0 kbytes in 0.02 seconds (0.00 MB/s)
Suspending console(s)
hub 4-0:1.0: hub_suspend
usb usb4: bus suspend
ehci_hcd 0000:00:10.3: suspend root hub
hub 3-0:1.0: hub_suspend
usb usb3: bus suspend
usb usb3: suspend_rh
hub 2-0:1.0: hub_suspend
usb usb2: bus suspend
usb usb2: suspend_rh
hub 1-0:1.0: hub_suspend
usb usb1: bus suspend
usb usb1: suspend_rh
sd 0:0:0:0: [sda] Synchronizing SCSI cache
parport_pc 00:09: disabled
serial 00:08: disabled
serial 00:07: disabled
ACPI: PCI interrupt for device 0000:00:11.5 disabled
ACPI handle has no context!
ACPI: PCI interrupt for device 0000:00:11.1 disabled
ACPI: PCI interrupt for device 0000:00:10.3 disabled
ehci_hcd 0000:00:10.3: --> PCI D3/wakeup
uhci_hcd 0000:00:10.2: uhci_suspend
ACPI: PCI interrupt for device 0000:00:10.2 disabled
uhci_hcd 0000:00:10.2: --> PCI D3
uhci_hcd 0000:00:10.1: uhci_suspend
ACPI: PCI interrupt for device 0000:00:10.1 disabled
uhci_hcd 0000:00:10.1: --> PCI D3
uhci_hcd 0000:00:10.0: uhci_suspend
ACPI: PCI interrupt for device 0000:00:10.0 disabled
uhci_hcd 0000:00:10.0: --> PCI D3
ACPI: PCI interrupt for device 0000:00:0d.0 disabled
ACPI handle has no context!
ACPI: PCI interrupt for device 0000:00:0c.0 disabled
ACPI handle has no context!
pci_set_power_state(): 0000:00:00.0: state=3, current state=5
swsusp: critical section:
swsusp: Need to copy 51195 pages
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
evxfevnt-0079 [00] enable : System is already in ACPI mode
ACPI: PCI Interrupt Link [ALKA] BIOS reported IRQ 0, using IRQ 20
ACPI: PCI Interrupt Link [ALKB] BIOS reported IRQ 0, using IRQ 21
ACPI: PCI Interrupt Link [ALKC] BIOS reported IRQ 0, using IRQ 22
ACPI: PCI Interrupt Link [ALKD] BIOS reported IRQ 0, using IRQ 23
evxfevnt-0079 [00] enable : System is already in ACPI mode
ACPI: Unable to turn cooling device [c180ff60] 'off'
PCI: Setting latency timer of device 0000:00:01.0 to 64
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[19] MMIO=[db140000-db1407ff] Max Packet=[2048] IR/IT contexts=[4/8]
ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 18 (level, low) -> IRQ 18
e100: eth-intel: e100_watchdog: link up, 100Mbps, full-duplex
PM: Writing back config space on device 0000:00:0d.0 at offset 1 (was 2100007, writing 2100003)
ACPI: PCI Interrupt 0000:00:0d.0[A] -> GSI 19 (level, low) -> IRQ 22
uhci_hcd 0000:00:10.0: PCI D0, from previous PCI D3
ACPI: PCI Interrupt 0000:00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
uhci_hcd 0000:00:10.0: uhci_resume
uhci_hcd 0000:00:10.0: uhci_check_and_reset_hc: cmd = 0x0000
uhci_hcd 0000:00:10.0: Performing full reset
usb usb1: root hub lost power or was reset
usb usb1: suspend_rh
uhci_hcd 0000:00:10.1: PCI D0, from previous PCI D3
ACPI: PCI Interrupt 0000:00:10.1[B] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
uhci_hcd 0000:00:10.1: uhci_resume
uhci_hcd 0000:00:10.1: uhci_check_and_reset_hc: cmd = 0x0000
uhci_hcd 0000:00:10.1: Performing full reset
usb usb2: root hub lost power or was reset
usb usb2: suspend_rh
uhci_hcd 0000:00:10.2: PCI D0, from previous PCI D3
ACPI: PCI Interrupt 0000:00:10.2[C] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
uhci_hcd 0000:00:10.2: uhci_resume
uhci_hcd 0000:00:10.2: uhci_check_and_reset_hc: cmd = 0x0000
uhci_hcd 0000:00:10.2: Performing full reset
usb usb3: root hub lost power or was reset
usb usb3: suspend_rh
ehci_hcd 0000:00:10.3: PCI D0, from previous PCI D3
ACPI: PCI Interrupt 0000:00:10.3[D] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
PM: Writing back config space on device 0000:00:10.3 at offset 3 (was 2008, writing 2010)
PM: Writing back config space on device 0000:00:10.3 at offset 1 (was 2100007, writing 2100017)
PM: Writing back config space on device 0000:00:11.1 at offset 1 (was 2900003, writing 2900007)
ACPI: PCI Interrupt 0000:00:11.1[A] -> Link [ALKA] -> GSI 20 (level, low) -> IRQ 17
ACPI: PCI Interrupt 0000:00:11.5[C] -> Link [ALKC] -> GSI 22 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:11.5 to 64
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
serial 00:07: activated
serial 00:08: activated
parport_pc 00:09: activated
i8042 aux 00:0a: activation failed
i8042 kbd 00:0b: activation failed
sd 0:0:0:0: [sda] Starting disk
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
ata1.01: _GTF evaluation failed (AE 0x300d)
ata1.01: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
ata1.01: _GTF evaluation failed (AE 0x300d)
ata1.01: ACPI on devcfg failed the second time, disabling (errno=-5)
ata1.01: revalidation failed (errno=1)
ata1: failed to recover some devices, retrying in 5 secs
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV0._GTF] (Node c180b840), AE_AML_PACKAGE_LIMIT
ata1.00: _GTF evaluation failed (AE 0x300d)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV0._GTF] (Node c180b840), AE_AML_PACKAGE_LIMIT
ata1.00: _GTF evaluation failed (AE 0x300d)
ata1.00: ACPI on devcfg failed the second time, disabling (errno=-5)
ata1.00: revalidation failed (errno=1)
ata1: failed to recover some devices, retrying in 5 secs
ata1.00: configured for UDMA/100
ata1.01: configured for UDMA/33
sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
usb usb1: usb resume
usb usb1: wakeup_rh
hub 1-0:1.0: trying to enable port power on non-switchable hub
usb usb2: usb resume
usb usb2: wakeup_rh
hub 2-0:1.0: trying to enable port power on non-switchable hub
usb usb3: usb resume
usb usb3: wakeup_rh
hub 3-0:1.0: trying to enable port power on non-switchable hub
usb usb4: usb resume
ehci_hcd 0000:00:10.3: resume root hub
hub 4-0:1.0: hub_resume
Restarting tasks ... <7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0006
uhci_hcd 0000:00:10.0: port 1 portsc 018a,00
hub 1-0:1.0: port 1, status 0300, change 0003, 1.5 Mb/s
done.
swsusp: Basic memory bitmaps freed
hub 1-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x300
uhci_hcd 0000:00:10.0: port 2 portsc 008a,00
hub 1-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
hub 1-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
hub 2-0:1.0: state 7 ports 2 chg 0000 evt 0006
uhci_hcd 0000:00:10.1: port 1 portsc 018a,00
hub 2-0:1.0: port 1, status 0300, change 0003, 1.5 Mb/s
hub 2-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x300
uhci_hcd 0000:00:10.1: port 2 portsc 008a,00
hub 2-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
hub 2-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0006
uhci_hcd 0000:00:10.2: port 1 portsc 008a,00
hub 3-0:1.0: port 1, status 0100, change 0003, 12 Mb/s
hub 3-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x100
uhci_hcd 0000:00:10.2: port 2 portsc 008a,00
hub 3-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
hub 3-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
hub 4-0:1.0: state 7 ports 6 chg 0000 evt 0000
hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
hub 2-0:1.0: state 7 ports 2 chg 0000 evt 0000
hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
usb usb1: suspend_rh (auto-stop)
usb usb2: suspend_rh (auto-stop)
usb usb3: suspend_rh (auto-stop)
agpgart: Found an AGP 2.0 compliant device at 0000:00:00.0.
agpgart: Putting AGP V2 device at 0000:00:00.0 into 4x mode
agpgart: Putting AGP V2 device at 0000:01:00.0 into 4x mode
[drm] Loading R200 Microcode

2.6.24-rc2 suspend log (one screenful), UNSUCCESSFUL:

serial 00:07: disabled
ACPI: PCI interrupt for device 0000:00:11.5 disabled
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTM_]
(Node c180b9a8), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN1._GTM] (Node c180b8d0), AE_AML_PACKAGE_LIMIT
ata2: ACPI get timing mode failed (AE 0x300d)
pci_device_suspend(): ata_pci_device_suspend+0x0/0x40() returns -22
suspend_device(): pci_device_suspend+0x0/0x70() returns -22
Could not suspend device 0000:00:11.1: error -22
ACPI: PCI Interrupt 0000:00:11.5[C] -> Link [ALKC] -> GSI 22 (level, low) -> IRQ 23
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
serial 00:07: activated
serial 00:08: activated
parport_pc 00:09: activated
i8042 aux 00:0a: activation failed
i8042 kbd 00:0b: activation failed
sd 0:0:0:0: [sda] Starting disk
sd 0:0:0:0: timing out command, waited 180s
sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
Restarting tasks ... <7>hub 1-0:1.0: state 7 ports 6 chg 0000 evt 0000
done.
swsusp: Basic memory bitmaps freed
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created
Syncing filesystems ...

# lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
00:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
00:0a.0 Multimedia audio controller: Aureal Semiconductor Vortex 2 (rev fe)
00:0c.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08)
00:0d.0 Multimedia audio controller: Aztech System Ltd 3328 Audio (rev 10)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon RV250 If [Radeon 9000] (rev 01)
01:00.1 Display controller: ATI Technologies Inc Radeon RV250 [Radeon 9000] (Secondary) (rev 01)

# dmidecode 2.9
SMBIOS 2.2 present.
39 structures occupying 1035 bytes.
Table at 0x000F0800.

Handle 0x0000, DMI type 0, 19 bytes
BIOS Information
Vendor: Award Software International, Inc.
Version: 6.00 PG
Release Date: 09/16/2003
Address: 0xE0000
Runtime Size: 128 kB
ROM Size: 512 kB
Characteristics:
ISA is supported
PCI is supported
PNP is supported
APM is supported
BIOS is upgradeable
BIOS shadowing is allowed
ESCD support is available
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/360 KB floppy services are supported (int 13h)
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 KB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
CGA/mono video services are supported (int 10h)
ACPI is supported
USB legacy is supported
AGP is supported
LS-120 boot is supported
ATAPI Zip drive boot is supported

Handle 0x0001, DMI type 1, 25 bytes
System Information
Manufacturer: VIA Technologies, Inc.
Product Name: VT8367-8235
Version:
Serial Number:
UUID: Not Present
Wake-up Type: Power Switch

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
Manufacturer:
Product Name: VT8367-8235
Version:
Serial Number:

Handle 0x0003, DMI type 3, 13 bytes
Chassis Information
Manufacturer:
Type: Desktop
Lock: Not Present
Version:
Serial Number:
Asset Tag:
Boot-up State: Unknown
Power Supply State: Unknown
Thermal State: Unknown
Security Status: Unknown

Handle 0x0004, DMI type 4, 32 bytes
Processor Information
Socket Designation: Socket A
Type: Central Processor
Family: Duron
Manufacturer: AMD
ID: 81 06 00 00 FF FB 83 03
Signature: Family 6, Model 8, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
MMX (MMX technology supported)
FXSR (Fast floating-point save and restore)
SSE (Streaming SIMD extensions)
Version: AMD K7 processor
Voltage: 3.3 V
External Clock: 133 MHz
Max Speed: 1500 MHz
Current Speed: 1200 MHz
Status: Populated, Enabled
Upgrade: ZIF Socket
L1 Cache Handle: 0x000A
L2 Cache Handle: 0x000B
L3 Cache Handle: No L3 Cache

Handle 0x0005, DMI type 5, 24 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
None
Supported Interleave: One-way Interleave
Current Interleave: Four-way Interleave
Maximum Memory Module Size: 32 MB
Maximum Total Memory Size: 128 MB
Supported Speeds:
70 ns
60 ns
Supported Memory Types:
Standard
EDO
Memory Module Voltage: 5.0 V
Associated Memory Slots: 4
0x0006
0x0007
0x0008
0x0009
Enabled Error Correcting Capabilities: None

.
.
.

# hdparm -i /dev/sda

/dev/sda:

Model=WDC WD1200JB-00CRA1 , FwRev=17.07W17, SerialNo=WD-WCA8C4285629
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5

* signifies the current active mode

Athlon on EPOX 8K5A2+ board.

Again, 2.6.23 and 2.6.24-rc1 work, yet 2.6.24 -rc2, -rc3 and -rc4 FAIL.

Probably won't be able to do any reporting over the weekend (WOL is
inoperable ATM for some weird reason), let me know what you need.
Took too much time to gather this report already anyway ;)

Thanks,

Andreas Mohr

2007-12-08 10:20:49

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007 11:12:57 +0100 Andreas Mohr <[email protected]> wrote:

> Hi,
>
> On Sat, Dec 08, 2007 at 01:36:31AM -0800, Andrew Morton wrote:
> > > Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
> > > Submitter : Hans de Bruin <[email protected]>
> > > References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
> > > Handled-By : Robert Moore <[email protected]>
> > > Tejun Heo <[email protected]>
> > > Fu Michael <[email protected]>
> > > Patch :
> > >
> >
> > A number of other people are seeing the same thing and Tejun is putting in
> > a blacklist of machines which cannot use libata+acpi. That patch is not
> > yet in any git tree which I pull.
> >
> > AFACIT the machines kepe working OK - there's just some nasty dmesg spew.
> >
> > If any machines _are_ breaking then this could cause real problems and I'd
> > prefer that we either go for a whitelist or arrange to detect the condition
> > and fall back to non-acpi ata.
>
> Does this report now win me the lucky draw, pretty please? ;)

nah, you have to cc the acpi guys to get a prize ;)

Len&co, could you please take a look?

Andreas, please do separately report that WOL problem too..

Our list just reached 30.

> STD regression rc1 -> rc234, suspend fails completely, recovering is
> pretty much useless since HDD is DEAD from this point on anyway.
> Managed to capture -rc2 suspend logging via still-alive ssh session.
>
> 2.6.24-rc1 suspend/resume log, successful (well, a couple seconds delay, most likely due to
> well-recovered AML failure):
>
> swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
> swsusp: Basic memory bitmaps created
> Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.00 seconds) done.
> Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
> Shrinking memory... done (0 pages freed)
> Freed 0 kbytes in 0.02 seconds (0.00 MB/s)
> Suspending console(s)
> hub 4-0:1.0: hub_suspend
> usb usb4: bus suspend
> ehci_hcd 0000:00:10.3: suspend root hub
> hub 3-0:1.0: hub_suspend
> usb usb3: bus suspend
> usb usb3: suspend_rh
> hub 2-0:1.0: hub_suspend
> usb usb2: bus suspend
> usb usb2: suspend_rh
> hub 1-0:1.0: hub_suspend
> usb usb1: bus suspend
> usb usb1: suspend_rh
> sd 0:0:0:0: [sda] Synchronizing SCSI cache
> parport_pc 00:09: disabled
> serial 00:08: disabled
> serial 00:07: disabled
> ACPI: PCI interrupt for device 0000:00:11.5 disabled
> ACPI handle has no context!
> ACPI: PCI interrupt for device 0000:00:11.1 disabled
> ACPI: PCI interrupt for device 0000:00:10.3 disabled
> ehci_hcd 0000:00:10.3: --> PCI D3/wakeup
> uhci_hcd 0000:00:10.2: uhci_suspend
> ACPI: PCI interrupt for device 0000:00:10.2 disabled
> uhci_hcd 0000:00:10.2: --> PCI D3
> uhci_hcd 0000:00:10.1: uhci_suspend
> ACPI: PCI interrupt for device 0000:00:10.1 disabled
> uhci_hcd 0000:00:10.1: --> PCI D3
> uhci_hcd 0000:00:10.0: uhci_suspend
> ACPI: PCI interrupt for device 0000:00:10.0 disabled
> uhci_hcd 0000:00:10.0: --> PCI D3
> ACPI: PCI interrupt for device 0000:00:0d.0 disabled
> ACPI handle has no context!
> ACPI: PCI interrupt for device 0000:00:0c.0 disabled
> ACPI handle has no context!
> pci_set_power_state(): 0000:00:00.0: state=3, current state=5
> swsusp: critical section:
> swsusp: Need to copy 51195 pages
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#0.
> evxfevnt-0079 [00] enable : System is already in ACPI mode
> ACPI: PCI Interrupt Link [ALKA] BIOS reported IRQ 0, using IRQ 20
> ACPI: PCI Interrupt Link [ALKB] BIOS reported IRQ 0, using IRQ 21
> ACPI: PCI Interrupt Link [ALKC] BIOS reported IRQ 0, using IRQ 22
> ACPI: PCI Interrupt Link [ALKD] BIOS reported IRQ 0, using IRQ 23
> evxfevnt-0079 [00] enable : System is already in ACPI mode
> ACPI: Unable to turn cooling device [c180ff60] 'off'
> PCI: Setting latency timer of device 0000:00:01.0 to 64
> ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[19] MMIO=[db140000-db1407ff] Max Packet=[2048] IR/IT contexts=[4/8]
> ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 18 (level, low) -> IRQ 18
> e100: eth-intel: e100_watchdog: link up, 100Mbps, full-duplex
> PM: Writing back config space on device 0000:00:0d.0 at offset 1 (was 2100007, writing 2100003)
> ACPI: PCI Interrupt 0000:00:0d.0[A] -> GSI 19 (level, low) -> IRQ 22
> uhci_hcd 0000:00:10.0: PCI D0, from previous PCI D3
> ACPI: PCI Interrupt 0000:00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
> uhci_hcd 0000:00:10.0: uhci_resume
> uhci_hcd 0000:00:10.0: uhci_check_and_reset_hc: cmd = 0x0000
> uhci_hcd 0000:00:10.0: Performing full reset
> usb usb1: root hub lost power or was reset
> usb usb1: suspend_rh
> uhci_hcd 0000:00:10.1: PCI D0, from previous PCI D3
> ACPI: PCI Interrupt 0000:00:10.1[B] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
> uhci_hcd 0000:00:10.1: uhci_resume
> uhci_hcd 0000:00:10.1: uhci_check_and_reset_hc: cmd = 0x0000
> uhci_hcd 0000:00:10.1: Performing full reset
> usb usb2: root hub lost power or was reset
> usb usb2: suspend_rh
> uhci_hcd 0000:00:10.2: PCI D0, from previous PCI D3
> ACPI: PCI Interrupt 0000:00:10.2[C] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
> uhci_hcd 0000:00:10.2: uhci_resume
> uhci_hcd 0000:00:10.2: uhci_check_and_reset_hc: cmd = 0x0000
> uhci_hcd 0000:00:10.2: Performing full reset
> usb usb3: root hub lost power or was reset
> usb usb3: suspend_rh
> ehci_hcd 0000:00:10.3: PCI D0, from previous PCI D3
> ACPI: PCI Interrupt 0000:00:10.3[D] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 20
> PM: Writing back config space on device 0000:00:10.3 at offset 3 (was 2008, writing 2010)
> PM: Writing back config space on device 0000:00:10.3 at offset 1 (was 2100007, writing 2100017)
> PM: Writing back config space on device 0000:00:11.1 at offset 1 (was 2900003, writing 2900007)
> ACPI: PCI Interrupt 0000:00:11.1[A] -> Link [ALKA] -> GSI 20 (level, low) -> IRQ 17
> ACPI: PCI Interrupt 0000:00:11.5[C] -> Link [ALKC] -> GSI 22 (level, low) -> IRQ 23
> PCI: Setting latency timer of device 0000:00:11.5 to 64
> ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> serial 00:07: activated
> serial 00:08: activated
> parport_pc 00:09: activated
> i8042 aux 00:0a: activation failed
> i8042 kbd 00:0b: activation failed
> sd 0:0:0:0: [sda] Starting disk
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
> ata1.01: _GTF evaluation failed (AE 0x300d)
> ata1.01: revalidation failed (errno=-5)
> ata1: failed to recover some devices, retrying in 5 secs
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
> ata1.01: _GTF evaluation failed (AE 0x300d)
> ata1.01: ACPI on devcfg failed the second time, disabling (errno=-5)
> ata1.01: revalidation failed (errno=1)
> ata1: failed to recover some devices, retrying in 5 secs
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV0._GTF] (Node c180b840), AE_AML_PACKAGE_LIMIT
> ata1.00: _GTF evaluation failed (AE 0x300d)
> ata1.00: revalidation failed (errno=-5)
> ata1: failed to recover some devices, retrying in 5 secs
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV0._GTF] (Node c180b840), AE_AML_PACKAGE_LIMIT
> ata1.00: _GTF evaluation failed (AE 0x300d)
> ata1.00: ACPI on devcfg failed the second time, disabling (errno=-5)
> ata1.00: revalidation failed (errno=1)
> ata1: failed to recover some devices, retrying in 5 secs
> ata1.00: configured for UDMA/100
> ata1.01: configured for UDMA/33
> sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> usb usb1: usb resume
> usb usb1: wakeup_rh
> hub 1-0:1.0: trying to enable port power on non-switchable hub
> usb usb2: usb resume
> usb usb2: wakeup_rh
> hub 2-0:1.0: trying to enable port power on non-switchable hub
> usb usb3: usb resume
> usb usb3: wakeup_rh
> hub 3-0:1.0: trying to enable port power on non-switchable hub
> usb usb4: usb resume
> ehci_hcd 0000:00:10.3: resume root hub
> hub 4-0:1.0: hub_resume
> Restarting tasks ... <7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0006
> uhci_hcd 0000:00:10.0: port 1 portsc 018a,00
> hub 1-0:1.0: port 1, status 0300, change 0003, 1.5 Mb/s
> done.
> swsusp: Basic memory bitmaps freed
> hub 1-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x300
> uhci_hcd 0000:00:10.0: port 2 portsc 008a,00
> hub 1-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
> hub 1-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
> hub 2-0:1.0: state 7 ports 2 chg 0000 evt 0006
> uhci_hcd 0000:00:10.1: port 1 portsc 018a,00
> hub 2-0:1.0: port 1, status 0300, change 0003, 1.5 Mb/s
> hub 2-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x300
> uhci_hcd 0000:00:10.1: port 2 portsc 008a,00
> hub 2-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
> hub 2-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
> hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0006
> uhci_hcd 0000:00:10.2: port 1 portsc 008a,00
> hub 3-0:1.0: port 1, status 0100, change 0003, 12 Mb/s
> hub 3-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x100
> uhci_hcd 0000:00:10.2: port 2 portsc 008a,00
> hub 3-0:1.0: port 2, status 0100, change 0003, 12 Mb/s
> hub 3-0:1.0: debounce: port 2: total 100ms stable 100ms status 0x100
> hub 4-0:1.0: state 7 ports 6 chg 0000 evt 0000
> hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> hub 2-0:1.0: state 7 ports 2 chg 0000 evt 0000
> hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> usb usb1: suspend_rh (auto-stop)
> usb usb2: suspend_rh (auto-stop)
> usb usb3: suspend_rh (auto-stop)
> agpgart: Found an AGP 2.0 compliant device at 0000:00:00.0.
> agpgart: Putting AGP V2 device at 0000:00:00.0 into 4x mode
> agpgart: Putting AGP V2 device at 0000:01:00.0 into 4x mode
> [drm] Loading R200 Microcode
>
>
>
> 2.6.24-rc2 suspend log (one screenful), UNSUCCESSFUL:
>
> serial 00:07: disabled
> ACPI: PCI interrupt for device 0000:00:11.5 disabled
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTM_]
> (Node c180b9a8), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN1._GTM] (Node c180b8d0), AE_AML_PACKAGE_LIMIT
> ata2: ACPI get timing mode failed (AE 0x300d)
> pci_device_suspend(): ata_pci_device_suspend+0x0/0x40() returns -22
> suspend_device(): pci_device_suspend+0x0/0x70() returns -22
> Could not suspend device 0000:00:11.1: error -22
> ACPI: PCI Interrupt 0000:00:11.5[C] -> Link [ALKC] -> GSI 22 (level, low) -> IRQ 23
> ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> serial 00:07: activated
> serial 00:08: activated
> parport_pc 00:09: activated
> i8042 aux 00:0a: activation failed
> i8042 kbd 00:0b: activation failed
> sd 0:0:0:0: [sda] Starting disk
> sd 0:0:0:0: timing out command, waited 180s
> sd 0:0:0:0: [sda] START_STOP FAILED
> sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
> Restarting tasks ... <7>hub 1-0:1.0: state 7 ports 6 chg 0000 evt 0000
> done.
> swsusp: Basic memory bitmaps freed
> swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
> swsusp: Basic memory bitmaps created
> Syncing filesystems ...
>
>
>
> # lspci
> 00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
> 00:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
> 00:0a.0 Multimedia audio controller: Aureal Semiconductor Vortex 2 (rev fe)
> 00:0c.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08)
> 00:0d.0 Multimedia audio controller: Aztech System Ltd 3328 Audio (rev 10)
> 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
> 00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
> 00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> 00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
> 00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
> 01:00.0 VGA compatible controller: ATI Technologies Inc Radeon RV250 If [Radeon 9000] (rev 01)
> 01:00.1 Display controller: ATI Technologies Inc Radeon RV250 [Radeon 9000] (Secondary) (rev 01)
>
>
> # dmidecode 2.9
> SMBIOS 2.2 present.
> 39 structures occupying 1035 bytes.
> Table at 0x000F0800.
>
> Handle 0x0000, DMI type 0, 19 bytes
> BIOS Information
> Vendor: Award Software International, Inc.
> Version: 6.00 PG
> Release Date: 09/16/2003
> Address: 0xE0000
> Runtime Size: 128 kB
> ROM Size: 512 kB
> Characteristics:
> ISA is supported
> PCI is supported
> PNP is supported
> APM is supported
> BIOS is upgradeable
> BIOS shadowing is allowed
> ESCD support is available
> Boot from CD is supported
> Selectable boot is supported
> BIOS ROM is socketed
> EDD is supported
> 5.25"/360 KB floppy services are supported (int 13h)
> 5.25"/1.2 MB floppy services are supported (int 13h)
> 3.5"/720 KB floppy services are supported (int 13h)
> 3.5"/2.88 MB floppy services are supported (int 13h)
> Print screen service is supported (int 5h)
> 8042 keyboard services are supported (int 9h)
> Serial services are supported (int 14h)
> Printer services are supported (int 17h)
> CGA/mono video services are supported (int 10h)
> ACPI is supported
> USB legacy is supported
> AGP is supported
> LS-120 boot is supported
> ATAPI Zip drive boot is supported
>
> Handle 0x0001, DMI type 1, 25 bytes
> System Information
> Manufacturer: VIA Technologies, Inc.
> Product Name: VT8367-8235
> Version:
> Serial Number:
> UUID: Not Present
> Wake-up Type: Power Switch
>
> Handle 0x0002, DMI type 2, 8 bytes
> Base Board Information
> Manufacturer:
> Product Name: VT8367-8235
> Version:
> Serial Number:
>
> Handle 0x0003, DMI type 3, 13 bytes
> Chassis Information
> Manufacturer:
> Type: Desktop
> Lock: Not Present
> Version:
> Serial Number:
> Asset Tag:
> Boot-up State: Unknown
> Power Supply State: Unknown
> Thermal State: Unknown
> Security Status: Unknown
>
> Handle 0x0004, DMI type 4, 32 bytes
> Processor Information
> Socket Designation: Socket A
> Type: Central Processor
> Family: Duron
> Manufacturer: AMD
> ID: 81 06 00 00 FF FB 83 03
> Signature: Family 6, Model 8, Stepping 1
> Flags:
> FPU (Floating-point unit on-chip)
> VME (Virtual mode extension)
> DE (Debugging extension)
> PSE (Page size extension)
> TSC (Time stamp counter)
> MSR (Model specific registers)
> PAE (Physical address extension)
> MCE (Machine check exception)
> CX8 (CMPXCHG8 instruction supported)
> APIC (On-chip APIC hardware supported)
> SEP (Fast system call)
> MTRR (Memory type range registers)
> PGE (Page global enable)
> MCA (Machine check architecture)
> CMOV (Conditional move instruction supported)
> PAT (Page attribute table)
> PSE-36 (36-bit page size extension)
> MMX (MMX technology supported)
> FXSR (Fast floating-point save and restore)
> SSE (Streaming SIMD extensions)
> Version: AMD K7 processor
> Voltage: 3.3 V
> External Clock: 133 MHz
> Max Speed: 1500 MHz
> Current Speed: 1200 MHz
> Status: Populated, Enabled
> Upgrade: ZIF Socket
> L1 Cache Handle: 0x000A
> L2 Cache Handle: 0x000B
> L3 Cache Handle: No L3 Cache
>
> Handle 0x0005, DMI type 5, 24 bytes
> Memory Controller Information
> Error Detecting Method: None
> Error Correcting Capabilities:
> None
> Supported Interleave: One-way Interleave
> Current Interleave: Four-way Interleave
> Maximum Memory Module Size: 32 MB
> Maximum Total Memory Size: 128 MB
> Supported Speeds:
> 70 ns
> 60 ns
> Supported Memory Types:
> Standard
> EDO
> Memory Module Voltage: 5.0 V
> Associated Memory Slots: 4
> 0x0006
> 0x0007
> 0x0008
> 0x0009
> Enabled Error Correcting Capabilities: None
>
> .
> .
> .
>
>
> # hdparm -i /dev/sda
>
> /dev/sda:
>
> Model=WDC WD1200JB-00CRA1 , FwRev=17.07W17, SerialNo=WD-WCA8C4285629
> Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
> RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
> BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=?16?
> CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2
> UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
> AdvancedPM=no WriteCache=enabled
> Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5
>
> * signifies the current active mode
>
>
>
> Athlon on EPOX 8K5A2+ board.
>
>
>
> Again, 2.6.23 and 2.6.24-rc1 work, yet 2.6.24 -rc2, -rc3 and -rc4 FAIL.
>
> Probably won't be able to do any reporting over the weekend (WOL is
> inoperable ATM for some weird reason), let me know what you need.
> Took too much time to gather this report already anyway ;)
>
> Thanks,
>
> Andreas Mohr

2007-12-08 10:38:18

by Matthew Garrett

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
> On Sat, 8 Dec 2007 11:12:57 +0100 Andreas Mohr <[email protected]> wrote:
> > ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> > ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
> > ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
> > ata1.01: _GTF evaluation failed (AE 0x300d)

037f6bb79f753c014bc84bca0de9bf98bb5ab169 ought to have fixed this?

--
Matthew Garrett | [email protected]

2007-12-08 10:45:55

by Richard Purdie

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 2007-12-08 at 03:40 +0100, Rafael J. Wysocki wrote:
> Subject : leds: ledtrig-timer calls sleeping function from invalid context
> Submitter : M?rton N?meth <[email protected]>
> References : http://bugzilla.kernel.org/show_bug.cgi?id=9264
> Handled-By : Richard Purdie <[email protected]>
> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view

The fix is now in mainline:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc47206e552c0850ad11f7e9a1fca0a3c92f5d65

Cheers,

Richard

2007-12-08 10:56:17

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hi,

On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
> > Does this report now win me the lucky draw, pretty please? ;)
>
> nah, you have to cc the acpi guys to get a prize ;)

Thought so shortly, but missed it.

> Andreas, please do separately report that WOL problem too..

Local setup issue only, at least this one *isn't* a 2.6.24-rc regression. ;)

> Our list just reached 30.

Oh, so this is in fact a separate issue? Wasn't sure, couldn't do
enough analysis of similar cases.

Will test any (already submitted!) suggestions ASAP.

Andreas Mohr

2007-12-08 15:49:35

by Alan Stern

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, 8 Dec 2007, Andrew Morton wrote:

> On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:
>
> > This message contains a list of some regressions from 2.6.23 which have been
> > reported since 2.6.24-rc1 was released and for which there are no fixes in the
> > mainline that I know of. ?If any of them have been fixed already, please let me
> > know.
> >
> > If you know of any other unresolved regressions from 2.6.23, please let me know
> > either and I'll add them to the list.
> >
> >
> > ..
> >
> > Subject : system hangs after a few minutes
> > Submitter : Marcus Better <[email protected]>
> > References : http://bugzilla.kernel.org/show_bug.cgi?id=9335
> > Handled-By : Andrew Morton <[email protected]>
> > Alan Stern <[email protected]>
> > Patch : http://bugzilla.kernel.org/attachment.cgi?id=13871&action=view
> >
>
> This one we have a confirmed fix from Alan but it doesn't appear to be in
> anyone's tree.

An expanded version of that fix is in Greg's queue:

http://marc.info/?l=linux-usb-devel&m=119697043410947&w=2

Since he's away until Tuesday, nothing will happen for a few days.
However you might want to replace the old fix that got added to -mm.

> There is a second bug in here, applicable to core x86: Marcus's machine
> won't boot with nmi_watchdog=1.

Alan Stern

2007-12-08 16:38:50

by Matt Mackall

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, Dec 08, 2007 at 10:30:39AM +0100, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > Subject : tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
> > Submitter : Ingo Molnar <[email protected]>
> > References : http://lkml.org/lkml/2007/11/29/157
> > http://bugzilla.kernel.org/show_bug.cgi?id=9497
> > Handled-By : Matt Mackall <[email protected]>
> > Patch : http://lkml.org/lkml/2007/11/29/387
>
> Matt, the above bug is still occuring en masse during random bootups:
>

I was hoping for some discussion about whether it was the best fix.
The current kzalloc thing strikes me as a step backwards for all
allocators. We'd do better to have a single non-inline kzalloc
function rather than an extra branch in the normal kmalloc fast path.

But here's the patch again, with my sign-off:

Avoid calling page allocator with __GFP_ZERO, as we might be in atomic
context and this will make thing unhappy on highmem systems. Instead,
manually zero allocations from the page allocator.

Signed-off-by: Matt Mackall <[email protected]>

diff -r f7edf7226317 mm/slob.c
--- a/mm/slob.c Wed Dec 05 15:57:06 2007 -0600
+++ b/mm/slob.c Wed Dec 05 15:57:51 2007 -0600
@@ -223,6 +231,7 @@ static void *slob_new_page(gfp_t gfp, in
{
void *page;

+ gfp &= ~__GFP_ZERO;
#ifdef CONFIG_NUMA
if (node != -1)
page = alloc_pages_node(node, gfp, order);
@@ -457,6 +470,8 @@ void *__kmalloc_node(size_t size, gfp_t
page = virt_to_page(ret);
page->private = size;
}
+ if (unlikely((gfp & __GFP_ZERO) && ret))
+ memset(ret, 0, size);
return ret;
}
}

--
Mathematics is the supreme nostalgia of our time.

2007-12-08 17:49:18

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007, Matt Mackall wrote:
>
> Avoid calling page allocator with __GFP_ZERO, as we might be in atomic
> context and this will make thing unhappy on highmem systems. Instead,
> manually zero allocations from the page allocator.

I think this is fine, but didn't we fix the warning already? Calling page
allocators with __GFP_ZERO should be fine, as long as __GFP_HIGHMEM isn't
set, and slab/slub/slob/kmalloc cannot use GFP_HIGHMEM *anyway*.

But I'll apply it anyway, because it looks "obviously correct" from the
standpoint that the _other_ slob user already clears the end result
explicitly later on, and we simply should never pass down __GFP_ZERO to
the actual page allocator.

On that note, shouldn't we also do this for slub.c? Christoph?

Linus

---
mm/slub.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b9f37cb..9c1d9f3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1468,6 +1468,9 @@ static void *__slab_alloc(struct kmem_cache *s,
void **object;
struct page *new;

+ /* We handle __GFP_ZERO in the caller */
+ gfpflags &= ~__GFP_ZERO;
+
if (!c->page)
goto new_slab;

2007-12-08 17:55:26

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007, Linus Torvalds wrote:
>
> But I'll apply it anyway, because it looks "obviously correct" from the
> standpoint that the _other_ slob user already clears the end result
> explicitly later on, and we simply should never pass down __GFP_ZERO to
> the actual page allocator.

Actually, I take that back. The other slob users are different. They share
pages, this codepath does not.

So I think a more proper solution would be:
(a) Something like this patch (which includes my previous mm/slub.c
change)
(b) don't warn about atomic GFP_ZERO's - unless they have GFP_HIGHMEM set
*too*.

So which warning is it that triggers the bogus error?

Linus
---
mm/slob.c | 2 +-
mm/slub.c | 3 +++
2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/slob.c b/mm/slob.c
index ee2ef8a..773a7aa 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -330,7 +330,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)

/* Not enough space: must allocate a new page */
if (!b) {
- b = slob_new_page(gfp, 0, node);
+ b = slob_new_page(gfp & ~__GFP_ZERO, 0, node);
if (!b)
return 0;
sp = (struct slob_page *)virt_to_page(b);
diff --git a/mm/slub.c b/mm/slub.c
index b9f37cb..9c1d9f3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1468,6 +1468,9 @@ static void *__slab_alloc(struct kmem_cache *s,
void **object;
struct page *new;

+ /* We handle __GFP_ZERO in the caller */
+ gfpflags &= ~__GFP_ZERO;
+
if (!c->page)
goto new_slab;

2007-12-08 18:10:26

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007 09:54:06 -0800 (PST) Linus Torvalds <[email protected]> wrote:

> On Sat, 8 Dec 2007, Linus Torvalds wrote:
> >
> > But I'll apply it anyway, because it looks "obviously correct" from the
> > standpoint that the _other___slob user already clears the end result
> > explicitly later on, and we simply should never pass down __GFP_ZERO to
> > the actual page allocator.
>
> Actually, I take that back. The other slob users are different. They share
> pages, this codepath does not.
>
> So I think a more proper solution would be:
> (a) Something like this patch (which includes my previous mm/slub.c
> change)
> (b) don't warn about atomic GFP_ZERO's - unless they have GFP_HIGHMEM set
> *too*.
>
> So which warning is it that triggers the bogus error?

It's a kmap_atomic() debugging patch which I wrote ages ago and whcih Ingo
sucked into his tree. I don't _think_ this warning is present in your tree
at all.

http://lkml.org/lkml/2007/11/29/157 is where it starts.

I had a lenghty back-and-forth with Christoph on this within the past
couple of months and I cannot locate the thread and I don't recall what the
upshot was and Christoph is still offline.

Knocking out __GFP_ZERO at the point where the slab allocator(s) call the
page allocator seems like a good approach to me.

But I don't think we need to do anything for 2.6.24..

2007-12-08 18:25:26

by Robert Hancock

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Matthew Garrett wrote:
> On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
>> On Sat, 8 Dec 2007 11:12:57 +0100 Andreas Mohr <[email protected]> wrote:
>>> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
>>> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
>>> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
>>> ata1.01: _GTF evaluation failed (AE 0x300d)
>
> 037f6bb79f753c014bc84bca0de9bf98bb5ab169 ought to have fixed this?
>

I should think it should have.

I think we're too aggressive about disabling the libata ACPI support,
even. One of my laptop's _GTF commands on resume is a DEVICE
CONFIGURATION FREEZE LOCK command, which gets rejected by the drive
(maybe it worked on the original Hitachi disk, but I've upgraded it to a
newer Samsung). I'd say if the drive returns command aborted on one of
these, we should just ignore that command and continue to the next one
without trying to retry or disabling the ACPI support entirely.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2007-12-08 18:34:37

by Matt Mackall

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

2007-12-08 18:38:38

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007, Andrew Morton wrote:
>
> > So which warning is it that triggers the bogus error?
>
> It's a kmap_atomic() debugging patch which I wrote ages ago and whcih Ingo
> sucked into his tree. I don't _think_ this warning is present in your tree
> at all.

Ok, that explains it.

> Knocking out __GFP_ZERO at the point where the slab allocator(s) call the
> page allocator seems like a good approach to me.
>
> But I don't think we need to do anything for 2.6.24..

Good. Although we should perhaps look at that reported performance problem
with SLUB. It looks like SLUB will do a memclear() for the area twice
(first for the whole page, then for the thing it allocated) for the slow
case. Maybe that exacerbates the problem.

Linus

2007-12-08 18:58:16

by Roland Dreier

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> > Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
> > Submitter : Roland Dreier <[email protected]>
> > References : http://lkml.org/lkml/2007/11/8/255
> > http://bugzilla.kernel.org/show_bug.cgi?id=9332
> > Handled-By :
> > Patch :
>
> Takashi had a patch and that has been merged. AFAIK this regression
> has been fixed and we're left with a new but harmless warning.
>
> However Roland reported other problems and it appears that the trail went
> cold (http://lkml.org/lkml/2007/11/14/251)

A fix for the most likely cause of this problem was merged (7eba5c9d
"[ALSA] hda-codec - Check PINCAP only for PIN widgets") but it seems
that setting CONFIG_SND_HDA_POWER_SAVE can cause the "azx_get_response
timeout, switching to polling mode" message sometimes too. However
according to Takashi this is really just a cosmetic problem -- polling
mode is not so bad.

- R.

2007-12-08 19:01:18

by Matt Mackall

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, Dec 08, 2007 at 09:54:06AM -0800, Linus Torvalds wrote:
>
>
> On Sat, 8 Dec 2007, Linus Torvalds wrote:
> >
> > But I'll apply it anyway, because it looks "obviously correct" from the
> > standpoint that the _other_??slob user already clears the end result
> > explicitly later on, and we simply should never pass down __GFP_ZERO to
> > the actual page allocator.
>
> Actually, I take that back. The other slob users are different. They share
> pages, this codepath does not.
>
> So I think a more proper solution would be:
> (a) Something like this patch (which includes my previous mm/slub.c
> change)
> (b) don't warn about atomic GFP_ZERO's - unless they have GFP_HIGHMEM set
> *too*.
>
> So which warning is it that triggers the bogus error?

While I think the whole GFP_ZERO thing is a bit broken here, this is
an improvement on my original patch, so:

Acked-by: Matt Mackall <[email protected]>
> ---
> mm/slob.c | 2 +-
> mm/slub.c | 3 +++
> 2 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/slob.c b/mm/slob.c
> index ee2ef8a..773a7aa 100644
> --- a/mm/slob.c
> +++ b/mm/slob.c
> @@ -330,7 +330,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
>
> /* Not enough space: must allocate a new page */
> if (!b) {
> - b = slob_new_page(gfp, 0, node);
> + b = slob_new_page(gfp & ~__GFP_ZERO, 0, node);
> if (!b)
> return 0;
> sp = (struct slob_page *)virt_to_page(b);
> diff --git a/mm/slub.c b/mm/slub.c
> index b9f37cb..9c1d9f3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1468,6 +1468,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> void **object;
> struct page *new;
>
> + /* We handle __GFP_ZERO in the caller */
> + gfpflags &= ~__GFP_ZERO;
> +
> if (!c->page)
> goto new_slab;

--
Mathematics is the supreme nostalgia of our time.

2007-12-08 19:41:05

by Theodore Ts'o

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, Dec 08, 2007 at 01:42:41AM -0800, Andrew Morton wrote:
> > Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
> > Submitter : Roland Dreier <[email protected]>
> > References : http://lkml.org/lkml/2007/11/8/255
> > http://bugzilla.kernel.org/show_bug.cgi?id=9332
> > Handled-By :
> > Patch :
>
> Takashi had a patch and that has been merged. AFAIK this regression
> has been fixed and we're left with a new but harmless warning.
>
> However Roland reported other problems and it appears that the trail went
> cold (http://lkml.org/lkml/2007/11/14/251)
>
> Ted was hitting some of the same problems but that trail appears to also
> have gone cold (http://lkml.org/lkml/2007/11/23/17).

Actually, not gone cold, but I stopped posting about it because it's
been solved and I thought agreement had been reached that it should be
pushed to mainline before 2.6.25.

I am very happily running with Ingo's "snd hda suspend latency:
shorten codec read" patch, which was originally intended to speed up
resuming from hibernation, but which as I discovered, also has the
nice side effect of eliminating the reported error.

On 11/23, Takashi replied to my note (http://lkml.org/lkml/2007/11/23/17)
and suggested that Jaroslav push this patch to Linus immediately
instead of waiting for 2.6.25, since it appearly solves two problems
with one stone. However, I just checked, as of Linus's public, and
Ingo's patch is *not* in mainline.

However, as far as I am concerned, Ingo's patch, first posted to LKML
here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
the above regression. Rafael, could you please make a note of this in
your regression list, and could we please get this patch pushed into
mainline?

Thanks!!

- Ted

2007-12-08 19:52:51

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Linus Torvalds <[email protected]> wrote:

> > But I don't think we need to do anything for 2.6.24..
>
> Good. Although we should perhaps look at that reported performance
> problem with SLUB. It looks like SLUB will do a memclear() for the
> area twice (first for the whole page, then for the thing it allocated)
> for the slow case. Maybe that exacerbates the problem.

i dont think the SLUB problem could be explained purely via a double
memset(). [which ought to be extremely fast anyway] We are talking about
a 10 times slowdown on a 64-way box of a workload that is fairly
common-sense. (tasks sending messages to each other via bog standard
means)

while i dont want to jump to conclusions without looking at some
profiles, i think the SLUB performance regression is indicative of the
following fallacy: "SLAB can be done significantly simpler while keeping
the same performance".

I couldnt point to any particular aspect of SLAB that i could
characterise as "needless bloat".

the SLUB concept is proudly outlined in init/Kconfig:

config SLUB
bool "SLUB (Unqueued Allocator)"
help
SLUB is a slab allocator that minimizes cache line usage
instead of managing queues of cached objects (SLAB approach).
Per cpu caching is realized using slabs of objects instead
of queues of objects. SLUB can use memory efficiently
and has enhanced diagnostics.

but that's not true anymore - the two concepts are pretty much
equivalent, after all the "performance tuning" that went on in SLUB.
(read: 'frantically try to catch up with SLAB in benchmarks')

so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and
SLUB have essentially the same linecount:

$ wc -l mm/slab.c mm/slub.c
4478 mm/slab.c
4125 mm/slub.c

(and while linecount != complexity, there is a strong relationship.)

With SLAB having 10 years more test coverage and tuning.

the messiest and most fragile aspect of SLAB that i can think of is its
bootstrap hacks - but that is an entirely unimportant detail in my
opinion. SLAB has been cleaned up significantly in the past few years by
Pekka Enberg & co, it's pretty pleasant and straightforward code these
days.

I think we should we make SLAB the default for v2.6.24 ...

Ingo

2007-12-08 19:56:39

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Theodore Tso <[email protected]> wrote:

> I am very happily running with Ingo's "snd hda suspend latency:
> shorten codec read" patch, which was originally intended to speed up
> resuming from hibernation, but which as I discovered, also has the
> nice side effect of eliminating the reported error.
>
> On 11/23, Takashi replied to my note
> (http://lkml.org/lkml/2007/11/23/17) and suggested that Jaroslav push
> this patch to Linus immediately instead of waiting for 2.6.25, since
> it appearly solves two problems with one stone. However, I just
> checked, as of Linus's public, and Ingo's patch is *not* in mainline.
>
> However, as far as I am concerned, Ingo's patch, first posted to LKML
> here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
> the above regression. Rafael, could you please make a note of this in
> your regression list, and could we please get this patch pushed into
> mainline?

ha! I'd never have expected _that_ to happen. Cool. Fixing a driver bug
by accident :-)

Ingo

2007-12-08 20:30:25

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Ingo Molnar <[email protected]> wrote:

> the SLUB concept is proudly outlined in init/Kconfig:
>
> config SLUB
> bool "SLUB (Unqueued Allocator)"
> help
> SLUB is a slab allocator that minimizes cache line usage
> instead of managing queues of cached objects (SLAB approach).
> Per cpu caching is realized using slabs of objects instead
> of queues of objects. SLUB can use memory efficiently
> and has enhanced diagnostics.
>
> but that's not true anymore - the two concepts are pretty much
> equivalent, after all the "performance tuning" that went on in SLUB.
> (read: 'frantically try to catch up with SLAB in benchmarks')

just to hammer this point home - and while Christoph Lameter isnt online
currently to defend SLUB, this issue did come up and i guess there must
be other MM hackers that are advocates of SLUB. (otherwise this code
couldnt be upstream)

I find SLUB a fantastically confusing concept, and that starts at the
name. "Unqueued" it says. Lets take a look at the actual code:

mm/slub.c:

static void __always_inline *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, void *addr)
{
...
local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
...
else {
object = c->freelist;
c->freelist = object[c->offset];
}
local_irq_restore(flags);

so it has a "free list", which is clearly per cpu. Hang on! Isnt that
actually a per CPU queue? Which SLUB has not, we are told? The "U" in
SLUB. How on earth can an allocator in 2007 claim to have no queuing
(which is in essence caching)? Am i on crack with this? Did i miss
something really obvious?

Ingo

2007-12-08 21:52:58

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Saturday, 8 of December 2007, Andrew Morton wrote:
> On Sat, 8 Dec 2007 09:28:15 +0100 Ingo Molnar <[email protected]> wrote:
>
> >
> > * Fabio Comolli <[email protected]> wrote:
> >
> > > <snip>
> > >
> > > > Subject : Battery shows up twice in kpowersave
> > > > Submitter : Rolf Eike Beer <[email protected]>
> > > > References : http://bugzilla.kernel.org/show_bug.cgi?id=9494
> > > > Handled-By : Alexey Starikovskiy <[email protected]>
> > > > Patch :
> > > >
> > >
> > > I don't think that this is a regression: I reported on RedHat bugzilla
> > > when I switched from F7 to F8 and I was using 2.6.23.8 at that time.
> > > It looks to me an HAL regression, but of course I may be wrong :-) as
> > > the reported bisected to a bad commit.
> > >
> > > https://bugzilla.redhat.com/show_bug.cgi?id=373041
> > >
> > > By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5
> > > custom kernel and Gnome desktop and the problem is still present, even
> > > with gnome-power-manager.
> >
> > to me this looks like an ABI regression - utilities should work without
> > change. Something changed in /sys output that caused HAL to think that
> > there are two batteries:
>
> Yep. Although HAL is of course a most special case of "userspace".
>
> > | The output of lshal shows that there are two UDI's with
> > | info.capabilities = { 'battery' }:
> > |
> > | udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
> > | udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'
> >
> > whether it's a HAL bug or a kernel bug, the original state should be
> > restored and it should be worked out without breaking users of older HAL
> > versions.
>
> "breaking users of older HAL versions" == "breaking machines".
>
> The patch should be reverted. Do we know which one it was?
>
> > grumble: way too many times do various system utilities break when i
> > upgrade the kernel on my laptop. Maybe a new debug mechanism: we should
> > start fingerprinting the exact /sys and /proc output and enforce that
> > it's immutable across kernel releases as long as the hardware is
> > unmodified?
>
> That would be neat. It would need to be executed on a lot of different
> machines.

Hm, that wouldn't allow us to add new attributes ...

2007-12-08 21:58:53

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Saturday, 8 of December 2007, Andrew Morton wrote:
> On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:
>
> > This message contains a list of some regressions from 2.6.23 which have been
> > reported since 2.6.24-rc1 was released and for which there are no fixes in the
> > mainline that I know of. ?If any of them have been fixed already, please let me
> > know.
> >
> > If you know of any other unresolved regressions from 2.6.23, please let me know
> > either and I'll add them to the list.
>
> Twenty nine, huh?
>
> It would be useful if these records were sorted in date-of-reportage order
> and had a date stamp so we could see how long they've been hanging about.

They are sorted by the bugzilla number which reflects the date-of-reportage
order pretty well. For a techincal reason, it's easier to me if they're sorted
like this.

Adding date stamps should be easy, tough, I'll try to add them to the next
report.

> Something to think about for the post-2.6.24 regression if you'll be handling
> those?

Yes, I'm going to handle the post-2.6.24 regressions too (in the hope there
will be less of them ;-)).

> > Subject : leds: ledtrig-timer calls sleeping function from invalid context
> > Submitter : M?rton N?meth <[email protected]>
> > References : http://bugzilla.kernel.org/show_bug.cgi?id=9264
> > Handled-By : Richard Purdie <[email protected]>
> > Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view
>
> That patch has been merged (dc47206e552c0850ad11f7e9a1fca0a3c92f5d65) and
> assuming M?rton has tested the latest git snapshot
> (ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots) successfully we can
> cross it off?

Yes, will drop.

Thanks,
Rafael

2007-12-08 22:12:16

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Saturday, 8 of December 2007, Theodore Tso wrote:
> On Sat, Dec 08, 2007 at 01:42:41AM -0800, Andrew Morton wrote:
> > > Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
> > > Submitter : Roland Dreier <[email protected]>
> > > References : http://lkml.org/lkml/2007/11/8/255
> > > http://bugzilla.kernel.org/show_bug.cgi?id=9332
> > > Handled-By :
> > > Patch :
> >
> > Takashi had a patch and that has been merged. AFAIK this regression
> > has been fixed and we're left with a new but harmless warning.
> >
> > However Roland reported other problems and it appears that the trail went
> > cold (http://lkml.org/lkml/2007/11/14/251)
> >
> > Ted was hitting some of the same problems but that trail appears to also
> > have gone cold (http://lkml.org/lkml/2007/11/23/17).
>
> Actually, not gone cold, but I stopped posting about it because it's
> been solved and I thought agreement had been reached that it should be
> pushed to mainline before 2.6.25.
>
> I am very happily running with Ingo's "snd hda suspend latency:
> shorten codec read" patch, which was originally intended to speed up
> resuming from hibernation, but which as I discovered, also has the
> nice side effect of eliminating the reported error.
>
> On 11/23, Takashi replied to my note (http://lkml.org/lkml/2007/11/23/17)
> and suggested that Jaroslav push this patch to Linus immediately
> instead of waiting for 2.6.25, since it appearly solves two problems
> with one stone. However, I just checked, as of Linus's public, and
> Ingo's patch is *not* in mainline.
>
> However, as far as I am concerned, Ingo's patch, first posted to LKML
> here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
> the above regression. Rafael, could you please make a note of this in
> your regression list,

Done, thanks.

> and could we please get this patch pushed into mainline?

2007-12-08 22:13:48

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Saturday, 8 of December 2007, Richard Purdie wrote:
> On Sat, 2007-12-08 at 03:40 +0100, Rafael J. Wysocki wrote:
> > Subject : leds: ledtrig-timer calls sleeping function from invalid context
> > Submitter : M?rton N?meth <[email protected]>
> > References : http://bugzilla.kernel.org/show_bug.cgi?id=9264
> > Handled-By : Richard Purdie <[email protected]>
> > Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view
>
> The fix is now in mainline:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc47206e552c0850ad11f7e9a1fca0a3c92f5d65

Yes, already dropped.

Thanks,
Rafael

2007-12-09 02:15:56

by Theodore Ts'o

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sat, Dec 08, 2007 at 11:30:53PM +0100, Rafael J. Wysocki wrote:
> On Saturday, 8 of December 2007, Theodore Tso wrote:
> > However, as far as I am concerned, Ingo's patch, first posted to LKML
> > here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
> > the above regression. Rafael, could you please make a note of this in
> > your regression list,
>
> Done, thanks.

Great, thanks. I should add that technically this wasn't a regression
since I had been seeing this since before 2.6.23. Also, it isn't a
big deal, since aside from noise in the syslog, falling back to
polling more doesn't make any functional or user-visible difference
(although I guess it's less efficient).

Regardless of whether it is a regression, it would be nice to get the
patch applied and and this issue fixed for 2.6.25!

- Ted

2007-12-09 05:59:55

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Robert Hancock wrote:
> Matthew Garrett wrote:
>> On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
>>> On Sat, 8 Dec 2007 11:12:57 +0100 Andreas Mohr <[email protected]> wrote:
>>>> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index
>>>> (0FFFFFFFF) is beyond end of object [20070126]
>>>> ACPI Error (psparse-0537): Method parse/execution failed
>>>> [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
>>>> ACPI Error (psparse-0537): Method parse/execution failed
>>>> [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
>>>> ata1.01: _GTF evaluation failed (AE 0x300d)
>>
>> 037f6bb79f753c014bc84bca0de9bf98bb5ab169 ought to have fixed this?
>>
>
> I should think it should have.
>
> I think we're too aggressive about disabling the libata ACPI support,
> even. One of my laptop's _GTF commands on resume is a DEVICE
> CONFIGURATION FREEZE LOCK command, which gets rejected by the drive
> (maybe it worked on the original Hitachi disk, but I've upgraded it to a
> newer Samsung). I'd say if the drive returns command aborted on one of
> these, we should just ignore that command and continue to the next one
> without trying to retry or disabling the ACPI support entirely.

Yeap, my pending patchset does exactly that. It's currently being
tested by but reporters. I'll soon post the patchset.

Thanks.

--
tejun

2007-12-09 06:53:18

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hello,

Andrew Morton wrote:
>> Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
>> Submitter : Hans de Bruin <[email protected]>
>> References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
>> Handled-By : Robert Moore <[email protected]>
>> Tejun Heo <[email protected]>
>> Fu Michael <[email protected]>
>> Patch :
>>
>
> A number of other people are seeing the same thing and Tejun is
> putting in a blacklist of machines which cannot use libata+acpi.
> That patch is not yet in any git tree which I pull.
>
> AFACIT the machines kepe working OK - there's just some nasty dmesg
> spew.
>
> If any machines _are_ breaking then this could cause real problems
> and I'd prefer that we either go for a whitelist or arrange to
> detect the condition and fall back to non-acpi ata.

The pending patchset should make ATA ACPI quite resistant to failures.
Known bad boards can be blacklisted (currently only one is on the
list), ATA ACPI is disabled quicker if ACPI evalution fails, execution
errors are handled better and commands which are intended to help the
vendor instead of the user are filtered. So, I think we have enough
safety nets.

Thanks.

--
tejun

2007-12-09 07:00:21

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hello, (cc'ing Alan)

Andrew Morton wrote:
>> Subject : cd/dvd inaccessible in 2.6.24-rc2
>> Submitter : Will Trives <[email protected]>
>> References : http://lkml.org/lkml/2007/11/9/290
>> http://bugzilla.kernel.org/show_bug.cgi?id=9346
>> Handled-By : Len Brown <[email protected]>
>> Tejun Heo <[email protected]>
>> Patch :
>>
>
> Nasty one. Tejun and several diligent reporters are doing sterling
> work there and things have improved. I don't know whether any of
> Tejun's patches have been merged yet, but we'll probably be OK on
> this one.

I'm still trying to find out what's really going on. That drive is
quite peculiar.

> What is unclear (to me) is what actually caused those people's machines to
> break?

It's introduced by setting ATAPI transfer chunk size to actual
transfer size which is the right thing to do generally. However, with
the change, the ATAPI HSM should be ready to drain full extra transfer
chunks which libata HSM wasn't doing. With that part changed, most
regressions should go away.

Unfortunately, simply adding that doesn't fix the case in bug 9346 and
I'm still trying to find out why. The good news is that the drive
works fine with proposed more extensive improvements to libata ATAPI
which will probably be included into 2.6.25, so we at least have long
term solution.

If we fail to find out the solution in time, we always have the
alternative of backing out the ATAPI transfer chunk size update. This
will break some other cases which were fixed by the change but those
won't be regressions at least and we can add transfer chunk size
update with other changes to 2.6.25.

Thanks.

--
tejun

2007-12-09 07:59:38

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Andrew Morton <[email protected]> wrote:

> It's a kmap_atomic() debugging patch which I wrote ages ago and whcih
> Ingo sucked into his tree. I don't _think_ this warning is present in
> your tree at all.
>
> http://lkml.org/lkml/2007/11/29/157 is where it starts.
>
> I had a lenghty back-and-forth with Christoph on this within the past
> couple of months and I cannot locate the thread and I don't recall
> what the upshot was and Christoph is still offline.
>
> Knocking out __GFP_ZERO at the point where the slab allocator(s) call
> the page allocator seems like a good approach to me.
>
> But I don't think we need to do anything for 2.6.24..

ok - Rafael, please strike this off the regressions list, there is no
problem in .24 other than the double memset for some SLOB metadata.

Ingo

2007-12-09 08:20:30

by Pekka Enberg

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

Hi Ingo,

On Dec 8, 2007 10:29 PM, Ingo Molnar <[email protected]> wrote:
> so it has a "free list", which is clearly per cpu. Hang on! Isnt that
> actually a per CPU queue? Which SLUB has not, we are told? The "U" in
> SLUB. How on earth can an allocator in 2007 claim to have no queuing
> (which is in essence caching)? Am i on crack with this? Did i miss
> something really obvious?

I think you did. The difference is explained in Christoph's announcement:

"A particular concern was the complex management of the numerous
object queues in SLAB. SLUB has no such queues. Instead we dedicate a
slab for each allocating CPU and use objects from a slab directly
instead of queueing them up."

Which, I think, is where SLUB gets its name from (the "unqueued" part).

Now, while SLAB code is "pleasant and straightforward code" (thanks,
btw) for UMA, it's really hairy for NUMA plus the "alien caches" eat
tons of memory (which is why Christoph wrote SLUB in the first place,
the current code in SLAB is mostly unfixable due to its *queuing*
nature).

I don't object changing the default to CONFIG_SLAB but it's not really
a long term strategy unless we want to have three kmalloc's in the
kernel: one for embedded, one for UMA, and one NUMA.

Pekka

2007-12-09 08:33:20

by Pekka Enberg

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

Hi Linus,

On Dec 8, 2007 7:54 PM, Linus Torvalds <[email protected]> wrote:
> diff --git a/mm/slob.c b/mm/slob.c
> index ee2ef8a..773a7aa 100644
> --- a/mm/slob.c
> +++ b/mm/slob.c
> @@ -330,7 +330,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
>
> /* Not enough space: must allocate a new page */
> if (!b) {
> - b = slob_new_page(gfp, 0, node);
> + b = slob_new_page(gfp & ~__GFP_ZERO, 0, node);
> if (!b)
> return 0;
> sp = (struct slob_page *)virt_to_page(b);
>
> diff --git a/mm/slub.c b/mm/slub.c
> index b9f37cb..9c1d9f3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1468,6 +1468,9 @@ static void *__slab_alloc(struct kmem_cache *s,
> void **object;
> struct page *new;
>
> + /* We handle __GFP_ZERO in the caller */
> + gfpflags &= ~__GFP_ZERO;
> +
> if (!c->page)
> goto new_slab;

In case you didn't already merge this:

Reviewed-by: Pekka Enberg <[email protected]>

2007-12-09 08:51:15

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Pekka Enberg <[email protected]> wrote:

> Hi Ingo,
>
> On Dec 8, 2007 10:29 PM, Ingo Molnar <[email protected]> wrote:
> > so it has a "free list", which is clearly per cpu. Hang on! Isnt that
> > actually a per CPU queue? Which SLUB has not, we are told? The "U" in
> > SLUB. How on earth can an allocator in 2007 claim to have no queuing
> > (which is in essence caching)? Am i on crack with this? Did i miss
> > something really obvious?
>
> I think you did. The difference is explained in Christoph's
> announcement:
>
> "A particular concern was the complex management of the numerous
> object queues in SLAB. SLUB has no such queues. Instead we dedicate a
> slab for each allocating CPU and use objects from a slab directly
> instead of queueing them up."
>
> Which, I think, is where SLUB gets its name from (the "unqueued"
> part).

yes, i understand the initial announcement (and the Kconfig entry still
says the same), but that is not matched up by the reality i see in the
actual code - SLUB clearly uses a queue/list of objects (as cited in my
previous mail), for obvious performance reasons.

unless i'm missing something obvious (and i easily might), i see SLUB as
SLAB reimplemented with a different queueing model. Not "without
queueing".

> Now, while SLAB code is "pleasant and straightforward code" (thanks,
> btw) for UMA, it's really hairy for NUMA plus the "alien caches" eat
> tons of memory (which is why Christoph wrote SLUB in the first place,
> the current code in SLAB is mostly unfixable due to its *queuing*
> nature).

i'm curious about the real facts behind this "alien cache problem". I
heard about it and asked around and was told that there's some sort of
bad quadratic behavior of memory consumption on NUMA - but i cannot
actually see that in the code.

The alien caches feature of SLAB i see as a spread out clustered
index/cache of objects on other nodes. It's not increasing the average
per object memory consumption per se! The number of alien caches
increases with increasing number of nodes, but _of course_, as memory
size increases too so there's more stuff and a larger expected spreadout
of memory to keep track of. ("Fixing" that would be like reintroducing a
single runqueue for the scheduler, based on the argument that it's an
O(1) number of runqueues against O(N) number of runqueues - which would
be complete nonsense.)

so i see SLAB alien caches as an automatic self-partitioning mechanism
... which has complexities but which also has _obvious_ performance
benefits. Yes, it has some disadvantages like all caching schemes do -
there's more cached memory tied up in the allocator at any given moment
- but arguing against that would be like arguing against a 2MB L2 cache
purely on the basis that a 1MB L2 cache is smaller and hence more
space-efficient. Caches are there to cache stuff, and more caches ...
use more memory. It's all a question of proportion and tuning, but the
_design_ should be based on having as thorough caching as possible.

Ingo

2007-12-09 09:18:22

by Pekka Enberg

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

Hi Ingo,

On Dec 9, 2007 10:50 AM, Ingo Molnar <[email protected]> wrote:
> yes, i understand the initial announcement (and the Kconfig entry still
> says the same), but that is not matched up by the reality i see in the
> actual code - SLUB clearly uses a queue/list of objects (as cited in my
> previous mail), for obvious performance reasons.
>
> unless i'm missing something obvious (and i easily might), i see SLUB as
> SLAB reimplemented with a different queueing model. Not "without
> queueing".

Sure, the emphasis is on "no *such* queues."

On Dec 9, 2007 10:50 AM, Ingo Molnar <[email protected]> wrote:
> i'm curious about the real facts behind this "alien cache problem". I
> heard about it and asked around and was told that there's some sort of
> bad quadratic behavior of memory consumption on NUMA - but i cannot
> actually see that in the code.

I mostly live in the legacy 32-bit UMA/UP land still so I cannot verify this
myself but the kind folks at SGI claim the following (again from the
announcement):

"On our systems with 1k nodes / processors we have several gigabytes
just tied up for storing references to objects for those queues This does
not include the objects that could be on those queues. One fears that the
whole memory of the machine could one day be consumed by those
queues."

The problem is that for each cache, you have an "per-node alien queues"
for each node (see struct kmem_cache nodelists -> struct kmem_list3 alien).
Moving slab metadata to struct page solves this but now you can only have
one "queue" thats part of the same struct.

Pekka

2007-12-09 11:52:18

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Pekka Enberg <[email protected]> wrote:

> I mostly live in the legacy 32-bit UMA/UP land still so I cannot
> verify this myself but the kind folks at SGI claim the following
> (again from the announcement):
>
> "On our systems with 1k nodes / processors we have several gigabytes
> just tied up for storing references to objects for those queues This
> does not include the objects that could be on those queues. One fears
> that the whole memory of the machine could one day be consumed by
> those queues."

Yes, you can find gigs tied up on systems that have 100 GB of RAM, or
you can have gigs tied up if you over-size your caches. I'd like to see
an accurate calculation done on this.

> The problem is that for each cache, you have an "per-node alien
> queues" for each node (see struct kmem_cache nodelists -> struct
> kmem_list3 alien). Moving slab metadata to struct page solves this but
> now you can only have one "queue" thats part of the same struct.

yes, it's what i referred to as "distributed, per node cache". It has no
"quadratic overhead". It has SLAB memory spread out amongst nodes. I.e.
1 million pages are distributed amongst 1k nodes with 1000 pages per
node with each node having 1 page.

But that memory is not lost and it's disingenous to call it 'overhead'
and it very much comes handy for performance _IF_ there's global
workload that involves cross-node allocations. It's simply a cache that
is mis-sized and mis-constructed on large node count systems but i bet
it makes quite a performance difference on low-node-count systems.

On high node-count systems it might make sense to reduce the amount of
cross-node caching and to _structure_ the distributed NUMA SLAB cache in
an intelligent way (perhaps along cpuset boundaries) - but a total,
design level _elimination_ of this caching concept, using very
misleading arguments, just looks stupid to me ...

Ingo

2007-12-09 11:54:59

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-09 12:05:39

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Andrew Morton <[email protected]> wrote:

> Am trying to do a git-disect on it but it seems that someone has been
> screwing with ata Kconfig and I'm hitting a pile of
> cant-find-root-disk bisection points and I can't immediately work out
> why. I'll try to find time to look at it again next week.

the way i solve such bisection problems is to have the patch like below
applied by a "git-bisect run" scriptlet (and popped off after the test).
This way all must-have drivers and kernel features are selected for that
particular testbox, no matter what Kconfig complication there are.
(except outright config option renaming but those are rare)

Ingo

Index: linux/arch/x86/Kconfig.needed
===================================================================
--- /dev/null
+++ linux/arch/x86/Kconfig.needed
@@ -0,0 +1,88 @@
+config FORCE_MINIMAL_CONFIG
+ bool
+ default y
+
+select EXPERIMENTAL
+
+select EXT3_FS
+select EXT3_FS_XATTR
+select EXT3_FS_POSIX_ACL
+select EXT3_FS_SECURITY
+select BLOCK
+select HOTPLUG
+#select INOTIFY
+#select INOTIFY_USER
+
+# so that capset() works (sudo, etc.):
+select SECURITY
+select SECURITY_CAPABILITIES
+
+select BINFMT_ELF
+select MSDOS_PARTITION
+select PARTITION_ADVANCED
+select BSD_DISKLABEL
+
+select SYSFS
+select SYSFS_DEPRECATED
+select PROC_FS
+select FUTEX
+
+select ATA
+select SATA_AHCI
+select ATA_PIIX
+select PATA_AMD
+select PATA_OLDPIIX
+select BLK_DEV_SD
+
+select E100
+select E1000
+select NET_ETHERNET
+select NET_PCI
+select MII
+select CRC32
+
+select 8139TOO
+select FORCEDETH
+
+select PACKET
+
+select NETPOLL
+select NETCONSOLE
+select NET_POLL_CONTROLLER
+select INET
+select NET
+select UNIX
+select NETDEVICES
+
+select SERIAL_8250
+select SERIAL_8250_CONSOLE
+select MAGIC_SYSRQ
+
+select INPUT
+select INPUT_MOUSEDEV
+select INPUT_POLLDEV
+select INPUT_KEYBOARD
+select KEYBOARD_ATKBD
+select SERIO
+select SERIO_I8042
+
+select VT
+select VT_CONSOLE
+select HW_CONSOLE
+select VGA_CONSOLE
+select EARLY_PRINTK
+select PRINTK
+select UNIX98_PTYS
+
+select USB
+select USB_MOUSE
+select USB_EHCI_HCD
+select USB_OHCI_HCD
+select USB_UHCI_HCD
+select USB_SUPPORT
+
+select PCI
+
+select STANDALONE
+select PREVENT_FIRMWARE_BUILD
+
Index: linux/lib/Kconfig
===================================================================
--- linux.orig/lib/Kconfig
+++ linux/lib/Kconfig
@@ -142,3 +142,6 @@ config CHECK_SIGNATURE
bool

endmenu
+
+source "arch/x86/Kconfig.needed"
+
Index: linux/lib/Kconfig.debug

2007-12-09 12:35:17

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Ingo Molnar <[email protected]> wrote:

> > The problem is that for each cache, you have an "per-node alien
> > queues" for each node (see struct kmem_cache nodelists -> struct
> > kmem_list3 alien). Moving slab metadata to struct page solves this
> > but now you can only have one "queue" thats part of the same struct.
>
> yes, it's what i referred to as "distributed, per node cache". It has
> no "quadratic overhead". It has SLAB memory spread out amongst nodes.
> I.e. 1 million pages are distributed amongst 1k nodes with 1000 pages
> per node with each node having 1 page.
>
> But that memory is not lost and it's disingenous to call it 'overhead'
> and it very much comes handy for performance _IF_ there's global
> workload that involves cross-node allocations. It's simply a cache
> that is mis-sized and mis-constructed on large node count systems but
> i bet it makes quite a performance difference on low-node-count
> systems.
>
> On high node-count systems it might make sense to reduce the amount of
> cross-node caching and to _structure_ the distributed NUMA SLAB cache
> in an intelligent way (perhaps along cpuset boundaries) [...]

i think much of this could be achieved by delaying the creation of alien
caches up until the point a { node X } -> { node Y } alloc/free
relationship gets established. So if a system is partitioned along
cpusets, vast areas of the NxN matrix never gets populated with alien
caches.

Ingo

2007-12-09 13:48:06

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> If we fail to find out the solution in time, we always have the
> alternative of backing out the ATAPI transfer chunk size update. This

Which will break far more controllers and drives than it fixes, so
backing it out is nonsensical and not in the general good.

> will break some other cases which were fixed by the change but those
> won't be regressions at least and we can add transfer chunk size
> update with other changes to 2.6.25.

Great, make everyone else wait another three months for a working CD
drive. The one off regression appears far less harmful than a revert.

Tejun - instead of backing out important updates for 2.6.24 we should
just blacklist that specific drive for now and sort it nicely in 2.6.25,
not revert stuff and break everyone elses ATAPI devices.

Alan

2007-12-09 13:59:19

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sunday, 9 of December 2007, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > It's a kmap_atomic() debugging patch which I wrote ages ago and whcih
> > Ingo sucked into his tree. I don't _think_ this warning is present in
> > your tree at all.
> >
> > http://lkml.org/lkml/2007/11/29/157 is where it starts.
> >
> > I had a lenghty back-and-forth with Christoph on this within the past
> > couple of months and I cannot locate the thread and I don't recall
> > what the upshot was and Christoph is still offline.
> >
> > Knocking out __GFP_ZERO at the point where the slab allocator(s) call
> > the page allocator seems like a good approach to me.
> >
> > But I don't think we need to do anything for 2.6.24..
>
> ok - Rafael, please strike this off the regressions list, there is no
> problem in .24 other than the double memset for some SLOB metadata.

OK, dropped.

Thanks,
Rafael

2007-12-09 14:01:13

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sunday, 9 of December 2007, Tejun Heo wrote:
> Hello,
>
> Andrew Morton wrote:
> >> Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
> >> Submitter : Hans de Bruin <[email protected]>
> >> References : http://bugzilla.kernel.org/show_bug.cgi?id=9320
> >> Handled-By : Robert Moore <[email protected]>
> >> Tejun Heo <[email protected]>
> >> Fu Michael <[email protected]>
> >> Patch :
> >>
> >
> > A number of other people are seeing the same thing and Tejun is
> > putting in a blacklist of machines which cannot use libata+acpi.
> > That patch is not yet in any git tree which I pull.
> >
> > AFACIT the machines kepe working OK - there's just some nasty dmesg
> > spew.
> >
> > If any machines _are_ breaking then this could cause real problems
> > and I'd prefer that we either go for a whitelist or arrange to
> > detect the condition and fall back to non-acpi ata.
>
> The pending patchset should make ATA ACPI quite resistant to failures.

Are you going to push it for 2.6.24?

> Known bad boards can be blacklisted (currently only one is on the
> list), ATA ACPI is disabled quicker if ACPI evalution fails, execution
> errors are handled better and commands which are intended to help the
> vendor instead of the user are filtered. So, I think we have enough
> safety nets.

Sounds good. :-)

Thanks,
Rafael

2007-12-09 14:05:24

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sunday, 9 of December 2007, Andrew Morton wrote:
> On Sat, 8 Dec 2007 03:40:49 +0100 "Rafael J. Wysocki" <[email protected]> wrote:
>
> > This message contains a list of some regressions from 2.6.23 which have been
> > reported since 2.6.24-rc1 was released and for which there are no fixes in the
> > mainline that I know of.
>
> Here's one for you - I have a new Lenovo t61p with which to irritate
> everyone.
>
> suspend-to-ram is a wipeout, but suspend-to-disk works OK under
> 2.6.23.
>
> However under 2.6.24-rc1 and -rc4 the machine reboots right at the end of
> resume-from-disk.

It's http://bugzilla.kernel.org/show_bug.cgi?id=9258 , I think.

Does it do that if you unload ehci-hcd before the hibernation?

2007-12-09 15:09:41

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hello, Alan.

Alan Cox wrote:
>> will break some other cases which were fixed by the change but those
>> won't be regressions at least and we can add transfer chunk size
>> update with other changes to 2.6.25.
>
> Great, make everyone else wait another three months for a working CD
> drive. The one off regression appears far less harmful than a revert.

Newly broken ones will be regressions. How many do we fix by the
change? On SATA, setting the correct transfer chunk size doesn't seem
to fix many.

> Tejun - instead of backing out important updates for 2.6.24 we should
> just blacklist that specific drive for now and sort it nicely in 2.6.25,
> not revert stuff and break everyone elses ATAPI devices.

We'll need to blacklist setting transfer chunk size, eek, and let's
leave that as the last resort and hope that we find the solution soon.
Blacklist takes time to develop and temporary blacklist for just one
release doesn't sound like a good idea.

--
tejun

2007-12-09 15:11:25

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Rafael J. Wysocki wrote:
>>> If any machines _are_ breaking then this could cause real problems
>>> and I'd prefer that we either go for a whitelist or arrange to
>>> detect the condition and fall back to non-acpi ata.
>> The pending patchset should make ATA ACPI quite resistant to failures.
>
> Are you going to push it for 2.6.24?

Yeah, I'm hoping so. Maybe command filtering should wait till 2.6.25
but the rest, yeap.

--
tejun

2007-12-09 15:31:09

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> Newly broken ones will be regressions. How many do we fix by the
> change? On SATA, setting the correct transfer chunk size doesn't seem
> to fix many.

Regressions are not some kind of grand evil. Better to regress the odd
device than continue to break entire controllers.

> > Tejun - instead of backing out important updates for 2.6.24 we should
> > just blacklist that specific drive for now and sort it nicely in 2.6.25,
> > not revert stuff and break everyone elses ATAPI devices.
>
> We'll need to blacklist setting transfer chunk size, eek, and let's
> leave that as the last resort and hope that we find the solution soon.
> Blacklist takes time to develop and temporary blacklist for just one
> release doesn't sound like a good idea.

It seems to be sensible to me *if* it is just this one device we are
somehow confusing and that one device is holding up fixing everything
else.

2007-12-09 15:40:13

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Alan Cox wrote:
>> Newly broken ones will be regressions. How many do we fix by the
>> change? On SATA, setting the correct transfer chunk size doesn't seem
>> to fix many.
>
> Regressions are not some kind of grand evil. Better to regress the odd
> device than continue to break entire controllers.

We need to put more weight on regressions as it at least makes releases
predictable to users. Anyways, I wasn't saying it was some absolute
maxim. I was literally asking how many so that we can evaluate the
trade off.

>>> Tejun - instead of backing out important updates for 2.6.24 we should
>>> just blacklist that specific drive for now and sort it nicely in 2.6.25,
>>> not revert stuff and break everyone elses ATAPI devices.
>> We'll need to blacklist setting transfer chunk size, eek, and let's
>> leave that as the last resort and hope that we find the solution soon.
>> Blacklist takes time to develop and temporary blacklist for just one
>> release doesn't sound like a good idea.
>
> It seems to be sensible to me *if* it is just this one device we are
> somehow confusing and that one device is holding up fixing everything
> else.

Yeah, if it's this one device, I fully agree. Let's see how debugging
turns out.

--
tejun

2007-12-09 15:47:21

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Andreas Mohr wrote:
> Hi,
>
> On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
>>> Does this report now win me the lucky draw, pretty please? ;)
>> nah, you have to cc the acpi guys to get a prize ;)
>
> Thought so shortly, but missed it.
>
>> Andreas, please do separately report that WOL problem too..
>
> Local setup issue only, at least this one *isn't* a 2.6.24-rc regression. ;)
>
>> Our list just reached 30.
>
> Oh, so this is in fact a separate issue? Wasn't sure, couldn't do
> enough analysis of similar cases.
>
> Will test any (already submitted!) suggestions ASAP.

Please post full kernel boot log and the result of 'lspci -nn'.

Thanks.

--
tejun

2007-12-09 16:00:22

by Arjan van de Ven

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sun, 9 Dec 2007 10:20:19 +0200
"Pekka Enberg" <[email protected]> wrote:
\
> Now, while SLAB code is "pleasant and straightforward code" (thanks,
> btw) for UMA, it's really hairy for NUMA plus the "alien caches" eat
> tons of memory

.. and they make slab slower on numa systems for database workloads
(I'm sure they do fine for SGI's customers HPC load though))

>(which is why Christoph wrote SLUB in the first place,
> the current code in SLAB is mostly unfixable due to its *queuing*
> nature).

To be honest, a SLAB without the alien stuff might be one of the best
performers today .... ;)
(on database loads.. where slub is quite a disaster as everyone knows)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-12-09 18:37:04

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sun, 9 Dec 2007, Alan Cox wrote:
>
> > If we fail to find out the solution in time, we always have the
> > alternative of backing out the ATAPI transfer chunk size update. This
>
> Which will break far more controllers and drives than it fixes, so
> backing it out is nonsensical and not in the general good.

No.

Regressions are worse. It doesn't matter AT ALL if you think that it
breaks ten times more devices, if it's a regression and those devices
didn't work in the past, they simply DO NOT COUNT.

Linus

2007-12-09 18:42:00

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sun, 9 Dec 2007, Alan Cox wrote:
>
> Great, make everyone else wait another three months for a working CD
> drive. The one off regression appears far less harmful than a revert.

Btw, Alan, that "math" is total and utter BULLSH*T, and you should know
that.

"The one off regression" is likely the tip of an iceberg. If something
regresses for one person, for that one person who tested and noticed and
made a bug-report, there's probably a thousand people who haven't even
tested the development kernel, or who had problems and just went back to
the previous version.

In contrast, reverting something will be guaranteed to not have those
kinds of issues, since the only people who could notice are people for who
it never worked in the first place. There's no "silent mass of people"
that can be affected.

This is why regressions are so important. They don't trump _everything_,
but basically ignoring and letting them slide is *much* more painful than
just reverting it.

The biggest reason to ignore a regression is if nobody can even figure
out where it came from, or reverting simply isn't an option for some
really deep and fundamental issue. That doesn't seem to be the case here.

So we should revert unless there is some known acceptable real fix.

Linus

2007-12-09 19:59:34

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hi,

On Mon, Dec 10, 2007 at 12:46:57AM +0900, Tejun Heo wrote:
> Please post full kernel boot log and the result of 'lspci -nn'.

Done, on #9530.

Will try some of the promising patches/suggestions now, hopefully this will
show me what's up. Will add further results there.

Andreas Mohr

2007-12-09 21:36:56

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hi,

[ACPI _GTM suspend issue sorta fixed, read below]

On Sat, Dec 08, 2007 at 12:24:16PM -0600, Robert Hancock wrote:
> Matthew Garrett wrote:
>> On Sat, Dec 08, 2007 at 02:20:01AM -0800, Andrew Morton wrote:
>>> On Sat, 8 Dec 2007 11:12:57 +0100 Andreas Mohr <[email protected]> wrote:
>>>> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
>>>> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTF_] (Node c180b990), AE_AML_PACKAGE_LIMIT
>>>> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN0.DRV1._GTF] (Node c180b888), AE_AML_PACKAGE_LIMIT
>>>> ata1.01: _GTF evaluation failed (AE 0x300d)
>>
>> 037f6bb79f753c014bc84bca0de9bf98bb5ab169 ought to have fixed this?
>>
>
> I should think it should have.

Yup, the _GTF problem is certainly fixed, but this is a dead-end
since I showed the -rc1 vs. -rc2 behaviour, whereas I still have
failing suspend in -rc4 with this patch confirmed to be applied
(source does contain the changes) and confirmed to apparently be working
(no errors in dmesg any more).

IOW, what I'm concerned about is not a _GTF error on boot any more,
but a seemingly fatally handled _GTM error on suspend.

...OK, dug some more into this, and now I managed to get it to work,
and it was indeed _GTM which broke my whole suspend:

Since _GTM is failing on me (and the point is, it's failing
catastrophically, not "normally"!), ata_acpi_on_suspend()
calling ata_acpi_gtm() fails with -EINVAL instead of -ENOENT,
however ata_acpi_on_suspend() has the following correction only:

if (rc == -ENOENT)
rc = 0;

to make sure a suspend doesn't get aborted (fatal error) when
_GTM is simply empty.

Changing this into

if ((rc == -ENOENT) || (rc == -EINVAL))
rc = 0;

to additionally account for _invalid_ _GTM execution makes my suspend
(and resume!) work again on -rc4.

Now the question is whether this error code correction is ok, or whether a
catastrophically failing _GTM should have been truly registered on boot
already (where it does gtm to fetch cable timings) to subsequently avoid
doing any ATA ACPI things on suspend at all.

And the second, possibly much more lucrative, question would be
whether we're actually doing something wrong with our ACPI _GTM execution
which triggers the AE_AML_PACKAGE_LIMIT problem.

This might help here, perhaps (relevant snippets of AML dump):

Device (CHN0)
{
Name (_ADR, 0x00)
Method (_GTM, 0, NotSerialized)
{
Return (GTM (PMPT, PMUE, PMUT, PSPT, PSUE, PSUT))
}

Method (_STM, 3, NotSerialized)
{
Store (Arg0, TMD0)
Store (PMPT, GMPT)
Store (PMUE, GMUE)
Store (PMUT, GMUT)
Store (PSPT, GSPT)
Store (PSUE, GSUE)
Store (PSUT, GSUT)
STM ()
Store (GMPT, PMPT)
Store (GMUE, PMUE)
Store (GMUT, PMUT)
Store (GSPT, PSPT)
Store (GSUE, PSUE)
Store (GSUT, PSUT)
}

Device (CHN1)
{
Name (_ADR, 0x01)
Method (_GTM, 0, NotSerialized)
{
Return (GTM (SMPT, SMUE, SMUT, SSPT, SSUE, SSUT))
}

Method (_STM, 3, NotSerialized)
{
Store (Arg0, TMD0)
Store (SMPT, GMPT)
Store (SMUE, GMUE)
Store (SMUT, GMUT)
Store (SSPT, GSPT)
Store (SSUE, GSUE)
Store (SSUT, GSUT)
STM ()
Store (GMPT, SMPT)
Store (GMUE, SMUE)
Store (GMUT, SMUT)
Store (GSPT, SSPT)
Store (GSUE, SSUE)
Store (GSUT, SSUT)
}

Method (GTM, 6, Serialized)
{
Store (Ones, PIO0)
Store (Ones, PIO1)
Store (Ones, DMA0)
Store (Ones, DMA1)
Store (0x10, CHNF)
If (REGF) {}
Else
{
Return (TMD0)
}

Store (Match (DerefOf (Index (TIM0, 0x01)), MEQ, Arg0, MTR,
0x00, 0x00), Local6)
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x00)), Local6)
),
Local7)
Store (Local7, DMA0)
Store (Local7, PIO0)
Store (Match (DerefOf (Index (TIM0, 0x01)), MEQ, Arg3, MTR,
0x00, 0x00), Local6)
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x00)), Local6)
),
Local7)
Store (Local7, DMA1)
Store (Local7, PIO1)
If (Arg1)
{
If (A133 ())
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0D)),
Arg2)),
Local5)
}
Else
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0A)),
Arg2)),
Local5)
}

If (A133 ())
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0C)),
Local5)),
DMA0)
}
Else
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x04)),
Local5)),
DMA0)
}

Or (CHNF, 0x01, CHNF)
}

If (Arg4)
{
If (A133 ())
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0D)), Arg5)),
Local5)
}
Else
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0A)), Arg5)),
Local5)
}

If (A133 ())
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x0C)), Local5)),
DMA1)
}
Else
{
Store (DerefOf (Index (DerefOf (Index (TIM0, 0x04)), Local5)),
DMA1)
}

Or (CHNF, 0x04, CHNF)
}

Return (TMD0)
}

Reminder: issue tracked at #9530.

Thanks,

Andreas Mohr

2007-12-09 22:00:12

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> Regressions are worse. It doesn't matter AT ALL if you think that it
> breaks ten times more devices, if it's a regression and those devices
> didn't work in the past, they simply DO NOT COUNT.

Must be time for an -ac tree again.

Alan

2007-12-09 22:07:20

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> Btw, Alan, that "math" is total and utter BULLSH*T, and you should know
> that.

The one off regression is probably not one off, but this is IDE so
actually its quite probable its a single broken firmware.

The alternative is that you cripple just about every user of various
other standards compliant devices and controllers whose hardware we
finally fixed.

Finally you need to remember that the 'regression' is caused by the fact
we now do the _right_ thing both in terms of 'old IDE' and specs.

Believe it or not I did actually think in quite some detail about this
case, and the relative probabilities, and go back and re-review the old
IDE code (whose behaviour we now follow) and the spec. I spend a
measurable amount of my time reviewing code and weighing risks,
regressions and progress for an enterprise Linux vendor, so its something
I do every day of the week.

To blindly argue regressions are critical is sometimes (as in this case)
to argue that "this freeway is no longer compatible with a horse and
cart" means the freeway should be turned back into a dirt road.

The horse and cart happened to work by chance because the road was quiet
that day. We clearly need to add a horse & cart lane in the long term,
but for 2.6.24 it may well be the right thing to do just to blacklist
that specific drive back to old behaviour until we can tidy it more
nicely.

Alan

2007-12-09 22:51:33

by Ray Lee

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Dec 9, 2007 2:01 PM, Alan Cox <[email protected]> wrote:
> > Btw, Alan, that "math" is total and utter BULLSH*T, and you should know
> > that.
>
> To blindly argue regressions are critical is sometimes (as in this case)
> to argue that "this freeway is no longer compatible with a horse and
> cart" means the freeway should be turned back into a dirt road.

Honest question: If you allow regressions, then how does one guarantee
forward progress? (If it were a finite set of systems, all within one
group's control, then the answer is simple: count how many work.
However, in this case we only have a statistical sampling available to
us.)

Ray

2007-12-10 00:04:48

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Hi,

On Sun, Dec 09, 2007 at 10:36:42PM +0100, Andreas Mohr wrote:
> And the second, possibly much more lucrative, question would be
> whether we're actually doing something wrong with our ACPI _GTM execution
> which triggers the AE_AML_PACKAGE_LIMIT problem.
>
> This might help here, perhaps (relevant snippets of AML dump):

Indeed, after looking over this horrid ASL stuff for ages I'm now starting
to believe that our IDE controller state is wrong,
since the Match()ing etc. in this particular _GTM implementation
is heavily dependant on actual PCI values
(it references some PCI_Config OperationRegion:s),
and some indexing seems to go wrong due to this.

IOW, it seems very likely that _GTM on these BIOSes (VIA chipsets) isn't
actually wrongly implemented but simply expects IDE controller values
to have been set up ""differently"".

Or... one could possibly even infer from this that - maybe -
the _GTM invocation spot is wrong, it should be done somewhere
different during bootup. Or whatever.

This seems to tell me again that we're often quick to blacklist
or whitelist things left and right when instead fundamental problems
are hidden somewhere.

Still investigating,

Andreas Mohr

2007-12-10 00:50:17

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Mon, Dec 10, 2007 at 01:04:31AM +0100, Andreas Mohr wrote:
> IOW, it seems very likely that _GTM on these BIOSes (VIA chipsets) isn't
> actually wrongly implemented but simply expects IDE controller values
> to have been set up ""differently"".
>
>
> Or... one could possibly even infer from this that - maybe -
> the _GTM invocation spot is wrong, it should be done somewhere
> different during bootup. Or whatever.

"Whatever" indeed:

There's an ASL Match() for a "PMPT" (Primary Master PorT) PCI register,
and the possible register values are:

Package (0x04)
{
0x20,
0x31,
0x65,
0xA8
},

and from

OperationRegion (CFG2, PCI_Config, 0x40, 0x20)
Field (CFG2, DWordAcc, NoLock, Preserve)
{
Offset (0x08),?
SSPT, 8,?
SMPT, 8,?
PSPT, 8,?
PMPT, 8,?
Offset (0x10),?
...
we can infer that at PCI_Config offset 0x48 those values should be located.
However after bootup or resume there are:

# lspci -s 00:11.1 -xxx
00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00: 06 11 71 05 07 00 90 02 06 8a 01 01 00 20 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 e4 00 00 00 00 00 00 00 00 00 00 06 11 71 05
30: 00 00 00 00 c0 00 00 00 00 00 00 00 ff 01 00 00
40: 0b 32 09 0a 18 1c c0 00 99 99 20 20 ff 00 a8 20
50: 07 07 f6 f1 14 03 00 00 a8 a8 a8 a8 00 00 00 00
60: 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
70: 02 01 00 00 00 00 00 00 82 01 00 00 00 00 00 00
80: 00 e0 a1 1f 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 06 00 71 05 06 11 71 05 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00

As one can see, the relevant values for SSPT, SMPT, PSPT and PMPT are
99 99 20 20, which are not quite entirely valid judging from the array above,
and this is because the secondary port is unused, as can also be seen
from my bootup log:

scsi0 : pata_via
scsi1 : pata_via
ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xe400 irq 14
ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xe408 irq 15
ata1.00: ATA-5: WDC WD1200JB-00CRA1, 17.07W17, max UDMA/100
ata1.00: 234441648 sectors, multi 16: LBA
ata1.01: ATAPI: TOSHIBA DVD-ROM SD-M1612, 1004, max UDMA/33
Switched to high resolution mode on CPU 0
ata1.00: configured for UDMA/100
ata1.01: configured for UDMA/33
ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTM_] (Node df80b9a8), AE_AML_PACKAGE_LIM
IT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN1._GTM] (Node df80b8d0), AE_AML_PACKAG
E_LIMIT
ata2: ACPI get timing mode failed (AE 0x300d)

Manually tweaking the values to 20 20 20 20 truly does skip the _GTM failure message on suspend -
only to reappear right on resume due to 99 99 20 20 combo happening again.
If I don't tweak, I get _GTM failure at both suspend and resume.

As such one can conclude that this BIOS is rather very confused when being called for _GTM on an entirely
unused controller port. And this is either because the BIOS is dumb or because ACPI doesn't really
expect anyone to call _GTM on an unused physical port. I'd bet on the latter...
(however I haven't found ACPI 3.0b explicitly mentioning this somewhere yet)

Andreas Mohr

2007-12-10 01:29:23

by Robert Hancock

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Andreas Mohr wrote:
> On Mon, Dec 10, 2007 at 01:04:31AM +0100, Andreas Mohr wrote:
>> IOW, it seems very likely that _GTM on these BIOSes (VIA chipsets) isn't
>> actually wrongly implemented but simply expects IDE controller values
>> to have been set up ""differently"".
>>
>>
>> Or... one could possibly even infer from this that - maybe -
>> the _GTM invocation spot is wrong, it should be done somewhere
>> different during bootup. Or whatever.
>
> "Whatever" indeed:
>
> There's an ASL Match() for a "PMPT" (Primary Master PorT) PCI register,
> and the possible register values are:
>
> Package (0x04)
> {
> 0x20,
> 0x31,
> 0x65,
> 0xA8
> },
>
> and from
>
> OperationRegion (CFG2, PCI_Config, 0x40, 0x20)
> Field (CFG2, DWordAcc, NoLock, Preserve)
> {
> Offset (0x08),?
> SSPT, 8,?
> SMPT, 8,?
> PSPT, 8,?
> PMPT, 8,?
> Offset (0x10),?
> ...
> we can infer that at PCI_Config offset 0x48 those values should be located.
> However after bootup or resume there are:
>
> # lspci -s 00:11.1 -xxx
> 00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> 00: 06 11 71 05 07 00 90 02 06 8a 01 01 00 20 00 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 01 e4 00 00 00 00 00 00 00 00 00 00 06 11 71 05
> 30: 00 00 00 00 c0 00 00 00 00 00 00 00 ff 01 00 00
> 40: 0b 32 09 0a 18 1c c0 00 99 99 20 20 ff 00 a8 20
> 50: 07 07 f6 f1 14 03 00 00 a8 a8 a8 a8 00 00 00 00
> 60: 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
> 70: 02 01 00 00 00 00 00 00 82 01 00 00 00 00 00 00
> 80: 00 e0 a1 1f 00 00 00 00 00 00 00 00 00 00 00 00
> 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 06 00 71 05 06 11 71 05 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00
>
>
> As one can see, the relevant values for SSPT, SMPT, PSPT and PMPT are
> 99 99 20 20, which are not quite entirely valid judging from the array above,
> and this is because the secondary port is unused, as can also be seen
> from my bootup log:
>
> scsi0 : pata_via
> scsi1 : pata_via
> ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xe400 irq 14
> ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xe408 irq 15
> ata1.00: ATA-5: WDC WD1200JB-00CRA1, 17.07W17, max UDMA/100
> ata1.00: 234441648 sectors, multi 16: LBA
> ata1.01: ATAPI: TOSHIBA DVD-ROM SD-M1612, 1004, max UDMA/33
> Switched to high resolution mode on CPU 0
> ata1.00: configured for UDMA/100
> ata1.01: configured for UDMA/33
> ACPI Exception (exoparg2-0442): AE_AML_PACKAGE_LIMIT, Index (0FFFFFFFF) is beyond end of object [20070126]
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.GTM_] (Node df80b9a8), AE_AML_PACKAGE_LIM
> IT
> ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.IDE0.CHN1._GTM] (Node df80b8d0), AE_AML_PACKAG
> E_LIMIT
> ata2: ACPI get timing mode failed (AE 0x300d)
>
>
> Manually tweaking the values to 20 20 20 20 truly does skip the _GTM failure message on suspend -
> only to reappear right on resume due to 99 99 20 20 combo happening again.
> If I don't tweak, I get _GTM failure at both suspend and resume.
>
>
> As such one can conclude that this BIOS is rather very confused when being called for _GTM on an entirely
> unused controller port. And this is either because the BIOS is dumb or because ACPI doesn't really
> expect anyone to call _GTM on an unused physical port. I'd bet on the latter...
> (however I haven't found ACPI 3.0b explicitly mentioning this somewhere yet)
>
> Andreas Mohr
>

Probably Windows doesn't call _GTM on a port with no devices connected,
and so the BIOS people never tested that case. Likely we can just avoid
doing this - if no devices are connected the timing settings for that
channel are irrelevant..

And you're quite right in your comment that we are often too quick to
blacklist hardware instead of looking into why it really is failing.
ACPI is one of those areas where we often just need to figure out how to
be bug-to-bug compatibile with what Windows is doing..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2007-12-10 01:58:44

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Sun, 9 Dec 2007, Alan Cox wrote:
>
> The one off regression is probably not one off, but this is IDE so
> actually its quite probable its a single broken firmware.
>
> The alternative is that you cripple just about every user of various
> other standards compliant devices and controllers whose hardware we
> finally fixed.

Alan, you're so full of shit that it's not even funny.

Have you even *read* the thread?

Tejun already reported that this apparently gets fixed _properly_ with the
more extensive cleanups and fixes that are pending for 2.6.25.

In other words, the stuff you call so critically important (yet we've been
able to live without it until now!) is apparently simply NOT YET READY.
It's breaking things.

In this case, Tejun seems to be right on the money. I also agree 100%
with him when he says

"Blacklist takes time to develop and temporary blacklist for just one
release doesn't sound like a good idea."

because if we create some blacklist for that one reported device, not only
is it likely going to be wrong (it's almost never just one firmware or one
chip that has a particular issue), but we tend to create thee blacklists
and later realize that we shouldn't have blacklisted things at all, we
should just have done things differently.

For examples of that, see the NCQ blacklist that was just _us_ doing
things wrong (over-reacting to things we shouldn't care about), and
there's currently another totally unrelated discussion on a very similar
thing wrt libata and the ACPI startup commands for an unused controller
port.

> Finally you need to remember that the 'regression' is caused by the fact
> we now do the _right_ thing both in terms of 'old IDE' and specs.

.. and what the hell does that matter? If the code doesn't work, it
doesn't work, and you might as well point to some random scribblings done
by a three-year-old on toilet paper rather than any "specs".

Real life matters more. Regressions matter more.

We apparently do have a full fix, but it seems to be too invasive for
2.6.24, which means that the thing that currently DOES NOT WORK and
causes regressions should be reverted, so that 2.6.24 is at least no worse
than 2.6.23 (and all earlier kernels) in this respect.

And then we should just hope that the more complete fix that Tejun has
doesn't cause any issues on its own. I would suggest that if you care so
deeply about this issue, you press Fedora into putting Tejun's tree into
Fedora testing, and get that thing tested out extensively.

So the fact is, we have a way forward, but we should *not* take steps
backwards just because you want to push something out that isn't quite
ready. We should revert the change that causes the current trouble, safe
in the knowledge (or at least "strong hope") that we have a way forward
that makes *both* 2.6.24 and 2.6.25 be continual improvements.

We used to allow regressions. It was really painful. It's hard to debug
things when things sometimes break. It's much better to have a nice
constant monotonic improvement.

It's better for users, but it's much better also for developers, even if
you may be frustrated right now because some new code effectively gets
shut down until it works for everybody.

Linus

2007-12-10 02:20:35

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Andreas Mohr wrote:
> As such one can conclude that this BIOS is rather very confused when being called for _GTM on an entirely
> unused controller port. And this is either because the BIOS is dumb or because ACPI doesn't really
> expect anyone to call _GTM on an unused physical port. I'd bet on the latter...
> (however I haven't found ACPI 3.0b explicitly mentioning this somewhere yet)

Thanks a lot for finding this out. One of the two reports in bug 9320
seems to be the same problem although the other doesn't seem to be. So,
it seems we'll have to check that both primary and secondary slots are
empty and skip _GTM if so. :-(

Also, right, there's no need to fail suspend on _GTM failure whatever
the error is. That was me being anal again. Will incorporate both into
the ACPI fixes patchset.

Thanks.

--
tejun

2007-12-10 02:25:32

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Robert Hancock wrote:
> And you're quite right in your comment that we are often too quick to
> blacklist hardware instead of looking into why it really is failing.
> ACPI is one of those areas where we often just need to figure out how to
> be bug-to-bug compatibile with what Windows is doing..

In the spirit of not blacklisting without looking deep into ACPI code,
can somebody familiar with ASL take a look at comment 11 of bug 9320?

http://bugzilla.kernel.org/show_bug.cgi?id=9320#c11

This is libata calling _GTM to find out how the BIOS configured the
device to determine cable type.

Thanks.

--
tejun

2007-12-10 03:20:28

by Robert Hancock

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Tejun Heo wrote:
> Robert Hancock wrote:
>> And you're quite right in your comment that we are often too quick to
>> blacklist hardware instead of looking into why it really is failing.
>> ACPI is one of those areas where we often just need to figure out how to
>> be bug-to-bug compatibile with what Windows is doing..
>
> In the spirit of not blacklisting without looking deep into ACPI code,
> can somebody familiar with ASL take a look at comment 11 of bug 9320?
>
> http://bugzilla.kernel.org/show_bug.cgi?id=9320#c11
>
> This is libata calling _GTM to find out how the BIOS configured the
> device to determine cable type.
>
> Thanks.

I suspect it's somewhat similar (though perhaps a different cause), the
code is trying to lookup a value (presumably register contents) in a
table using Match, gets a value that's not in the table (which makes
Match return the ONES value FFFFFFFF meaning not found) and so the
lookup of the corresponding output value with that index fails. We'd
need the full ASL dump to know exactly what's going on there.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2007-12-10 03:35:13

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Its your kernel. Its your call, and your privilege to be wrong.

And anyone with ATAPI problems should probably test the -mm tree before
reporting anything.

Alan

2007-12-10 03:44:20

by Alan

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

> Have you even *read* the thread?

In detail, as it unfolds and while testing variants of Tejun's code on
the hardware I have access to - none of which has this bug making it
rather trickier to help.

> In other words, the stuff you call so critically important (yet we've been
> able to live without it until now!) is apparently simply NOT YET READY.
> It's breaking things.

And as I keep pointing out but you keep ignoring - not doing it breaks
even more things, by a factor of quite a lot.

> .. and what the hell does that matter? If the code doesn't work, it
> doesn't work, and you might as well point to some random scribblings done
> by a three-year-old on toilet paper rather than any "specs".

The code without the changes doesn't work either. So pick your toilet
paper.. by your argument both are toilet paper.

> causes regressions should be reverted, so that 2.6.24 is at least no worse
> than 2.6.23 (and all earlier kernels) in this respect.

Which as the distro bug lists for ATAPI will tell you - aint good. Still
distro vendors can ship patches.

> We used to allow regressions. It was really painful. It's hard to debug
> things when things sometimes break. It's much better to have a nice
> constant monotonic improvement.

Linus, the kernel regresses all over the place every release. If it
didn't do that you'd never get any changes in. Your kernel would
fossilize like RHEL or SLES and you'd be spending weeks analysing each
changeset for possible side effects, or - as happens by neccessity -
adding code paths so a fix vital to one driver ceases to share core code
with another driver - to reduce regression risk. Been there, done that
and its not the way progress happens.

> It's better for users, but it's much better also for developers, even if
> you may be frustrated right now because some new code effectively gets
> shut down until it works for everybody.

Have fun. I trust you'll be fixing the other 11 I think it was listed
regressions before 2.6.24 - or backing out every changeset that could be
responsible ?

No I thought not - because that wouldn't be sensible either.

Alan

2007-12-10 08:22:23

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Linus Torvalds <[email protected]> wrote:

> Tejun already reported that this apparently gets fixed _properly_ with
> the more extensive cleanups and fixes that are pending for 2.6.25.

btw., how extensive are those cleanups and fixes in reality, is there a
rollup somewhere one could take a look at? Those fixes and cleanups were
deferred to v2.6.25 in the knowledge of having the current code included
in v2.6.24 - but now that the current approach seems to regress, maybe
those cleanups are still safe enough. (compared to an outright revert)

Ingo

2007-12-10 08:27:49

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> Tejun already reported that this apparently gets fixed _properly_ with
>> the more extensive cleanups and fixes that are pending for 2.6.25.
>
> btw., how extensive are those cleanups and fixes in reality, is there a
> rollup somewhere one could take a look at? Those fixes and cleanups were
> deferred to v2.6.25 in the knowledge of having the current code included
> in v2.6.24 - but now that the current approach seems to regress, maybe
> those cleanups are still safe enough. (compared to an outright revert)

The following git tree contains patches pending review for 2.6.25.

http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=shortlog;h=improve-ATAPI-data-transfer-no-pio

And we're getting close to fixing the regression. I don't think there's
too much worry about this one. Just need a bit more time to test few
more things.

Thanks.

--
tejun

2007-12-10 08:42:22

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Tejun Heo <[email protected]> wrote:

> The following git tree contains patches pending review for 2.6.25.
>
> http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=shortlog;h=improve-ATAPI-data-transfer-no-pio
>
> And we're getting close to fixing the regression. I don't think
> there's too much worry about this one. Just need a bit more time to
> test few more things.

ah, i see, the joys of the kernel running BIOS written code (AML):

http://bugzilla.kernel.org/attachment.cgi?id=13932&action=view

cute!

Ingo

2007-12-10 15:39:59

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Mon, 10 Dec 2007, Alan Cox wrote:
>
> And as I keep pointing out but you keep ignoring - not doing it breaks
> even more things, by a factor of quite a lot.

But we've never done it before in libata, right?

So the "not doing it breaks" argument is about stuff that isn't
regressions.

Can you really not see the difference?

Linus

2007-12-10 20:43:06

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Rafael J. Wysocki <[email protected]> wrote:

> Subject : jiffies counter leaps in 2.6.24-rc3
> Submitter : Stefano Brivio <[email protected]>
> References : http://lkml.org/lkml/2007/11/24/53
> http://bugzilla.kernel.org/show_bug.cgi?id=9475
> Handled-By : Ingo Molnar <[email protected]>
> Patch : http://lkml.org/lkml/2007/12/7/132

Linus, Andrew, i need some help deciding what to do about this
regression. The fixes for this have been tested and resolve the
regression, but they change printk and other code that runs by default
and is thus rather invasive so late in the v2.6.24 cycle. This bug
should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug
option) - although some claimed effect was on udelay()/mdelay() too.

i've attached below the queue of 5 patches that fix this problem. They
have been build and boot tested with more than 1000 random kernels in
the past few days, so i certainly trust the core and x86 bits of this.

what do you think? Right now i've got them queued up for 2.6.25 in both
the scheduler-devel and the x86-devel git trees - but can submit them
for 2.6.24 if it's better if we did them there. I've got no strong
opinion either way.

Ingo

-------------------->
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <[email protected]>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [email protected]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++-----
arch/x86/kernel/tsc_64.c | 57 ++++++++++++++++++++++++++++++++++++++---------
include/asm-x86/timer.h | 23 ++++++++++++++----
3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
#include <linux/jiffies.h>
#include <linux/init.h>
#include <linux/dmi.h>
+#include <linux/percpu.h>

#include <asm/delay.h>
#include <asm/tsc.h>
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
*
* [email protected] "math is hard, lets go shopping!"
*/
-unsigned long cyc2ns_scale __read_mostly;

-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);

-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
{
- cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+ unsigned long flags, prev_scale, *scale;
+ unsigned long long tsc_now, ns_now;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ scale = &per_cpu(cyc2ns, cpu);
+
+ rdtscll(tsc_now);
+ ns_now = __cycles_2_ns(tsc_now);
+
+ prev_scale = *scale;
+ if (cpu_khz)
+ *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+ /*
+ * Start smoothly with the new frequency:
+ */
+ sched_clock_idle_wakeup_event(0);
+ local_irq_restore(flags);
}

/*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_bl
ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
- set_cyc2ns_scale(cpu_khz);
+ preempt_disable();
+ set_cyc2ns_scale(cpu_khz, smp_processor_id());
+ preempt_enable();
/*
* TSC based sched_clock turns
* to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_relia

void __init tsc_init(void)
{
+ int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;

@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);

- set_cyc2ns_scale(cpu_khz);
+ /*
+ * Secondary CPUs do not run through tsc_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+ for_each_possible_cpu(cpu)
+ set_cyc2ns_scale(cpu_khz, cpu);
+
use_tsc_delay();

/* Check and install the TSC clocksource */
Index: linux/arch/x86/kernel/tsc_64.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_64.c
+++ linux/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@

#include <asm/hpet.h>
#include <asm/timex.h>
+#include <asm/timer.h>

static int notsc __initdata = 0;

@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int tsc_khz;
EXPORT_SYMBOL(tsc_khz);

-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ * basic equation:
+ * ns = cycles / (freq / ns_per_sec)
+ * ns = cycles * (ns_per_sec / freq)
+ * ns = cycles * (10^9 / (cpu_khz * 10^3))
+ * ns = cycles * (10^6 / cpu_khz)
+ *
+ * Then we use scaling math (suggested by [email protected]) to get:
+ * ns = cycles * (10^6 * SC / cpu_khz) / SC
+ * ns = cycles * cyc2ns_scale / SC
+ *
+ * And since SC is a constant power of two, we can convert the div
+ * into a shift.
+ *
+ * We can use khz divisor instead of mhz to keep a better precision, since
+ * cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ * ([email protected])
+ *
+ * [email protected] "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);

-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
{
- cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+ unsigned long flags, prev_scale, *scale;
+ unsigned long long tsc_now, ns_now;

-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
- return (cyc * cyc2ns_scale) >> NS_SCALE;
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ scale = &per_cpu(cyc2ns, cpu);
+
+ rdtscll(tsc_now);
+ ns_now = __cycles_2_ns(tsc_now);
+
+ prev_scale = *scale;
+ if (cpu_khz)
+ *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+ sched_clock_idle_wakeup_event(0);
+ local_irq_restore(flags);
}

unsigned long long sched_clock(void)
@@ -100,7 +133,9 @@ static int time_cpufreq_notifier(struct
mark_tsc_unstable("cpufreq changes");
}

- set_cyc2ns_scale(tsc_khz_ref);
+ preempt_disable();
+ set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+ preempt_enable();

return 0;
}
@@ -151,7 +186,7 @@ static unsigned long __init tsc_read_ref
void __init tsc_calibrate(void)
{
unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
- int hpet = is_hpet_enabled();
+ int hpet = is_hpet_enabled(), cpu;

local_irq_save(flags);

@@ -206,7 +241,9 @@ void __init tsc_calibrate(void)
}

tsc_khz = tsc2 / tsc1;
- set_cyc2ns_scale(tsc_khz);
+
+ for_each_possible_cpu(cpu)
+ set_cyc2ns_scale(tsc_khz, cpu);
}

/*
Index: linux/include/asm-x86/timer.h
===================================================================
--- linux.orig/include/asm-x86/timer.h
+++ linux/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
#define _ASMi386_TIMER_H
#include <linux/init.h>
#include <linux/pm.h>
+#include <linux/percpu.h>

#define TICK_SIZE (tick_nsec / 1000)

@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
#define calculate_cpu_khz() native_calculate_cpu_khz()
#endif

-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
* convert from cycles(64bits) => nanoseconds (64bits)
* basic equation:
* ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
* And since SC is a constant power of two, we can convert the div
* into a shift.
*
- * We can use khz divisor instead of mhz to keep a better percision, since
+ * We can use khz divisor instead of mhz to keep a better precision, since
* cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
* ([email protected])
*
* [email protected] "math is hard, lets go shopping!"
*/
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);

#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */

-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
{
- return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+ return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
}

+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+ unsigned long long ns;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ ns = __cycles_2_ns(cyc);
+ local_irq_restore(flags);
+
+ return ns;
+}

#endif
--------------->
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <[email protected]>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_32.c | 15 ++++++++++++---
arch/x86/kernel/process_64.c | 13 ++++++++++---
2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
smp_mb();

local_irq_disable();
- if (!need_resched())
+ if (!need_resched()) {
+ ktime_t t0, t1;
+ u64 t0n, t1n;
+
+ t0 = ktime_get();
+ t0n = ktime_to_ns(t0);
safe_halt(); /* enables interrupts racelessly */
- else
- local_irq_enable();
+ local_irq_disable();
+ t1 = ktime_get();
+ t1n = ktime_to_ns(t1);
+ sched_clock_idle_wakeup_event(t1n - t0n);
+ }
+ local_irq_enable();
current_thread_info()->status |= TS_POLLING;
} else {
/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
smp_mb();
local_irq_disable();
if (!need_resched()) {
- /* Enables interrupts one instruction before HLT.
- x86 special cases this so there is no race. */
- safe_halt();
+ ktime_t t0, t1;
+ u64 t0n, t1n;
+
+ t0 = ktime_get();
+ t0n = ktime_to_ns(t0);
+ safe_halt(); /* enables interrupts racelessly */
+ local_irq_disable();
+ t1 = ktime_get();
+ t1n = ktime_to_ns(t1);
+ sched_clock_idle_wakeup_event(t1n - t0n);
} else
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
--------------->
Subject: printk: make printk more robust by not allowing recursion
From: Ingo Molnar <[email protected]>

make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.

I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:

[ 41.057335] SysRq : <2>BUG: recent printk recursion!
[ 41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks

also do all this printk-debug logic with irqs disabled.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/printk.c | 48 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 10 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -628,30 +628,57 @@ asmlinkage int printk(const char *fmt, .
/* cpu currently holding logbuf_lock */
static volatile unsigned int printk_cpu = UINT_MAX;

+const char printk_recursion_bug_msg [] =
+ KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
asmlinkage int vprintk(const char *fmt, va_list args)
{
+ static int log_level_unknown = 1;
+ static char printk_buf[1024];
+
unsigned long flags;
- int printed_len;
+ int printed_len = 0;
+ int this_cpu;
char *p;
- static char printk_buf[1024];
- static int log_level_unknown = 1;

boot_delay_msec();

preempt_disable();
- if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
- /* If a crash is occurring during printk() on this CPU,
- * make sure we can't deadlock */
- zap_locks();
-
/* This stops the holder of console_sem just where we want him */
raw_local_irq_save(flags);
+ this_cpu = smp_processor_id();
+
+ /*
+ * Ouch, printk recursed into itself!
+ */
+ if (unlikely(printk_cpu == this_cpu)) {
+ /*
+ * If a crash is occurring during printk() on this CPU,
+ * then try to get the crash message out but make sure
+ * we can't deadlock. Otherwise just return to avoid the
+ * recursion and return - but flag the recursion so that
+ * it can be printed at the next appropriate moment:
+ */
+ if (!oops_in_progress) {
+ printk_recursion_bug = 1;
+ goto out_restore_irqs;
+ }
+ zap_locks();
+ }
+
lockdep_off();
spin_lock(&logbuf_lock);
- printk_cpu = smp_processor_id();
+ printk_cpu = this_cpu;

+ if (printk_recursion_bug) {
+ printk_recursion_bug = 0;
+ strcpy(printk_buf, printk_recursion_bug_msg);
+ printed_len = sizeof(printk_recursion_bug_msg);
+ }
/* Emit the output into the temporary buffer */
- printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+ printed_len += vscnprintf(printk_buf + printed_len,
+ sizeof(printk_buf), fmt, args);

/*
* Copy the output into log_buf. If the caller didn't provide
@@ -744,6 +771,7 @@ asmlinkage int vprintk(const char *fmt,
printk_cpu = UINT_MAX;
spin_unlock(&logbuf_lock);
lockdep_on();
+out_restore_irqs:
raw_local_irq_restore(flags);
}

--------------->
Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
From: Ingo Molnar <[email protected]>

Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:

http://bugzilla.kernel.org/show_bug.cgi?id=9475

fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.

Reported-and-bisected-by: Stefano Brivio <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/printk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -707,7 +707,7 @@ asmlinkage int vprintk(const char *fmt,
loglev_char = default_message_loglevel
+ '0';
}
- t = printk_clock();
+ t = cpu_clock(printk_cpu);
nanosec_rem = do_div(t, 1000000000);
tlen = sprintf(tbuf,
"<%c>[%5lu.%06lu] ",
--------------->
Subject: sched: remove printk_clock()
From: Ingo Molnar <[email protected]>

printk_clock() is obsolete - it has been replaced with cpu_clock().

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/arm/kernel/time.c | 11 -----------
arch/ia64/kernel/time.c | 27 ---------------------------
kernel/printk.c | 5 -----
3 files changed, 43 deletions(-)

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
}
#endif

-/*
- * An implementation of printk_clock() independent from
- * sched_clock(). This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
- return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
- (1000000000 / HZ);
-}
-
static unsigned long next_rtc_update;

/*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
}
EXPORT_SYMBOL(udelay);

-static unsigned long long ia64_itc_printk_clock(void)
-{
- if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
- return sched_clock();
- return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
- return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
- (1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
- return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
- if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
- ia64_printk_clock = ia64_itc_printk_clock;
-}
-
/* IA64 doesn't cache the timezone */
void update_vsyscall_tz(void)
{
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char

__setup("time", printk_time_setup);

-__attribute__((weak)) unsigned long long printk_clock(void)
-{
- return sched_clock();
-}
-
/* Check if we have any console registered that can be called early in boot. */
static int have_callable_console(void)
{

2007-12-10 20:57:41

by Guillaume Chazarain

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Dec 10, 2007 9:42 PM, Ingo Molnar <[email protected]> wrote:
> although some claimed effect was on udelay()/mdelay() too.

Any specific report?
The jumping sched_clock on frequency change caused some
scheduling oddities for me, but CFS attenuated the effect.

Thanks.

--
Guillaume

2007-12-10 21:00:25

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Mon, 10 Dec 2007 21:42:12 +0100
Ingo Molnar <[email protected]> wrote:

> * Rafael J. Wysocki <[email protected]> wrote:
>
> > Subject : jiffies counter leaps in 2.6.24-rc3
> > Submitter : Stefano Brivio <[email protected]>
> > References : http://lkml.org/lkml/2007/11/24/53
> > http://bugzilla.kernel.org/show_bug.cgi?id=9475
> > Handled-By : Ingo Molnar <[email protected]>
> > Patch : http://lkml.org/lkml/2007/12/7/132
>
> Linus, Andrew, i need some help deciding what to do about this
> regression. The fixes for this have been tested and resolve the
> regression, but they change printk and other code that runs by default
> and is thus rather invasive so late in the v2.6.24 cycle. This bug
> should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug
> option) - although some claimed effect was on udelay()/mdelay() too.
>
> i've attached below the queue of 5 patches that fix this problem. They
> have been build and boot tested with more than 1000 random kernels in
> the past few days, so i certainly trust the core and x86 bits of this.
>
> what do you think? Right now i've got them queued up for 2.6.25 in both
> the scheduler-devel and the x86-devel git trees - but can submit them
> for 2.6.24 if it's better if we did them there. I've got no strong
> opinion either way.

printk_clock() doesn't seem terribly important but what's this stuff about
effects on udelay/mdelay? That can be serious if they're getting
shortened.

2007-12-10 22:45:37

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Andrew Morton <[email protected]> wrote:

> > what do you think? Right now i've got them queued up for 2.6.25 in
> > both the scheduler-devel and the x86-devel git trees - but can
> > submit them for 2.6.24 if it's better if we did them there. I've got
> > no strong opinion either way.
>
> printk_clock() doesn't seem terribly important but what's this stuff
> about effects on udelay/mdelay? That can be serious if they're
> getting shortened.

since udelay depends on loops_per_jiffy, which is fixed up
time_cpufreq_notifier(), i dont see how it could be affected by
frequency changes. (but that's the theory - practice might be different)

Ingo

2007-12-10 23:04:50

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Ingo Molnar <[email protected]> wrote:

> * Andrew Morton <[email protected]> wrote:
>
> > > what do you think? Right now i've got them queued up for 2.6.25 in
> > > both the scheduler-devel and the x86-devel git trees - but can
> > > submit them for 2.6.24 if it's better if we did them there. I've got
> > > no strong opinion either way.
> >
> > printk_clock() doesn't seem terribly important but what's this stuff
> > about effects on udelay/mdelay? That can be serious if they're
> > getting shortened.
>
> since udelay depends on loops_per_jiffy, which is fixed up
> time_cpufreq_notifier(), i dont see how it could be affected by
> frequency changes. (but that's the theory - practice might be
> different)

Stefano Brivio reported udelay()/mdelay() effects in the b43 driver.
(and it caused driver failures for him.)

Stefano, could you please try to sum up your experiences with that
issue? Is it reproducable, and the 5 patches i did fix it? (if yes,
could you try to re-do the mdelay verifications perhaps, to make sure
it's not some other effect interacting here. In theory sched-clock
scaling has no effect on udelay behavior.)

Ingo

2007-12-10 23:38:48

by Stefano Brivio

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Tue, 11 Dec 2007 00:04:25 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Andrew Morton <[email protected]> wrote:
> >
> > > > what do you think? Right now i've got them queued up for 2.6.25 in
> > > > both the scheduler-devel and the x86-devel git trees - but can
> > > > submit them for 2.6.24 if it's better if we did them there. I've got
> > > > no strong opinion either way.
> > >
> > > printk_clock() doesn't seem terribly important but what's this stuff
> > > about effects on udelay/mdelay? That can be serious if they're
> > > getting shortened.
> >
> > since udelay depends on loops_per_jiffy, which is fixed up
> > time_cpufreq_notifier(), i dont see how it could be affected by
> > frequency changes. (but that's the theory - practice might be
> > different)
>
> Stefano Brivio reported udelay()/mdelay() effects in the b43 driver.
> (and it caused driver failures for him.)
>
> Stefano, could you please try to sum up your experiences with that
> issue? Is it reproducable, and the 5 patches i did fix it? (if yes,
> could you try to re-do the mdelay verifications perhaps, to make sure
> it's not some other effect interacting here. In theory sched-clock
> scaling has no effect on udelay behavior.)

Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
delays isn't that good when using my crappy unstable TSC (mdelay(2000)
causes delays between 2 and 2.9 seconds) but it's not depending on frequency
changes anymore. So I'd say it's fixed, but please tell me if you want me
to do any other test so as to be sure it is.

--
Ciao
Stefano

2007-12-10 23:59:16

by Arjan van de Ven

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Tue, 11 Dec 2007 00:34:33 +0100
Stefano Brivio <[email protected]> wrote:

> On Tue, 11 Dec 2007 00:04:25 +0100
> Ingo Molnar <[email protected]> wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > * Andrew Morton <[email protected]> wrote:
> > >
> > > > > what do you think? Right now i've got them queued up for
> > > > > 2.6.25 in both the scheduler-devel and the x86-devel git
> > > > > trees - but can submit them for 2.6.24 if it's better if we
> > > > > did them there. I've got no strong opinion either way.
> > > >
> > > > printk_clock() doesn't seem terribly important but what's this
> > > > stuff about effects on udelay/mdelay? That can be serious if
> > > > they're getting shortened.
> > >
> > > since udelay depends on loops_per_jiffy, which is fixed up
> > > time_cpufreq_notifier(), i dont see how it could be affected by
> > > frequency changes. (but that's the theory - practice might be
> > > different)
> >
> > Stefano Brivio reported udelay()/mdelay() effects in the b43
> > driver. (and it caused driver failures for him.)
> >
> > Stefano, could you please try to sum up your experiences with that
> > issue? Is it reproducable, and the 5 patches i did fix it? (if yes,
> > could you try to re-do the mdelay verifications perhaps, to make
> > sure it's not some other effect interacting here. In theory
> > sched-clock scaling has no effect on udelay behavior.)
>
> Sorry for disappearing. Anyway, yes, those patches fixed it.
> Precision in delays isn't that good when using my crappy unstable TSC
> (mdelay(2000) causes delays between 2 and 2.9 seconds) but it's not
> depending on frequency changes anymore. So I'd say it's fixed, but
> please tell me if you want me to do any other test so as to be sure
> it is.
>
>
I'm still quite concerned about this in dual/quad core scenarios;
the frequency of both cores is the maximum of what linux sets each core to;
this means that if you're THIS sensitive to that there still is quite a nasty issue there.

I wonder if the various delay functions (maybe only in .25) should use the maximum observed loops_per_jiffie instead always (across cpus) to be super safe here.

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-12-11 00:01:19

by Guillaume Chazarain

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Stefano Brivio <[email protected]> wrote:

> Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
> delays isn't that good when using my crappy unstable TSC (mdelay(2000)
> causes delays between 2 and 2.9 seconds) but it's not depending on frequency
> changes anymore. So I'd say it's fixed, but please tell me if you want me
> to do any other test so as to be sure it is.

Ingo,

it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
based udelay), so how udelay can be affected by your proposed changes?

Thanks.

--
Guillaume

2007-12-11 00:08:50

by Guillaume Chazarain

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Arjan van de Ven <[email protected]> wrote:

> the frequency of both cores is the maximum of what linux sets each core to;

Do you mean that the cpufreq code can be confused about the actual
frequency of the cores? That sounds like a big problem.

Thanks for any insight.

--
Guillaume

2007-12-11 01:10:00

by Arjan van de Ven

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Tue, 11 Dec 2007 01:01:25 +0100
Guillaume Chazarain <[email protected]> wrote:

> Arjan van de Ven <[email protected]> wrote:
>
> > the frequency of both cores is the maximum of what linux sets each
> > core to;
>
> Do you mean that the cpufreq code can be confused about the actual
> frequency of the cores?

it means that cpufreq doesn't know the actual frequency (although bios sometimes tells us about the relationship, often the bios just lies through it's teeth); it only knows what it asks for, not what it gets. We know it'll get at least what it asks for, but it can get more than it asks for basically.

>That sounds like a big problem.

it'll get way worse going forward.
(but even on todays systems, the tsc no longer represents frequency, but is some fixed clock totally unrelated to cpu frequency)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-12-11 06:28:48

by Dave Jones

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote:

> so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and
> SLUB have essentially the same linecount:
>
> $ wc -l mm/slab.c mm/slub.c
> 4478 mm/slab.c
> 4125 mm/slub.c
>
> (and while linecount != complexity, there is a strong relationship.)
>
> With SLAB having 10 years more test coverage and tuning.

FWIW, the one thing slub does that slab doesn't that I find really nice
is being enable to enable debugging at boot time rather than compile time.

We don't get many people running benchmarks against the Fedora kernel,
so any scalability differences between slub/slab probably won't reach us
until we start shipping betas of the next RHEL based on the same kernel.

Which leaves my only other gripe. It broke slabtop.
There's an alternative implementation in Documentation/vm/slabinfo.c
(why there not say, util-linux, home of current slabtop?)

Dave

--
http://www.codemonkey.org.uk

2007-12-11 08:43:58

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Arjan van de Ven <[email protected]> wrote:

> > That sounds like a big problem.
>
> it'll get way worse going forward. (but even on todays systems, the
> tsc no longer represents frequency, but is some fixed clock totally
> unrelated to cpu frequency)

X86_FEATURE_CONSTANT_TSC CPUs (all modern Intel CPUs) should be fine -
we dont do any TSC frequency fixups for them. The loops_per_jiffy fixup
looks like this:

if (!(freq->flags & CPUFREQ_CONST_LOOPS))
cpu_data(freq->cpu).loops_per_jiffy =
cpufreq_scale(loops_per_jiffy_ref,
ref_freq, freq->new);

i.e. X86_FEATURE_CONSTANT_TSC excluded. The sched_clock() scaling factor
is modified like this:

if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
preempt_disable();
set_cyc2ns_scale(cpu_khz, smp_processor_id());

so here X86_FEATURE_CONSTANT_TSC is excluded again. So the whole
frequency scaling issue will become a pure legacy issue only with time.

Ingo

2007-12-11 08:49:35

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Guillaume Chazarain <[email protected]> wrote:

> Stefano Brivio <[email protected]> wrote:
>
> > Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
> > delays isn't that good when using my crappy unstable TSC (mdelay(2000)
> > causes delays between 2 and 2.9 seconds) but it's not depending on frequency
> > changes anymore. So I'd say it's fixed, but please tell me if you want me
> > to do any other test so as to be sure it is.
>
> Ingo,
>
> it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
> based udelay), so how udelay can be affected by your proposed changes?

was this needed for you to get stable udelay()? (that cpu_clock() based
udelay patch was buggy, i got the units wrong. udelay does wacky
conversions between various units. So i dropped it for the time being.)

the last rollup you tested didnt show udelay problems, and it didnt
include the sched_clock() based udelay patch.

so it would be nice if you could re-examine exactly what is needed.
Please try latest -git and the concatenation of the 4 patches below.
What would be the best info is to see which (if any!) patches are needed
against latest -git to get a stable udelay() on your box.

Ingo

--------------------------------------->
* Rafael J. Wysocki <[email protected]> wrote:

> Subject : jiffies counter leaps in 2.6.24-rc3
> Submitter : Stefano Brivio <[email protected]>
> References : http://lkml.org/lkml/2007/11/24/53
> http://bugzilla.kernel.org/show_bug.cgi?id=9475
> Handled-By : Ingo Molnar <[email protected]>
> Patch : http://lkml.org/lkml/2007/12/7/132

Linus, Andrew, i need some help deciding what to do about this
regression. The fixes for this have been tested and resolve the
regression, but they change printk and other code that runs by default
and is thus rather invasive so late in the v2.6.24 cycle. This bug
should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug
option) - although some claimed effect was on udelay()/mdelay() too.

i've attached below the queue of 5 patches that fix this problem. They
have been build and boot tested with more than 1000 random kernels in
the past few days, so i certainly trust the core and x86 bits of this.

what do you think? Right now i've got them queued up for 2.6.25 in both
the scheduler-devel and the x86-devel git trees - but can submit them
for 2.6.24 if it's better if we did them there. I've got no strong
opinion either way.

Ingo

-------------------->
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <[email protected]>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [email protected]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++-----
arch/x86/kernel/tsc_64.c | 57 ++++++++++++++++++++++++++++++++++++++---------
include/asm-x86/timer.h | 23 ++++++++++++++----
3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
#include <linux/jiffies.h>
#include <linux/init.h>
#include <linux/dmi.h>
+#include <linux/percpu.h>

#include <asm/delay.h>
#include <asm/tsc.h>
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
*
* [email protected] "math is hard, lets go shopping!"
*/
-unsigned long cyc2ns_scale __read_mostly;

-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);

-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
{
- cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+ unsigned long flags, prev_scale, *scale;
+ unsigned long long tsc_now, ns_now;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ scale = &per_cpu(cyc2ns, cpu);
+
+ rdtscll(tsc_now);
+ ns_now = __cycles_2_ns(tsc_now);
+
+ prev_scale = *scale;
+ if (cpu_khz)
+ *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+ /*
+ * Start smoothly with the new frequency:
+ */
+ sched_clock_idle_wakeup_event(0);
+ local_irq_restore(flags);
}

/*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_bl
ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
- set_cyc2ns_scale(cpu_khz);
+ preempt_disable();
+ set_cyc2ns_scale(cpu_khz, smp_processor_id());
+ preempt_enable();
/*
* TSC based sched_clock turns
* to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_relia

void __init tsc_init(void)
{
+ int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;

@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);

- set_cyc2ns_scale(cpu_khz);
+ /*
+ * Secondary CPUs do not run through tsc_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+ for_each_possible_cpu(cpu)
+ set_cyc2ns_scale(cpu_khz, cpu);
+
use_tsc_delay();

/* Check and install the TSC clocksource */
Index: linux/arch/x86/kernel/tsc_64.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_64.c
+++ linux/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@

#include <asm/hpet.h>
#include <asm/timex.h>
+#include <asm/timer.h>

static int notsc __initdata = 0;

@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int tsc_khz;
EXPORT_SYMBOL(tsc_khz);

-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ * basic equation:
+ * ns = cycles / (freq / ns_per_sec)
+ * ns = cycles * (ns_per_sec / freq)
+ * ns = cycles * (10^9 / (cpu_khz * 10^3))
+ * ns = cycles * (10^6 / cpu_khz)
+ *
+ * Then we use scaling math (suggested by [email protected]) to get:
+ * ns = cycles * (10^6 * SC / cpu_khz) / SC
+ * ns = cycles * cyc2ns_scale / SC
+ *
+ * And since SC is a constant power of two, we can convert the div
+ * into a shift.
+ *
+ * We can use khz divisor instead of mhz to keep a better precision, since
+ * cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ * ([email protected])
+ *
+ * [email protected] "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);

-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
{
- cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+ unsigned long flags, prev_scale, *scale;
+ unsigned long long tsc_now, ns_now;

-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
- return (cyc * cyc2ns_scale) >> NS_SCALE;
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ scale = &per_cpu(cyc2ns, cpu);
+
+ rdtscll(tsc_now);
+ ns_now = __cycles_2_ns(tsc_now);
+
+ prev_scale = *scale;
+ if (cpu_khz)
+ *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+ sched_clock_idle_wakeup_event(0);
+ local_irq_restore(flags);
}

unsigned long long sched_clock(void)
@@ -100,7 +133,9 @@ static int time_cpufreq_notifier(struct
mark_tsc_unstable("cpufreq changes");
}

- set_cyc2ns_scale(tsc_khz_ref);
+ preempt_disable();
+ set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+ preempt_enable();

return 0;
}
@@ -151,7 +186,7 @@ static unsigned long __init tsc_read_ref
void __init tsc_calibrate(void)
{
unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
- int hpet = is_hpet_enabled();
+ int hpet = is_hpet_enabled(), cpu;

local_irq_save(flags);

@@ -206,7 +241,9 @@ void __init tsc_calibrate(void)
}

tsc_khz = tsc2 / tsc1;
- set_cyc2ns_scale(tsc_khz);
+
+ for_each_possible_cpu(cpu)
+ set_cyc2ns_scale(tsc_khz, cpu);
}

/*
Index: linux/include/asm-x86/timer.h
===================================================================
--- linux.orig/include/asm-x86/timer.h
+++ linux/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
#define _ASMi386_TIMER_H
#include <linux/init.h>
#include <linux/pm.h>
+#include <linux/percpu.h>

#define TICK_SIZE (tick_nsec / 1000)

@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
#define calculate_cpu_khz() native_calculate_cpu_khz()
#endif

-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
* convert from cycles(64bits) => nanoseconds (64bits)
* basic equation:
* ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
* And since SC is a constant power of two, we can convert the div
* into a shift.
*
- * We can use khz divisor instead of mhz to keep a better percision, since
+ * We can use khz divisor instead of mhz to keep a better precision, since
* cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
* ([email protected])
*
* [email protected] "math is hard, lets go shopping!"
*/
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);

#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */

-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
{
- return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+ return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
}

+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+ unsigned long long ns;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ ns = __cycles_2_ns(cyc);
+ local_irq_restore(flags);
+
+ return ns;
+}

#endif
--------------->
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <[email protected]>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_32.c | 15 ++++++++++++---
arch/x86/kernel/process_64.c | 13 ++++++++++---
2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
smp_mb();

local_irq_disable();
- if (!need_resched())
+ if (!need_resched()) {
+ ktime_t t0, t1;
+ u64 t0n, t1n;
+
+ t0 = ktime_get();
+ t0n = ktime_to_ns(t0);
safe_halt(); /* enables interrupts racelessly */
- else
- local_irq_enable();
+ local_irq_disable();
+ t1 = ktime_get();
+ t1n = ktime_to_ns(t1);
+ sched_clock_idle_wakeup_event(t1n - t0n);
+ }
+ local_irq_enable();
current_thread_info()->status |= TS_POLLING;
} else {
/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
smp_mb();
local_irq_disable();
if (!need_resched()) {
- /* Enables interrupts one instruction before HLT.
- x86 special cases this so there is no race. */
- safe_halt();
+ ktime_t t0, t1;
+ u64 t0n, t1n;
+
+ t0 = ktime_get();
+ t0n = ktime_to_ns(t0);
+ safe_halt(); /* enables interrupts racelessly */
+ local_irq_disable();
+ t1 = ktime_get();
+ t1n = ktime_to_ns(t1);
+ sched_clock_idle_wakeup_event(t1n - t0n);
} else
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
--------------->
Subject: printk: make printk more robust by not allowing recursion
From: Ingo Molnar <[email protected]>

make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.

I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:

[ 41.057335] SysRq : <2>BUG: recent printk recursion!
[ 41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks

also do all this printk-debug logic with irqs disabled.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/printk.c | 48 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 10 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -628,30 +628,57 @@ asmlinkage int printk(const char *fmt, .
/* cpu currently holding logbuf_lock */
static volatile unsigned int printk_cpu = UINT_MAX;

+const char printk_recursion_bug_msg [] =
+ KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
asmlinkage int vprintk(const char *fmt, va_list args)
{
+ static int log_level_unknown = 1;
+ static char printk_buf[1024];
+
unsigned long flags;
- int printed_len;
+ int printed_len = 0;
+ int this_cpu;
char *p;
- static char printk_buf[1024];
- static int log_level_unknown = 1;

boot_delay_msec();

preempt_disable();
- if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
- /* If a crash is occurring during printk() on this CPU,
- * make sure we can't deadlock */
- zap_locks();
-
/* This stops the holder of console_sem just where we want him */
raw_local_irq_save(flags);
+ this_cpu = smp_processor_id();
+
+ /*
+ * Ouch, printk recursed into itself!
+ */
+ if (unlikely(printk_cpu == this_cpu)) {
+ /*
+ * If a crash is occurring during printk() on this CPU,
+ * then try to get the crash message out but make sure
+ * we can't deadlock. Otherwise just return to avoid the
+ * recursion and return - but flag the recursion so that
+ * it can be printed at the next appropriate moment:
+ */
+ if (!oops_in_progress) {
+ printk_recursion_bug = 1;
+ goto out_restore_irqs;
+ }
+ zap_locks();
+ }
+
lockdep_off();
spin_lock(&logbuf_lock);
- printk_cpu = smp_processor_id();
+ printk_cpu = this_cpu;

+ if (printk_recursion_bug) {
+ printk_recursion_bug = 0;
+ strcpy(printk_buf, printk_recursion_bug_msg);
+ printed_len = sizeof(printk_recursion_bug_msg);
+ }
/* Emit the output into the temporary buffer */
- printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+ printed_len += vscnprintf(printk_buf + printed_len,
+ sizeof(printk_buf), fmt, args);

/*
* Copy the output into log_buf. If the caller didn't provide
@@ -744,6 +771,7 @@ asmlinkage int vprintk(const char *fmt,
printk_cpu = UINT_MAX;
spin_unlock(&logbuf_lock);
lockdep_on();
+out_restore_irqs:
raw_local_irq_restore(flags);
}

--------------->
Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
From: Ingo Molnar <[email protected]>

Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:

http://bugzilla.kernel.org/show_bug.cgi?id=9475

fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.

Reported-and-bisected-by: Stefano Brivio <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/printk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -707,7 +707,7 @@ asmlinkage int vprintk(const char *fmt,
loglev_char = default_message_loglevel
+ '0';
}
- t = printk_clock();
+ t = cpu_clock(printk_cpu);
nanosec_rem = do_div(t, 1000000000);
tlen = sprintf(tbuf,
"<%c>[%5lu.%06lu] ",

2007-12-11 08:53:24

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Dave Jones <[email protected]> wrote:

> On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote:
>
> > so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and
> > SLUB have essentially the same linecount:
> >
> > $ wc -l mm/slab.c mm/slub.c
> > 4478 mm/slab.c
> > 4125 mm/slub.c
> >
> > (and while linecount != complexity, there is a strong relationship.)
> >
> > With SLAB having 10 years more test coverage and tuning.
>
> FWIW, the one thing slub does that slab doesn't that I find really
> nice is being enable to enable debugging at boot time rather than
> compile time.

yes, but that's largely due to "dont change SLAB because we've got SLUB"
resistence to SLAB patches. It's a 2 minute hack to implement this for
SLAB.

> We don't get many people running benchmarks against the Fedora kernel,
> so any scalability differences between slub/slab probably won't reach
> us until we start shipping betas of the next RHEL based on the same
> kernel.
>
> Which leaves my only other gripe. It broke slabtop.

that's actually a _bad_ ABI regression. Rafael, could you please add
this to the regressions list?

> There's an alternative implementation in Documentation/vm/slabinfo.c
> (why there not say, util-linux, home of current slabtop?)

the kernel should output /proc/slabinfo data with the same formatting,
period.

Ingo

2007-12-11 09:02:03

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

* Stefano Brivio <[email protected]> wrote:

> > Stefano, could you please try to sum up your experiences with that
> > issue? Is it reproducable, and the 5 patches i did fix it? (if yes,
> > could you try to re-do the mdelay verifications perhaps, to make
> > sure it's not some other effect interacting here. In theory
> > sched-clock scaling has no effect on udelay behavior.)
>
> Sorry for disappearing. Anyway, yes, those patches fixed it. Precision
> in delays isn't that good when using my crappy unstable TSC
> (mdelay(2000) causes delays between 2 and 2.9 seconds) but it's not
> depending on frequency changes anymore. So I'd say it's fixed, but
> please tell me if you want me to do any other test so as to be sure it
> is.

ok, just to make sure we are all synced up. I made 8 patches related to
this problem category (and all the trickle effects). 3 are upstream
already, 5 are pending for v2.6.25. One out of those 5 is an immaterial
cleanup patch - which leaves us 4 patches to sort out.

So i'd suggest for you to try latest -git - that will tell us whether
udelay() is acceptable on your box right now.

i've attached those 4 patches:

x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch
x86-cpu-clock-idle-event.patch
sched-printk-recursion-fix.patch
sched-printk-clock-fix.patch

none of them is _supposed_ to have any effect on udelay(), but the
interactions in this area are weird.

[ note: CONFIG_PRINTK_TIME will be broken and only fixed in v2.6.25, so
use some other time metric for determining mdelay quality. ]

plus then there's this patch:

http://lkml.org/lkml/2007/12/7/100

is it perhaps this one that fixed udelay for you? [ which would be much
more expected, as this patch changes udelay ;-) ]

Ingo

Attachments:

(No filename) (1.75 kB)
x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch (7.24 kB)
x86-cpu-clock-idle-event.patch (2.01 kB)
sched-printk-recursion-fix.patch (3.11 kB)
sched-printk-clock-fix.patch (943.00 B)
Download all attachments

2007-12-11 19:03:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Tue, 2007-12-11 at 09:52 +0100, Ingo Molnar wrote:
> * Dave Jones <[email protected]> wrote:

> > Which leaves my only other gripe. It broke slabtop.
>
> that's actually a _bad_ ABI regression. Rafael, could you please add
> this to the regressions list?
>
> > There's an alternative implementation in Documentation/vm/slabinfo.c
> > (why there not say, util-linux, home of current slabtop?)
>
> the kernel should output /proc/slabinfo data with the same formatting,
> period.

I can remember it being said that would be one of the things to-do
before SLUB could become default.

2007-12-11 21:14:48

by Stefano Brivio

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

On Tue, 11 Dec 2007 10:01:20 +0100
Ingo Molnar <[email protected]> wrote:

> ok, just to make sure we are all synced up. I made 8 patches related to
> this problem category (and all the trickle effects). 3 are upstream
> already, 5 are pending for v2.6.25. One out of those 5 is an immaterial
> cleanup patch - which leaves us 4 patches to sort out.
>
> So i'd suggest for you to try latest -git - that will tell us whether
> udelay() is acceptable on your box right now.

Yes, it is (msleep(2000), as said, gives delays between 2 and 2.9s on my
box, and drivers are happy).

The commit which fixed this (it seems) is
fa2dd441df28b9fdfc68f84ae66f1b507cfff0e4. I'll bisect and tell you more in the
next days.

> i've attached those 4 patches:
>
> x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch
> x86-cpu-clock-idle-event.patch
> sched-printk-recursion-fix.patch
> sched-printk-clock-fix.patch
>
> none of them is _supposed_ to have any effect on udelay(), but the
> interactions in this area are weird.

No effects here IIRC.

> [ note: CONFIG_PRINTK_TIME will be broken and only fixed in v2.6.25, so
> use some other time metric for determining mdelay quality. ]
>
> plus then there's this patch:
>
> http://lkml.org/lkml/2007/12/7/100
>
> is it perhaps this one that fixed udelay for you? [ which would be much
> more expected, as this patch changes udelay ;-) ]

Will try it ASAP, again, in the next few days anyway.

--
Ciao
Stefano

2007-12-13 11:58:56

by Takashi Iwai

[permalink] [raw]

Subject: Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

[Sorry for the late response as I've been on vacation]

At Sat, 8 Dec 2007 21:15:44 -0500,
Theodore Tso wrote:
>
> On Sat, Dec 08, 2007 at 11:30:53PM +0100, Rafael J. Wysocki wrote:
> > On Saturday, 8 of December 2007, Theodore Tso wrote:
> > > However, as far as I am concerned, Ingo's patch, first posted to LKML
> > > here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
> > > the above regression. Rafael, could you please make a note of this in
> > > your regression list,
> >
> > Done, thanks.
>
> Great, thanks. I should add that technically this wasn't a regression
> since I had been seeing this since before 2.6.23. Also, it isn't a
> big deal, since aside from noise in the syslog, falling back to
> polling more doesn't make any functional or user-visible difference
> (although I guess it's less efficient).
>
> Regardless of whether it is a regression, it would be nice to get the
> patch applied and and this issue fixed for 2.6.25!

You mean 2.6.24 ? ;-)

Yes, if it solves the problem, not only improves the latency, it's
definitely nice to have now. I was just too conservative to mark it
for 2.6.24 merge although it looks safe.

Jaroslav, could you prepare this for the push? It corresponds to
alsa-kernel HG changeset 5557.

thanks,

Takashi

2007-12-13 22:04:13

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007, Linus Torvalds wrote:

> On that note, shouldn't we also do this for slub.c? Christoph?

SLUB does not pass __GFP_ZERO to the page allocator.

2007-12-13 22:07:28

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sun, 9 Dec 2007, Ingo Molnar wrote:

> unless i'm missing something obvious (and i easily might), i see SLUB as
> SLAB reimplemented with a different queueing model. Not "without
> queueing".

The "queue" that you are talking about is the freelist of a slab. It exist
for each slab. SLAB uses a table of free entries there. The freelist is
limited to a slab and thus the locality of the freelist is bound to
the same page which avoids NUMA locality checks.

2007-12-14 04:08:17

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

On Sat, 8 Dec 2007, Ingo Molnar wrote:

>
> > Good. Although we should perhaps look at that reported performance
> > problem with SLUB. It looks like SLUB will do a memclear() for the
> > area twice (first for the whole page, then for the thing it allocated)
> > for the slow case. Maybe that exacerbates the problem.
>
> i dont think the SLUB problem could be explained purely via a double
> memset(). [which ought to be extremely fast anyway] We are talking about
> a 10 times slowdown on a 64-way box of a workload that is fairly
> common-sense. (tasks sending messages to each other via bog standard
> means)
>
> while i dont want to jump to conclusions without looking at some
> profiles, i think the SLUB performance regression is indicative of the
> following fallacy: "SLAB can be done significantly simpler while keeping
> the same performance".

Well this is double crap. First of all SLUB does not do memclear twice.
There is no reason to assume that SLUB has the problem just because SLOB
hat that. A "fix" for that nonexistent problem went into Linus tree. WTH
is going on?

SLUB was done because of a series of problem with the basic concepts of
SLAB that treaten it usability in the future.

> I couldnt point to any particular aspect of SLAB that i could
> characterise as "needless bloat".

I agree, SLABs architecture is pretty tight and I was one of those who
helped it along to be that way.

However, SLAB is just fundamentally wrong for todays machine. The key
problem today is cacheline fetch latency and that problem will increase
significantly in the future. Sure under some circumstances that exploit
the fact that SLAB sometimes gets its guesses on the cpu cache right SLAB
can still win but the more processors and nodes we get the more it will
become difficult to keep SLAB around and the more it will become
difficult to establish what cachelines are in the cpu cache.

> I think we should we make SLAB the default for v2.6.24 ...

If you guarantee that all the regression of SLAB vs. SLUB are addressed
then thats fine but AFAICT that is not possible.

Here is a list of some of the benefits of SLUB just in case we forgot:

- SLUB is performance wise much faster than SLAB. This can be more than a
factor of 10 (case of concurrent allocations / frees on multiple
processors). See http://lkml.org/lkml/2007/10/27/245

- Single threaded allocation speed is up to double that of SLAB

- Remote freeing of objectcs in a NUMA systems is typically 30% faster.

- Debugging on SLAB is difficult. Requires recompile of the kernel
and the resulting output is difficult to interpret. SLUB can apply
debugging options to a subset of the slabcaches in order to allow
the system to work with maximum speed. This is necessary to detect
difficult to reproduce race conditions.

- SLAB can capture huge amounts of memory in its queues. The problem
gets worse the more processors and NUMA nodes are in the system. The
amount of memory limits the number of per cpu objects one can configure.

- SLAB requires a pass through all slab caches every 2 seconds to
expire objects. This is a problem both for realtime and MPI jobs
that cannot take such a processor outage.

- SLAB does not have a sophisticated slabinfo tool to report the
state of slab objects on the system. Can provide details of
object use.

- SLAB requires the update of two words for freeing
and allocation. SLUB can do that by updating a single
word which allows to avoid enabling and disabling interrupts if
the processor supports an atomic instruction for that purpose.
This is important for realtime kernels where special measures
may have to be implemented if one wants to disable interrupts.

- SLAB requires memory to be set aside for queues (processors
times number of slabs times queue size). SLUB requires none of that.

- SLUB merges slab caches with similar characteristics to
reduce the memory footprint even further.

- SLAB performs object level NUMA management which creates
a complex allocator complexity. SLUB manages NUMA on the level of
slab pages reducing object management overhead.

- SLUB allows remote node defragmentation to avoid the buildup
of large partial lists on a single node.

- SLUB can actively reduce the fragmentation of slabs through
slab cache specific callbacks (not merged yet)

- SLUB has resiliency features that allow it to isolate a problem
object and continue after diagnostics have been performed.

- SLUB creates rarely used DMA caches on demand instead of creating
them all on bootup (SLAB).

2007-12-14 07:17:25

by Eric Dumazet

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

Christoph Lameter a ?crit :
> On Sat, 8 Dec 2007, Ingo Molnar wrote:
>
>>> Good. Although we should perhaps look at that reported performance
>>> problem with SLUB. It looks like SLUB will do a memclear() for the
>>> area twice (first for the whole page, then for the thing it allocated)
>>> for the slow case. Maybe that exacerbates the problem.
>> i dont think the SLUB problem could be explained purely via a double
>> memset(). [which ought to be extremely fast anyway] We are talking about
>> a 10 times slowdown on a 64-way box of a workload that is fairly
>> common-sense. (tasks sending messages to each other via bog standard
>> means)
>>
>> while i dont want to jump to conclusions without looking at some
>> profiles, i think the SLUB performance regression is indicative of the
>> following fallacy: "SLAB can be done significantly simpler while keeping
>> the same performance".
>
> Well this is double crap. First of all SLUB does not do memclear twice.
> There is no reason to assume that SLUB has the problem just because SLOB
> hat that. A "fix" for that nonexistent problem went into Linus tree. WTH
> is going on?
>
> SLUB was done because of a series of problem with the basic concepts of
> SLAB that treaten it usability in the future.
>
>> I couldnt point to any particular aspect of SLAB that i could
>> characterise as "needless bloat".
>
> I agree, SLABs architecture is pretty tight and I was one of those who
> helped it along to be that way.
>
> However, SLAB is just fundamentally wrong for todays machine. The key
> problem today is cacheline fetch latency and that problem will increase
> significantly in the future. Sure under some circumstances that exploit
> the fact that SLAB sometimes gets its guesses on the cpu cache right SLAB
> can still win but the more processors and nodes we get the more it will
> become difficult to keep SLAB around and the more it will become
> difficult to establish what cachelines are in the cpu cache.
>
>> I think we should we make SLAB the default for v2.6.24 ...
>
> If you guarantee that all the regression of SLAB vs. SLUB are addressed
> then thats fine but AFAICT that is not possible.
>
> Here is a list of some of the benefits of SLUB just in case we forgot:
>
>
> - SLUB is performance wise much faster than SLAB. This can be more than a
> factor of 10 (case of concurrent allocations / frees on multiple
> processors). See http://lkml.org/lkml/2007/10/27/245
>
> - Single threaded allocation speed is up to double that of SLAB
>
> - Remote freeing of objectcs in a NUMA systems is typically 30% faster.
>
> - Debugging on SLAB is difficult. Requires recompile of the kernel
> and the resulting output is difficult to interpret. SLUB can apply
> debugging options to a subset of the slabcaches in order to allow
> the system to work with maximum speed. This is necessary to detect
> difficult to reproduce race conditions.
>
> - SLAB can capture huge amounts of memory in its queues. The problem
> gets worse the more processors and NUMA nodes are in the system. The
> amount of memory limits the number of per cpu objects one can configure.
>
> - SLAB requires a pass through all slab caches every 2 seconds to
> expire objects. This is a problem both for realtime and MPI jobs
> that cannot take such a processor outage.
>
> - SLAB does not have a sophisticated slabinfo tool to report the
> state of slab objects on the system. Can provide details of
> object use.
>
> - SLAB requires the update of two words for freeing
> and allocation. SLUB can do that by updating a single
> word which allows to avoid enabling and disabling interrupts if
> the processor supports an atomic instruction for that purpose.
> This is important for realtime kernels where special measures
> may have to be implemented if one wants to disable interrupts.
>
> - SLAB requires memory to be set aside for queues (processors
> times number of slabs times queue size). SLUB requires none of that.
>
> - SLUB merges slab caches with similar characteristics to
> reduce the memory footprint even further.
>
> - SLAB performs object level NUMA management which creates
> a complex allocator complexity. SLUB manages NUMA on the level of
> slab pages reducing object management overhead.
>
> - SLUB allows remote node defragmentation to avoid the buildup
> of large partial lists on a single node.
>
> - SLUB can actively reduce the fragmentation of slabs through
> slab cache specific callbacks (not merged yet)
>
> - SLUB has resiliency features that allow it to isolate a problem
> object and continue after diagnostics have been performed.
>
> - SLUB creates rarely used DMA caches on demand instead of creating
> them all on bootup (SLAB).
>

Yes, SLUB should be the way to go, but some issues are not yet solved.

I had to switch back to SLAB on a production NUMA server, with 2 nodes and 8GB
ram. Using a lot of sockets, so a large part of memory was used by kernel.

SLUB kernel was hitting OOM after 2 or 3 days of uptime.
SLAB kernel never hit this.

Unfortunatly I dont have a test machine to reproduce the setup.

Maybe the problem is not related to SLUB at all, but an underlying VM/NUMA bug.

The /proc/buddyinfo showed that :

Node 0 contained two zones (DMA and DMA32) total 4 GB
Node 1 contained one zone (Normal) total 4 GB

So Node 0 contained no (Normal) zone

part of /proc/meminfo

Slab: 3338512 kB
SReclaimable: 789716 kB
SUnreclaim: 2548796 kB

I remember network interrupts were taken by CPU 1, so most allocations were
done by CPU 1 (node 1), and many freeing were done on CPU 0

Hope this helps

2007-12-14 12:49:42

[permalink] [raw]

Subject: Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

* Christoph Lameter <[email protected]> wrote:

> > I think we should we make SLAB the default for v2.6.24 ...
>
> If you guarantee that all the regression of SLAB vs. SLUB are
> addressed then thats fine but AFAICT that is not possible.

huh? You got the ordering wrong ;-) SLUB needs to resolve all
regressions relative to SLAB. (or at least have a really good
explanation about why it regresses)

> Here is a list of some of the benefits of SLUB just in case we forgot:
>
> - SLUB is performance wise much faster than SLAB. This can be more than a
> factor of 10 (case of concurrent allocations / frees on multiple
> processors). See http://lkml.org/lkml/2007/10/27/245

which is of little help if it regresses on other workloads. As we've
seen it, SLUB can be more than 10 times slower on hackbench. You can
tune SLUB to use 2MB pages but of course that's not a production level
system. OTOH, have you tried to tune SLAB in the above benchmark?

> - Single threaded allocation speed is up to double that of SLAB

link?

> - Remote freeing of objectcs in a NUMA systems is typically 30% faster.
>
> - Debugging on SLAB is difficult. Requires recompile of the kernel
> and the resulting output is difficult to interpret. SLUB can apply
> debugging options to a subset of the slabcaches in order to allow
> the system to work with maximum speed. This is necessary to detect
> difficult to reproduce race conditions.

that's not a fundamental property of SLAB. It would be an about 10 lines
hack to enable SLAB debugging switchable-on runtime, with the boot flag
defaulting to 'off'.

> - SLAB can capture huge amounts of memory in its queues. The problem
> gets worse the more processors and NUMA nodes are in the system. The
> amount of memory limits the number of per cpu objects one can
> configure.

well that's the nature of caches, but it could be improved: restrict
alien caches along cpusets and demand-allocate them.

> - SLAB requires a pass through all slab caches every 2 seconds to
> expire objects. This is a problem both for realtime and MPI jobs
> that cannot take such a processor outage.

the moment you start capturing more memory in SLUB's per cpu queues
(which do exist), you will have the same sort of problem.

> - SLAB does not have a sophisticated slabinfo tool to report the
> state of slab objects on the system. Can provide details of object
> use.

again, not a fundamental property of SLAB.

> - SLAB requires the update of two words for freeing
> and allocation. SLUB can do that by updating a single word which
> allows to avoid enabling and disabling interrupts if the processor
> supports an atomic instruction for that purpose. This is important
> for realtime kernels where special measures may have to be
> implemented if one wants to disable interrupts.

i do appreciate that :-) SLUB was rather easy to "port" to PREEMPT_RT:
it did not need a single line of change. The SLAB portion is a lot
scarier:

dione:~linux-rt.q> diffstat patches/rt-slab-new.patch
1 file changed, 319 insertions(+), 177 deletions(-)

> - SLAB requires memory to be set aside for queues (processors
> times number of slabs times queue size). SLUB requires none of that.
>
> - SLUB merges slab caches with similar characteristics to
> reduce the memory footprint even further.
>
> - SLAB performs object level NUMA management which creates
> a complex allocator complexity. SLUB manages NUMA on the level of
> slab pages reducing object management overhead.
>
> - SLUB allows remote node defragmentation to avoid the buildup
> of large partial lists on a single node.
>
> - SLUB can actively reduce the fragmentation of slabs through
> slab cache specific callbacks (not merged yet)
>
> - SLUB has resiliency features that allow it to isolate a problem
> object and continue after diagnostics have been performed.

all of these are neat.

How about renaming it to SLAB2 instead of SLUB? The "unqueued" bit is
just stupid NIH syndrome. It's _of course_ queued because it has to. "It
does not have _THAT_ queue as SLAB used to have" is just a silly excuse.

> - SLUB creates rarely used DMA caches on demand instead of creating
> them all on bootup (SLAB).

actually, this might be a bug. the DMA caches should be created right
away and filled with a small amount of objects due to stupid 16MB
limitations with certain hardware. Later on a GFP_DMA request might not
be fulfillable. (because that zone is filled up pretty quickly)

Ingo

2007-12-17 20:14:36