2006-11-11 18:01:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sat, 11 Nov 2006 03:29:32 -0800
[email protected] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=7495
>
> Summary: Kernel periodically hangs.
> Kernel Version: Linux version 2.6.18.2 (root@pub) (gcc version 3.4.6)
> #13 SMP Fr
> Status: NEW
> Severity: blocking
> Owner: [email protected]
> Submitter: [email protected]
>
>
> [42587.676000] BUG: unable to handle kernel NULL pointer dereference at
> virtual address 0000003c
> [42587.680000] printing eip:
> [42587.680000] 781610e7
> [42587.680000] *pde = 00000000
> [42587.680000] Oops: 0000 [#1]
> [42587.684000] SMP
> [42587.684000] Modules linked in: sata_promise sk98lin 8250_pnp 8250
> i2c_nforce2 ehci_hcd serial_core sata_nv ahci i2c_core ohci_hcd forcedeth
> libata
> [42587.688000] CPU: 1
> [42587.688000] EIP: 0060:[<781610e7>] Not tainted VLI
> [42587.688000] EFLAGS: 00010286 (2.6.18.2 #13)
> [42587.692000] EIP is at clear_inode+0x96/0xce
> [42587.692000] eax: 00000000 ebx: c0102240 ecx: f7f278d4 edx: f510d400
> [42587.692000] esi: c0102384 edi: f7e6dec0 ebp: 00000070 esp: f7e6de98
> [42587.696000] ds: 007b es: 007b ss: 0068
> [42587.696000] Process kswapd0 (pid: 230, ti=f7e6c000 task=f7c03560
> task.ti=f7e6c000)
> [42587.696000] Stack: c0102248 c0102240 7816116a da7b4af0 da7b4af8 00000000
> 00000080 781614a2
> [42587.700000] 00000080 00000080 c01023f8 ef78dca8 00000000 00009858
> 00000083 f7fee560
> [42587.700000] 781614c8 7813a643 00261600 00000000 00009858 00000005
> 00000000 00000000
> [42587.700000] Call Trace:
> [42587.704000] [<7816116a>] dispose_list+0x4b/0xc1
> [42587.708000] [<781614a2>] prune_icache+0x17c/0x18e
> [42587.708000] [<781614c8>] shrink_icache_memory+0x14/0x2b
> [42587.708000] [<7813a643>] shrink_slab+0x130/0x18c
> [42587.712000] [<7813b75a>] balance_pgdat+0x1ea/0x2dd
> [42587.712000] [<7813b933>] kswapd+0xe6/0xe8
> [42587.716000] [<781261dc>] kthread+0x7d/0xa1
> [42587.716000] [<78100e05>] kernel_thread_helper+0x5/0xb

I've seen three or four reports of oopses like this in 2.6.18. I have a
suspision we broke something.


> Kernel started with noapic option, cause it hands on load without this option.

Him and a million other people. I know we broke APIC. Around 2.6.9, I
think.


2006-11-11 18:10:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

> > Kernel started with noapic option, cause it hands on load without this option.
>
> Him and a million other people. I know we broke APIC. Around 2.6.9, I
> think.


is that when the "enable apic even on UP so that distro kernels can
install on the ibm x44*" patches went in?

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-11 18:20:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sat, 11 Nov 2006 19:10:03 +0100
Arjan van de Ven <[email protected]> wrote:

> > > Kernel started with noapic option, cause it hands on load without this option.
> >
> > Him and a million other people. I know we broke APIC. Around 2.6.9, I
> > think.
>
>
> is that when the "enable apic even on UP so that distro kernels can
> install on the ibm x44*" patches went in?
>

I don't know. In fact I forget how I worked out that it worsened in
2.6.early.

google(noapic) gets 232,000 hits.

I don't think it really matters when or why it happened. If we take the
approach of fixing one machine at a time, we'll only need to fix a few
individual machines to improve the situation for a lot of people.

2006-11-12 11:50:50

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.


> I don't know. In fact I forget how I worked out that it worsened in
> 2.6.early.
>
> google(noapic) gets 232,000 hits.

is there a way to ask google "only stuff in the last year"?
Asking because "noapic" in 2.4 was the standard "try this" answer when
people had a bios that had busted MPS (but good ACPI)...


> I don't think it really matters when or why it happened.

well to some degree it does; if it's one patch causing it narrowing it
down at least somewhat in time would help ;)

> If we take the
> approach of fixing one machine at a time, we'll only need to fix a few
> individual machines to improve the situation for a lot of people.

alternative is that more new machines showed up that need it somehow, eg
not really a regression just something else. Different approach is
needed for hunting that down. But to be realistic we need to narrow
things down a bit, which means

1) Only care about SMP machines. APIC on true UP (no
Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
doesn't use it) and is just too likely to trip up SMM and other bad BIOS
stuff.
* exception is probably people who don't WANT to use apic but where it
somehow gets used anyway; if that happens we probably have the magic
bullet that causes the regression :)
2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for
this, but most vendors hardly maintain those anymore at all and they are
generally just /dev/random nowadays
3) Ignore overclocking; if you overclock using the FSB the apic busses
run out of spec as well; can be a huge timewaster in debug time.



--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 12:53:55

by Adrian Bunk

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 12:50:37PM +0100, Arjan van de Ven wrote:
>
> > I don't know. In fact I forget how I worked out that it worsened in
> > 2.6.early.
> >
> > google(noapic) gets 232,000 hits.
>
> is there a way to ask google "only stuff in the last year"?
> Asking because "noapic" in 2.4 was the standard "try this" answer when
> people had a bios that had busted MPS (but good ACPI)...

Some APIC-related bugs in the kernel Bugzilla that have been reported or
confirmed during the last 12 months (I only looked at "apic" in the
subject, there might be more related bugs in the Bugzilla):

#5038 Fast running system clock with IO-APIC enabled
#5303 AMD64 Erratum: Should not enable C2 when using APIC
#5565 Guess of i386 APIC PTE area scribble
#6404 APIC error on CPU0: 40(40)
#6748 Clock drifts by 30% for SMP kernel w/APIC
#6859 Linux kernel won't work without "nolapic" passed
#6890 Kernel boot freezes when APIC is enabled & SATA is used

> > I don't think it really matters when or why it happened.
>
> well to some degree it does; if it's one patch causing it narrowing it
> down at least somewhat in time would help ;)
>
> > If we take the
> > approach of fixing one machine at a time, we'll only need to fix a few
> > individual machines to improve the situation for a lot of people.
>
> alternative is that more new machines showed up that need it somehow, eg
> not really a regression just something else. Different approach is
> needed for hunting that down. But to be realistic we need to narrow
> things down a bit, which means
>
> 1) Only care about SMP machines. APIC on true UP (no
> Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
> doesn't use it) and is just too likely to trip up SMM and other bad BIOS
> stuff.
> * exception is probably people who don't WANT to use apic but where it
> somehow gets used anyway; if that happens we probably have the magic
> bullet that causes the regression :)

On i386, it's a kernel configuration option.

On x86_64, the APIC is currently always enabled even when configuring a
UP kernel.

> 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for
> this, but most vendors hardly maintain those anymore at all and they are
> generally just /dev/random nowadays

What about non-ACPI SMP?

> 3) Ignore overclocking; if you overclock using the FSB the apic busses
> run out of spec as well; can be a huge timewaster in debug time.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-12 13:16:33

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.


> Some APIC-related bugs in the kernel Bugzilla that have been reported or
> confirmed during the last 12 months (I only looked at "apic" in the
> subject, there might be more related bugs in the Bugzilla):
>
> #5038 Fast running system clock with IO-APIC enabled

This is a UP machine. NotInteresting(tm) wrt APIC.

> #5303 AMD64 Erratum: Should not enable C2 when using APIC

This is clearly not a linux issue but a hardware bug, as the title says

> #5565 Guess of i386 APIC PTE area scribble
this is only on one machine and a "special case"; not ruling out
anything fundamental but..

> #6404 APIC error on CPU0: 40(40)

This bug is a mess though; many different people seeing a symptom of an
apic error, and all jumping in assuming they see the same problem...
Also it's afaik only a message and not (yet) fatal in any way.
Sometimes apics do this a few times a day, esp when things are getting
hot in the box. Afaik there is then just a resend of the message and
nothing is lost.

> #6748 Clock drifts by 30% for SMP kernel w/APIC

this looks like a totally weird hardware case that probably just wants
to be blacklisted.

> #6859 Linux kernel won't work without "nolapic" passed
weird one, probably a bios issue but it's the opposite of "noapic", and
also this is about local apic not about ioapic. Although they share 4
letters they're entirely different animals.

> #6890 Kernel boot freezes when APIC is enabled & SATA is used

seems to be UP as well but asked for confirmation in the bug (lack of
lots of information here!).

If this isn't UP this could be the first real case of "noapic" in your
entire list...... which isn't too useful.
Maybe we need to get more/any people who see "need noapic on SMP" to
file a bug (and provide a reasonable amount of info)

> >
> > 1) Only care about SMP machines. APIC on true UP (no
> > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
> > doesn't use it) and is just too likely to trip up SMM and other bad BIOS
> > stuff.
> > * exception is probably people who don't WANT to use apic but where it
> > somehow gets used anyway; if that happens we probably have the magic
> > bullet that causes the regression :)
>
> On i386, it's a kernel configuration option.

yes but it's generally a bad idea to set it; it only works on some
machines. (and it can't be fixed)
>
> On x86_64, the APIC is currently always enabled even when configuring a
> UP kernel.

I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS
cause it to be turned off automatic most of the time.

>
> > 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for
> > this, but most vendors hardly maintain those anymore at all and they are
> > generally just /dev/random nowadays
>
> What about non-ACPI SMP?

if the machine is new enough to run ACPI I don't care about the non-ACPI
case; just enable it. Really. On newish machines (and that is 7 years
old or newer) MPS tables are NOT getting much if any attention by the
bios guys. So Linux should use ACPI, and if you deliberately disable
ACPI and THEN hit a problem to a large degree you asked for the problem
in the first place.

Older machines, different story.

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 13:37:57

by Adrian Bunk

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote:
>
> > Some APIC-related bugs in the kernel Bugzilla that have been reported or
> > confirmed during the last 12 months (I only looked at "apic" in the
> > subject, there might be more related bugs in the Bugzilla):
> >
> > #5038 Fast running system clock with IO-APIC enabled
>
> This is a UP machine. NotInteresting(tm) wrt APIC.
>...

Currently it's a supported configuration.

We must either handle such cases or explicitely disable the APIC on all
UP machines (BTW: Is there any way to handle this when installing a
distribution kernel with CONFIG_HOTPLUG_CPU=y on an UP machine?).

> > > 1) Only care about SMP machines. APIC on true UP (no
> > > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
> > > doesn't use it) and is just too likely to trip up SMM and other bad BIOS
> > > stuff.
> > > * exception is probably people who don't WANT to use apic but where it
> > > somehow gets used anyway; if that happens we probably have the magic
> > > bullet that causes the regression :)
> >
> > On i386, it's a kernel configuration option.
>
> yes but it's generally a bad idea to set it; it only works on some
> machines. (and it can't be fixed)
> >
> > On x86_64, the APIC is currently always enabled even when configuring a
> > UP kernel.
>
> I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS
> cause it to be turned off automatic most of the time.

I'd doubt the latter. Even on my cheap Asus board running an i386
AMD Athlon XP with 1.8 GHz the APIC is both used and working without any
problems.

> > > 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for
> > > this, but most vendors hardly maintain those anymore at all and they are
> > > generally just /dev/random nowadays
> >
> > What about non-ACPI SMP?
>
> if the machine is new enough to run ACPI I don't care about the non-ACPI
> case; just enable it. Really. On newish machines (and that is 7 years
> old or newer) MPS tables are NOT getting much if any attention by the
> bios guys. So Linux should use ACPI, and if you deliberately disable
> ACPI and THEN hit a problem to a large degree you asked for the problem
> in the first place.
>
> Older machines, different story.

My point was regarding the latter ones...

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-12 13:57:55

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, 2006-11-12 at 14:37 +0100, Adrian Bunk wrote:
> On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote:
> >
> > > Some APIC-related bugs in the kernel Bugzilla that have been reported or
> > > confirmed during the last 12 months (I only looked at "apic" in the
> > > subject, there might be more related bugs in the Bugzilla):
> > >
> > > #5038 Fast running system clock with IO-APIC enabled
> >
> > This is a UP machine. NotInteresting(tm) wrt APIC.
> >...
>
> Currently it's a supported configuration.

define "supported"; we have code to try it and it's great if it works.
But if it doesn't... you're out of luck.

We KNOW it can't work on a sizable amount of machines. This is why it
is a config option; you can enable it if YOUR machine is KNOWN to work,
and you get some gains. But it's also understood that it often it won't
work. So any sensible distro (since they have to aim for a wide
audience) disables this option ...

>
> We must either handle such cases or explicitely disable the APIC on all
> UP machines

that'd be the same as setting the config option off...
> > I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS
> > cause it to be turned off automatic most of the time.
>
> I'd doubt the latter. Even on my cheap Asus board running an i386
> AMD Athlon XP with 1.8 GHz the APIC is both used and working without any
> problems.

"it works on my one machine so it works for everyone". That's simply not
true. We KNOW it can't work everywhere on UP, especially on i386. SMM
assumptions; people gluing the apic pins to the reset line, we've seen
it all.
That it works for you is great. But that doesn't mean it automatically
works for everyone.



--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 14:10:14

by Adrian Bunk

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 02:57:48PM +0100, Arjan van de Ven wrote:
> On Sun, 2006-11-12 at 14:37 +0100, Adrian Bunk wrote:
> > On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote:
> > >
> > > > Some APIC-related bugs in the kernel Bugzilla that have been reported or
> > > > confirmed during the last 12 months (I only looked at "apic" in the
> > > > subject, there might be more related bugs in the Bugzilla):
> > > >
> > > > #5038 Fast running system clock with IO-APIC enabled
> > >
> > > This is a UP machine. NotInteresting(tm) wrt APIC.
> > >...
> >
> > Currently it's a supported configuration.
>
> define "supported"; we have code to try it and it's great if it works.
> But if it doesn't... you're out of luck.
>
> We KNOW it can't work on a sizable amount of machines. This is why it
> is a config option; you can enable it if YOUR machine is KNOWN to work,
> and you get some gains. But it's also understood that it often it won't
> work. So any sensible distro (since they have to aim for a wide
> audience) disables this option ...

Nowadays, many distributions only ship CONFIG_SMP=y kernels...

> > We must either handle such cases or explicitely disable the APIC on all
> > UP machines
>
> that'd be the same as setting the config option off...

Except for the common case of CONFIG_SMP=y kernels on UP machines...

> > > I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS
> > > cause it to be turned off automatic most of the time.
> >
> > I'd doubt the latter. Even on my cheap Asus board running an i386
> > AMD Athlon XP with 1.8 GHz the APIC is both used and working without any
> > problems.
>
> "it works on my one machine so it works for everyone". That's simply not
> true. We KNOW it can't work everywhere on UP, especially on i386. SMM
> assumptions; people gluing the apic pins to the reset line, we've seen
> it all.
> That it works for you is great. But that doesn't mean it automatically
> works for everyone.

You miss my point.

You said you'd suspect it to be turned off automatic most of the time,
and that's the point I think you might be wrong at.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-12 14:16:49

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.


> > We KNOW it can't work on a sizable amount of machines. This is why it
> > is a config option; you can enable it if YOUR machine is KNOWN to work,
> > and you get some gains. But it's also understood that it often it won't
> > work. So any sensible distro (since they have to aim for a wide
> > audience) disables this option ...
>
> Nowadays, many distributions only ship CONFIG_SMP=y kernels...

that's a calculated risk on their side (and they know that); they're
balancing not functioning on a set of machines off against needing more
kernels.


> You miss my point.
>
> You said you'd suspect it to be turned off automatic most of the time,
> and that's the point I think you might be wrong at.

it won't be turned off on machines that support dual core processors
etc, since those DO get validated and designed for APIC use.. even if
you only stick a single core processor in. So yes you're right, that
nowadays is a pretty large group. But it's the safe group I guess:)

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 15:21:51

by Adrian Bunk

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote:
>
> > > We KNOW it can't work on a sizable amount of machines. This is why it
> > > is a config option; you can enable it if YOUR machine is KNOWN to work,
> > > and you get some gains. But it's also understood that it often it won't
> > > work. So any sensible distro (since they have to aim for a wide
> > > audience) disables this option ...
> >
> > Nowadays, many distributions only ship CONFIG_SMP=y kernels...
>
> that's a calculated risk on their side (and they know that); they're
> balancing not functioning on a set of machines off against needing more
> kernels.

This might soon affect the majority of Linux users, so it's a case that
has to be handled...

> > You miss my point.
> >
> > You said you'd suspect it to be turned off automatic most of the time,
> > and that's the point I think you might be wrong at.
>
> it won't be turned off on machines that support dual core processors
> etc, since those DO get validated and designed for APIC use.. even if
> you only stick a single core processor in. So yes you're right, that
> nowadays is a pretty large group. But it's the safe group I guess:)

But if APIC is even used on my more than 1 year old 40 Euro Socket A
board (AFAIK there have never been dual core Socket A processors, there
were no Socket A hyperthreading CPUs, it's not an SMP board, and the
VIA KT600 is not an SMP chipset) it's not in what you call "safe group",
and I don't see any reason why my board should behave different in this
respect from all of the millions of other UP Socket A boards.

Googling show that it could be that your claim "APIC on true UP (no
Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
doesn't use it)" earlier in this thread was wrong. Looking at e.g. [1],
it seems Windows does use the APIC even on UP.

cu
Adrian

[1] http://www.microsoft.com/whdc/system/sysperf/IO-APIC.mspx

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-12 15:50:15

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.


> But if APIC is even used on my more than 1 year old 40 Euro Socket A

once sparrow does not a summer make.


now can we get constructive again. If you find a real case where noapic
is needed on an SMP machine, preferably one where it wasn't needed
before earlier in 2.6, let us know; it's worthwhile to chase those down
since we know it's a decent use case and it's not flaky hardware.


2006-11-12 16:00:06

by Patrick McFarland

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sunday 12 November 2006 10:21, Adrian Bunk wrote:
> On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote:
> > > > We KNOW it can't work on a sizable amount of machines. This is why
> > > > it is a config option; you can enable it if YOUR machine is KNOWN to
> > > > work, and you get some gains. But it's also understood that it often
> > > > it won't work. So any sensible distro (since they have to aim for a
> > > > wide audience) disables this option ...
> > >
> > > Nowadays, many distributions only ship CONFIG_SMP=y kernels...
> >
> > that's a calculated risk on their side (and they know that); they're
> > balancing not functioning on a set of machines off against needing more
> > kernels.
>
> This might soon affect the majority of Linux users, so it's a case that
> has to be handled...

I actually agree here. Linux needs to be easier for people to use, not harder.
Isn't there a way for bootloaders or the kernel early on figure out if the
machine supports SMP, and if it doesnt, load a uniproc kernel instead?

> > > You miss my point.
> > >
> > > You said you'd suspect it to be turned off automatic most of the time,
> > > and that's the point I think you might be wrong at.
> >
> > it won't be turned off on machines that support dual core processors
> > etc, since those DO get validated and designed for APIC use.. even if
> > you only stick a single core processor in. So yes you're right, that
> > nowadays is a pretty large group. But it's the safe group I guess:)
>
> But if APIC is even used on my more than 1 year old 40 Euro Socket A
> board (AFAIK there have never been dual core Socket A processors, there
> were no Socket A hyperthreading CPUs, it's not an SMP board, and the
> VIA KT600 is not an SMP chipset) it's not in what you call "safe group",
> and I don't see any reason why my board should behave different in this
> respect from all of the millions of other UP Socket A boards.
>
> Googling show that it could be that your claim "APIC on true UP (no
> Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft
> doesn't use it)" earlier in this thread was wrong. Looking at e.g. [1],
> it seems Windows does use the APIC even on UP.

Socket A CPUs are also ungodly common. They're as common as slot 1/socket 370
Pentium 3s, and, at least with my old P3 board, trying to use APIC on UP
caused lockups. My Duron 1ghz laptop also does the same thing. (Booting
either with noapic fixes it).

So yeah, if distros make stupid choices like these, then we're pretty screwed.

> cu
> Adrian
>
> [1] http://www.microsoft.com/whdc/system/sysperf/IO-APIC.mspx

--
Patrick McFarland || http://AdTerrasPerAspera.com
"Computer games don't affect kids; I mean if Pac-Man affected us as kids,
we'd all be running around in darkened rooms, munching magic pills and
listening to repetitive electronic music." -- Kristian Wilson, Nintendo,
Inc, 1989

2006-11-12 16:07:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, 2006-11-12 at 10:59 -0500, Patrick McFarland wrote:
> On Sunday 12 November 2006 10:21, Adrian Bunk wrote:
> > On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote:
> > > > > We KNOW it can't work on a sizable amount of machines. This is why
> > > > > it is a config option; you can enable it if YOUR machine is KNOWN to
> > > > > work, and you get some gains. But it's also understood that it often
> > > > > it won't work. So any sensible distro (since they have to aim for a
> > > > > wide audience) disables this option ...
> > > >
> > > > Nowadays, many distributions only ship CONFIG_SMP=y kernels...
> > >
> > > that's a calculated risk on their side (and they know that); they're
> > > balancing not functioning on a set of machines off against needing more
> > > kernels.
> >
> > This might soon affect the majority of Linux users, so it's a case that
> > has to be handled...
>
> I actually agree here. Linux needs to be easier for people to use, not harder.
> Isn't there a way for bootloaders or the kernel early on figure out if the
> machine supports SMP, and if it doesnt, load a uniproc kernel instead?

this is what OS installers have been doing for a decade or so.

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 16:47:13

by Adrian Bunk

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 10:59:55AM -0500, Patrick McFarland wrote:
>...
> Socket A CPUs are also ungodly common. They're as common as slot 1/socket 370
> Pentium 3s, and, at least with my old P3 board, trying to use APIC on UP
> caused lockups. My Duron 1ghz laptop also does the same thing. (Booting
> either with noapic fixes it).
>...

It might depend on the age of your computer.

Microsoft mandates the presence of an APIC implemented per MADT and all
hardware interrupts connected to an IOAPIC for all servers and desktops
with a "Designed for Windows XP" sticker.

This implies more or less that a working APIC is present in all
non-laptop x86 UP systems manufactured during the last 5 years.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-12 19:19:16

by Ingo Oeser

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

Hi there,

On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote:
> If this isn't UP this could be the first real case of "noapic" in your
> entire list...... which isn't too useful.
> Maybe we need to get more/any people who see "need noapic on SMP" to
> file a bug (and provide a reasonable amount of info)

I need noapic since ever (5 years!) to get my USB controller running.
Without noapic it doesn't get any interrupts for some reason.

If now is the time to fix those bugs, I would be happy to try a new kernel
and get you the dmesg + result of plugging in an usb mass storage device
and reading from it on a DAILY basis.

If you need anything else to resolve the issue, I would be happy to help
out here.

Maybe a pattern can be detected, which could help others.
If you like to blacklist this machine by DMI, that would also
help me.

Many Thanks!

Best Regards

Ingo Oeser


Attachments:
(No filename) (890.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-12 19:38:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, 12 Nov 2006 20:18:51 +0100
Ingo Oeser <[email protected]> wrote:

> Hi there,
>
> On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote:
> > If this isn't UP this could be the first real case of "noapic" in your
> > entire list...... which isn't too useful.
> > Maybe we need to get more/any people who see "need noapic on SMP" to
> > file a bug (and provide a reasonable amount of info)
>
> I need noapic since ever (5 years!) to get my USB controller running.
> Without noapic it doesn't get any interrupts for some reason.
>
> If now is the time to fix those bugs, I would be happy to try a new kernel
> and get you the dmesg + result of plugging in an usb mass storage device
> and reading from it on a DAILY basis.

Yes, please send those. It'd be best to get the info into bugzilla too -
this doesn't look like a quick-fix scenario.


2006-11-12 20:32:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, 2006-11-12 at 20:18 +0100, Ingo Oeser wrote:
> Hi there,
>
> On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote:
> > If this isn't UP this could be the first real case of "noapic" in your
> > entire list...... which isn't too useful.
> > Maybe we need to get more/any people who see "need noapic on SMP" to
> > file a bug (and provide a reasonable amount of info)
>
> I need noapic since ever (5 years!) to get my USB controller running.
> Without noapic it doesn't get any interrupts for some reason.

so it never worked? (that's important to know versus regression)

Also does this machine use ACPI for interrupt routing?
That's also important, because if you're NOT using ACPI, "noapic" means
that you're using the PIRQ for irq routing and not MPS, so you're not
"just" changing apic behavior, you're actually using a different BIOS
table. (and to be honest, a buggy bios table is more likely the
cause ... ;)



--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 21:47:26

by Dave Jones

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.

On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote:
>
> > > We KNOW it can't work on a sizable amount of machines. This is why it
> > > is a config option; you can enable it if YOUR machine is KNOWN to work,
> > > and you get some gains. But it's also understood that it often it won't
> > > work. So any sensible distro (since they have to aim for a wide
> > > audience) disables this option ...
> >
> > Nowadays, many distributions only ship CONFIG_SMP=y kernels...
>
> that's a calculated risk on their side (and they know that); they're
> balancing not functioning on a set of machines off against needing more
> kernels.

Andi has a nice patch in the suse kernel which adds heuristics to disable
apic on systems where it isn't likely to work. It DTRT in at least
one problem case that I know of. The actual fall-out from enabling
'run SMP kernels on UP i686' for FC6 has mostly been a non-event.
Literally a handful of cases, that will likely all get caught and worked
around by Andi's patch or similar.

Dave

--
http://www.codemonkey.org.uk

2006-11-13 02:08:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 7495] New: Kernel periodically hangs.


> Andi has a nice patch in the suse kernel which adds heuristics to disable
> apic on systems where it isn't likely to work. It DTRT in at least
> one problem case that I know of. The actual fall-out from enabling
> 'run SMP kernels on UP i686' for FC6 has mostly been a non-event.
> Literally a handful of cases, that will likely all get caught and worked
> around by Andi's patch or similar.

I haven't pushed that recently because i was busy with other things, but
needs to be revisited yes.

One broken case that still happens is that the patch assumes working
SMBIOS. When there is no year in SMBIOS it will turn off APIC because
it assumes it is a very old system. But sometimes new systems who would
like APIC have illegal or broken SMBIOS year. On very new systems it isn't
a problem again because those tend to have multiple cores.

That could be probably a bit more clever. It's always difficult to
navigate around all kinds of BIOS bugs.

-Andi