SMP box, compiled for UP with CONFIG_LOCAL_APIC=y freezes on boot with
last lines:
POSIX conformance testing by UNIFIX
masked ExtINT on CPU#0
ESR value before enabling vector: 00000008
[ Freeze here ]
With SMP, it boots fine (then freezes mysteriously a few mins after
boot, which is what I am still chasing):
masked ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
...
Don't know exactly what kernel this first happened, I usually run SMP.
Clues?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
> SMP box, compiled for UP with CONFIG_LOCAL_APIC=y freezes on boot with
> last lines:
>
> POSIX conformance testing by UNIFIX
> masked ExtINT on CPU#0
> ESR value before enabling vector: 00000008
> [ Freeze here ]
>
> With SMP, it boots fine (then freezes mysteriously a few mins after
> boot, which is what I am still chasing):
>
> masked ExtINT on CPU#0
> ESR value before enabling vector: 00000000
> ESR value after enabling vector: 00000000
> ...
>
> Don't know exactly what kernel this first happened, I usually run SMP.
I put an esr_disable flag in there a while back ... does that workaround it?
M.
In message mbligh wrote:
>> I put an esr_disable flag in there a while back ... does that workaround it?
On Wed, Feb 26, 2003 at 06:14:42PM +1100, Rusty Russell wrote:
> Yes. Hmm. Wonder if that helps my SMP wierness, too.
It shouldn't be set on anything but NUMA-Q and "bigsmp".
-- wli
In message <9530000.1046238665@[10.10.2.4]> you write:
> > SMP box, compiled for UP with CONFIG_LOCAL_APIC=y freezes on boot with
> > last lines:
> >
> > POSIX conformance testing by UNIFIX
> > masked ExtINT on CPU#0
> > ESR value before enabling vector: 00000008
> > [ Freeze here ]
> I put an esr_disable flag in there a while back ... does that workaround it?
Yes. Hmm. Wonder if that helps my SMP wierness, too.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Tue, 25 Feb 2003, William Lee Irwin III wrote:
> In message mbligh wrote:
> >> I put an esr_disable flag in there a while back ... does that workaround it?
>
> On Wed, Feb 26, 2003 at 06:14:42PM +1100, Rusty Russell wrote:
> > Yes. Hmm. Wonder if that helps my SMP wierness, too.
>
> It shouldn't be set on anything but NUMA-Q and "bigsmp".
Hmm.. Why is it right on those, but not on normal machines? The APIC is
the same, and if the big machines need it, apparently at least _one_ small
machine needs it too..
Also, if we find that the ESR value was non-zero, it sounds a bit stupid
to enable error delivery at bootup. We already know there was an error, we
don't need to be told.
Linus
>> >> I put an esr_disable flag in there a while back ... does that
>> >> workaround it?
>>
>> On Wed, Feb 26, 2003 at 06:14:42PM +1100, Rusty Russell wrote:
>> > Yes. Hmm. Wonder if that helps my SMP wierness, too.
>>
>> It shouldn't be set on anything but NUMA-Q and "bigsmp".
>
> Hmm.. Why is it right on those, but not on normal machines? The APIC is
> the same, and if the big machines need it, apparently at least _one_
> small machine needs it too..
>
> Also, if we find that the ESR value was non-zero, it sounds a bit stupid
> to enable error delivery at bootup. We already know there was an error,
> we don't need to be told.
There's bugs in the APIC that mean that once we get an error, we never
manage to get rid of it, I believe. I don't believe Intel have ever
acknowledged that or worked around it, but I think Sequent engineers spent
a lot of time with bus analysers, etc looking at this, and that was the
ultimate conclusion. It's not like we do anything with the error anyway, so
disabling it seemed like the prudent thing to do.
Now in the case Rusty has, would be nice to find why it's changed, this was
just a workaround. On the NUMA-Qs, this always happened, so it's not so
interesting ;-)
M.
[ I added Mikael and Asit to the Cc, since they might be able to confirm
or deny the ESR issue ]
On Wed, 26 Feb 2003, Martin J. Bligh wrote:
>
> Now in the case Rusty has, would be nice to find why it's changed, this was
> just a workaround. On the NUMA-Qs, this always happened, so it's not so
> interesting ;-)
Well, after having re-read the ESR description in the Intel manuals, and
looking at our code, I think the code has always been crap.
The manual clearly states that you clear the ESR by doing back-to-back
writes, and that you read it by writing to it and then reading it. That's
not at _all_ what we do at bootup. It's _also_ not what we do at
"clear_local_apic()".
>From the description in the Intel docs, it really looks like the ESR works
as follows:
- reading ESR reads the "cached" register (which is nothing but a latch,
and has no real "meaning" except as exactly that)
- writing to ESR moves the current real hardware register into the
latched one and it resets the real hardware one.
In other words, you have three cases:
- "read previous state" (which is only really useful for software that
doesn't want to carry the state with it, ie things like status
printing stuff in the error handler):
value = apic_read(APIC_ESR)
This has _nothing_ to do with whatever the current ESR state is, and
only reads the value as it was at the previous latch event.
- latch current state and read it:
apic_write(0, APIC_ESR); // I doubt the value matters
value = apic_read(APIC_ESR);
This reads the real "current state", leaving it in the latch.
- clear and read current state:
apic_write(0, APIC_ESR);
value = apic_read(APIC_ESR);
apic_write(0, APIC_ESR);
(You probably might as well do this sequence to clear the thing even if
you don't actually care what the old value was, if only to make sure
there are no write-buffer bugs).
Also, I would _assume_ that the error interrupt is active based on the
bit-wise "or" of both the latched and the real value, since the docs
clearly say that it must be cleared by sw by back-to-back writes
(rationale: if the error interrupt is only triggered by the latch it is
obviously useless, since we wouldn't get an interrupt for new events. If
the error interrupt is only triggered by the real value, we'd only need a
single write to clear it).
Anyway, the above is clearly not what we're doing with the ESR right now.
Martin: in the esr disable case you clearly write the ESR multiple times
("over the head with a big hammer"), and you must do that because you
noticed that a single write was insufficient. Why four? Did you just
decide that as long as you're doing multiple writes, you might as well
just do "several". Or did four writes work and two didn't?
Asit, can you confirm/deny (or know who can)?
Linus
> Well, after having re-read the ESR description in the Intel manuals, and
> looking at our code, I think the code has always been crap.
Personally I'd point the finger more in the direction of the hardware for
that remark, but still ;-)
> The manual clearly states that you clear the ESR by doing back-to-back
> writes, and that you read it by writing to it and then reading it. That's
> not at _all_ what we do at bootup. It's _also_ not what we do at
> "clear_local_apic()".
Right ... not sure what they were smoking when they wrote / implemented
that, but yes, that's what it says.
> Also, I would _assume_ that the error interrupt is active based on the
> bit-wise "or" of both the latched and the real value, since the docs
> clearly say that it must be cleared by sw by back-to-back writes
> (rationale: if the error interrupt is only triggered by the latch it is
> obviously useless, since we wouldn't get an interrupt for new events. If
> the error interrupt is only triggered by the real value, we'd only need a
> single write to clear it).
>
> Anyway, the above is clearly not what we're doing with the ESR right now.
>
> Martin: in the esr disable case you clearly write the ESR multiple times
> ("over the head with a big hammer"), and you must do that because you
> noticed that a single write was insufficient. Why four? Did you just
> decide that as long as you're doing multiple writes, you might as well
> just do "several". Or did four writes work and two didn't?
The latter, IIRC, 2 writes worked most of the time, but never really fixed
it. Using any kind of logical analysis never seemed to work on that chip
... brute force, trial and error, and 3 months of tearing my hair out was
the only thing that suceeded in the end. A time I have no wish to revisit
;-)
cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
some pointers to hair-restorer during the Linux timeframe.
M.
On Wed, 26 Feb 2003 18:14:42 +1100, Rusty Russell <[email protected]> wrote:
> In message <9530000.1046238665@[10.10.2.4]> you write:
>> > SMP box, compiled for UP with CONFIG_LOCAL_APIC=y freezes on boot with
>> > last lines:
>> >
>> > POSIX conformance testing by UNIFIX
>> > masked ExtINT on CPU#0
>> > ESR value before enabling vector: 00000008
>> > [ Freeze here ]
I suspect the ESR is a red herring. The problem is that the kernel
assumes that the boot CPU is always CPU#0, and it also misprograms the
boot CPU's APIC.
What kind of SMP box is it (Intel/AMD)?
>> I put an esr_disable flag in there a while back ... does that workaround it?
>
> Yes. Hmm. Wonder if that helps my SMP wierness, too.
Does this patch (from Mikael) help? It fixed my problem on a dual AMD
box running a UP + local APIC kernel.
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
--------------------------
--- linux-2.4.21-pre4/arch/i386/kernel/apic.c.~1~ 2003-02-23 15:55:31.000000000 +0100
+++ linux-2.4.21-pre4/arch/i386/kernel/apic.c 2003-02-23 16:03:50.000000000 +0100
@@ -649,7 +649,7 @@
}
set_bit(X86_FEATURE_APIC, &boot_cpu_data.x86_capability);
mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
- boot_cpu_physical_apicid = 0;
+ boot_cpu_physical_apicid = -1U;
if (nmi_watchdog != NMI_NONE)
nmi_watchdog = NMI_LOCAL_APIC;
@@ -1169,8 +1169,8 @@
connect_bsp_APIC();
- phys_cpu_present_map = 1;
- apic_write_around(APIC_ID, boot_cpu_physical_apicid);
+ BUG_ON(boot_cpu_physical_apicid != GET_APIC_ID(apic_read(APIC_ID)));
+ phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
apic_pm_init2();
> I suspect the ESR is a red herring. The problem is that the kernel
> assumes that the boot CPU is always CPU#0, and it also misprograms the
> boot CPU's APIC.
>
> What kind of SMP box is it (Intel/AMD)?
The boot cpu *is* always CPU#0. It may not be physical apicid 0, but that
matters not. as long as the mpstables are correct. And we should bug out if
it's not (which is pretty stupid anyway ... we know what the boot cpu ID
is, we should just warn). This is how I fixed it for kexec:
diff -urpN -X /home/fletch/.diff.exclude virgin/arch/i386/kernel/smpboot.c
nonzero_apicid/arch/i386/kernel/smpboot.c
--- virgin/arch/i386/kernel/smpboot.c Sat Feb 15 16:11:40 2003
+++ nonzero_apicid/arch/i386/kernel/smpboot.c Wed Feb 26 13:02:10 2003
@@ -951,6 +951,7 @@ static void __init smp_boot_cpus(unsigne
print_cpu_info(&cpu_data[0]);
boot_cpu_logical_apicid = logical_smp_processor_id();
+ boot_cpu_physical_apicid = hard_smp_processor_id();
current_thread_info()->cpu = 0;
smp_tune_scheduling();
On Wed, 26 Feb 2003, Martin J. Bligh wrote:
> The boot cpu *is* always CPU#0. It may not be physical apicid 0, but that
> matters not. as long as the mpstables are correct. And we should bug out if
> it's not (which is pretty stupid anyway ... we know what the boot cpu ID
> is, we should just warn). This is how I fixed it for kexec:
>
> diff -urpN -X /home/fletch/.diff.exclude virgin/arch/i386/kernel/smpboot.c
> nonzero_apicid/arch/i386/kernel/smpboot.c
> --- virgin/arch/i386/kernel/smpboot.c Sat Feb 15 16:11:40 2003
> +++ nonzero_apicid/arch/i386/kernel/smpboot.c Wed Feb 26 13:02:10 2003
> @@ -951,6 +951,7 @@ static void __init smp_boot_cpus(unsigne
> print_cpu_info(&cpu_data[0]);
>
> boot_cpu_logical_apicid = logical_smp_processor_id();
> + boot_cpu_physical_apicid = hard_smp_processor_id();
>
> current_thread_info()->cpu = 0;
> smp_tune_scheduling();
But this patch is for smpboot.c, which is not even compiled in for a UP
kernel...
Both Rusty and I had problems with a UP+APIC kernel running on an SMP box.
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
> But this patch is for smpboot.c, which is not even compiled in for a UP
> kernel...
>
> Both Rusty and I had problems with a UP+APIC kernel running on an SMP box.
Urg. OK, that's too wierd to think about ;-)
M.
On Wed, 26 Feb 2003, Ion Badulescu wrote:
> > diff -urpN -X /home/fletch/.diff.exclude virgin/arch/i386/kernel/smpboot.c
> > nonzero_apicid/arch/i386/kernel/smpboot.c
> > --- virgin/arch/i386/kernel/smpboot.c Sat Feb 15 16:11:40 2003
> > +++ nonzero_apicid/arch/i386/kernel/smpboot.c Wed Feb 26 13:02:10 2003
> > @@ -951,6 +951,7 @@ static void __init smp_boot_cpus(unsigne
> > print_cpu_info(&cpu_data[0]);
> >
> > boot_cpu_logical_apicid = logical_smp_processor_id();
> > + boot_cpu_physical_apicid = hard_smp_processor_id();
> >
> > current_thread_info()->cpu = 0;
> > smp_tune_scheduling();
>
> But this patch is for smpboot.c, which is not even compiled in for a UP
> kernel...
>
> Both Rusty and I had problems with a UP+APIC kernel running on an SMP box.
What about detect_init_APIC()?
That one currently does an unconditional
boot_cpu_physical_apicid = 0;
and as far as I can see, this is the path taken by the UP case (because it
doesn't even _try_ to find the SMP config tables).
What happens if you just remove that line (which means that the code
further on will do
*/
if (boot_cpu_physical_apicid == -1U)
boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
which might be correct.
Linus
On Wed, 26 Feb 2003, Linus Torvalds wrote:
> What about detect_init_APIC()?
>
> That one currently does an unconditional
>
> boot_cpu_physical_apicid = 0;
Mikael's patch (included in the previous message) changes this to
boot_cpu_physical_apicid = -1U;
which is the same thing indeed.
> What happens if you just remove that line (which means that the code
> further on will do
>
> */
> if (boot_cpu_physical_apicid == -1U)
> boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
>
> which might be correct.
It's not enough. There are two other problems, further down in
APIC_init_uniprocessor():
1) apic_write_around(APIC_ID, boot_cpu_physical_apicid) places the APIC
value in the lower 8 bits of APIC_ID, when it should be in the upper 8. As
as result, it effectively forces the APIC id to always be 0 for the boot
CPU, which is fatal on SMP AMD boxes.
2) phys_cpu_present_map = 1 means we always set bit 0, but later on
in setup_local_APIC() we do
if (!clustered_apic_mode &&
!test_bit(GET_APIC_ID(apic_read(APIC_ID)), &phys_cpu_present_map))
BUG();
and the bug is triggered if the APIC_ID is not zero.
Here's Mikael's patch again -- it's quite obviously correct, it fixes the
problem on my SMP AMD boxes and doesn't break anything else I've thrown at
it. Applies cleanly to both 2.4 and 2.5.latest.
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-----------------------------------------------
--- linux-2.4.21-pre4/arch/i386/kernel/apic.c.~1~ 2003-02-23 15:55:31.000000000 +0100
+++ linux-2.4.21-pre4/arch/i386/kernel/apic.c 2003-02-23 16:03:50.000000000 +0100
@@ -649,7 +649,7 @@
}
set_bit(X86_FEATURE_APIC, &boot_cpu_data.x86_capability);
mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
- boot_cpu_physical_apicid = 0;
+ boot_cpu_physical_apicid = -1U;
if (nmi_watchdog != NMI_NONE)
nmi_watchdog = NMI_LOCAL_APIC;
@@ -1169,8 +1169,8 @@
connect_bsp_APIC();
- phys_cpu_present_map = 1;
- apic_write_around(APIC_ID, boot_cpu_physical_apicid);
+ BUG_ON(boot_cpu_physical_apicid != GET_APIC_ID(apic_read(APIC_ID)));
+ phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
apic_pm_init2();
On Wed, 26 Feb 2003, Ion Badulescu wrote:
>
> Mikael's patch (included in the previous message) changes this to
>
> boot_cpu_physical_apicid = -1U;
>
> which is the same thing indeed.
Yeah, I'd rather remove it if this is it.
> It's not enough. There are two other problems, further down in
> APIC_init_uniprocessor():
>
> 1) apic_write_around(APIC_ID, boot_cpu_physical_apicid) places the APIC
> value in the lower 8 bits of APIC_ID, when it should be in the upper 8. As
> as result, it effectively forces the APIC id to always be 0 for the boot
> CPU, which is fatal on SMP AMD boxes.
Wouldn't it be nicer to just fix the write instead? I can see the
potential to actually want to change the APIC ID - in particular, if the
SMP MP tables say that the APIC ID for the BP should be X, maybe we should
actually write X to it instead of just using what is there.
In particular, Mikaels patch will BUG() if the MP tables don't match the
APIC ID. I think that's extremely rude: we should select one of the two
and just run with it, instead of unconditionally failing.
> 2) phys_cpu_present_map = 1 means we always set bit 0, but later on
> in setup_local_APIC() we do
> if (!clustered_apic_mode &&
> !test_bit(GET_APIC_ID(apic_read(APIC_ID)), &phys_cpu_present_map))
> BUG();
> and the bug is triggered if the APIC_ID is not zero.
Yeah, there's no question something is wrong. However:
> Here's Mikael's patch again -- it's quite obviously correct, it fixes the
> problem on my SMP AMD boxes and doesn't break anything else I've thrown at
> it. Applies cleanly to both 2.4 and 2.5.latest.
I disagree with the "obviously correct", due to the above issue of
mismatches between MP tables and actual APIC contents. I think it is more
correct than what we have now, but..
Linus
> From: Linus Torvalds [mailto:[email protected]]
> Wouldn't it be nicer to just fix the write instead? I can see the
> potential to actually want to change the APIC ID - in
> particular, if the
> SMP MP tables say that the APIC ID for the BP should be X,
> maybe we should
> actually write X to it instead of just using what is there.
OK so we have a redundancy. You can get the same info from MPS and from
the lapic itself.
The fact that ACPI's boot tables does not include the lapic id (just its
address) suggests strongly to me that we should similarly query the
lapic for its address instead of writing in a new value when using the
MPS tables, as well.
> In particular, Mikaels patch will BUG() if the MP tables
> don't match the
> APIC ID. I think that's extremely rude: we should select one
> of the two
> and just run with it, instead of unconditionally failing.
Agree.
Regards -- Andy
> Wouldn't it be nicer to just fix the write instead? I can see the
> potential to actually want to change the APIC ID - in particular, if the
> SMP MP tables say that the APIC ID for the BP should be X, maybe we
> should actually write X to it instead of just using what is there.
That strikes me as the wrong way around. We *are* booting on this CPU,
so the MPS tables are wrong. Why reprogram reality to fit the MPS tables?
Also, you don't know what the other CPUs phys ID's really are if
everything is inconsistent anyway ... if the MPS tables say boot is phys 0,
and phys 1 also exists, and we find we're on phys 1, what now?
Also, if we kexec from a non-original boot cpu, we come back with that CPU
as the boot CPU, which is correct, but doesn't match the MPS tables .. the
one liner I sent before will fix that.
Wouldn't it be better to just leave the phys apicids as is, and just print
a warning of the MPS tables boot flags don't match up with reality? Seems
far less drastic to me.
> In particular, Mikaels patch will BUG() if the MP tables don't match the
> APIC ID. I think that's extremely rude: we should select one of the two
> and just run with it, instead of unconditionally failing.
OK, that's bad. SMP does that currently too, but it's easy to fix without
going to the extent of reprogramming things that we don't really understand
the state of (because we have inconsistent info)
M.
Linus Torvalds writes:
> > 1) apic_write_around(APIC_ID, boot_cpu_physical_apicid) places the APIC
> > value in the lower 8 bits of APIC_ID, when it should be in the upper 8. As
> > as result, it effectively forces the APIC id to always be 0 for the boot
> > CPU, which is fatal on SMP AMD boxes.
>
> Wouldn't it be nicer to just fix the write instead? I can see the
> potential to actually want to change the APIC ID - in particular, if the
> SMP MP tables say that the APIC ID for the BP should be X, maybe we should
> actually write X to it instead of just using what is there.
>
> In particular, Mikaels patch will BUG() if the MP tables don't match the
> APIC ID. I think that's extremely rude: we should select one of the two
> and just run with it, instead of unconditionally failing.
Yes, but that was just a test patch to test my suspicions about why
a UP_APIC kernel failed on an SMP K7 box. (CPU #1 was BSP and was
programmed with CPU #0's local APIC ID.)
I assume that an SMP machine once booted by the BIOS will have
unique local APIC IDs. They're typically either hardwired or programmed
by the BIOS. If the MP table then disagrees with what's in the
CPUs' local APIC IDs, who do you trust: the MP table or the CPUs?
I personally would trust the CPUs and leave the local APIC IDs alone,
in particular since writing to them always risks collisions, especially
in the UP-kernel-on-SMP-HW case.
So I think the BUG should be a warning, but we shouldn't clobber APIC_ID.
/Mikael
Ok, can people agree on this simplified version of Mikaels patch, which
doesn't BUG_ON(), and doesn't reset 'boot_cpu_physical_apicid'
unnecessarily..
Does this work for people?
Linus
---
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1087 -> 1.1088
# arch/i386/kernel/apic.c 1.33 -> 1.34
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/02/26 [email protected] 1.1088
# [PATCH] APIC ID fixes
#
# 1) apic_write_around(APIC_ID, boot_cpu_physical_apicid) places the APIC
# value in the lower 8 bits of APIC_ID, when it should be in the upper 8. As
# as result, it effectively forces the APIC id to always be 0 for the boot
# CPU, which is fatal on SMP AMD boxes.
#
# Fix: don't do the write at all. The APIC_ID value should be right already.
#
# 2) phys_cpu_present_map = 1 means we always set bit 0, but later on
# in setup_local_APIC() we do
# if (!clustered_apic_mode &&
# !test_bit(GET_APIC_ID(apic_read(APIC_ID)), &phys_cpu_present_map))
# BUG();
# and the bug is triggered if the APIC_ID is not zero.
#
# Fix: initialize 'phys_cpu_present_map' correctly.
# --------------------------------------------
#
diff -Nru a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
--- a/arch/i386/kernel/apic.c Wed Feb 26 16:00:18 2003
+++ b/arch/i386/kernel/apic.c Wed Feb 26 16:00:18 2003
@@ -665,7 +665,6 @@
}
set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
- boot_cpu_physical_apicid = 0;
if (nmi_watchdog != NMI_NONE)
nmi_watchdog = NMI_LOCAL_APIC;
@@ -1154,8 +1153,7 @@
connect_bsp_APIC();
- phys_cpu_present_map = 1;
- apic_write_around(APIC_ID, boot_cpu_physical_apicid);
+ phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
apic_pm_init2();
On Wednesday 26 February 2003 08:52 am, Martin J. Bligh wrote:
[ Snip! ]
> >
> > Anyway, the above is clearly not what we're doing with the ESR right now.
> >
> > Martin: in the esr disable case you clearly write the ESR multiple times
> > ("over the head with a big hammer"), and you must do that because you
> > noticed that a single write was insufficient. Why four? Did you just
> > decide that as long as you're doing multiple writes, you might as well
> > just do "several". Or did four writes work and two didn't?
>
> The latter, IIRC, 2 writes worked most of the time, but never really fixed
> it. Using any kind of logical analysis never seemed to work on that chip
> ... brute force, trial and error, and 3 months of tearing my hair out was
> the only thing that succeeded in the end. A time I have no wish to revisit
> ;-)
>
> cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
> some pointers to hair-restorer during the Linux timeframe.
>
> M.
> -
You want _that_ story, eh? 8^)
* * * * *
Yeah we had ESR problems on the original NUMA-Q boxes with P6 CPUs. On system
shutdown, CPU 0 on one or more secondary nodes would occasionally spasm with
an infinite stream of APIC error interrupts claiming invalid message. A
couple hardware guys and I spent a lot of time looking at the APIC bus with
special APIC bus analyzers, etc. We _never_ caught a malformed message on
the APIC bus.
Once a CPU started weirding out like this, it was impossible to make it shut
up. We could clear the error status, and it would show cleared in the ESR,
but the local APIC would reissue the same error interrupt as soon as we
returned from the error handler.
In fact, with kernel printf turned off we would get about a million of them
per second, faster than most APIC messages could be sent over the APIC bus.
(This was a 16.6667 MHz two bit wide bus. Messages were about 10 to 40
frames long.)
Thus, I concluded that it was some weird error state in the local APIC. We
never got any answer back from Intel on how to clear this state, let alone
admission that it existed, so we just turned off the APIC error IRQ. Since
we were shutting down the system anyway, this seemed an adequate kludge.
Writing 0 to the ESR four times was done out of paranoia, and a desire to
grind the clear deeper into the local APIC's state machine. I have no
evidence that it ever really fixed this bug. Nothing did.
Maybe this weirdness was fixed in P2s or later CPUs. Maybe. Intel never did
say anything about it to us. Regardless, the four writes to ESR is still
enshrined in Dynix/PTX's APIC error handler, and will remain a hidden
testimony to this bug for as long as IBM maintains PTX support.
--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com
>
> connect_bsp_APIC();
>
> - phys_cpu_present_map = 1;
> - apic_write_around(APIC_ID, boot_cpu_physical_apicid);
> + phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
>
> apic_pm_init2();
If I'm reading this correctly, this is called out of APIC_init_uniprocessor
from smp_init ... isn't that after people have finished using
phys_cpu_present_map (eg setup_ioapic_ids_from_mpc)?
maybe change this bit in trap_init:
@@ -665,7 +665,6 @@
}
set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
- boot_cpu_physical_apicid = 0;
if (nmi_watchdog != NMI_NONE)
nmi_watchdog = NMI_LOCAL_APIC;
to do:
boot_cpu_physical_apicid = hard_smp_processor_id();
phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
or something like that? On the other hand, I can't see how this works right
now (maybe it doesn't), so the above may be utterly wrong.
M.
On Wed, 26 Feb 2003, Martin J. Bligh wrote:
> >
> > connect_bsp_APIC();
> >
> > - phys_cpu_present_map = 1;
> > - apic_write_around(APIC_ID, boot_cpu_physical_apicid);
> > + phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
> >
> > apic_pm_init2();
>
> If I'm reading this correctly, this is called out of APIC_init_uniprocessor
> from smp_init ... isn't that after people have finished using
> phys_cpu_present_map (eg setup_ioapic_ids_from_mpc)?
No, look further down in apic.c: APIC_init_uniprocessor() calls
setup_IO_APIC() which then calls setup_ioapic_ids_from_mpc(). This happens
after the code quoted above runs.
Additionally, in 2.4.x setup_local_APIC() also uses
boot_cpu_physical_apicid, although only as a sanity check. That code
is gone from (or never made it into) 2.5.
> maybe change this bit in trap_init:
>
> @@ -665,7 +665,6 @@
> }
> set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
> mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
> - boot_cpu_physical_apicid = 0;
> if (nmi_watchdog != NMI_NONE)
> nmi_watchdog = NMI_LOCAL_APIC;
>
> to do:
>
> boot_cpu_physical_apicid = hard_smp_processor_id();
> phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
But is it necessarily true that hard_smp_processor_id() equals the APIC
id?
Anyway, after applying Linus' patch, the first thing that touches
boot_cpu_physical_apicid becomes init_apic_mappings() which does:
if (boot_cpu_physical_apicid == -1U)
boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
and that's just about as good as it gets, certainly the CPU knows best
what its APIC id is.
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
On Wed, 26 Feb 2003, Linus Torvalds wrote:
> Ok, can people agree on this simplified version of Mikaels patch, which
> doesn't BUG_ON(), and doesn't reset 'boot_cpu_physical_apicid'
> unnecessarily..
>
> Does this work for people?
Works for me, but I'm curious if it works for Rusty. We're still not sure
if he's hitting the same bug... but I guess it's rather early morning in
.au. :-)
Oh, and 2.4 needs the same fix -- and if Mikael's original BUG_ON() is
undesirable then we should probably also remove this code from 2.4's
apic.c:setup_local_APIC()
if (!clustered_apic_mode &&
!test_bit(GET_APIC_ID(apic_read(APIC_ID)), &phys_cpu_present_map))
BUG();
because it's essentially the same test as the BUG_ON(), at least for the
UP case.
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
> No, look further down in apic.c: APIC_init_uniprocessor() calls
> setup_IO_APIC() which then calls setup_ioapic_ids_from_mpc(). This
> happens after the code quoted above runs.
OK, sounds good - I might be twisting it with the SMP code, which I'm much
more familiar with.
>> boot_cpu_physical_apicid = hard_smp_processor_id();
>> phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
>
> But is it necessarily true that hard_smp_processor_id() equals the APIC
> id?
Well it should be. Except that someone did this for some odd reason
#define hard_smp_processor_id() 0
on non-SMP rather than non-local-apic.
But we could just substitute the real code:
static __inline int hard_smp_processor_id(void)
{
/* we don't want to mark this access volatile - bad code generation
*/
return GET_APIC_ID(*(unsigned long *)(APIC_BASE+APIC_ID));
}
which is in smp.h, but wrapped in #ifdef CONFIG_X86_LOCAL_APIC.
All very twisted. Probably not worth it.
> and that's just about as good as it gets, certainly the CPU knows best
> what its APIC id is.
Absolutely ;-)
M.
> Works for me, but I'm curious if it works for Rusty. We're still not sure
> if he's hitting the same bug... but I guess it's rather early morning in
> .au. :-)
>
> Oh, and 2.4 needs the same fix -- and if Mikael's original BUG_ON() is
> undesirable then we should probably also remove this code from 2.4's
> apic.c:setup_local_APIC()
>
> if (!clustered_apic_mode &&
> !test_bit(GET_APIC_ID(apic_read(APIC_ID)),
> &phys_cpu_present_map)) BUG();
>
> because it's essentially the same test as the BUG_ON(), at least for the
> UP case.
Also from smpboot.c / mpparse.c. Untested patch below (sorry, don't have a
box to hand that uses phys apicids to boot).
diff -urpN -X /home/fletch/.diff.exclude virgin/arch/i386/kernel/mpparse.c
nonzero_apicid2/arch/i386/kernel/mpparse.c
--- virgin/arch/i386/kernel/mpparse.c Tue Feb 25 23:03:43 2003
+++ nonzero_apicid2/arch/i386/kernel/mpparse.c Wed Feb 26 17:39:39 2003
@@ -162,7 +162,8 @@ void __init MP_processor_info (struct mp
if (m->mpc_cpuflag & CPU_BOOTPROCESSOR) {
Dprintk(" Bootup CPU\n");
- boot_cpu_physical_apicid = m->mpc_apicid;
+ if (m->mpc_apicid != hard_smp_processor_id())
+ printk "Warning: MP table lies about boot cpu\n";
boot_cpu_logical_apicid = apicid;
}
diff -urpN -X /home/fletch/.diff.exclude virgin/arch/i386/kernel/smpboot.c
nonzero_apicid2/arch/i386/kernel/smpboot.c
--- virgin/arch/i386/kernel/smpboot.c Sat Feb 15 16:11:40 2003
+++ nonzero_apicid2/arch/i386/kernel/smpboot.c Wed Feb 26 17:37:12 2003
@@ -951,6 +951,7 @@ static void __init smp_boot_cpus(unsigne
print_cpu_info(&cpu_data[0]);
boot_cpu_logical_apicid = logical_smp_processor_id();
+ boot_cpu_physical_apicid = hard_smp_processor_id();
current_thread_info()->cpu = 0;
smp_tune_scheduling();
@@ -1008,9 +1009,6 @@ static void __init smp_boot_cpus(unsigne
connect_bsp_APIC();
setup_local_APIC();
map_cpu_to_logical_apicid();
-
- if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_physical_apicid)
- BUG();
setup_portio_remap();
> In message <9530000.1046238665@[10.10.2.4]> I wrote:
> Yes. Hmm. Wonder if that helps my SMP wierness, too.
Didn't get that far. Booted, then, froze later in UP with esr_disable
#defined to 1, too.
I'm stumped. 2.4 works fine, 2.5 has trouble lasting 10 minutes.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <[email protected]> you wri
te:
>
> Ok, can people agree on this simplified version of Mikaels patch, which
> doesn't BUG_ON(), and doesn't reset 'boot_cpu_physical_apicid'
> unnecessarily..
>
> Does this work for people?
Yes!
Wow, now I feel sorry about breaking module.c on you.
Thanks!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Martin J. Bligh writes:
> maybe change this bit in trap_init:
>
> @@ -665,7 +665,6 @@
> }
> set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
> mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
> - boot_cpu_physical_apicid = 0;
> if (nmi_watchdog != NMI_NONE)
> nmi_watchdog = NMI_LOCAL_APIC;
>
> to do:
>
> boot_cpu_physical_apicid = hard_smp_processor_id();
> phys_cpu_present_map = 1 << boot_cpu_physical_apicid;
I assume you meant detect_init_APIC().
No, you can't do any apic reads (which is what hard_smp_processor_id() does)
at this point since the local APIC isn't mapped yet. That happens shortly
after this, in init_apic_mappings()' set_fixmap_nocache() call.
/Mikael
On Wed, 26 Feb 2003, Linus Torvalds wrote:
> On Wed, 26 Feb 2003, Martin J. Bligh wrote:
> >
> > Now in the case Rusty has, would be nice to find why it's changed, this was
> > just a workaround. On the NUMA-Qs, this always happened, so it's not so
> > interesting ;-)
>
> Well, after having re-read the ESR description in the Intel manuals, and
> looking at our code, I think the code has always been crap.
>
> The manual clearly states that you clear the ESR by doing back-to-back
> writes, and that you read it by writing to it and then reading it. That's
> not at _all_ what we do at bootup. It's _also_ not what we do at
> "clear_local_apic()".
That is true for the P6+ APICs. For the Pentium one, the semantics is a
bit different. And there is an erratum causing the loss of the ESR
contents upon a write on early revisions.
> >From the description in the Intel docs, it really looks like the ESR works
> as follows:
> - reading ESR reads the "cached" register (which is nothing but a latch,
> and has no real "meaning" except as exactly that)
> - writing to ESR moves the current real hardware register into the
> latched one and it resets the real hardware one.
>
> In other words, you have three cases:
>
> - "read previous state" (which is only really useful for software that
> doesn't want to carry the state with it, ie things like status
> printing stuff in the error handler):
>
> value = apic_read(APIC_ESR)
>
> This has _nothing_ to do with whatever the current ESR state is, and
> only reads the value as it was at the previous latch event.
For Pentium it reads and clears the current value of the real ESR.
> - latch current state and read it:
>
> apic_write(0, APIC_ESR); // I doubt the value matters
> value = apic_read(APIC_ESR);
>
> This reads the real "current state", leaving it in the latch.
For Pentium a write is meant to be a noop (and mentioned in the manual to
be executed anyway for future compatibility), but due to the mentioned
erratum, it clears the ESR. Thus the read always yields zero. That's why
there is that "if (maxlvt > 3)" statement in the code -- which essentially
means "if (CPU > Pentium)".
> - clear and read current state:
>
> apic_write(0, APIC_ESR);
> value = apic_read(APIC_ESR);
> apic_write(0, APIC_ESR);
>
> (You probably might as well do this sequence to clear the thing even if
> you don't actually care what the old value was, if only to make sure
> there are no write-buffer bugs).
For Pentium you always read a zero as above.
Of course all three sequences do clear ESR for Pentia properly.
> Also, I would _assume_ that the error interrupt is active based on the
> bit-wise "or" of both the latched and the real value, since the docs
> clearly say that it must be cleared by sw by back-to-back writes
I believe only the real value matters.
> (rationale: if the error interrupt is only triggered by the latch it is
> obviously useless, since we wouldn't get an interrupt for new events. If
> the error interrupt is only triggered by the real value, we'd only need
> a single write to clear it).
But it only refers to actions required upon a system initialization, when
the error interrupt is enabled in the LVT.
> Anyway, the above is clearly not what we're doing with the ESR right now.
What we are doing is based on the Pentium specification. Certainly
clear_local_APIC(), setup_local_APIC() and smp_error_interrupt() do their
things right for Pentia (the two latter read ESR twice merely for
debugging). The code probably wasn't updated for P6 processors, because
the Pentium specification supposed to be upward compatible. Basically
they stated to properly handle an error interrupt you need to execute:
1. Do a write to the ESR. The value doesn't matter. For Pentium it is a
noop, but future APICs will require it.
2. Read from the ESR. This will provide the current state of error bits
and clear them at the same time.
Then the mentioned erratum (3AP in the Pentium spec update) is worked
around by skipping step #1 entirely for Pentium APICs as that's the
easiest way. Why a second write would be needed after #2 for P6 isn't
clear to me and the doc part that you refer to, isn't written very well,
either. I used to assume that another write to ESR would simply trigger a
next update of the software-visible latch and it would only be zero if no
other error happened meanwhile.
I don't have the manual handy, but I can dig it from my archive if anyone
is interested.
And finally, such docs need to be taken with a grain of salt and should
always be verified with real hardware if possible. Unfortunately it's
hard to debug ESR problems without special equipment.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +
Linus,
Your interpretation is correct. The algorithm in the documentation (PRM
vol3) is defined to make it work for both Pentium and P6 and above
family of processors.
As Maciej mentioned, on Pentium any read of ESR will clear the ESR and
writes to ESR have no effect except the errata #3AP that was fixed in
C-stepping Pentium processors.
On P6 and above processors a write is needed to make the current error
bits to be visible in the ESR (the latch as you mentioned). Also this
write will make the current error bits 0 (clear). So, this will provide
the current status of the errors:
> > - latch current state and read it:
> >
> > apic_write(0, APIC_ESR); // I doubt the value matters
> > value = apic_read(APIC_ESR);
> >
> > This reads the real "current state", leaving it in the latch.
To clear the ESR (the latch) you need to do a back-to-back write as the
first write will clear the current error bits and the 2nd write will
move the cleared bits (form previous write) to readable ESR. So, your
algorithm should work:
> > - clear and read current state:
> >
> > apic_write(0, APIC_ESR);
> > value = apic_read(APIC_ESR);
<<<====== another error can
occur
> > apic_write(0, APIC_ESR);
> >
However, there is a window where another error could be generated after
the first write and read. In this case, a read after the last write will
see a non-zero value and the ESR will not be cleared. You can handle
this by doing write and read in a loop until the ESR read value becomes
0.
> > Also, I would _assume_ that the error interrupt is active based on
the
> > bit-wise "or" of both the latched and the real value, since the docs
> > clearly say that it must be cleared by sw by back-to-back writes
>
> I believe only the real value matters.
It is the real value. Error interrupt is generated when any bit is set
in the real value (error bits) and does not use visible ESR. However,
the ESR (latch) bits are cumulative and if the ESR is not cleared (using
2 writes) when we handle the interrupt the read of ESR status will also
contain the errors for the previous error. So, the interrupt handler
also should use the clear and read current state as you mentioned.
I do not know the problem that James mentioned but it will be good to
know the kind of errors that were causing the problem. Also, if someone
can provide a test case we will be glad to look at this.
Thanks,
Asit
I want to correct the cumulative part:
>
> It is the real value. Error interrupt is generated when any
> bit is set in the real value (error bits) and does not use
> visible ESR. However, the ESR (latch) bits are cumulative and
> if the ESR is not cleared (using 2 writes) when we handle the
> interrupt the read of ESR status will also contain the errors
> for the previous error. So, the interrupt handler also should
> use the clear and read current state as you mentioned.
The error interrupt is generated based on the real error bits and
readable ESR bits does not affect the interrupt generation (did verify
with the architects). We need the back-to-back write only to make the
readable ESR to get 0 on a read. So, the interrupt handler should be
able to use the write to ESR and read of ESR to get the current error
status.
Thanks,
Asit