by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH 03/15] x86: remove early_gdt_descr reference

On Mon, 2008-06-09 at 14:23 -0300, Glauber Costa wrote:
> James Bottomley wrote:
> > On Mon, 2008-06-09 at 12:49 -0300, Glauber Costa wrote:
> >> James Bottomley wrote:
> >>> On Mon, 2008-06-09 at 11:16 -0300, Glauber Costa wrote:
> >>>> since we use switch_to_new_gdt, there is no point
> >>>> in assigning early_gdt_descr except for the first
> >>>> assignment, which is done manually.
> >>> What makes you think you can do this? If you don't update the early
> >>> boot gdt, they all end up using the Boot CPU one. The problem with this
> >>> is that there's a time from start_secondary to switch_to_new_gdt where
> >>> the per cpu selector (%fs) and the pda selector (%gs) are those of the
> >>> boot CPU. The former isn't a problem but the CPU number is in the
> >>> latter, and it's used in that path before we get to the initialisation.
> >> You are right, I missed it.
> >>
> >> However, it only seem to be used in cpu_init, and very early. Sure there
> >> are some users _before_ we load the new gdt, but nothing prevents them
> >> to be moved after it. (Of course, this patch is wrong anyway).
> >>
> >> And if we do that, we can even take the %fs loading out of head_32.S
> >> Of course, it's only valid if those are indeed the only early users of it.
> >>
> >> Is there any other use I'm missing?
> >
> > Well, %fs loading there is done for the boot CPU. To eliminate that you
> > have to not only verify that start_secondary doesn't use anything in
> > per_cpu areas, but also verify that nothing in start_kernel() up until
> > boot_cpu_init() does ... That's a lot of smp_processor_id() references
> > to convert.
> Yes, after a second look, it would be tricky indeed. But only for cpu0.
> For all the others, I still think we could get rid of the problem by
> switching to the new gdt earlier in cpu_init.
>
> What do you think?

Operating a CPU with a bogus GDT is very fragile. You can fix all the
current issues with the secondary CPUs, but it gives a critical section
within which none of the per_cpu operations will work. It only takes
one patch violating this rule and we have a very subtle bug introduced.

It looks to me like the better fix might be just to initialise the gdt
completely and properly in do_boot_cpu and just have the single switch
in head_32.S be the correct one. That way there's no problem critical
region.

James

On Tue, Jun 10, 2008 at 5:29 PM, Maciej W. Rozycki <[email protected]> wrote:
> On Tue, 10 Jun 2008, Yinghai Lu wrote:
>
>> ExtINT is routed to ioapic pin0. but the dst is set to 0.
>> and the systems has multi sockets with quadcore cpu, so the apic id of boot cpu
>> is set to 4 instead of 0
>
> Thanks for the info. Let me understand the situation better: local APIC
> IDs are preassigned by the firmware based on their "socket address" and
> the socket where the lowest numbered quad would be is empty.
> Nevertheless the firmware sets the destination ID of the ExtINTA interrupt
> in the I/O APIC to 0 rather than the ID of the bootstrap CPU. Is that
> correct?

Yes

after I asked bios engineer to change the dest apic id to 4, the error is gone.

before clear_IO_APIC()
number of MP IRQ sources: 15.
number of IO-APIC #0 registers: 24.
number of IO-APIC #1 registers: 7.
number of IO-APIC #2 registers: 7.
number of IO-APIC #3 registers: 24.
testing the IO APIC.......................

IO APIC #0......
.... register #00: 00000000
....... : physical APIC id: 00
.... register #01: 00170011
....... : max redirection entries: 0017
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect:
00 004 0 0 0 0 0 0 7 00
01 000 1 0 0 0 0 0 0 00
02 000 1 0 0 0 0 0 0 00
03 000 1 0 0 0 0 0 0 00
04 000 1 0 0 0 0 0 0 00
05 000 1 0 0 0 0 0 0 00
06 000 1 0 0 0 0 0 0 00
07 000 1 0 0 0 0 0 0 00
08 000 1 0 0 0 0 0 0 00
09 000 1 0 0 0 0 0 0 00
0a 000 1 0 0 0 0 0 0 00
0b 000 1 0 0 0 0 0 0 00
0c 000 1 0 0 0 0 0 0 00
0d 000 1 0 0 0 0 0 0 00
0e 000 1 0 0 0 0 0 0 00
0f 000 1 0 0 0 0 0 0 00
10 000 1 0 0 0 0 0 0 00
11 000 1 0 0 0 0 0 0 00
12 000 1 0 0 0 0 0 0 00
13 000 1 0 0 0 0 0 0 00
14 000 1 0 0 0 0 0 0 00
15 000 1 0 0 0 0 0 0 00
16 000 1 0 0 0 0 0 0 00
17 000 1 0 0 0 0 0 0 00

>
> But it would mean the Virtual Wire interrupt delivery would not work, or
> is the I/O APIC setup redundant and the local APIC of the bootstrap CPU is
> set up for ExtINTA delivery as well?

it doesn't need to virtual wire for timer (ExtInt), timer is hpet and
routed to ioapic pin2.

Just know at first BIOS engineer refused to change that to 4, because
other os is not happy.

YH

2008-06-11 13:00:04

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [PATCH 11/15] x86: move enabling of io_apic to prepare_cpus

On Tue, 10 Jun 2008, Yinghai Lu wrote:

> > Thanks for the info. Let me understand the situation better: local APIC
> > IDs are preassigned by the firmware based on their "socket address" and
> > the socket where the lowest numbered quad would be is empty.
> > Nevertheless the firmware sets the destination ID of the ExtINTA interrupt
> > in the I/O APIC to 0 rather than the ID of the bootstrap CPU. Is that
> > correct?
>
> Yes
>
> after I asked bios engineer to change the dest apic id to 4, the error
> is gone.

Thanks for the clarification.

> > But it would mean the Virtual Wire interrupt delivery would not work, or
> > is the I/O APIC setup redundant and the local APIC of the bootstrap CPU is
> > set up for ExtINTA delivery as well?
>
> it doesn't need to virtual wire for timer (ExtInt), timer is hpet and
> routed to ioapic pin2.

That's not what I asked about -- the timer does not matter here. The
Virtual Wire mode is a configuration, where one input of one APIC in the
system is set up for the ExtINTA mode and acts transparently with the
system software having no need to know about it. Instead a pair of
legacy 8259A chips is used to deliver interrupts, including claiming the
INTA cycles, providing vectors and prioritising sources, as defined in the
PC/AT architecture.

Many pieces of software rely on the 8259A PICs, either because they
predate the APIC or because they have no means to make use of
multiprocessor features anyway. They include various versions of DOS
together with software run in that environment (as DOS programs quite
frequently drive hardware at the low level), many versions of the
Microsoft Windows system as well as other software. For these a legacy
mode, either the Virtual Wire mode, or a mode where 8259A interrupts are
delivered directly to one processor's INT line has to be implemented as
mandated both by the Multiprocessor Specification and the Advanced
Configuration and Power Interface Specification.

Coming back to my question -- how is such a mode implemented in the
affected system? Clearly the route through the I/O APIC is not going to
work and moreover, the chip clutters the system with broken interrupt
messages each time the 8259A signals an interrupt.

Please note Linux can use the Virtual Wire mode in the APIC/SMP mode too,
if requested by specifying the "noapic" command-line option -- have you
verified the option works correctly with the affected system?

> Just know at first BIOS engineer refused to change that to 4, because
> other os is not happy.

Well, this is just a confirmation my attitude is correct -- such problems
should not be papered over, because vendors will deny their existence
then. At least a complaint message should be printed so that users have
an opportunity to see it and ask their hardware supplier for an
explanation.

In this case, a workaround for the 64-bit mode happens to be quite cheap,
but it should be extended to cover the 32-bit mode as well. The only
solution for the 32-bit mode I have in mind would lead to a waste of
resources for many users of good hardware. And this because of somebody's
sloppiness -- as I have written -- this better be well justified.

Unless you have precise means to identify this system -- in that case I
think reconfiguring the bootstrap processor's local APIC ID to 0 would be
the right approach. Have you tried it?

Maciej