2001-10-12 09:28:08

by Sean Cavanaugh

[permalink] [raw]
Subject: P4 SMP load balancing

I posted this a while back in linux-smp (which seems like a dead list?)

I have several P4 Xeon SMP systems (Supermicro P4DCE, Intel i860
chipset)

ovendev:~# cat /proc/interrupts
CPU0 CPU1
0: 6348212 0 IO-APIC-edge timer
1: 2 0 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
8: 1 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-edge acpi
16: 92620 0 IO-APIC-level eth0
18: 5085 0 IO-APIC-level aic7xxx, aic7xxx
NMI: 0 0
LOC: 6348388 6348427
ERR: 0
MIS: 0


How much of a problem is this really? The program's I am
running on these systems (I have 9 of them) seem do ok right now.
Currently the jobs running on them are heavily CPU bound and don't do
any I/O, but this is going to change when I link them up over a private
network so they can work together on some distributable jobs). I am
running 2.4.10 on most of them, and 2.4.10-ac10 on my developer system
in the farm. The only difference this newer kernel seems to have made
from older ones is that there is only one 'warning unexpected IO-APIC'
message in my startup instead of two.


Snippet from dmesg:

CPU1: Intel(R) Xeon(TM) CPU 1700MHz stepping 0a
Total of 2 processors activated (6723.99 BogoMIPS).
ENABLING IO-APIC IRQs
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-5, 2-10, 2-11, 2-12, 2-17, 2-20, 2-21, 2-22
not connected.
..TIMER: vector=0x31 pin1=2 pin2=0
number of MP IRQ sources: 18.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 00178020
....... : max redirection entries: 0017
....... : PRQ implemented: 1
....... : IO APIC version: 0020
WARNING: unexpected IO-APIC, please mail
to [email protected]
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
<snip>



- Sean


2001-10-12 18:05:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: P4 SMP load balancing

> ovendev:~# cat /proc/interrupts
> CPU0 CPU1
> 0: 6348212 0 IO-APIC-edge timer
> 1: 2 0 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 8: 1 0 IO-APIC-edge rtc
> 9: 0 0 IO-APIC-edge acpi
> 16: 92620 0 IO-APIC-level eth0
> 18: 5085 0 IO-APIC-level aic7xxx, aic7xxx
> NMI: 0 0
> LOC: 6348388 6348427
> ERR: 0
> MIS: 0

I don't think this should happen. In the event of both procs having equal
priority (linux never changes them, so they always do), we should fall back
to the arbitration priority of the lapic. Whether you have 1 or 2 I/O apics
working shouldn't make a difference.

The arb priority of the local apic should change whenever a message
is sent (see the Intel docs on developer.intel.com), so we effectively
get round robin. For instance, a 4 way looks like this:

CPU0 CPU1 CPU2 CPU3
0: 1608606 1595657 2168078 1575546 IO-APIC-edge timer
1: 0 0 0 2 IO-APIC-edge keyboard
2: 0 0 0 0 XT-PIC cascade
4: 76 52 62 48 IO-APIC-edge serial
23: 7983 8263 8286 8306 IO-APIC-level qlogicisp
39: 0 0 0 0 IO-APIC-level eth1
40: 6247 6216 6894 6325 IO-APIC-level eth0
NMI: 0 0 0 0
LOC: 6947876 6947859 6947873 6947874
ERR: 0
MIS: 0

Which isn't perfectly balanced, but it looks a damned sight better than
yours does ;-) Do you have something in the log that looks like this?

Oct 11 15:35:04 elm3b76 kernel: IO APIC #13......
Oct 11 15:35:04 elm3b76 kernel: .... register #00: 0D000000
Oct 11 15:35:04 elm3b76 kernel: ....... : physical APIC id: 0D
Oct 11 15:35:04 elm3b76 kernel: .... register #01: 00170011
Oct 11 15:35:04 elm3b76 kernel: ....... : max redirection entries: 0017
Oct 11 15:35:04 elm3b76 kernel: ....... : IO APIC version: 0011
Oct 11 15:35:04 elm3b76 kernel: .... register #02: 00000000
Oct 11 15:35:04 elm3b76 kernel: ....... : arbitration: 00
Oct 11 15:35:04 elm3b76 kernel: .... IRQ redirection table:
Oct 11 15:35:04 elm3b76 kernel: NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
Oct 11 15:35:04 elm3b76 kernel: 00 000 00 1 0 0 0 0 0 0 00
Oct 11 15:35:05 elm3b76 kernel: 01 00F 0F 0 0 0 0 0 0 1 39
Oct 11 15:35:05 elm3b76 kernel: 02 00F 0F 0 0 0 0 0 0 1 31
Oct 11 15:35:05 elm3b76 kernel: 03 00F 0F 0 0 0 0 0 0 1 41
Oct 11 15:35:05 elm3b76 kernel: 04 00F 0F 0 0 0 0 0 0 1 49
Oct 11 15:35:05 elm3b76 kernel: 05 00F 0F 0 0 0 0 0 0 1 51
Oct 11 15:35:05 elm3b76 kernel: 06 00F 0F 0 0 0 0 0 0 1 59
Oct 11 15:35:05 elm3b76 kernel: 07 00F 0F 1 1 0 1 0 0 1 61
Oct 11 15:35:05 elm3b76 kernel: 08 00F 0F 1 1 0 0 0 0 1 69
Oct 11 15:35:05 elm3b76 kernel: 09 00F 0F 0 0 0 0 0 0 1 71
Oct 11 15:35:05 elm3b76 kernel: 0a 00F 0F 0 0 0 0 0 0 1 79
Oct 11 15:35:05 elm3b76 kernel: 0b 00F 0F 1 1 0 1 0 0 1 81
Oct 11 15:35:05 elm3b76 kernel: 0c 00F 0F 0 0 0 0 0 0 1 89
Oct 11 15:35:05 elm3b76 kernel: 0d 00F 0F 1 1 0 1 0 0 1 91
Oct 11 15:35:05 elm3b76 kernel: 0e 00F 0F 0 0 0 0 0 0 1 99
Oct 11 15:35:05 elm3b76 kernel: 0f 00F 0F 1 1 0 1 0 0 1 A1
Oct 11 15:35:05 elm3b76 kernel: 10 00F 0F 1 1 0 1 0 0 1 A9
Oct 11 15:35:05 elm3b76 kernel: 11 00F 0F 1 1 0 1 0 0 1 B1
Oct 11 15:35:05 elm3b76 kernel: 12 00F 0F 1 1 0 1 0 0 1 B9
Oct 11 15:35:05 elm3b76 kernel: 13 00F 0F 1 1 0 1 0 0 1 C1
Oct 11 15:35:05 elm3b76 kernel: 14 00F 0F 1 1 0 1 0 0 1 C9
Oct 11 15:35:05 elm3b76 kernel: 15 00F 0F 1 1 0 1 0 0 1 D1
Oct 11 15:35:05 elm3b76 kernel: 16 00F 0F 1 1 0 1 0 0 1 D9
Oct 11 15:35:05 elm3b76 kernel: 17 00F 0F 1 1 0 1 0 0 1 E1

You might have to tweak syslog.conf to log the debug messages. And
possibly increase LOG_BUF_LEN in kernel/printk.c to something sensible
(63356?)

M.


2001-10-12 18:38:59

by Manfred Spraul

[permalink] [raw]
Subject: Re: P4 SMP load balancing


Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> > ovendev:~# cat /proc/interrupts
> > CPU0 CPU1
> > 0: 6348212 0 IO-APIC-edge timer
> > 1: 2 0 IO-APIC-edge keyboard
> > 2: 0 0 XT-PIC cascade
> > 8: 1 0 IO-APIC-edge rtc
> > 9: 0 0 IO-APIC-edge acpi
> > 16: 92620 0 IO-APIC-level eth0
> > 18: 5085 0 IO-APIC-level aic7xxx, aic7xxx
> > NMI: 0 0
> > LOC: 6348388 6348427
> > ERR: 0
> > MIS: 0
>
> I don't think this should happen. In the event of both procs having equal
> priority (linux never changes them, so they always do), we should fall back
> to the arbitration priority of the lapic. Whether you have 1 or 2 I/O apics
> working shouldn't make a difference.

The P 4 has a new apic, and lowest priority delivery doesn't work
anymore.

<<<<<<< Chapter 7.6.10 of 24547202.pdf
In operating systems that use the lowest priority interrupt delivery
mode
but do not update the TPR, the TPR information saved in the chipset will
potentially cause the interrupt to be always delivered to the same
processor from the logical set. This behavior is functionally backward
compatible with the P6 family processor but may result in unexpected
performance implications.
<<<<<<< (search for 245472 on google for the pdf file)


--
Manfred

2001-10-12 21:41:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: P4 SMP load balancing

>> > ovendev:~# cat /proc/interrupts
>> > CPU0 CPU1
>> > 0: 6348212 0 IO-APIC-edge timer
>> > 1: 2 0 IO-APIC-edge keyboard
>> > 2: 0 0 XT-PIC cascade
>> > 8: 1 0 IO-APIC-edge rtc
>> > 9: 0 0 IO-APIC-edge acpi
>> > 16: 92620 0 IO-APIC-level eth0
>> > 18: 5085 0 IO-APIC-level aic7xxx, aic7xxx
>> > NMI: 0 0
>> > LOC: 6348388 6348427
>> > ERR: 0
>> > MIS: 0
>>
>> I don't think this should happen. In the event of both procs having equal
>> priority (linux never changes them, so they always do), we should fall back
>> to the arbitration priority of the lapic. Whether you have 1 or 2 I/O apics
>> working shouldn't make a difference.
>
> The P 4 has a new apic, and lowest priority delivery doesn't work
> anymore.
>
> <<<<<<< Chapter 7.6.10 of 24547202.pdf
> In operating systems that use the lowest priority interrupt delivery
> mode
> but do not update the TPR, the TPR information saved in the chipset will
> potentially cause the interrupt to be always delivered to the same
> processor from the logical set. This behavior is functionally backward
> compatible with the P6 family processor but may result in unexpected
> performance implications.
> <<<<<<< (search for 245472 on google for the pdf file)

Ick. Thanks for pointing this out ... will go read the P4 docs closer.

Someone here has patches to set the TPR properly, but they weren't
giving the performance gain we'd hoped for. In light of this, they'd
probably help out much more on the P4. I'll see if I can persuade them
to publish ...

M.

2001-10-14 09:07:24

by Sean Cavanaugh

[permalink] [raw]
Subject: RE: P4 SMP load balancing

> From: Martin J. Bligh [mailto:[email protected]]
> Sent: Friday, October 12, 2001 1:00 PM
> To: Sean Cavanaugh
> Cc: [email protected]
> Subject: Re: P4 SMP load balancing
>
>
> Which isn't perfectly balanced, but it looks a damned sight better
than yours does ;-) Do you have something in the log that looks like
this?
>
> Oct 11 15:35:04 elm3b76 kernel: IO APIC #13......
> Oct 11 15:35:04 elm3b76 kernel: .... register #00: 0D000000
> Oct 11 15:35:04 elm3b76 kernel: ....... : physical APIC id: 0D
> Oct 11 15:35:04 elm3b76 kernel: .... register #01: 00170011
> Oct 11 15:35:04 elm3b76 kernel: ....... : max redirection entries:
0017
> Oct 11 15:35:04 elm3b76 kernel: ....... : IO APIC version: 0011
> Oct 11 15:35:04 elm3b76 kernel: .... register #02: 00000000
> Oct 11 15:35:04 elm3b76 kernel: ....... : arbitration: 00
> Oct 11 15:35:04 elm3b76 kernel: .... IRQ redirection table:
> Oct 11 15:35:04 elm3b76 kernel: NR Log Phy Mask Trig IRR Pol Stat
Dest Deli Vect:
> Oct 11 15:35:04 elm3b76 kernel: 00 000 00 1 0 0 0 0 0
0 00
> Oct 11 15:35:05 elm3b76 kernel: 01 00F 0F 0 0 0 0 0 0
1 39

<snip>



Relevent: APIC info from dmesg (with some lead-in/lead-out):


Calibrating delay loop... 3368.55 BogoMIPS
CPU: Before vendor init, caps: 3febfbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 12K, L1 D cache: 8K
CPU: L2 cache: 256K
CPU: After vendor init, caps: 3febfbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#1.
CPU: After generic, caps: 3febfbff 00000000 00000000 00000000
CPU: Common caps: 3febfbff 00000000 00000000 00000000
CPU1: Intel(R) Xeon(TM) CPU 1700MHz stepping 0a
Total of 2 processors activated (6723.99 BogoMIPS).
ENABLING IO-APIC IRQs
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-5, 2-10, 2-11, 2-12, 2-17, 2-20, 2-21, 2-22
not connected.
..TIMER: vector=0x31 pin1=2 pin2=0
number of MP IRQ sources: 18.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 00178020
....... : max redirection entries: 0017
....... : PRQ implemented: 1
....... : IO APIC version: 0020
WARNING: unexpected IO-APIC, please mail
to [email protected]
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 000 00 1 0 0 0 0 0 0 00
01 003 03 0 0 0 0 0 1 1 39
02 003 03 0 0 0 0 0 1 1 31
03 003 03 0 0 0 0 0 1 1 41
04 003 03 0 0 0 0 0 1 1 49
05 000 00 1 0 0 0 0 0 0 00
06 003 03 0 0 0 0 0 1 1 51
07 003 03 0 0 0 0 0 1 1 59
08 003 03 0 0 0 0 0 1 1 61
09 003 03 0 0 0 0 0 1 1 69
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 003 03 0 0 0 0 0 1 1 71
0e 003 03 0 0 0 0 0 1 1 79
0f 003 03 0 0 0 0 0 1 1 81
10 003 03 1 1 0 1 0 1 1 89
11 000 00 1 0 0 0 0 0 0 00
12 003 03 1 1 0 1 0 1 1 91
13 003 03 1 1 0 1 0 1 1 99
14 000 00 1 0 0 0 0 0 0 00
15 000 00 1 0 0 0 0 0 0 00
16 000 00 1 0 0 0 0 0 0 00
17 003 03 1 1 0 1 0 1 1 A1
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 0:16
IRQ18 -> 0:18
IRQ19 -> 0:19
IRQ23 -> 0:23
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1685.2574 MHz.
..... host bus clock speed is 99.1326 MHz.
cpu: 0, clocks: 991326, slice: 330442
CPU0<T0:991312,T1:660864,D:6,S:330442,C:991326>
cpu: 1, clocks: 991326, slice: 330442
CPU1<T0:991312,T1:330416,D:12,S:330442,C:991326>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfb3e0, last bus=4
PCI: Using configuration type 1
PCI: Probing PCI hardware
Unknown bridge resource 0: assuming transparent
Unknown bridge resource 1: assuming transparent
Unknown bridge resource 2: assuming transparent
Unknown bridge resource 2: assuming transparent
Unknown bridge resource 2: assuming transparent
Unknown bridge resource 2: assuming transparent
PCI: Using IRQ router PIIX [8086/2440] at 00:1f.0
PCI->APIC IRQ transform: (B0,I31,P3) -> 19
PCI->APIC IRQ transform: (B0,I31,P1) -> 19
PCI->APIC IRQ transform: (B0,I31,P2) -> 23
PCI->APIC IRQ transform: (B3,I4,P0) -> 18
PCI->APIC IRQ transform: (B3,I4,P1) -> 18
PCI->APIC IRQ transform: (B4,I4,P0) -> 16
PCI->APIC IRQ transform: (B4,I7,P0) -> 16
isapnp: Scanning for PnP cards...



- Sean