2004-11-02 14:19:12

by Daniel Egger

[permalink] [raw]
Subject: 2.6.8 and 2.6.9 Dual Opteron glitches

Hija,

I've a few glitches with my brandnew dual Opteron System which I'd
like to share with you. First of all, all those problems seem to
be there with 2.6.8.1 and 2.6.9 but since this seemed to be the
case I moved on with 2.6.9 and hadn't investigated any further
on 2.6.8.1 so some of the issues might only apply to 2.6.9.

1) 32 bit kernel HPET calibration hang: If the kernel is compiled
with HPET support, the kernel will hang on boot while
calibrating the timer. The problem goes away if HPET support is
not compiled in. I've no idea what information to provide to help
debug this.

2) 64 bit kernel vgettimeofday panic: The kernel panics in
arch/x64_64/vsyscall.c:169 on boot.

static int __init vsyscall_init(void)
{
if ((unsigned long) &vgettimeofday !=
VSYSCALL_ADDR(__NR_vgettimeofday))
panic("vgettimeofday link addr broken");

Replacing those panic(s) by printk make the machine boot just fine
and also work (seemingly) without any problems under load.

3) Interrupt distribution 32 bit vs. 64 bit. Below is a copy of the
current interrupt distribution for the 64 bit kernel which shows
a huge shift towards CPU1. In a 32 bit kernel the distribution is
reversed and even more visible than here since in total <100
interupts will be handled by CPU1 after days of operation. The 64
bit kernel has all relevant options for K8 (irq balancing,
NUMA support, etc.) enabled.

CPU0 CPU1
0: 15260 4196668 IO-APIC-edge timer
9: 0 0 IO-APIC-level acpi
169: 0 5 IO-APIC-level ehci_hcd
177: 0 3 IO-APIC-level uhci_hcd, ohci1394
185: 1999 934839 IO-APIC-level uhci_hcd, eth0
NMI: 2698 2817
LOC: 4211263 4211263
ERR: 0
MIS: 0

4) ACPI powermanagement (32bit and 64bit): No matter which ACPI options
I choose in the BIOS, ACPI will only handle the first CPU somewhat
and leave the second CPU alone. I'd love to have some simple
powermanagement because the system will get quite warm, even when
idle, and warm == loud because the fans (which are barely noticeable
when the system is cold) kick into gear quite fast.

processor id: 0
acpi id: 1
bus mastering control: no
power management: no
throttling control: yes
limit interface: yes
active limit: P0:T0
user limit: P0:T0
thermal limit: P0:T0
active state: C1
default state: C1
bus master activity: 00000000
states:
*C1: promotion[--] demotion[--] latency[000]
usage[00000000]
C2: <not supported>
C3: <not supported>
state count: 8
active state: T0
states:
*T0: 00%
T1: 12%
T2: 25%
T3: 37%
T4: 50%
T5: 62%
T6: 75%
T7: 87%

processor id: 1
acpi id: 2
bus mastering control: no
power management: no
throttling control: no
limit interface: no
<not supported>
active state: C1
default state: C1
bus master activity: 00000000
states:
*C1: promotion[--] demotion[--] latency[000]
usage[00000000]
C2: <not supported>
C3: <not supported>
<not supported>

Ask me for info, you'll get it. ;)

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-11-02 17:05:57

by Thomas Zehetbauer

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

Hi,

I am using a not-so-new Tyan Thunder K8W S2885 based Dual Opteron
System.

On Die, 2004-11-02 at 14:59 +0100, Daniel Egger wrote:
> 1) 32 bit kernel HPET calibration hang: If the kernel is compiled

Cannot tell as I am using a 64-bit kernel without HPET. Can someone
maybe tell me which applications use HPET yet?

> 2) 64 bit kernel vgettimeofday panic: The kernel panics in

Cannot confirm this, both 2.6.8.1 and 2.6.9 boot OK.

> 3) Interrupt distribution 32 bit vs. 64 bit. Below is a copy of the

Cannot confirm this, interrupts seem to be almost equally distributed
with 64-bit kernel and irqbalance running. Did you note that x86_64 does
not provide in-kernel IRQ balancing.

CPU0 CPU1
0: 2921345 2988327 IO-APIC-edge timer
1: 5767 5414 IO-APIC-edge i8042
3: 2 147 IO-APIC-edge serial
4: 23806 21183 IO-APIC-edge serial
8: 2 37 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
14: 77847 72327 IO-APIC-edge ide0
15: 21317 29959 IO-APIC-edge ide1
16: 216766 217251 IO-APIC-level EMU10K1, mga@pci:0000:05:00.0
17: 0 0 IO-APIC-level AMD AMD8111
19: 182493 182216 IO-APIC-level ohci_hcd, ohci_hcd
24: 317611 1085 IO-APIC-level eth0
NMI: 0 0
LOC: 5908168 5908259
ERR: 0
MIS: 0

> 4) ACPI powermanagement (32bit and 64bit): No matter which ACPI options

AFAIK power management is almost unsupported on SMP systems.

Tom

--
T h o m a s Z e h e t b a u e r ( TZ251 )
PGP encrypted mail preferred - KeyID 96FFCB89
finger [email protected] for key

Quantum Mechanics is God's version of "Trust me."





Attachments:
signature.asc (481.00 B)
This is a digitally signed message part

2004-11-02 17:55:52

by Tomas Carnecky

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

> AFAIK power management is almost unsupported on SMP systems.
>

Even on Pentium 4 Processors with HT? They shop up as two CPU's and you
can enable SMP. Does power managment not work on those processors? HT is
quite common on todays processors, I don't know if notebook CPUs have
HT, but mot of the new Intel desktop CPUs have it.

tom

2004-11-02 21:58:15

by Thomas Zehetbauer

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On Die, 2004-11-02 at 18:55 +0100, Tomas Carnecky wrote:
> Even on Pentium 4 Processors with HT?

Of course not, HyperThrading (HT) is just the operating system and the
processor agreeing on the lie that there are two processors. Maybe this
is still advantageous for operating systems still not supporting real
preemptive multitasking.

Tom

--
T h o m a s Z e h e t b a u e r ( TZ251 )
PGP encrypted mail preferred - KeyID 96FFCB89
finger [email protected] for key

"Memory is like gasoline. You use it up when you are running. Of course you
get it all back when you reboot..."
Microsoft Helpdesk




Attachments:
signature.asc (481.00 B)
This is a digitally signed message part

2004-11-02 22:13:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

Hi,

On Tuesday 02 of November 2004 14:59, Daniel Egger wrote:
> Hija,
>
> I've a few glitches with my brandnew dual Opteron System which I'd
> like to share with you. First of all, all those problems seem to
> be there with 2.6.8.1 and 2.6.9 but since this seemed to be the
> case I moved on with 2.6.9 and hadn't investigated any further
> on 2.6.8.1 so some of the issues might only apply to 2.6.9.

I'm using 2.6.10-rc1-mm2 currently, on a dual Opteron w/ Tyan Thunder K8W.

> 1) 32 bit kernel HPET calibration hang: If the kernel is compiled
> with HPET support, the kernel will hang on boot while
> calibrating the timer. The problem goes away if HPET support is
> not compiled in. I've no idea what information to provide to help
> debug this.

I can't confirm this. I've just set CONFIG_HPET and friends (except for
CONFIG_HPET_RTC_IRQ) and nothing wrong happens.

> 2) 64 bit kernel vgettimeofday panic: The kernel panics in
> arch/x64_64/vsyscall.c:169 on boot.

This does not happen on my system.

>
> static int __init vsyscall_init(void)
> {
> if ((unsigned long) &vgettimeofday !=
> VSYSCALL_ADDR(__NR_vgettimeofday))
> panic("vgettimeofday link addr broken");
>
> Replacing those panic(s) by printk make the machine boot just fine
> and also work (seemingly) without any problems under load.
>
> 3) Interrupt distribution 32 bit vs. 64 bit. Below is a copy of the
> current interrupt distribution for the 64 bit kernel which shows
> a huge shift towards CPU1. In a 32 bit kernel the distribution is
> reversed and even more visible than here since in total <100
> interupts will be handled by CPU1 after days of operation. The 64
> bit kernel has all relevant options for K8 (irq balancing,
> NUMA support, etc.) enabled.
>
> CPU0 CPU1
> 0: 15260 4196668 IO-APIC-edge timer
> 9: 0 0 IO-APIC-level acpi
> 169: 0 5 IO-APIC-level ehci_hcd
> 177: 0 3 IO-APIC-level uhci_hcd, ohci1394
> 185: 1999 934839 IO-APIC-level uhci_hcd, eth0
> NMI: 2698 2817
> LOC: 4211263 4211263
> ERR: 0
> MIS: 0

I see this effect too, but I'd attribute it to the fact that on my board the
whole I/O is attached to one of the processors (then it's CPU1, I'd bet).

> 4) ACPI powermanagement (32bit and 64bit): No matter which ACPI options
> I choose in the BIOS, ACPI will only handle the first CPU somewhat
> and leave the second CPU alone. I'd love to have some simple
> powermanagement because the system will get quite warm, even when
> idle, and warm == loud because the fans (which are barely noticeable
> when the system is cold) kick into gear quite fast.
>
> processor id: 0
> acpi id: 1
> bus mastering control: no
> power management: no
> throttling control: yes
> limit interface: yes
> active limit: P0:T0
> user limit: P0:T0
> thermal limit: P0:T0
> active state: C1
> default state: C1
> bus master activity: 00000000
> states:
> *C1: promotion[--] demotion[--] latency[000]
> usage[00000000]
> C2: <not supported>
> C3: <not supported>
> state count: 8
> active state: T0
> states:
> *T0: 00%
> T1: 12%
> T2: 25%
> T3: 37%
> T4: 50%
> T5: 62%
> T6: 75%
> T7: 87%
>
> processor id: 1
> acpi id: 2
> bus mastering control: no
> power management: no
> throttling control: no
> limit interface: no
> <not supported>
> active state: C1
> default state: C1
> bus master activity: 00000000
> states:
> *C1: promotion[--] demotion[--] latency[000]
> usage[00000000]
> C2: <not supported>
> C3: <not supported>
> <not supported>

This happens on my system too.

Greets,
RJW

--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2004-11-02 22:22:26

by Jesse Pollard

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On Tuesday 02 November 2004 07:59, Daniel Egger wrote:
>
>
> 2) 64 bit kernel vgettimeofday panic: The kernel panics in
> arch/x64_64/vsyscall.c:169 on boot.
>
> static int __init vsyscall_init(void)
> {
> if ((unsigned long) &vgettimeofday !=
> VSYSCALL_ADDR(__NR_vgettimeofday))
> panic("vgettimeofday link addr broken");
>
> Replacing those panic(s) by printk make the machine boot just fine
> and also work (seemingly) without any problems under load.
>

This may be all wet but....
> if ((unsigned long) &vgettimeofday !=
^^^^ this is a 32 bit value
> VSYSCALL_ADDR(__NR_vgettimeofday))
^^^^^^^^^^^^ and I think this is a 64 bit value.
> panic("vgettimeofday link addr broken");

And elevating an unsigned 32 bit to 64 will not match under any circumstances.

Bet it would work if "(unsigned long)" were "(void *)"

2004-11-02 22:51:56

by Daniel Egger

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On 02.11.2004, at 17:58, Thomas Zehetbauer wrote:

> I am using a not-so-new Tyan Thunder K8W S2885 based Dual Opteron
> System.

Mine is a Tyan Tiger K8W. :)

>> 2) 64 bit kernel vgettimeofday panic: The kernel panics in

> Cannot confirm this, both 2.6.8.1 and 2.6.9 boot OK.

Could be the compiler, I'm using a gcc HEAD snapshot from yesterday.

However since I do not have any problems with the panics replaced by
printk I have troubles to understand the meaning of them.

>> 3) Interrupt distribution 32 bit vs. 64 bit. Below is a copy of the

> Cannot confirm this, interrupts seem to be almost equally distributed
> with 64-bit kernel and irqbalance running. Did you note that x86_64
> does
> not provide in-kernel IRQ balancing.

Fair enough. Thanks for the pointer.

>> 4) ACPI powermanagement (32bit and 64bit): No matter which ACPI
>> options

> AFAIK power management is almost unsupported on SMP systems.

Strange. The ACPI tables seem to be filled with valueable information
which I can enable pretty finegrained in the BIOS and I even seem to get
somewhat useful options with the first CPU.

Also /proc/cpuinfo mentions powermanagement:
...
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts ttp

Whatever ts and ttp may mean.

I'd really love to have this machine running and use its power on demand
instead of having to think about a more sophisticated airflow to keep
the temperature (of idle CPUs) and thus the noiselevel down.

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-11-03 05:10:12

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

Daniel Egger <[email protected]> writes:

> 2) 64 bit kernel vgettimeofday panic: The kernel panics in
> arch/x64_64/vsyscall.c:169 on boot.
>
> static int __init vsyscall_init(void)
> {
> if ((unsigned long) &vgettimeofday !=
> VSYSCALL_ADDR(__NR_vgettimeofday))
> panic("vgettimeofday link addr broken");
>
> Replacing those panic(s) by printk make the machine boot just fine
> and also work (seemingly) without any problems under load.

Can you print the two values? I've never seen such a problem.
If it works then they must be identical, otherwise user space would
break very quickly.

-Andi

2004-11-03 10:53:33

by Daniel Egger

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On 03.11.2004, at 06:06, Andi Kleen wrote:

>> Replacing those panic(s) by printk make the machine boot just fine
>> and also work (seemingly) without any problems under load.

> Can you print the two values? I've never seen such a problem.
> If it works then they must be identical, otherwise user space would
> break very quickly.

printk("%p %p %p\n", (unsigned long) &vgettimeofday, &vgettimeofday,
VSYSCALL_ADDR(__NR_vgettimeofday));

ffffffffff600000 ffffffffff600000 ffffffffff600000

I've no idea why it still triggers. Also the next one BTW:
vtime link addr brokenIA32

The compiler is: gcc version 3.4.0 20040111 (experimental)

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-11-03 11:17:27

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On Wed, Nov 03, 2004 at 11:53:05AM +0100, Daniel Egger wrote:
> On 03.11.2004, at 06:06, Andi Kleen wrote:
>
> >> Replacing those panic(s) by printk make the machine boot just fine
> >> and also work (seemingly) without any problems under load.
>
> >Can you print the two values? I've never seen such a problem.
> >If it works then they must be identical, otherwise user space would
> >break very quickly.
>
> printk("%p %p %p\n", (unsigned long) &vgettimeofday, &vgettimeofday,
> VSYSCALL_ADDR(__NR_vgettimeofday));
>
> ffffffffff600000 ffffffffff600000 ffffffffff600000
>
> I've no idea why it still triggers. Also the next one BTW:
> vtime link addr brokenIA32
>
> The compiler is: gcc version 3.4.0 20040111 (experimental)

Looks like a compiler bug. I would talk to the gcc people.

-Andi

2004-11-03 15:19:56

by Jesse Pollard

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On Wednesday 03 November 2004 05:05, Andi Kleen wrote:
> On Wed, Nov 03, 2004 at 11:53:05AM +0100, Daniel Egger wrote:
> > On 03.11.2004, at 06:06, Andi Kleen wrote:
> > >> Replacing those panic(s) by printk make the machine boot just fine
> > >> and also work (seemingly) without any problems under load.
> > >
> > >Can you print the two values? I've never seen such a problem.
> > >If it works then they must be identical, otherwise user space would
> > >break very quickly.
> >
> > printk("%p %p %p\n", (unsigned long) &vgettimeofday, &vgettimeofday,
> > VSYSCALL_ADDR(__NR_vgettimeofday));
> >
> > ffffffffff600000 ffffffffff600000 ffffffffff600000
> >
> > I've no idea why it still triggers. Also the next one BTW:
> > vtime link addr brokenIA32
> >
> > The compiler is: gcc version 3.4.0 20040111 (experimental)
>
> Looks like a compiler bug. I would talk to the gcc people.

Personally .. I think it is a type error - unsigned long is 32 bits.

It appears to be comparing it to an address - which is 64 bits.

There is no sign extension from 32bit to 64bit for an unsigned number.

This same problem occured in Kerberos when compiled on AMD in
64 bit mode. The solution was to use (void *).

2004-11-06 23:06:12

by Christopher E. Brown

[permalink] [raw]
Subject: Re: 2.6.8 and 2.6.9 Dual Opteron glitches

On Tue, 2 Nov 2004, Rafael J. Wysocki wrote:

> Hi,
>
> On Tuesday 02 of November 2004 14:59, Daniel Egger wrote:
> > Hija,
> >
> > I've a few glitches with my brandnew dual Opteron System which I'd
> > like to share with you. First of all, all those problems seem to
> > be there with 2.6.8.1 and 2.6.9 but since this seemed to be the
> > case I moved on with 2.6.9 and hadn't investigated any further
> > on 2.6.8.1 so some of the issues might only apply to 2.6.9.
>
> I'm using 2.6.10-rc1-mm2 currently, on a dual Opteron w/ Tyan Thunder K8W.
>
> > 1) 32 bit kernel HPET calibration hang: If the kernel is compiled
> > with HPET support, the kernel will hang on boot while
> > calibrating the timer. The problem goes away if HPET support is
> > not compiled in. I've no idea what information to provide to help
> > debug this.
>
> I can't confirm this. I've just set CONFIG_HPET and friends (except for
> CONFIG_HPET_RTC_IRQ) and nothing wrong happens.


Running 2.6.9 32bit here on a Tyan Thunder K8W with BIOS 2.02.

Is a new system, started with BIOS 1.02, built the kernel with CONFIG_HPET
and CONFIG_HPET_RTC_IRQ enabled. No issues, but no messages about HPET
during boot.

Updated to current BIOS (that among other things adds an enable/disable
HPET option) and I have the same issues, system hang on HPET calabration,
with an opps several minutes later. Disable the HPET in BIOS and the
system boots just fine (but no HPET).