2004-09-16 04:50:26

by Sergei Haller

[permalink] [raw]
Subject: lost memory on a 4GB amd64


Hello,

A friend of mine has a new Opteron based machine (Tyan Tiger K8W with two
Opteron 24?) and 4GB main memory.

the problem is that about 512 MB of that memory is lost (AGP aperture and
stuff). Although everything is perfect otherwise.
As far as I understand, all the PCI/AGP hardware uses the top end of the
4GB address range to access their memory and there is just an
"overlapping" of the addresses. thus only the remaining 3.5 GB are
available.


Now there is an option in the BIOS called "Adjust Memory" which puts a
certain amount of memory (several choices between 64MB and 2GB) above the
4GB address range. I tried the 2GB setting which results in 2GB main
memory at addresses 0-2GB and 2GB memory at addresses 4GB-6GB.

the problem is that the kernel (2.6.3-9mdksmp and vanilla 2.6.8.1) crashes
if this option is enabled as soon as some memory expensive program is run
(e.g. X)

I've seen some postings on the net talking about some "kernel patch" for
some "memory split", but nothing more specific. Do I just need a certain
patch to get it working or is there more to it?



BTW, the memory map displayed at boot is

BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000d3ff0000 (usable)
BIOS-e820: 00000000d3ff0000 - 00000000d3fff000 (ACPI data)
BIOS-e820: 00000000d3fff000 - 00000000d4000000 (ACPI NVS)
BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)

if I leave the 4GB memory in one chunk and

BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
BIOS-e820: 000000007fff0000 - 000000007ffff000 (ACPI data)
BIOS-e820: 000000007ffff000 - 0000000080000000 (ACPI NVS)
BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000180000000 (usable)

if I enable the "adjust memory" option and split the memory in two 2GB
blocks.

Thanks in advance,

Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain


2004-09-16 07:01:38

by Andi Kleen

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

Sergei Haller <[email protected]> writes:

> the problem is that about 512 MB of that memory is lost (AGP aperture and
> stuff). Although everything is perfect otherwise.
> As far as I understand, all the PCI/AGP hardware uses the top end of the
> 4GB address range to access their memory and there is just an
> "overlapping" of the addresses. thus only the remaining 3.5 GB are
> available.

It's a BIOS issue. Nothing the kernel can do about it. You are
talking to the wrong people.

-Andi


2004-09-16 12:16:35

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, 16 Sep 2004, Andi Kleen (AK) wrote:

AK> Sergei Haller <[email protected]> writes:
AK>
AK> > the problem is that about 512 MB of that memory is lost (AGP aperture and
AK> > stuff). Although everything is perfect otherwise.
AK> > As far as I understand, all the PCI/AGP hardware uses the top end of the
AK> > 4GB address range to access their memory and there is just an
AK> > "overlapping" of the addresses. thus only the remaining 3.5 GB are
AK> > available.
AK>
AK> It's a BIOS issue. Nothing the kernel can do about it. You are
AK> talking to the wrong people.

Hi Andi,

I am sure you did read the rest of my mail, didn't you? I mean the part
where I describe that there is an option in the BIOS for that but the
kernel crashes if I enable it.

If it is still a problem of the BIOS, could you please be more specific
about what exactly is the problem with the BIOS?


Thanks,
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-16 12:37:43

by Andi Kleen

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, Sep 16, 2004 at 10:15:22PM +1000, Sergei Haller wrote:
> I am sure you did read the rest of my mail, didn't you? I mean the part

I did.

> where I describe that there is an option in the BIOS for that but the
> kernel crashes if I enable it.

It means the memory was not correct configured. If you don't trust the kernel
you can use memtest86 to confirm it.

> If it is still a problem of the BIOS, could you please be more specific
> about what exactly is the problem with the BIOS?

That it doesn't supply usable memory to the kernel with that option.

-Andi

2004-09-16 13:30:18

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

Hi Sergei,

I have the same board with 4Gb ram running a 64bit 2.6.8.1 and solved it by
changing something in the bios. Let me reboot and I'll take a note of what I
did...

On Thursday 16 Sep 2004 05:48, Sergei Haller wrote:
> Hello,
>
> A friend of mine has a new Opteron based machine (Tyan Tiger K8W with two
> Opteron 24?) and 4GB main memory.
>
> the problem is that about 512 MB of that memory is lost (AGP aperture and
> stuff). Although everything is perfect otherwise.
> As far as I understand, all the PCI/AGP hardware uses the top end of the
> 4GB address range to access their memory and there is just an
> "overlapping" of the addresses. thus only the remaining 3.5 GB are
> available.
>
>
> Now there is an option in the BIOS called "Adjust Memory" which puts a
> certain amount of memory (several choices between 64MB and 2GB) above the
> 4GB address range. I tried the 2GB setting which results in 2GB main
> memory at addresses 0-2GB and 2GB memory at addresses 4GB-6GB.
>
> the problem is that the kernel (2.6.3-9mdksmp and vanilla 2.6.8.1) crashes
> if this option is enabled as soon as some memory expensive program is run
> (e.g. X)
>
> I've seen some postings on the net talking about some "kernel patch" for
> some "memory split", but nothing more specific. Do I just need a certain
> patch to get it working or is there more to it?
>
>
>
> BTW, the memory map displayed at boot is
>
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000d3ff0000 (usable)
> BIOS-e820: 00000000d3ff0000 - 00000000d3fff000 (ACPI data)
> BIOS-e820: 00000000d3fff000 - 00000000d4000000 (ACPI NVS)
> BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
>
> if I leave the 4GB memory in one chunk and
>
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
> BIOS-e820: 000000007fff0000 - 000000007ffff000 (ACPI data)
> BIOS-e820: 000000007ffff000 - 0000000080000000 (ACPI NVS)
> BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
>
> if I enable the "adjust memory" option and split the memory in two 2GB
> blocks.
>
> Thanks in advance,
>
> Sergei

2004-09-16 13:48:46

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thursday 16 Sep 2004 05:48, Sergei Haller wrote:
> Hello,
>
> A friend of mine has a new Opteron based machine (Tyan Tiger K8W with two
> Opteron 24?) and 4GB main memory.

Typo? Tyan Thunder?

>
> the problem is that about 512 MB of that memory is lost (AGP aperture and
> stuff). Although everything is perfect otherwise.
> As far as I understand, all the PCI/AGP hardware uses the top end of the
> 4GB address range to access their memory and there is just an
> "overlapping" of the addresses. thus only the remaining 3.5 GB are
> available.
>
>
> Now there is an option in the BIOS called "Adjust Memory" which puts a
> certain amount of memory (several choices between 64MB and 2GB) above the
> 4GB address range. I tried the 2GB setting which results in 2GB main
> memory at addresses 0-2GB and 2GB memory at addresses 4GB-6GB.
>

Ok;

Assuming bios version 2.02. (upgrade if you haven't already);

The option you mention should be set to 'Auto'

Chipset->Northbridge->Memory Configuration->Adjust Memory = Auto

but set

Advanced->Cpu Configuration->MTRR Mapping = Continuous

That fixed it for me if I remember correctly :)

Andrew Walrond

2004-09-16 14:12:59

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, 16 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Thursday 16 Sep 2004 05:48, Sergei Haller wrote:
AW> > Hello,
AW> >
AW> > A friend of mine has a new Opteron based machine (Tyan Tiger K8W with two
AW> > Opteron 24?) and 4GB main memory.
AW>
AW> Typo? Tyan Thunder?

no, it's a tiger: http://www.tyan.com/products/html/tigerk8w.html

AW> > the problem is that about 512 MB of that memory is lost (AGP aperture and
AW> > stuff). Although everything is perfect otherwise.
AW> > As far as I understand, all the PCI/AGP hardware uses the top end of the
AW> > 4GB address range to access their memory and there is just an
AW> > "overlapping" of the addresses. thus only the remaining 3.5 GB are
AW> > available.
AW> >
AW> >
AW> > Now there is an option in the BIOS called "Adjust Memory" which puts a
AW> > certain amount of memory (several choices between 64MB and 2GB) above the
AW> > 4GB address range. I tried the 2GB setting which results in 2GB main
AW> > memory at addresses 0-2GB and 2GB memory at addresses 4GB-6GB.
AW> >
AW>
AW> Ok;
AW>
AW> Assuming bios version 2.02.

yes

AW> The option you mention should be set to 'Auto'
AW>
AW> Chipset->Northbridge->Memory Configuration->Adjust Memory = Auto
AW>
AW> but set
AW>
AW> Advanced->Cpu Configuration->MTRR Mapping = Continuous

I had "MTRR Mapping = Continuous" set all the time and tried "Adjust
Memory" in all three modes (Auto/manual/disabled) and manual with 1 and
2gb size.

today I had discovered the MTRR option and changed it to "discrete".
tried "Adjust Memory" manually at 2gb.

the only working (but with loss of memory) combination seems to be "Adjust
Memory = disabled" and independant of "MTRR Mapping".

The only combination I didn't try is "MTRR Mapping=Discrete"+"Adjust
Memory= Auto". Will try tomorrow morning.


AW> That fixed it for me if I remember correctly :)

do you have a thunder or a tiger? And could you check in your BIOS setup
whih options you used?



thanks,
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-16 14:28:25

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thursday 16 Sep 2004 15:09, you wrote:
> AW> Typo? Tyan Thunder?
>
> no, it's a tiger: http://www.tyan.com/products/html/tigerk8w.html

Ah - ok; I thought the Tiger was their dual athlon board. Didn't realise they
had a dual opteron version.

I have a Thunder K8W.

>
> AW> The option you mention should be set to 'Auto'
> AW>
> AW> Chipset->Northbridge->Memory Configuration->Adjust Memory = Auto
> AW>
> AW> but set
> AW>
> AW> Advanced->Cpu Configuration->MTRR Mapping = Continuous
>
> I had "MTRR Mapping = Continuous" set all the time and tried "Adjust
> Memory" in all three modes (Auto/manual/disabled) and manual with 1 and
> 2gb size.
>
> today I had discovered the MTRR option and changed it to "discrete".
> tried "Adjust Memory" manually at 2gb.
>
> the only working (but with loss of memory) combination seems to be "Adjust
> Memory = disabled" and independant of "MTRR Mapping".
>
> The only combination I didn't try is "MTRR Mapping=Discrete"+"Adjust
> Memory= Auto". Will try tomorrow morning.
>

On further investigation, The settings I mentioned, 'Auto' and 'Continuous'
only work when running a 64bit kernel. Are you running a 32bit kernel?

Andrew

2004-09-16 14:58:07

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, 16 Sep 2004, Andrew Walrond (AW) wrote:

AW>
AW> On further investigation, The settings I mentioned, 'Auto' and 'Continuous'
AW> only work when running a 64bit kernel. Are you running a 32bit kernel?

it's a 64bit one. the precise setting for the processor is
"AMD-Opteron/Athlon64". Should I try "Generic-x86-64"?

BTW, 32bit processors are not offered at all.


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-16 15:35:17

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thursday 16 Sep 2004 15:56, Sergei Haller wrote:
> On Thu, 16 Sep 2004, Andrew Walrond (AW) wrote:
>
> AW>
> AW> On further investigation, The settings I mentioned, 'Auto' and
> 'Continuous' AW> only work when running a 64bit kernel. Are you running a
> 32bit kernel?
>
> it's a 64bit one. the precise setting for the processor is
> "AMD-Opteron/Athlon64". Should I try "Generic-x86-64"?

No - thats what I use. Do you have MTRR support enabled?

I'll send you my .config file; Perhaps you could try that.

Andrew

2004-09-16 15:57:15

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, 16 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Thursday 16 Sep 2004 15:56, Sergei Haller wrote:
AW> > On Thu, 16 Sep 2004, Andrew Walrond (AW) wrote:
AW> >
AW> > AW>
AW> > AW> On further investigation, The settings I mentioned, 'Auto' and
AW> > 'Continuous' AW> only work when running a 64bit kernel. Are you running a
AW> > 32bit kernel?
AW> >
AW> > it's a 64bit one. the precise setting for the processor is
AW> > "AMD-Opteron/Athlon64". Should I try "Generic-x86-64"?
AW>
AW> No - thats what I use. Do you have MTRR support enabled?

yes.

AW> I'll send you my .config file; Perhaps you could try that.

I just had a look at it. tomorrow morning I'll try out some of the
options. If you like, I can send you my .config, so you can tell me which
options are more likely to affect memory handling.

c ya
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-18 14:08:00

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Thu, 16 Sep 2004, Andi Kleen (AK) wrote:

AK> On Thu, Sep 16, 2004 at 10:15:22PM +1000, Sergei Haller wrote:
AK>
AK> > where I describe that there is an option in the BIOS for that but the
AK> > kernel crashes if I enable it.
AK>
AK> It means the memory was not correct configured. If you don't trust the kernel
AK> you can use memtest86 to confirm it.

memtest86 is happy with the memory.


c ya
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-18 14:19:45

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 17 Sep 2004, Sergei Haller (SH) wrote:

SH> AW> No - thats what I use. Do you have MTRR support enabled?
SH>
SH> yes.
SH>
SH> AW> I'll send you my .config file; Perhaps you could try that.
SH>
SH> I just had a look at it. tomorrow morning I'll try out some of the
SH> options.

I tried out many configurations of the kernel config, nothing helped.

now I switched off SMP and it runs stable! So what am I to do about it?

that's the summary:

* if the memory is configured in one chunk (0-4gb) then the SMP kernel
works, but I only have about 3.4 gb main memory. (I know why)

* if the memory configuration is as follows: the first 3gb ar at the
normal address range, the fourth gb is at the address range 4-5gb.
then all 4gb are available (not quite -- a few mb ere missing, but
thats ok) and
- the SMP kernel panics as soon as I start X or allocate about 1.6gb of
memory (maybe less would trigger that as well, that was the only test
I ran) ahh, kernel compillation runs fine.
- the non-SMP kernel runs stable.
- memtest86 runs fine

all kernels I mention are 2.6.8.1 vanilla.

What do you think? Is there anything I can do?



Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-19 20:01:27

by Jon Masters

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Sun, 19 Sep 2004 00:18:38 +1000 (EST), Sergei Haller

> * if the memory configuration is as follows: the first 3gb ar at the
> normal address range, the fourth gb is at the address range 4-5gb.
> then all 4gb are available (not quite -- a few mb ere missing, but
> thats ok) and
> - the SMP kernel panics as soon as I start X

Just out of interest - can you say what tests you ran here - for
example whether you tried allocating large amounts of memory from a
userspace process without running X and/or touching bits of memory
mapped hardware? You say a kernel compile works fine so can you rule
out this being X taking down the system (you're previous mail seemed
somehat unclear).

Jon.

2004-09-19 21:48:10

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Sun, 19 Sep 2004, Jon Masters (JM) wrote:

JM> On Sun, 19 Sep 2004 00:18:38 +1000 (EST), Sergei Haller
JM>
JM> > * if the memory configuration is as follows: the first 3gb ar at the
JM> > normal address range, the fourth gb is at the address range 4-5gb.
JM> > then all 4gb are available (not quite -- a few mb ere missing, but
JM> > thats ok) and
JM> > - the SMP kernel panics as soon as I start X
JM>
JM> Just out of interest - can you say what tests you ran here - for
JM> example whether you tried allocating large amounts of memory from a
JM> userspace process without running X and/or touching bits of memory
JM> mapped hardware? You say a kernel compile works fine so can you rule
JM> out this being X taking down the system (you're previous mail seemed
JM> somehat unclear).

- as soon as I start X, the machine is gone.
- if I compile the kernel (without X of course) its ok.

the machine is supposed to be for scientific calculations and we run magma
on it. The test I ran is just a one-liner which creates a matrix of size
20000x20000 with zeros in it. so basically it just allocates 1.6gb of
memory and writes zeros in it. the SMP kernel with the above memory
configuration crashes immediately.

I guess I should write a simple C-program using malloc or something to
reproduce the crash in the simplest possible way, shouldn't I?

Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-19 22:00:15

by Jon Masters

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Mon, 20 Sep 2004 07:47:20 +1000 (EST), Sergei Haller

> I guess I should write a simple C-program using malloc or something to
> reproduce the crash in the simplest possible way, shouldn't I?

You've answered your own question Sergei. Thing is - you mentioned the
AGP aperature settings in your original post and then got tied up
thinking there's a bug in the kernel but we have to rule out stuff
like X getting very unhappy trying to play in the wrong place or
something. Try a simple test case and then see if you can give any
other handy details on your situation.

Jon.

2004-09-19 22:39:10

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Sun, 19 Sep 2004, Jon Masters (JM) wrote:

JM> On Mon, 20 Sep 2004 07:47:20 +1000 (EST), Sergei Haller
JM>
JM> > I guess I should write a simple C-program using malloc or something to
JM> > reproduce the crash in the simplest possible way, shouldn't I?
JM>
JM> You've answered your own question Sergei. Thing is - you mentioned the
JM> AGP aperature settings in your original post [...]

well, AGP and PCI I mentioned only as "the bad guys stealing the memory"
and as the reason to why I wanted to spilt the main memory in two 2gb
blocks or in 3gb+1gb, one block being at the normal address range and the
other at the addresses >4gb

JM> but we have to rule out stuff like X

I guess, you're right in htis.

JM> Try a simple test case and then see if you can give any other handy
JM> details on your situation.


thanks so far,

Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-20 10:26:31

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Mon, 20 Sep 2004, Sergei Haller (SH) wrote:

SH> On Sun, 19 Sep 2004, Jon Masters (JM) wrote:
SH>
SH> JM> On Mon, 20 Sep 2004 07:47:20 +1000 (EST), Sergei Haller
SH> JM>
SH> JM> > I guess I should write a simple C-program using malloc or something to
SH> JM> > reproduce the crash in the simplest possible way, shouldn't I?


here we go again. that's the program I wrote:
--------------------------------------------------------------
#include <stdlib.h>

int main(int argc, char **argv) {
unsigned long long int bytes;
char *mem;

if (argc < 2)
bytes = 0x40000000; // 1gb
else
bytes = strtoll(argv[1], NULL, 10);

printf("allocate %llu: ", bytes);

if (mem = malloc(bytes))
{
printf("ok\n");
printf("set them to 0... ");

memset(mem,0,bytes);

printf("done\n");

}
else
printf("not ok\n");

return 0;
}
--------------------------------------------------------------

and that's the log:

fang ~sergei> ./memtest 10000
allocate 10000: ok
set them to 0... done
fang ~sergei> ./memtest 100000
allocate 100000: ok
set them to 0... done
fang ~sergei> ./memtest 1000000
allocate 1000000: ok
set them to 0... done
fang ~sergei> ./memtest 10000000
allocate 10000000: ok
set them to 0... done
fang ~sergei> ./memtest 100000000
allocate 100000000: ok
set them to 0... done
fang ~sergei> ./memtest 1000000000
allocate 1000000000: ok

Message from syslogd@fang at Mon Sep 20 18:03:16 2004 ...
fang kernel: Oops: 0000 [1] PREEMPT SMP


Attached is the full OOPS excerpt from /var/log/messages


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain


Attachments:
messages (5.92 kB)

2004-09-24 04:38:51

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64


I just discovered the linux-smp list and decided to summarize the topic
and take the opportunity to cross-post to there.

an archive of the discussion can be found e.g. here:
http://marc.theaimsgroup.com/?t=109525952600004&r=1&w=4


The machine at hand is the Tyan Tiger K8W with two Opterons 246
(http://www.tyan.com/products/html/tigerk8w.html) and 4GB of memory

we are using vanilla 2.6.8.1 kernel.

* if the memory is set up in the ordinary way in the BIOS, then
approximately 512MB are lost (PCI/AGM adressing and stuff), but
everything is stable
* if the memory is set up in the BIOS to be in two chunks (e.g. 3GB at
the address range 0-3GB and 1GB at 4-5GB address range), then
- memtest86 tells everything is fine.
- if we run a non-SMP kernel, everything is stable
- if we run an SMP kernel, it crashes as soon as approx. 1GB of memory
is allocated and set to 0 (see the test case C-program in my
previous mail)

Is there anything we can do? Any logs I can provide? Something to try out?

Thanks,
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-24 08:16:01

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Monday 20 Sep 2004 11:26, Sergei Haller wrote:
>
> fang ~sergei> ./memtest 1000000000
> allocate 1000000000: ok
>
> Message from syslogd@fang at Mon Sep 20 18:03:16 2004 ...
> fang kernel: Oops: 0000 [1] PREEMPT SMP
>

Works fine on my 4Gb Tyan thunder K8W machine, even running from an xterm:
andrew@orac ~ $ uname -a
Linux orac.walrond.org 2.6.8.1 #3 SMP Sun Aug 29 17:36:49 BST 2004 x86_64
unknown unknown GNU/Linux

andrew@orac ~ $ free -m
total used free shared buffers cached
Mem: 4008 263 3744 0 0 66
-/+ buffers/cache: 195 3812
Swap: 3827 25 3802

andrew@orac ~ $ ./memtest 1000000000
allocate 1000000000: ok
set them to 0... done
andrew@orac ~ $ ./memtest 4000000000
allocate 4000000000: ok
set them to 0... done
andrew@orac ~ $ ./memtest 5000000000
allocate 5000000000: ok
set them to 0... done
andrew@orac ~ $

The last one took a while (using 1Gb swap) but it still worked fine.

Without swap:

andrew@orac ~ $ sudo swapoff -a
andrew@orac ~ $ free -m
total used free shared buffers cached
Mem: 4008 237 3770 0 1 54
-/+ buffers/cache: 181 3826
Swap: 0 0 0
andrew@orac ~ $ ./memtest 1000000000
allocate 1000000000: ok
set them to 0... done
andrew@orac ~ $ ./memtest 2000000000
allocate 2000000000: ok
set them to 0... done
andrew@orac ~ $

Still fine.

Andrew Walrond

2004-09-24 08:23:33

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 24 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Monday 20 Sep 2004 11:26, Sergei Haller wrote:
AW> >
AW> > fang ~sergei> ./memtest 1000000000
AW> > allocate 1000000000: ok
AW> >
AW> > Message from syslogd@fang at Mon Sep 20 18:03:16 2004 ...
AW> > fang kernel: Oops: 0000 [1] PREEMPT SMP
AW> >
AW>
AW> Works fine on my 4Gb Tyan thunder K8W machine, even running from an xterm:
AW> andrew@orac ~ $ uname -a
AW> Linux orac.walrond.org 2.6.8.1 #3 SMP Sun Aug 29 17:36:49 BST 2004 x86_64
AW> unknown unknown GNU/Linux
AW>
AW> andrew@orac ~ $ free -m
AW> total used free shared buffers cached
AW> Mem: 4008 263 3744 0 0 66
AW> -/+ buffers/cache: 195 3812
AW> Swap: 3827 25 3802
AW>
AW> andrew@orac ~ $ ./memtest 1000000000
AW> allocate 1000000000: ok
AW> set them to 0... done
AW> andrew@orac ~ $ ./memtest 4000000000
AW> allocate 4000000000: ok
AW> set them to 0... done
AW> andrew@orac ~ $ ./memtest 5000000000
AW> allocate 5000000000: ok
AW> set them to 0... done
AW> andrew@orac ~ $
AW>
AW> The last one took a while (using 1Gb swap) but it still worked fine.

Hi Andrew,

thanks for your report.

It's the same for me if I use the non-SMP version of the kernel.
but the SMP one seems to be panicking for some reason.


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-24 08:31:54

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Friday 24 Sep 2004 09:23, Sergei Haller wrote:
> It's the same for me if I use the non-SMP version of the kernel.
> but the SMP one seems to be panicking for some reason.
>

Just a thought; How are the memory modules arranged on the board?
I have 2 x 1Gb modules in each cpu-specific bank, rather than all four in
cpu1's bank. How are yours arranged?


2004-09-24 08:58:21

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 24 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Friday 24 Sep 2004 09:23, Sergei Haller wrote:
AW> > It's the same for me if I use the non-SMP version of the kernel.
AW> > but the SMP one seems to be panicking for some reason.
AW> >
AW>
AW> Just a thought; How are the memory modules arranged on the board?
AW> I have 2 x 1Gb modules in each cpu-specific bank, rather than all four in
AW> cpu1's bank. How are yours arranged?

my board has only four banks, each of them has a 1GB module sitting.
(page 26 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf)


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-24 09:26:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Friday 24 of September 2004 10:57, Sergei Haller wrote:
> On Fri, 24 Sep 2004, Andrew Walrond (AW) wrote:
>
> AW> On Friday 24 Sep 2004 09:23, Sergei Haller wrote:
> AW> > It's the same for me if I use the non-SMP version of the kernel.
> AW> > but the SMP one seems to be panicking for some reason.
> AW> >
> AW>
> AW> Just a thought; How are the memory modules arranged on the board?
> AW> I have 2 x 1Gb modules in each cpu-specific bank, rather than all four
in
> AW> cpu1's bank. How are yours arranged?
>
> my board has only four banks, each of them has a 1GB module sitting.
> (page 26 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf)

Which is what makes the difference, I think. IMO, the problem is that _both_
CPUs use the same memory bank that is physically attached to only one of them
which leads to conflicts, apparently (the CPU with memory has also
PCI/AGP/whatever attached to it via HyperTransport so I can imagine there may
be issues with overlapping address spaces etc.). I'd bet that there's
something wrong either with the BIOS or with the board design itself and I
don't think there's anything that the kernel can do about it (usual
disclaimer applies).

Out of couriosity: have you tried to run the kernel with K8 NUMA enabled?

Greets,
RJW

--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2004-09-24 09:41:18

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Friday 24 Sep 2004 10:27, Rafael J. Wysocki wrote:
>
> > AW> cpu1's bank. How are yours arranged?
> >
> > my board has only four banks, each of them has a 1GB module sitting.
> > (page 26 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf)
>
> Which is what makes the difference, I think. IMO, the problem is that
> _both_ CPUs use the same memory bank that is physically attached to only
> one of them which leads to conflicts, apparently (the CPU with memory has
> also PCI/AGP/whatever attached to it via HyperTransport so I can imagine
> there may be issues with overlapping address spaces etc.). I'd bet that
> there's something wrong either with the BIOS or with the board design
> itself and I don't think there's anything that the kernel can do about it
> (usual disclaimer applies).
>
> Out of couriosity: have you tried to run the kernel with K8 NUMA enabled?
>

Actually, the block diagram on page 9 of the manual suggests that this is
_not_ a NUMA board, since all DIMMS are connected to cpu1. The block diagram
for my thunder k8w specifically shows DIMMS associated with individual
processors.

Which suggests that NUMA show be _disabled_ in the kernel config.

Have you tried it with NUMA disabled? I think I remeber it being on in
the .config you sent me.

Andrew

2004-09-24 11:42:49

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 24 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Friday 24 Sep 2004 10:27, Rafael J. Wysocki wrote:
AW> >
AW> > > AW> cpu1's bank. How are yours arranged?
AW> > >
AW> > > my board has only four banks, each of them has a 1GB module sitting.
AW> > > (page 26 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf)
AW> >
AW> > Which is what makes the difference, I think. IMO, the problem is that
AW> > _both_ CPUs use the same memory bank that is physically attached to only
AW> > one of them which leads to conflicts, apparently (the CPU with memory has
AW> > also PCI/AGP/whatever attached to it via HyperTransport so I can imagine
AW> > there may be issues with overlapping address spaces etc.). I'd bet that
AW> > there's something wrong either with the BIOS or with the board design
AW> > itself and I don't think there's anything that the kernel can do about it
AW> > (usual disclaimer applies).
AW> >
AW> > Out of couriosity: have you tried to run the kernel with K8 NUMA enabled?
AW> >

yes.

AW> Actually, the block diagram on page 9 of the manual suggests that this is
AW> _not_ a NUMA board, since all DIMMS are connected to cpu1. The block diagram
AW> for my thunder k8w specifically shows DIMMS associated with individual
AW> processors.
AW>
AW> Which suggests that NUMA show be _disabled_ in the kernel config.

hmm.

AW> Have you tried it with NUMA disabled? I think I remeber it being on in
AW> the .config you sent me.

NUMA was enabled all the time (at least most of the time). I don't know if
I ever ran it without NUMA. I'll certainly try that.

Unfortunately, I won't be able to do any reboots during the next one or
two weeks since the machine has gone into stable operation tonight. (with
some loss of memory for now)

if it is of some interest, that's what dmesg tells about NUMA:

BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable)
BIOS-e820: 00000000bfff0000 - 00000000bffff000 (ACPI data)
BIOS-e820: 00000000bffff000 - 00000000c0000000 (ACPI NVS)
BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
Scanning NUMA topology in Northbridge 24
Number of nodes 2 (10010)
Node 0 MemBase 0000000000000000 Limit 000000013fffffff
Skipping disabled node 1
Using node hash shift of 24
Bootmem setup node 0 0000000000000000-000000013fffffff
No mptable found.
On node 0 totalpages: 1310719
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 1306623 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1

So actually it looks like the kernel well notices that only one processor
has access to the memory here.


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-24 11:50:38

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 24 Sep 2004, Rafael J. Wysocki (RW) wrote:

RW> > my board has only four banks, each of them has a 1GB module sitting.
RW> > (page 26 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf)
RW>
RW> Which is what makes the difference, I think. IMO, the problem is that _both_
RW> CPUs use the same memory bank that is physically attached to only one of them
RW> which leads to conflicts, apparently (the CPU with memory has also
RW> PCI/AGP/whatever attached to it via HyperTransport so I can imagine there may
RW> be issues with overlapping address spaces etc.). I'd bet that there's
RW> something wrong either with the BIOS or with the board design itself and I
RW> don't think there's anything that the kernel can do about it (usual
RW> disclaimer applies).

I got the impression that the whole point of the problem is that the
kernel is getting some wrong information about the memory configuration.

Is there any way to check which information exactly is wrong that leads to
the error and to see after that, where this information comes from: if the
BIOS is lying or if the kernel is misinterpreting something...



Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-09-24 12:15:47

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Friday 24 Sep 2004 12:42, you wrote:
>
> NUMA was enabled all the time (at least most of the time). I don't know if
> I ever ran it without NUMA. I'll certainly try that.
>
> Unfortunately, I won't be able to do any reboots during the next one or
> two weeks since the machine has gone into stable operation tonight. (with
> some loss of memory for now)
>
> if it is of some interest, that's what dmesg tells about NUMA:
>
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable)
> BIOS-e820: 00000000bfff0000 - 00000000bffff000 (ACPI data)
> BIOS-e820: 00000000bffff000 - 00000000c0000000 (ACPI NVS)
> BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
> Scanning NUMA topology in Northbridge 24
> Number of nodes 2 (10010)
> Node 0 MemBase 0000000000000000 Limit 000000013fffffff
> Skipping disabled node 1
> Using node hash shift of 24
> Bootmem setup node 0 0000000000000000-000000013fffffff
> No mptable found.
> On node 0 totalpages: 1310719
> DMA zone: 4096 pages, LIFO batch:1
> Normal zone: 1306623 pages, LIFO batch:16
> HighMem zone: 0 pages, LIFO batch:1
>
> So actually it looks like the kernel well notices that only one processor
> has access to the memory here.
>

Intriguing. If it works with NUMA disabled, it would strongly indicate a bug
in the NUMA kernel code.

Definately worth a try as soon as you can afford to take the machine down for
a few minutes.

Andrew

2004-10-22 09:05:54

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 24 Sep 2004, Andrew Walrond (AW) wrote:

AW> On Friday 24 Sep 2004 12:42, you wrote:
AW> >
AW> > NUMA was enabled all the time (at least most of the time). I don't know if
AW> > I ever ran it without NUMA. I'll certainly try that.
AW> >
AW> > Unfortunately, I won't be able to do any reboots during the next one or
AW> > two weeks since the machine has gone into stable operation tonight. (with
AW> > some loss of memory for now)
AW> >
AW> > if it is of some interest, that's what dmesg tells about NUMA:
AW> >
AW> > BIOS-provided physical RAM map:
AW> > BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
AW> > BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
AW> > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
AW> > BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable)
AW> > BIOS-e820: 00000000bfff0000 - 00000000bffff000 (ACPI data)
AW> > BIOS-e820: 00000000bffff000 - 00000000c0000000 (ACPI NVS)
AW> > BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
AW> > BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
AW> > Scanning NUMA topology in Northbridge 24
AW> > Number of nodes 2 (10010)
AW> > Node 0 MemBase 0000000000000000 Limit 000000013fffffff
AW> > Skipping disabled node 1
AW> > Using node hash shift of 24
AW> > Bootmem setup node 0 0000000000000000-000000013fffffff
AW> > No mptable found.
AW> > On node 0 totalpages: 1310719
AW> > DMA zone: 4096 pages, LIFO batch:1
AW> > Normal zone: 1306623 pages, LIFO batch:16
AW> > HighMem zone: 0 pages, LIFO batch:1
AW> >
AW> > So actually it looks like the kernel well notices that only one processor
AW> > has access to the memory here.
AW> >
AW>
AW> Intriguing. If it works with NUMA disabled, it would strongly indicate a bug
AW> in the NUMA kernel code.

Now I have some good news (that is, I hope that this is good news)

If I disable NUMA in 2.6.8.1, it works stable!

The same with 2.6.9, which is out for a few days: if NUMA is disabled,
everything's find, if NUMA is enabled, the kernel crashes (as described in
previous mails)

What is this NUMA by the way? Is it OK to live without?

If you need some additional output, let me know. I can't promise to be
fast, though (as I said, this machine is in production use now)


Thanks for all help,

Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-10-22 09:34:32

by Andrew Walrond

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Friday 22 Oct 2004 09:59, Sergei Haller wrote:
>
> If I disable NUMA in 2.6.8.1, it works stable!
>
> The same with 2.6.9, which is out for a few days: if NUMA is disabled,
> everything's find, if NUMA is enabled, the kernel crashes (as described in
> previous mails)
>
> What is this NUMA by the way? Is it OK to live without?
>

Non uniform memory architecture. Basically, some of the ram is attached to
cpu1 and some to cpu2. They can still access each others ram using
HyperTransport, but can access their own ram faster. I'm no expert, but I
guess that the NUMA code tries to achieve persistent process cpu affinity and
keep all process memory in the relevant cpu's NUMA ram bank. Or something :)

Your board isn't numa, in the sense that all the ram is attached to one cpu,
but I don't think that it should break when NUMA is enabled.

I've cc'ed Andi Kleen (x86_64 supremo) who might have some insights, but I'm
guessing he'll say "Bios problem - tough luck". I might be wrong ;)

Andrew Walrond

2004-10-22 18:25:14

by Andi Kleen

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

> I've cc'ed Andi Kleen (x86_64 supremo) who might have some insights, but I'm
> guessing he'll say "Bios problem - tough luck". I might be wrong ;)

Is there a full boot log of the system?
-Andi

2004-10-23 10:03:04

by Andreas Klein

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64


Hello,

I have the same problem with 45 Tyan S2885 boards, but I have one running
sample. The one running machine has the following configuration:

- Tyan S2885 pre-production model with a 1.01 pre-release bios
6 mb ram (4x512mb, 4x1gb)
The machine is running SuSE Linux Enterprise Server 8 (32bit).
We use this machine as our primary mail-server without problems for over a
year.

- Now we ordered 45 Tyan S2885 and 4 S2875S board.
Both board do not run stable with more than 2GB ram usable.
4GB will only be recognized if the MTRR setting is set to Continuous and
the Adjust Memory setting is set to Auto.
If the bios is configured this way and two 1gb ram modules are installed
for each CPU on the 2885, the machine will not even load and unpack a
SLES 9 kernel. Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
about each memory address).
If all four modules are installed for CPU0, then memtest seems to work
without problems (0-2GB, 4-6GB), but SLES9 will crash during boot-up.
If all four modules are installed for CPU1, then memtest seems to work
without problems too. SLES 9 will run a few minutes before a crash.
I will try to install SLES 8 (32bit) on the new boxes to see if it runs
stable. If yes, there is something broken in the 2.6 kernels for amd64, if
not, the pre-production bios is better that the final ones.

Bye,

On Fri, 22 Oct 2004, Andi Kleen wrote:

> > I've cc'ed Andi Kleen (x86_64 supremo) who might have some insights, but I'm
> > guessing he'll say "Bios problem - tough luck". I might be wrong ;)
>
> Is there a full boot log of the system?
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


-- Andreas Klein
[email protected]
root / webmaster @cip.physik.uni-wuerzburg.de
root / webmaster @http://www.physik.uni-wuerzburg.de
_____________________________________
| |
| Long live our gracious AMIGA! |
|___________________________________|

2004-10-23 10:34:17

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Fri, 22 Oct 2004, Andi Kleen (AK) wrote:

AK> > I've cc'ed Andi Kleen (x86_64 supremo) who might have some insights, but I'm
AK> > guessing he'll say "Bios problem - tough luck". I might be wrong ;)
AK>
AK> Is there a full boot log of the system?

yes. attached two files:

dmesg-2.6.9-smp-NUMA (crashing one)
dmesg-2.6.9-smp-noNUMA (working one)


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain


Attachments:
dmesg-2.6.9-smp-NUMA (14.18 kB)
dmesg-2.6.9-smp-noNUMA (14.22 kB)
Download all attachments

2004-10-23 16:49:24

by Andi Kleen

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Sat, Oct 23, 2004 at 12:26:38PM +0200, Sergei Haller wrote:
> On Fri, 22 Oct 2004, Andi Kleen (AK) wrote:
>
> AK> > I've cc'ed Andi Kleen (x86_64 supremo) who might have some insights, but I'm
> AK> > guessing he'll say "Bios problem - tough luck". I might be wrong ;)
> AK>
> AK> Is there a full boot log of the system?
>
> yes. attached two files:
>
> dmesg-2.6.9-smp-NUMA (crashing one)
> dmesg-2.6.9-smp-noNUMA (working one)

I bet that if you fill all memory on the non NUMA setup
it will crash too.

e.g. run something like this

#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

#define MEMSIZE
main()
{
unsigned long len = (sysconf(_SC_AVPHYS_PAGES) - 10)* getpagesize();
char *mem = malloc(len);
for (;;) {
memset(mem, 0xff, len);
printf(".");
}
}


-Andi

2004-10-23 16:44:02

by Andi Kleen

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

[cc'ed to [email protected] for future reference. If you find
this message in google and you have the same problem, talk
to your BIOS vendor, not to your Linux vendor]

On Sat, Oct 23, 2004 at 12:02:10PM +0200, Andreas Klein wrote:
> - Tyan S2885 pre-production model with a 1.01 pre-release bios
> 6 mb ram (4x512mb, 4x1gb)
> The machine is running SuSE Linux Enterprise Server 8 (32bit).
> We use this machine as our primary mail-server without problems for over a
> year.
>
> - Now we ordered 45 Tyan S2885 and 4 S2875S board.
> Both board do not run stable with more than 2GB ram usable.
> 4GB will only be recognized if the MTRR setting is set to Continuous and
> the Adjust Memory setting is set to Auto.
> If the bios is configured this way and two 1gb ram modules are installed
> for each CPU on the 2885, the machine will not even load and unpack a
> SLES 9 kernel. Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
> about each memory address).
> If all four modules are installed for CPU0, then memtest seems to work
> without problems (0-2GB, 4-6GB), but SLES9 will crash during boot-up.
> If all four modules are installed for CPU1, then memtest seems to work
> without problems too. SLES 9 will run a few minutes before a crash.
> I will try to install SLES 8 (32bit) on the new boxes to see if it runs
> stable. If yes, there is something broken in the 2.6 kernels for amd64, if
> not, the pre-production bios is better that the final ones.

It all sounds very much like a BIOS problem. I doubt 2.4 will
run stable on this setup - if memtest86 doesn't like the memory, Linux
won't like it neither. All I can recommend is to talk to Tyan or
live with the lost memory.

-Andi

2004-10-24 09:54:48

by Sergei Haller

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64

On Sat, 23 Oct 2004, Andi Kleen (AK) wrote:

AK> > dmesg-2.6.9-smp-noNUMA (working one)
AK>
AK> I bet that if you fill all memory on the non NUMA setup
AK> it will crash too.
AK>
AK> e.g. run something like this
AK>
AK> #include <stdlib.h>
AK> #include <string.h>
AK> #include <unistd.h>
AK> #include <stdio.h>
AK>
AK> #define MEMSIZE
AK> main()
AK> {
AK> unsigned long len = (sysconf(_SC_AVPHYS_PAGES) - 10)* getpagesize();
AK> char *mem = malloc(len);
AK> for (;;) {
AK> memset(mem, 0xff, len);
AK> printf(".");
AK> }
AK> }

what's the difference to the program I posted (here a link to an archive:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109567610824746&w=4

other than yours is filling the memory with 0xFF and mine with 0x00 and
mine does it only once and yours continuously?

BTW: I added an fflush(stdout) after the printf and after two lines of
dots on a 150 cols terminal I just stopped the Program. This is with
2.6.9-smp-noNUMA.

on a 2.6.9-smp-NUMA, running my program crashes the kernel immediately.


c ya
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-10-26 11:29:57

by Andreas Klein

[permalink] [raw]
Subject: Re: lost memory on a 4GB amd64


Hello,

You are right. I tried now to install SLES 8 (32bit) and the system
hangs when initialising CPU0. Our running pre-production sample was
installed from the same CD and with exactly the same setup.
So something is really broken in the final bios or in the final board
layout.
The only thing I haven't tried is to flash the pre-production bios on the
final boards.

Bye,

On Sat, 23 Oct 2004, Andi Kleen wrote:

> [cc'ed to [email protected] for future reference. If you find
> this message in google and you have the same problem, talk
> to your BIOS vendor, not to your Linux vendor]
>
> On Sat, Oct 23, 2004 at 12:02:10PM +0200, Andreas Klein wrote:
> > - Tyan S2885 pre-production model with a 1.01 pre-release bios
> > 6 mb ram (4x512mb, 4x1gb)
> > The machine is running SuSE Linux Enterprise Server 8 (32bit).
> > We use this machine as our primary mail-server without problems for over a
> > year.
> >
> > - Now we ordered 45 Tyan S2885 and 4 S2875S board.
> > Both board do not run stable with more than 2GB ram usable.
> > 4GB will only be recognized if the MTRR setting is set to Continuous and
> > the Adjust Memory setting is set to Auto.
> > If the bios is configured this way and two 1gb ram modules are installed
> > for each CPU on the 2885, the machine will not even load and unpack a
> > SLES 9 kernel. Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
> > about each memory address).
> > If all four modules are installed for CPU0, then memtest seems to work
> > without problems (0-2GB, 4-6GB), but SLES9 will crash during boot-up.
> > If all four modules are installed for CPU1, then memtest seems to work
> > without problems too. SLES 9 will run a few minutes before a crash.
> > I will try to install SLES 8 (32bit) on the new boxes to see if it runs
> > stable. If yes, there is something broken in the 2.6 kernels for amd64, if
> > not, the pre-production bios is better that the final ones.
>
> It all sounds very much like a BIOS problem. I doubt 2.4 will
> run stable on this setup - if memtest86 doesn't like the memory, Linux
> won't like it neither. All I can recommend is to talk to Tyan or
> live with the lost memory.
>
> -Andi
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


-- Andreas Klein
[email protected]
root / webmaster @cip.physik.uni-wuerzburg.de
root / webmaster @http://www.physik.uni-wuerzburg.de
_____________________________________
| |
| Long live our gracious AMIGA! |
|___________________________________|