2004-10-27 15:46:40

by Sergei Haller

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64

On Wed, 27 Oct 2004, Andreas Klein (AK) wrote:

AK> Hello,
AK>
AK> the problem has been verified by Tyan. It is definately a hardware issue.
AK> The Tyan and AMD engineers are developing a solution (BIOS) for the
AK> problem.
AK> They will fix the problem for the S2885, as weel as for the S2875 boards.

Are you sure this is the same problem, that I have? You discovered
Problems with memtest86:

> Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
> about each memory address).

I didn't:

> memtest86 is happy with the memory.

The next difference:
You have the S2885 (thunder K8W) and S2875S (tiger K8W single processor)
boards and I have a S2875 (tiger K8W double processor)


I summarize (again) my problems:

Independantly of the memory settings in the BIOS:
- non-SMP Kernel is stable
- memtest86 does not report any errors

If the memory (4GB) is set up in one block (0-4GB) in the BIOS, then
- SMP Kernel is stable

If the memory (4GB) is set up in two blocks (eg. 0-3GB, 4-5GB) in the
BIOS, then
- SMP Kernel is stable _if_and_only_if_ NUMA is _disabled_.


BTW.: I won't be able to flash a new BIOS to our machine, since it is in
production use and runs _rock_stable_ with _full_memory_ after we
_disabled_ NUMA support in the kernel. (see the two small programs posted
by me and by Andi Kleen)

What I COULD do is running some tests if needed. (e.g. to check if the
Board/BIOS is lying about its capabilities)


Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain


2004-10-27 16:07:03

by Andreas Klein

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64


Hello,

On Wed, 27 Oct 2004, Sergei Haller wrote:

> On Wed, 27 Oct 2004, Andreas Klein (AK) wrote:
>
> Are you sure this is the same problem, that I have? You discovered
> Problems with memtest86:
>
> > Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
> > about each memory address).
>
> I didn't:
>
> > memtest86 is happy with the memory.

Memtest is happy with my memory too, if all 4 modules are installed in the
slots belonging to CPU1. If I install 2 modules for each CPU, memtest86 is
not happy anymore.

>
> The next difference:
> You have the S2885 (thunder K8W) and S2875S (tiger K8W single processor)
> boards and I have a S2875 (tiger K8W double processor)

The boards are nearly identical (on-board lan is different, and your
memory-slots are connected to one CPU).
If all memory modules are installed for one CPU, I have your problems.
Additionaly there are some other problems that only occur, when the
modules are installed one pair for each CPU.

Since I have a pre-producion board and bios which is running solid as a
rock, regardless if a SMP/no-SMP kernel is installed, I hope that they
will fix all problems.

> I summarize (again) my problems:
>
> Independantly of the memory settings in the BIOS:
> - non-SMP Kernel is stable
> - memtest86 does not report any errors
>
> If the memory (4GB) is set up in one block (0-4GB) in the BIOS, then
> - SMP Kernel is stable
>
> If the memory (4GB) is set up in two blocks (eg. 0-3GB, 4-5GB) in the
> BIOS, then
> - SMP Kernel is stable _if_and_only_if_ NUMA is _disabled_.
>
>
> BTW.: I won't be able to flash a new BIOS to our machine, since it is in
> production use and runs _rock_stable_ with _full_memory_ after we
> _disabled_ NUMA support in the kernel. (see the two small programs posted
> by me and by Andi Kleen)
>
> What I COULD do is running some tests if needed. (e.g. to check if the
> Board/BIOS is lying about its capabilities)
>
>
> Sergei
> --
> -------------------------------------------------------------------- -?)
> eMail: [email protected] /\\
> -------------------------------------------------------------------- _\_V
> Be careful of reading health books, you might die of a misprint.
> -- Mark Twain
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


-- Andreas Klein
[email protected]
root / webmaster @cip.physik.uni-wuerzburg.de
root / webmaster @http://www.physik.uni-wuerzburg.de
_____________________________________
| |
| Long live our gracious AMIGA! |
|___________________________________|

2004-10-27 16:29:15

by Sergei Haller

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64

On Wed, 27 Oct 2004, Andreas Klein (AK) wrote:

AK> > The next difference:
AK> > You have the S2885 (thunder K8W) and S2875S (tiger K8W single processor)
AK> > boards and I have a S2875 (tiger K8W double processor)
AK>
AK> The boards are nearly identical (on-board lan is different, and your
AK> memory-slots are connected to one CPU).

I don't think that LAN is of any importance for our problems.

But I was told (see previous messages) that the fact that all memory slots
are connedted to one CPU makes a non-NUMA board of it (S2875).

AK> If all memory modules are installed for one CPU, I have your problems.

I see.

AK> Additionaly there are some other problems that only occur, when the
AK> modules are installed one pair for each CPU.

IIRC [I might be wrong], Andrew is running this board (S2885) with exactly
this memory configuration without problems.



c ya
Sergei
--
-------------------------------------------------------------------- -?)
eMail: [email protected] /\\
-------------------------------------------------------------------- _\_V
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2004-10-27 16:54:31

by Richard B. Johnson

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64

On Wed, 27 Oct 2004, Andreas Klein wrote:

>
> Hello,
>
> On Wed, 27 Oct 2004, Sergei Haller wrote:
>
>> On Wed, 27 Oct 2004, Andreas Klein (AK) wrote:
>>
>> Are you sure this is the same problem, that I have? You discovered
>> Problems with memtest86:
>>
>>> Memtest sees 0-2GB mem usable and 4-6GB unusable (complains
>>> about each memory address).
>>
>> I didn't:
>>
>>> memtest86 is happy with the memory.
>
> Memtest is happy with my memory too, if all 4 modules are installed in the
> slots belonging to CPU1. If I install 2 modules for each CPU, memtest86 is
> not happy anymore.
>

Could you please explain how memory is connected to only one
CPU? I don't think this is possible.

Is this board for some new multiple-CPU specification? It can't
work for SMP (symmetrical multiprocessor specification) unless
both CPUs can access the same RAM.

[SNIPPED...]


Cheers,
Dick Johnson
Penguin : Linux version 2.6.8 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached and reviewed by John Ashcroft.
98.36% of all statistics are fiction.

2004-10-27 17:56:29

by Andrew Walrond

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64

On Wednesday 27 Oct 2004 17:44, linux-os wrote:
>
> Could you please explain how memory is connected to only one
> CPU? I don't think this is possible.

The DIMMS are connected connected to cpu1. cpu2 accesses the ram with the
Hypertransport bus.

See page 9 of ftp://ftp.tyan.com/manuals/m_s2875_102.pdf


>
> Is this board for some new multiple-CPU specification? It can't
> work for SMP (symmetrical multiprocessor specification) unless
> both CPUs can access the same RAM.

They can. Its a sort of castrated NUMA board :)

Andrew Walrond

2004-10-28 13:33:26

by Andreas Klein

[permalink] [raw]
Subject: Re: solution Re: lost memory on a 4GB amd64


Hello,


On Wed, 27 Oct 2004, Sergei Haller wrote:

> On Wed, 27 Oct 2004, Andreas Klein (AK) wrote:
>
> AK> Additionaly there are some other problems that only occur, when the
> AK> modules are installed one pair for each CPU.
>
> IIRC [I might be wrong], Andrew is running this board (S2885) with exactly
> this memory configuration without problems.

Yes, I have read that. I also have one board running without problems and
45 boards which have the problem. The running one has been bought over a
year ago. The 45 boards which are not running have been bought a few weeks
ago.
So I think they have made a bad change to the board design since the
prototype board which is running perfect or there is a bug in the CPUs
(maybe memory controller, or hypertransport). In the running board CPUs
with stepping 1 are installed, the not running ones have CPU stepping 10.

Bye,

-- Andreas Klein
[email protected]
root / webmaster @cip.physik.uni-wuerzburg.de
root / webmaster @http://www.physik.uni-wuerzburg.de
_____________________________________
| |
| Long live our gracious AMIGA! |
|___________________________________|