Hi,
I have a question concerning the kernel's memory detection at boot
time, and a proper way of clearing the recognized memory before
proceeding further.
Background:
I am toying with ECC (error checking and correction) module
for linux for some time.
ECC module URL:
http://www.anime.net/~goemon/linux-ecc/
I have found that the particular motherboard (and memory sticks)
that I use at home tends to generate bogus memory problem
warning messages when I use ecc module.
Motherboard is Gigabyte 7XIE4 that uses AMD751.
(Yes, AMD has now provides AMD76x series chipset for
newer CPUs.)
I say "bogus" because I have tested the hardware
many times using memtest86 and found that it doesn't
detect any memory errors even
if I let the test run over the night.
Also, from what I found, it seems this bogus message
may not appear if I somehow use/touch/write to
memory under win98 by running memory-hungry application
such as mozilla, internet explorer, etc. and THEN
reboot the PC into linux.
After following disucssion in ECC mailing list,
I now realize that the proper ECC support
would possibly require the BIOS's writing 0's, or
for that matter, any value, to
ALL the known memory locations before
the main kernel starts and we insert the ecc module.
My use of the said PC under win98 to possibly write
to all the available memory area using mozilla, etc.
may explain the disappearnce of bogus ecc messages under
linux afterward.
Now, as many are aware, not many BIOS's on
low-end motherboards seem to do the writing to
all the memory locations.
[ I now know that
BIOSs on Tyan motherboard for dual AMD CPU operation
do this. But the writing seems to be very slow
for some reason. There is a replacement BIOS project
going on. Please see the following URL, but I digress.
http://www.acl.lanl.gov/linuxbios/
]
Please understand this writing is NOT the simple
writing of a byte or a few bytes per 1K/MB boundaries
to detect the presence of memory chips.
We need the clearing or writing of values to ALL
the bytes in the memory.
Instead of relying BIOS to do the clearing,
we can possibly do similar writing of 0's
at the kenerl boot time.
So my question now is whether we can
clear (or write a known value) to
known memory locations before the main portion of
kernel starts up.
Let us focus the question to x86 linux kernel startup code
for now.
So the question boils down to the following, I think:
In the following file, the kernel
recognizes the BIOS-supplied, and/or user-supplied
memory regions and sets up internal data structure.
/usr/src/linux/arch/i386/kernel/setup.c
>static void __init add_memory_region(unsigned long long start,
This routine takes the arguments of the form
start, length, type
and builds the list of triplets.
>static void __init parse_mem_cmdline (char ** cmdline_p)
parse_mem_cmdline seems to handle the user-supplied
memory info.
There are other routines in the file
to handle the memory-related functins, especially
to read BIOS supplied memory info.
Specific questions:
So my guess is that
probably at the finish of one of the memory routines
in setup.c, I can add a loop to
clear (write zero's) to all the known locations.
But I am not sure about a few things.
- Firstly, what routine would be the best to
add such clearing code ?
(That is, what function is the last one called among
the functions in setup.c during boot time.)
- Secondly, I don't want to overwrite the
kernel startup code, or other data/code entities
already present in the memroy.
Where (or what region, or what type in the
triplet's third argument to add_memory_region) should
I avoid clearing ?
My first guess would be the lower 640K region where
this setup.c code resides (correct?), but
are there others?
- Thirdly, can I use a simple direct addressing to
access the memory at this stage of booting?
That is, can I simply do a loop like the following.
I am using char pointer since I am not even sure
if I can write to the memory region(s) using
long. (Oh well, I probably can on x86 architecture
since if `start' is even an odd value, x86 won't
see alignment error.)
int i;
for(i = 0 ; i < e820.nr_map ; i++)
{
char *a;
if ( dontwrite_foobar(i) ) /* necessary
checking for regions that I should not clear */
next;
/* now clear */
a = e820.map[i].addr;
for(l = 0; l < e820.map[i].size; l++)
*a++ = 0;
}
Obivious optimizations can wait until later.
- Is there a special macro that I can use
to access high memory regions if simple direct
addressing won't work at this stage?
Or is there a place I can possibly add
such clearing code beside setup.c?
But then again, the clearing ought to be done
as early as possible for ECC purposes.
(I suspect that I may not be able to
access memory beyond certain upper limit
at this stage of booting. I wonder in what mode
the CPU is. My plan of modifying setup.c
requires more work if I need to wait until CPU
changes mode to enable full 32-bit linear addressing, etc..)
TIA.
On Sat, 2002-07-13 at 11:08, Ishikawa wrote:
> I have found that the particular motherboard (and memory sticks)
> that I use at home tends to generate bogus memory problem
> warning messages when I use ecc module.
> Motherboard is Gigabyte 7XIE4 that uses AMD751.
> (Yes, AMD has now provides AMD76x series chipset for
> newer CPUs.)
>
> I say "bogus" because I have tested the hardware
> many times using memtest86 and found that it doesn't
> detect any memory errors even
memtest86 isnt (except on the very very latest versions) aware of ECC.
It sees the memory after the ECC rescues minor errors so if the RAM has
errors but ECC just about saves you it will show up clean
>> I have found that the particular motherboard (and memory sticks)
>> that I use at home tends to generate bogus memory problem
>> warning messages when I use ecc module.
>> Motherboard is Gigabyte 7XIE4 that uses AMD751.
>> (Yes, AMD has now provides AMD76x series chipset for
>> newer CPUs.)
>>
>> I say "bogus" because I have tested the hardware
>> many times using memtest86 and found that it doesn't
>> detect any memory errors even
>
>memtest86 isnt (except on the very very latest versions) aware of ECC.
>It sees the memory after the ECC rescues minor errors so if the RAM has
>
>errors but ECC just about saves you it will show up clean
>
Thank you for the info on the latest memtest86.
I will check out.
It might as well be the case that memtest86 (previous versions)
was not quite ECC-aware.
I was not clear on the type of error messages
I received from ecc.o module.
I got both SBE (single bit error detected and corrected)
and MBE (multiple bits error detected,
which presumably was not correctable!).
My point was that there is something amiss
if memtest86 doesn't report errors due to
underlying (hardware) ECC fix,
but the why ecc.o module does.
In any case, I will run memtest86 (the latest version).
Thank you again for the info.
In the meantime, after posting the previous message
I read the Documentation/i386/boot.txt, etc..
It seems that by the time setup.c was called
the kernel was already 32bit linear addressing mode.
The table I mentioned as BIOS-supplied was not
quite directly supplied by BIOS.
The table was supplied by the routines that collect
BIOS data and the routines are in setup.S.
The collected information is placed in a memory area
which is passed from the early boot code to the
following boot pass: it seems that the triplet-table
I metioned is placed in the so called zero-page during
the whole boot process.
What I would probably need is to hack boot/setup.S and
if that may have a danger of making setup.S too large (and
I think so considering what setup.c does to clean up
bogus BIOS-supplied memory info), I would need to
tinker with one of the two hooks mentioned in the boot.txt
by letting loader (my case being loadlin.exe under DOS )
to prepare a hook to do the kind of writing I have in mind.
But this is not going to be a weekend job that I hope it would
be. I would need to tinker with two different programs: loader and
kernel startup code.
But any comment to my original post and other avenue
to achieve similar result welcome.
(Maybe the high reliability computing people
have a better idea short of replacement BIOS or
even have some prototype code working on this?
Hmm. Come to think of it, maybe I can take
the part of free BIOS and see if it will not
enlarge setup.S too large, etc.. But thinking of
various proprietary chipsets, I would hope that
I can insert a short C routine somewhere in the
boot chain, preferably on the kernel side, to
accomplish my objective.)
Alan Cox wrote:
> On Sat, 2002-07-13 at 11:08, Ishikawa wrote:
> > I have found that the particular motherboard (and memory sticks)
> > that I use at home tends to generate bogus memory problem
> > warning messages when I use ecc module.
> > Motherboard is Gigabyte 7XIE4 that uses AMD751.
> > (Yes, AMD has now provides AMD76x series chipset for
> > newer CPUs.)
> >
> > I say "bogus" because I have tested the hardware
> > many times using memtest86 and found that it doesn't
> > detect any memory errors even
>
> memtest86 isnt (except on the very very latest versions) aware of ECC.
> It sees the memory after the ECC rescues minor errors so if the RAM has
> errors but ECC just about saves you it will show up clean
A friend of mine had "subtle memory problems". He tested the memory
using memtest86. No errors running for hours.
Then he did:
bank1 = malloc (bignum);
bank2 = malloc (bignum);
srand (seed);
bash (bank1);
srand (seed);
bash (bank2);
if (memcmp (bank1, bank2) != 0)
Memory error!
and quickly found a memory error on the first pass.
It turns out you can memtest all you want, but different access
patterns may cause different errors. In some cases memtest86 doesn't
detect problems, while other stuff does.....
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.
Thank you for the feedback.
While I would not rule out the possibility
of actual memory error occuring,
I think it is highly unlikely. I will explain the reasoning below, but
before proceeding, here is a question to the list in general.
Q: Does anyone know for sure
- what the memory area used by the kernel code and
its data structure including stack during
the call to routines to arch/i386/kernel/setup.c.
[If there is a formula to calculcate the upper limit boundary using
a variable or two, it is better :-) ]
I am trying to see if I can simply add
a memory clearing routine arch/i386/kernel/setup.c
instead of hacking the arch/i386/boot/setup.S.
Knowing the used area makes it possible to
clear the known available memory space with, say, 0's.
This only permits me to clear the area not used by
the kernel code and stuff at the time of call, but this is close enough
and much better than the current state of the affairs
for my ECC experimentation...
Now back to why I think the memory problem warning from ecc.o on my PC
is bogus.
I am speaking this from my experience of having
a real memory error in past and that I could notice the problem
rather quickly from erratic system behavior.
If we have a memory problem, sooner or later we will see that
- the system may lock up,
- kernel recompilation may fail,
- recompilation of large software package such as xfree86 may fail, or
- running many copies of bonnie and the similar test programs may
fail.
But nothing like this happens on my PC for some time.
I did experience the above mentioned problems when
there WAS a memory problem with a memory board. After figuring
that there was a hardware problem I ran memtest86.
Memtest86 detected errors only after about a few minutes of
running and I was impressed.
Now, your friend's use of rand() to test memory is a good idea
although I do think the systematic approach of memtest86 is necesary when
I consider the "typical" error patterns in memory chips. (I know
companies who make memory chips have special in-house memory test
programs. I wonder if any of these programs would be made publicly
available sometime in the future. Caveat: one company's test program
may not be effective to another's memory chip. Internal memory cell
layout and wiring patters have much to do with "typical" memory
patterns of one vendor's make.) In any case, it might be a good
idea to add such random patterns as the last entry to the test patterns
of memtest86 although the chance of finding errors not found by the
existing patterns may be small indeed.
Testing memory problems is a fairly complex
problem and I want to see a kernel support for ECC
module and that is why I have been toying with
ecc.o module and pondering on the memory clearing issues.
Back to my PC's ECC handling.
Now after running the memtest86 version 3.0 downloded from
http://www.memtest86.com
which indeed has the ECC memory controller suppport now, for
a few hours, I found that it didn't find any errors.
Considering that the amd751 controller is supported
by the memtest86 v3 and the linux ecc support project ecc.c
module (although I use a locally modified version of
ecc.c) it is a little strange that one reports the error
and the other doesn't.
I am not excluding that the real world memory usage
patterns may cause an error which memtest86
doesn't discover using its test patterns,
but from the experience
I explained above, I am looking for other causes currently.
- Maybe my local ecc.c hack is incorrect.
But it is unlilely.
(I have checked this out. I compared what ecc.c and controller.c of
memtest86 does.
I found there is slight difference in handling error output from AMD751.
This is about locating the bank # of the memory chip where the error
occurs, but otherwise, the code seems to be logically identical.
I have downloaded the AMD751 pdf file again to make sure
bank # calculation is OK.
I then found the difference is cosmetic. My ecc.c reports
only the bank # while memtest86 tries to infer the starting memory
address
of that bank. Fair enough. So the ecc.c code is OK. Both
memtest86 controller.c and ecc.c follow the AMD documentation. So
we are OK unless there is a typo in AMD doc.)
- Opportunistic memory access of AMD CPU.
The opportunistic memory access of AMD CPU might
cause the CPU to read a byte or two which the
CPU is not supposed to read. (Maybe
it is reading a memory mapped I/O area or something, say, unmapped
memory area??? by this opportunistic memory read mechanism?)
After following the discussion on ECC mailing list where
it was suggested that boot command line option
mem=nopentium would disable the
opportunistic read of AMD CPU and might solve the bogus ecc.o warning
message, I have added mem=nopentium on the boot command line, i.e.,
mem=nopentium devfs=mount drm=debug root=/dev/sda6 ro \
scsihosts=sym53c8xx:tmscsim BOOT_IMAGE=lin2418.img
But the bogus message still appears (and as if it depends on
the phase of the moon. After I boot into win98 first and
use memory using mozilla, and then reboot into linux, the
warning disappears. Some of you might be saying, "transient error!".
Well, if I boot under linux and see the warning message and
run mozilla and others, and THEN reboot into linux again,
it is likely that I see warning messages again. So I think
there is linux-specific thingy about this warning message.)
Can it be that mem=nopentium support is not quite working on
linux kernel 2.4.18?
I am not sure what "mem=nopentium" does, though.
Quick search through the linux 2.4.18 source tree reveals the
reference to the C macro, X86_FEATURE_PSE,
which this mem=nopentium clears in CPU capability settting.
/usr/src/linux/include/asm-i386/processor.h:#define cpu_has_pse
(test_bit(X86_FEATURE_PSE, boot_cpu_data.x86_capability))
/usr/src/linux/include/asm-i386/cpufeature.h:#define X86_FEATURE_PSE
(0*32+ 3) /* Page Size Extensions */
/usr/src/linux/include/asm-i386/cpufeature.h:#define X86_FEATURE_PSE36
(0*32+17) /* 36-bit PSEs */
/usr/src/linux/arch/i386/kernel/setup.c:
clear_bit(X86_FEATURE_PSE, &boot_cpu_data.x86_capability);
I wonder why there are two macros, X86_FEATURE_PSE, and
X86_FEATURE_PSE36, but I digress.
OK so cpu_has_pse macro should be used somewhere.
A quick search reveled these references.
/usr/src/linux/include/asm-i386/processor.h:#define cpu_has_pse
(test_bit(X86_FEATURE_PSE, boot_cpu_data.x86_capability))
/usr/src/linux/arch/i386/mm/init.c: if (cpu_has_pse) {
I read init.c but not quite so sure what it does yet as far as
oppotunitisc read of AMD CPU is concerned.
(Does memory page size have something to do with eliminating
opportunistic read of AMD CPU?)
arch/i386/kernel/setup.c contains numerous comments about
tricky business of these CPU registers. Is it possible somehow
AMD memory access mechanism is not handled quite right under 2.4.18
as far as opportunistic memory access is concerned?
(Or maybe I have a buggy Duron CPU. :-).)
(Sorry searching through the kernel source file for "opportunistic"
ended in two references totally urelated to the subject at hand.
/usr/src/linux/fs/coda/upcall.c: The statements below are part of the
Coda opportunistic
/usr/src/linux/mm/swapfile.c: * work, but we opportunistically check whether
Anyway, if someone can answer my question above
>Q: Does anyone know for sure
>
> - what the memory area used by the kernel code and
> its data structure including stack during
> the call to routines to arch/i386/kernel/setup.c.
I would be able to test if clearing memory with 0's first
might help.
(Or I may hack memtest86 and loadlin to
run memory test first and then load linux kernel as was
suggested on ECC mailing list.)
cf.
Part of my .config file:
Hmm... Maybe I should build in the following options for
the correct AMD cpu mem=nopentium usage?
>CONFIG_X86_MSR=m
>CONFIG_X86_CPUID=m
??? Well, these only offer the support for reading the
priviledged registers via device file, and
should not matter.
#
# Processor type and features
#
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
CONFIG_MK7=y
# CONFIG_MELAN is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_USE_3DNOW=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_SMP is not set
# CONFIG_X86_UP_APIC is not set
# CONFIG_X86_UP_IOAPIC is not set
Chiaki <[email protected]> writes:
> >> I have found that the particular motherboard (and memory sticks)
> >> that I use at home tends to generate bogus memory problem warning messages
> >> when I use ecc module.
> >> Motherboard is Gigabyte 7XIE4 that uses AMD751.
> >> (Yes, AMD has now provides AMD76x series chipset for
> >> newer CPUs.)
> >> I say "bogus" because I have tested the hardware
> >> many times using memtest86 and found that it doesn't
> >> detect any memory errors even
> >
> >memtest86 isnt (except on the very very latest versions) aware of ECC.
> >It sees the memory after the ECC rescues minor errors so if the RAM has
> >
> >errors but ECC just about saves you it will show up clean
> >
> Thank you for the info on the latest memtest86.
> I will check out.
>
> It might as well be the case that memtest86 (previous versions)
> was not quite ECC-aware.
>
> I was not clear on the type of error messages
> I received from ecc.o module.
> I got both SBE (single bit error detected and corrected)
> and MBE (multiple bits error detected,
> which presumably was not correctable!).
>
> My point was that there is something amiss
> if memtest86 doesn't report errors due to
> underlying (hardware) ECC fix,
> but the why ecc.o module does.
>
> In any case, I will run memtest86 (the latest version).
> Thank you again for the info.
Note. The hardware ECC support in memtest86 3.0
is limited, so I would check to make certain your chipset
is supported..
> But any comment to my original post and other avenue
> to achieve similar result welcome.
> (Maybe the high reliability computing people
> have a better idea short of replacement BIOS or
> even have some prototype code working on this?
> Hmm. Come to think of it, maybe I can take
> the part of free BIOS and see if it will not
> enlarge setup.S too large, etc.. But thinking of
> various proprietary chipsets, I would hope that
> I can insert a short C routine somewhere in the
> boot chain, preferably on the kernel side, to
> accomplish my objective.)
Your objective is misguided. Even with a bios that
is slightly buggy in initializing the ECC bits, what you want is
scrubbing. Then if the error disappears after 5 minutes of uptime
you can ignore it. And if it comes back you know you really have
something to worry about. At least for single bit errors this should
fix the whole problem with something that is useful for other
purposes.
Eric
"Eric W. Biederman" wrote:
> Note. The hardware ECC support in memtest86 3.0
> is limited, so I would check to make certain your chipset
> is supported..
>
Well, I read controller.c source file, and
found that amd751 is supported according to the info
in AMD751 documentation.
(This is the same documentatin I used when I hacked
early ecc.c source file.)
I ran memtest v3 on the motherboard, and
it didn't find any errors when I ran it overnight.
> Your objective is misguided. Even with a bios that
> is slightly buggy in initializing the ECC bits, what you want is
> scrubbing. Then if the error disappears after 5 minutes of uptime
> you can ignore it.
I see. Then the problem would boil down whether
AMD751 supports scrubbingat the hardware level.
It reports that it saw correctable single bit error,
and I take it to mean that the chip itself
has fixed the single bit error.
Am I too optimistic to expect this?
The problem, though, is this.
According to the AMD751 documentation, we
can clear the memory error info in the chip register,
by writing to a certain location of the PCI space.
However, when one error get reported on my motherboard,
it would NOT go away and I think there is something amiss
about this.
I am not sure if the AMD doc is correct about this now.
from controller.c of memtest86.
/* Clear the error status */
pci_conf_write(ctrl.bus, ctrl.dev, ctrl.fn, 0x58, 2, 0);
>From locally hacked ecc.c:
/*
* clear error flag bits that were set by writing 0 to
them
* we hope the error was a fluke or something :)
*/
/* int value = eccstat & 0xFCFF; */
/* pci_write_config_word(bridge, 0x58, value); */
pci_write_config_byte(bridge, 0x58, 0x0);
pci_write_config_byte(bridge, 0x59, 0x0);
BTW, I tried both the byte and word write to 0x58, but
it never seemd to clear the error status.
(It is possible that the error is real, but again
the error is reported always in the first bank even if
I rotate the memry sticks...)
> And if it comes back you know you really have
> something to worry about. At least for single bit errors this should
> fix the whole problem with something that is useful for other
> purposes.
>
> Eric
Thank you for your feedback. I noticed that you
contributed to linuxBIOS and memtest86.
Please keep the good work going!
Chiaki
PS: I am wondering when we can have a reasonably good
ECC support on desktop PC...
(A la Sun pizza boxes, that is.)
Ishikawa <[email protected]> writes:
> "Eric W. Biederman" wrote:
>
> > Note. The hardware ECC support in memtest86 3.0
> > is limited, so I would check to make certain your chipset
> > is supported..
> >
>
> Well, I read controller.c source file, and
> found that amd751 is supported according to the info
> in AMD751 documentation.
> (This is the same documentatin I used when I hacked
> early ecc.c source file.)
>
> I ran memtest v3 on the motherboard, and
> it didn't find any errors when I ran it overnight.
I wasn't able to verify it so the AMD751 was disabled by default.
>
> > Your objective is misguided. Even with a bios that
> > is slightly buggy in initializing the ECC bits, what you want is
> > scrubbing. Then if the error disappears after 5 minutes of uptime
> > you can ignore it.
>
> I see. Then the problem would boil down whether
> AMD751 supports scrubbingat the hardware level.
> It reports that it saw correctable single bit error,
> and I take it to mean that the chip itself
> has fixed the single bit error.
> Am I too optimistic to expect this?
The single bit error has been corrected in only one direction,
in the data going to the cpu. The data remaining in memory
was not corrected.
> The problem, though, is this.
> According to the AMD751 documentation, we
> can clear the memory error info in the chip register,
> by writing to a certain location of the PCI space.
> However, when one error get reported on my motherboard,
> it would NOT go away and I think there is something amiss
> about this.
That the location is frequently read and there is no hardware
scrubbing (writing the correct value back to ram).
> I am not sure if the AMD doc is correct about this now.
> from controller.c of memtest86.
>
> /* Clear the error status */
> pci_conf_write(ctrl.bus, ctrl.dev, ctrl.fn, 0x58, 2, 0);
>
> >From locally hacked ecc.c:
> /*
> * clear error flag bits that were set by writing 0 to
> them
> * we hope the error was a fluke or something :)
> */
> /* int value = eccstat & 0xFCFF; */
> /* pci_write_config_word(bridge, 0x58, value); */
> pci_write_config_byte(bridge, 0x58, 0x0);
> pci_write_config_byte(bridge, 0x59, 0x0);
Both of which match.
> BTW, I tried both the byte and word write to 0x58, but
> it never seemd to clear the error status.
> (It is possible that the error is real, but again
> the error is reported always in the first bank even if
> I rotate the memry sticks...)
Sounds like a questionable bit of documentation.
> > And if it comes back you know you really have
> > something to worry about. At least for single bit errors this should
> > fix the whole problem with something that is useful for other
> > purposes.
> >
> > Eric
>
> Thank you for your feedback. I noticed that you
> contributed to linuxBIOS and memtest86.
> Please keep the good work going!
Hopefully we can get the problems well enough understood in the community
that we can actually get some of this code fixed up and working well.
Eric
>
>
>>
>> I ran memtest v3 on the motherboard, and
>> it didn't find any errors when I ran it overnight.
>
>I wasn't able to verify it so the AMD751 was disabled by default.
>
I modified the controller.c source file and enabled AMD751, but
it didn't change anything at all. memtest86 didn't find
any errors.
>>
>> > Your objective is misguided. Even with a bios that
>> > is slightly buggy in initializing the ECC bits, what you want is
>> > scrubbing. Then if the error disappears after 5 minutes of uptime
>> > you can ignore it.
>>
>> I see. Then the problem would boil down whether
>> AMD751 supports scrubbingat the hardware level.
>> It reports that it saw correctable single bit error,
>> and I take it to mean that the chip itself
>> has fixed the single bit error.
>> Am I too optimistic to expect this?
>
>The single bit error has been corrected in only one direction,
>in the data going to the cpu. The data remaining in memory
>was not corrected.
>
After reading some e-mails in ECC mailing list and
your comment above, I have a feeling that
the particular kernel I use (compiled for AMD K7) may
be reading an unitialized region during some operation
and that may trigger this incorrect region : and
that AMD751 doesn't do hardware scrubbing, so we
have the error looks as if the error got stuck.
(AMD's later 76x chipset seems to support hardware scrubbing.)
I am a little disappointed to find that the low-level hardware
doesn't support scrubbing. Incidently, I noticed from google search
that many
workstation have had explict software scrubber program to
check the available memory every now and then to
look for software ECC errors that can be corrected.
But I think this software aims at correcting soft errors before
other multiple errors would emerge and ECC at hardware level would
not be able to fix such errors any more.
In any case, at the granuality reported by
the AMD751, it is a little awkward to do software scrubbing.
Also, for effective software scrubber as in the workstation
software offerings (for checking memory: just read them
so that we can fix the ECC correctable errors in the physical memory
prbably by means of underlying hardware scrubbing function.),
such software ought to run from kernel context as a thread, and it probably
needs to know the interference from DMA's etc, so it is not
easy to write, I have to admit.
>
>> The problem, though, is this.
>> According to the AMD751 documentation, we
>> can clear the memory error info in the chip register,
>> by writing to a certain location of the PCI space.
>> However, when one error get reported on my motherboard,
>> it would NOT go away and I think there is something amiss
>> about this.
>
>That the location is frequently read and there is no hardware
>scrubbing (writing the correct value back to ram).
>
>
Sounds plausible.
>> I am not sure if the AMD doc is correct about this now.
>> from controller.c of memtest86.
>>
>> /* Clear the error status */
>> pci_conf_write(ctrl.bus, ctrl.dev, ctrl.fn, 0x58, 2, 0);
>>
>> >From locally hacked ecc.c:
>> /*
>> * clear error flag bits that were set by writing 0 to
>> them
>> * we hope the error was a fluke or something :)
>> */
>> /* int value = eccstat & 0xFCFF; */
>> /* pci_write_config_word(bridge, 0x58, value); */
>> pci_write_config_byte(bridge, 0x58, 0x0);
>> pci_write_config_byte(bridge, 0x59, 0x0);
>
>Both of which match.
>
Glad to hear that.
>
>> BTW, I tried both the byte and word write to 0x58, but
>> it never seemd to clear the error status.
>> (It is possible that the error is real, but again
>> the error is reported always in the first bank even if
>> I rotate the memry sticks...)
>
>Sounds like a questionable bit of documentation.
>
I wish someone who is familar with AMD chipset design
can speak up.
>
>> > And if it comes back you know you really have
>> > something to worry about. At least for single bit errors this should
>> > fix the whole problem with something that is useful for other
>> > purposes.
>> >
>> > Eric
>>
>> Thank you for your feedback. I noticed that you
>> contributed to linuxBIOS and memtest86.
>> Please keep the good work going!
>
>Hopefully we can get the problems well enough understood in the community
>that we can actually get some of this code fixed up and working well.
>
>
>
I hope so.
Currently I have a few plans to attack this topic.
- recompile the kernel to use a lesser aggressive kernel in terms of
memory access by choosing, say, AMD k5 or something.
(This might fix the bogus warning messages...)
- considering to use the part of memtest86 code to
incorporate it into the boot loader steps somewhere so that
all the memory is written to at least once during booting.
- Or as someone suggested in private e-mail, I might want to add
the aggressive memory copying code to memtest86 to see
if such would make memtest86 to catch more subtle errors...
Thank you again for the feedback.