Hi there. I'm seeing a really strange problem on my system lately and I
am not really sure that it has anything to do with the kernels.
I would appreciate any guidance with my problems. Any help is welcome.
My desktop has a Duron 1.3GHz (but, for some reason, it runs only at
1.1GHz) and an Asus A7V motherboard, with chipset VIA KT133 (not the
enhanced version KT133A).
Its memory modules are all PC133 and it had a 128MB card + 256MB
memory. Then, I decided that it wasn't the right time for a new computer
and just bought myself 2 newer expansion cards of 512MB.
Now, the motherboard has 512MB + 512MB + 256MB (all slots filled). I
then, recompiled the kernels with HIGHMEM support (the 4GB version) and
I've been seeing some strangeness since then.
The first thing that I noticed was that I run some file integrity
programs (debsums, which checks the md5sum signatures of the packages
that I have installed) to check the state of my system. I discovered
that some packages didn't have its signatures matching the originals.
Then, I reinstalled said packages and run debsums again. I got some
*other* packages with md5sum mismatches. Thinking that it could be
something related to the memory of my system, I decided to run
memtest86+ for some time.
After running for 6 hours, it could not find anything wrong with the
1.25GB of memory installed, which left me quite puzzled.
I then tried using the system again, but, still puzzled by the md5sum
mismatches, I tried to verify them again and I got some other packages
with problems.
At the same time, I was trying to stress test the machine a little bit
and decompressing the kernel tree from a tar.bz2 file, since a friend of
mine asked me to compile him a kernel >= 2.6.12 so that he could use
udev.
In the middle of the untarring, bzip2 stopped and said that it found
inconsitencies and that I should run bzip2recover on the file. I
removed the entire tree and tried uncompressing the tarball again and
the same result happened.
I then decided to reboot the machine, since I was fed up with this
strangeness (that I had never seen occurring before), and after the
boot, I tried running memtest86+ again for some minutes. It didn't find
anything.
Then, I booted back into Linux (at the time I was using 2.6.14-rc2) and
*succeeded* in uncompressing the tar.bz2 file that was "corrupted". At
this point in time, I did not understand anything.
I then left my computer running on memtest86+ while I went to work and
16 hours later, no problem was found and it was still running fine.
I then thought that it could be something with the harddisk and tried to
play with smartctl. I run one long/off-line test on my HD, but it
succeeded (I conjectured that the drive could be running out of spare
sectors).
I also tried running the kernel with highmem=0K, but the symptoms of
corruption repeated themselves. I even thought that maybe Linux couldn't
have been very much exposed to systems with HIGHMEM on older hardware
(like mine) and I then left the machine with just a 512MB card and it
still has problems.
I have voluntary preempt enabled, but I had it before and didn't notice
anything strange. I am now back to kernel 2.6.13.2 (avoiding all the
niceness that is in the 2.6.14-rc's), just to be sure. I can't see many
other things to try, except disabling voluntary preempt (which hasn't
given me any problems with earlier kernels and even -mm kernels).
Other than that, I am stuck and without any ideas. Please, any help
would be much more than welcome.
Thank you very much for suggestions, Rog?rio Brito.
P.S.: If anybody knows of a live CD with memtest86+ and cpuburn and
other things so that I could test my system, I would be highly
interested to know.
I sincerely don't know if I have a software or a hardware problem here.
P.P.S.: I am using a Debian testing system and the most demanding thing
that I do with my system is to compress some files to MP3 and to type
some texts in LaTeX with Emacs under Fluxbox with the Minimal style
(which is quite easy on the machine---I have not yet dared to use any
heavy desktop environment).
If any information else is desired, please let me know. I will gladly
help you to help me, as I am almost desperate. Thanks.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
El Tue, 27 Sep 2005 08:10:39 -0300,
Rog?rio Brito <[email protected]> escribi?:
> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.
You don't say what filesystem are you using. Have you tried running fsck?
On Tue, 27 Sep 2005 08:10:39 -0300, Rog?rio Brito <[email protected]> wrote:
>Hi there. I'm seeing a really strange problem on my system lately and I
>am not really sure that it has anything to do with the kernels.
Probably not, I had a similar problem recently and for a test case
copied a .iso image file then compared it to original (cp + cmp),
turned out to be bad memory, and yes, memtest86 did not find the
problem. Check mobo datasheet if 2+ double-sided memory allowed,
you may need to stay at 1GB to reduce bus loading.
Cheers,
Grant.
Hi, Diego. Thank you very much for your reply.
On Sep 27, 2005, at 8:34 AM, Diego Calleja wrote:
> You don't say what filesystem are you using. Have you tried running
> fsck?
Oh, sure. I forgot to mention that. I am using ext3 with ACL/xattrs
and with hashed B-Trees (I optimized the filesystem with option -D of
fsck.ext2). Would one of these things be a possible cause for the
strange behaviour that I am seeing?
Again, thank you very much for your interest.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Oops! I forgot to answer your question completely.
On Sep 27, 2005, at 8:58 AM, Rog?rio Brito wrote:
> On Sep 27, 2005, at 8:34 AM, Diego Calleja wrote:
>> You don't say what filesystem are you using. Have you tried
>> running fsck?
>
> Oh, sure. I forgot to mention that. I am using ext3 with ACL/xattrs
> and with hashed B-Trees (I optimized the filesystem with option -D
> of fsck.ext2). Would one of these things be a possible cause for
> the strange behaviour that I am seeing?
Yes, I did run fsck. Twice now, in a row (shutdown -r -F now).
Nothing was found, unfortunately. :-( I'm really running out of
ideas. :-(
Thanks, Rog?rio Brito.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi, Grzegorz. Thank you for your response.
On Sep 27, 2005, at 8:43 AM, Grzegorz Kulewski wrote:
> What is your southbridge?
The southbridge is a VIA VT82C686.
> Maybe there are some problems there with DMA or cables.
Humm, cables. I forgot to check that. I will check that as soon as I
wake up. I spent the entire night trying to fix this, but of course,
I gave up after some days of effort and decided to ask for help.
> Anything in logs?
Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now
that you mention it, I remember that I also made my Matrox G400 use
speed 4x. I will try slowing it down to see if there is any influence
on what I see.
> Maybe sourthbridge or northbridge is simply overheating? Maybe you
> have bad power suply? What are readings of temperatures and
> voltages in BIOS after some heavy disk-memmory activities?
I don't know, because lmsensors doesn't give accurate measurements,
unfortunately. :-(
> You can use http://pyropus.ca/software/memtester/ to check your
> memory in linux. You can run cpuburn at the same time. And you can
> do some disk activity at the same time (for example dd if=/dev/hda
> bs=200M | md5sum several times to check if it will give the same
> results).
I had already tried using memtester, but I guess that I was too
ambitious with the amount of memory that I tried it to allocate. I
will try this, but with my filesystem in read-only mode, as I cannot
afford to loose what I have (and Debian's mondo/mind isn't working
right now---I already filed a bug report that is shared by others).
> I will bet that you have some hardware problem there. You can try
> to remove the 256MB DDR module and turn HIGHMEM off. You can also
> try to check each module separately.
I already checked each module separately, but I didn't see any
corruption. I guess that I maybe wasn't paying too much attention. I
will try it again. Thanks for the suggestion.
> And the best choice will be probably to buy new mb (for example
> Abit KW7 or KV7) because your is very old and it can start to
> silently break after so many years... Today mbs are very short
> living parts - 3-4 years and they are broken...
Yes, I was just trying to avoid getting a new system now, with all
the transitions going on (i386 -> x86_64 CPUs, PATA -> SATA etc). But
my time is also costing me some nights of sleep... :-( It sucks not
to be in the US, where things are cheaper. :-(
Thank you very much, Rog?rio.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
On Tue, 27 Sep 2005, Rog?rio Brito wrote:
> Hi, Grzegorz. Thank you for your response.
Hi, no problem.
> On Sep 27, 2005, at 8:43 AM, Grzegorz Kulewski wrote:
>> What is your southbridge?
>
> The southbridge is a VIA VT82C686.
I know. I had the same southbridge in my Abit KG7 but I don't know if you
have version A or version B. I had version B and it has several disk
problems fixed. For version A there are some workarounds in the kernel.
>> Maybe there are some problems there with DMA or cables.
>
> Humm, cables. I forgot to check that. I will check that as soon as I wake up.
> I spent the entire night trying to fix this, but of course, I gave up after
> some days of effort and decided to ask for help.
>
>> Anything in logs?
>
> Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now that
I don't think that there will be any oops or something like that. But
maybe some IDE messages - like failed commands or something. But if there
are no such messages then chance is that this is some memory/mb problem.
> you mention it, I remember that I also made my Matrox G400 use speed 4x. I
> will try slowing it down to see if there is any influence on what I see.
Yes, slowing down your graphics card could help.
>> Maybe sourthbridge or northbridge is simply overheating? Maybe you have bad
>> power suply? What are readings of temperatures and voltages in BIOS after
>> some heavy disk-memmory activities?
>
> I don't know, because lmsensors doesn't give accurate measurements,
> unfortunately. :-(
So after burning reboot fast end check the BIOS measurements. Temperatures
will not change that much in minute or two. If your system is overheating
they will be high for at least 5 minutes after reboot.
>> You can use http://pyropus.ca/software/memtester/ to check your memory in
>> linux. You can run cpuburn at the same time. And you can do some disk
>> activity at the same time (for example dd if=/dev/hda bs=200M | md5sum
>> several times to check if it will give the same results).
>
> I had already tried using memtester, but I guess that I was too ambitious
> with the amount of memory that I tried it to allocate. I will try this, but
> with my filesystem in read-only mode, as I cannot afford to loose what I have
> (and Debian's mondo/mind isn't working right now---I already filed a bug
> report that is shared by others).
>
>> I will bet that you have some hardware problem there. You can try to remove
>> the 256MB DDR module and turn HIGHMEM off. You can also try to check each
>> module separately.
>
> I already checked each module separately, but I didn't see any corruption. I
> guess that I maybe wasn't paying too much attention. I will try it again.
> Thanks for the suggestion.
Hmm... What did you change before the system started not working? Maybe
try with only 256MB module installed if that was the working
configuration...
>> And the best choice will be probably to buy new mb (for example Abit KW7 or
>> KV7) because your is very old and it can start to silently break after so
>> many years... Today mbs are very short living parts - 3-4 years and they
>> are broken...
>
> Yes, I was just trying to avoid getting a new system now, with all the
> transitions going on (i386 -> x86_64 CPUs, PATA -> SATA etc). But my time is
Yeah, I am waiting for stable and better x86_64 too. But I replaced my KG7
to KW7 in the mean time just to be sure I have something before I will
buy x86_64. :-)
> also costing me some nights of sleep... :-( It sucks not to be in the US,
> where things are cheaper. :-(
Yeah, it sucks. I live in Poland and we have really big prices for
computer parts here. :-(
> Thank you very much, Rog?rio.
No problem.
Grzegorz Kulewski
Grant Coady wrote:
> On Tue, 27 Sep 2005 08:10:39 -0300, Rog?rio Brito <[email protected]> wrote:
>
>
>>Hi there. I'm seeing a really strange problem on my system lately and I
>>am not really sure that it has anything to do with the kernels.
>
>
> Probably not, I had a similar problem recently and for a test case
> copied a .iso image file then compared it to original (cp + cmp),
> turned out to be bad memory, and yes, memtest86 did not find the
> problem. Check mobo datasheet if 2+ double-sided memory allowed,
> you may need to stay at 1GB to reduce bus loading.
I work a lot with hardware any my experience is that memtest is not very
good at detecting errors. I have a Socket 7 board somewhere with bad L2
cache - it was unstable but memtest was unable to find anything.
However, GoldMemory found some errors - they disappeared after disabling
L2 cache and crashes disappeared too. It's not free but at least
shareware - you can find it at http://www.goldmemory.cz/ The older
version (IIRC 5.07) was better, I had problems with some of the newer
ones on perfectly OK hardware (when the test should start, it rebooted
instead).
--
Ondrej Zary
On Tue, Sep 27, 2005 at 09:57:52PM +1000, Grant Coady wrote:
> Probably not, I had a similar problem recently and for a test case
> copied a .iso image file then compared it to original (cp + cmp),
> turned out to be bad memory, and yes, memtest86 did not find the
> problem. Check mobo datasheet if 2+ double-sided memory allowed,
> you may need to stay at 1GB to reduce bus loading.
The board is allowed 1.5GB using 3 x 512M. I believe the 512M modules
must be double sided to work but I am not 100% sure of that.
It is also generally unstable if set to anything over PC100 memory speed
in my experience (my machine has the same board). The memory speed
detection doesn't work properly. I have found it perfectly stable when
set to PC100 in bios and using PC133 memory. It seems to prefer having
the extra margin.
I have never personally had more than 2 x 256M on mine.
Len Sorensen
Hi
On Tue, 27 Sep 2005, Grzegorz Kulewski wrote:
> On Tue, 27 Sep 2005, Rog?rio Brito wrote:
>
> > The southbridge is a VIA VT82C686.
>
> I know. I had the same southbridge in my Abit KG7 but I don't know if you have
> version A or version B. I had version B and it has several disk problems
> fixed. For version A there are some workarounds in the kernel.
Version B here. It first had only 128MB, worked fine, I added 256MB,
system become unstable, memtest86 found "bad memory" around the last
megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
Every module alone works, but not together. But in my case memtest86 did
find errors. Try removing the 256MB module?...
Thanks
Guennadi
---
Guennadi Liakhovetski
On Tue, Sep 27, 2005 at 09:42:44PM +0200, Guennadi Liakhovetski wrote:
> Version B here. It first had only 128MB, worked fine, I added 256MB,
> system become unstable, memtest86 found "bad memory" around the last
> megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
> Every module alone works, but not together. But in my case memtest86 did
> find errors. Try removing the 256MB module?...
FWIW, some VIA based chipsets only take a single DDR400 module, not
two. The manuals are a bit vague about it.
Erik
--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
On Tue, 27 Sep 2005, Erik Mouw wrote:
> On Tue, Sep 27, 2005 at 09:42:44PM +0200, Guennadi Liakhovetski wrote:
> > Version B here. It first had only 128MB, worked fine, I added 256MB,
> > system become unstable, memtest86 found "bad memory" around the last
> > megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
> > Every module alone works, but not together. But in my case memtest86 did
> > find errors. Try removing the 256MB module?...
>
> FWIW, some VIA based chipsets only take a single DDR400 module, not
> two. The manuals are a bit vague about it.
My manual says "2". And it's a A7VI-VM, so, unfortunately, no DDR400, just
PC133/VC133.
Thanks
Guennadi
---
Guennadi Liakhovetski
Hi Rogerio.
On Tue, 2005-09-27 at 21:10, Rog?rio Brito wrote:
> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.
I've seen the thread mostly following the hardware line. I'd like to
enquire down the kernel path because I've seen occasional, impossible to
reproduce problems too.
Can I ask first a few questions:
1) Are you using vanilla kernels, or do you have other patches applied?
2) Are you using ext3 only?
3) Is the corruption only ever in memory, or seen on disk too?
4) Is the corruption only in one filesystem or spread across several (if
applicable)? (ie in / but not /home or others?)
Regards,
Nigel
On Tue, Sep 27, 2005 at 08:10:39AM -0300, you [Rog?rio Brito] wrote:
> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.
>
> I would appreciate any guidance with my problems. Any help is welcome.
>
> My desktop has a Duron 1.3GHz (but, for some reason, it runs only at
> 1.1GHz) and an Asus A7V motherboard, with chipset VIA KT133 (not the
> enhanced version KT133A).
You may be running into this problem:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0207.2/0574.html
http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/1727.html
http://www.cs.helsinki.fi/linux/linux-kernel/2002-01/1048.html
http://marc.theaimsgroup.com/?l=linux-kernel&m=99889965423508&w=2
(A google search will turn up more.)
I had enourmeous trouble with Via KT133 and IDE.
Placing network card to a different PCI slot helped somewhat as did
upgrading the bios.
I NEVER got the board stable, and ended up ditching it.
It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
is utter crap period.
When browsing the viaarena.com forums, I found huge number of problem
reports about KT133 corrupting DMA transfers with sound cards, video
editing cards and IDE. It seemed to me it just can't get DMA right when it
is under heavy load. The reports were mostly windows, btw.
-- v --
[email protected]
On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> I NEVER got the board stable, and ended up ditching it.
>
> It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> is utter crap period.
It was a FIFO bug, but the kernel knows about it and it should handle
this correctly. Is the hard disk running UDMA133 ?
On Thu, Sep 29, 2005 at 12:23:28AM +0100, you [Alan Cox] wrote:
> On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> > I NEVER got the board stable, and ended up ditching it.
> >
> > It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> > is utter crap period.
>
> It was a FIFO bug, but the kernel knows about it and it should handle
> this correctly.
Interesting. Since which version?
> Is the hard disk running UDMA133 ?
The hardware has long since been ditched for good after months of vasted
effort to get it working, but I think HPT370 on KT7 supports UDMA100 at
maximum, and the disks were likely UDMA66.
-- v --
[email protected]
On Iau, 2005-09-29 at 09:29 +0300, Ville Herva wrote:
> On Thu, Sep 29, 2005 at 12:23:28AM +0100, you [Alan Cox] wrote:
> > On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> > > I NEVER got the board stable, and ended up ditching it.
> > >
> > > It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> > > is utter crap period.
> >
> > It was a FIFO bug, but the kernel knows about it and it should handle
> > this correctly.
>
> Interesting. Since which version?
Some fixes went in early 2.4 and they got refined later on. See the
function quirk_vialatency). There is a brief summary at the first URL
listed still. Essentially the chip has a flaw where it can lose a
transfer.
If people see this behaviour on a KT133 can you please check the quirk
is being run and displaying
printk(KERN_INFO "Applying VIA southbridge workaround.\n");
Hi, Grzegorz. Thank you again for your response.
I haven't been up with linux kernel since I have been experimenting with
my motherboard to see if I could make it stable.
On Sep 27 2005, Grzegorz Kulewski wrote:
> On Tue, 27 Sep 2005, Rog?rio Brito wrote:
> >The southbridge is a VIA VT82C686.
>
> I know. I had the same southbridge in my Abit KG7 but I don't know if
> you have version A or version B. I had version B and it has several
> disk problems fixed. For version A there are some workarounds in the
> kernel.
Didn't know that until I saw the following in the dmesg log:
- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -
rbrito@dumont:~$ dmesg | grep -i via
Disabling VIA memory write queue (PCI ID 0305, rev 02): [55] 89 & 1f -> 09
PCI: Disabling Via external APIC routing
agpgart: Detected VIA Twister-K/KT133x/KM133 chipset
parport_pc: VIA 686A/8231 detected
parport_pc: VIA parallel port: io=0x378, irq=7
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -
This also answers the question of my motherboard having the revision A
of the southbridge.
> >Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now
> >that
>
> I don't think that there will be any oops or something like that. But
> maybe some IDE messages - like failed commands or something. But if there
> are no such messages then chance is that this is some memory/mb
> problem.
Yes, I found some of them. See below.
> >you mention it, I remember that I also made my Matrox G400 use speed
> >4x. I will try slowing it down to see if there is any influence on
> >what I see.
>
> Yes, slowing down your graphics card could help.
This is something that I still have not tried, because I lost a good
amount of time using Gold Memory (already mentioned in this thread) to
scan for bad memory.
Even though GM is shareware and only limited its tests to the "quick
tests", it did a *much* better job than memtest86+ finding errors (i.e.,
Gold Memory found errors with my system even when memtest86+ didn't).
Perhaps some of those tests could be included in memtest86+.
Oh, and the fact that we have both memtest86{,+} doesn't help one when
choosing what to use. :-(
> >>I will bet that you have some hardware problem there. You can try to
> >>remove the 256MB DDR module and turn HIGHMEM off. You can also try to
> >>check each module separately.
> >
> >I already checked each module separately, but I didn't see any corruption.
> >I guess that I maybe wasn't paying too much attention. I will try it
> >again. Thanks for the suggestion.
>
> Hmm... What did you change before the system started not working?
It had 256MB + 128MB running at PC100 speed (even though both were rated
to work at PC133 speeds).
> Maybe try with only 256MB module installed if that was the working
> configuration...
The catch is that the problem seems to be transient and not that easy to
reproduce. For instance, I had 2 x 512MB + 256MB installed and it
"worked" (meaning that it booted Linux and the system was useable, even
though I saw some problems with md5sums on my system).
Then, just removing the 256MB module made the computer not even POST
anymore! Weird, isn't it? Beyond anything that I can explain yet.
> >It sucks not to be in the US, where things are cheaper. :-(
>
> Yeah, it sucks. I live in Poland and we have really big prices for
> computer parts here. :-(
So, you know what I am talking about when I want to keep what I have
just for the moment.
Regards,
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi, Guennadi.
On Sep 27 2005, Guennadi Liakhovetski wrote:
> Version B here. It first had only 128MB, worked fine, I added 256MB,
> system become unstable, memtest86 found "bad memory" around the last
> megabytes.
This is *quite* similar to what I am seeing.
> Then I bought 512MB, hoping to use it with 256MB - no way.
Again, similar to what I see.
> Every module alone works, but not together. But in my case memtest86
> did find errors.
This is something puzzling: when I first installed the modules to get
1.25GB, things "worked", but I had problems with memtest86+ (not
memtest86).
I changed things (removing modules), got frustrated having only 512MB on
the system with all the other modules laying around here and put them
back.
This second time, I reduced the latency on the BIOS from 2-2-2 to 3-3-3
and it booted and memtest86+ did't find any errors. Yet, I saw some
corruption, which was what prompted me to send the original mail to
linux-kernel (since I didn't know if it was a hardware or a software
problem, as memtest86+ had not found any errors).
> Try removing the 256MB module?...
Right now, I'm only using one 512MB module, but after I have already
paid for the second one, and it wasn't cheap. :-(
I suspect that the system is stable now, but I am not sure. If I
reinstall some packages with apt, it still gets some problems with the
md5sum signatures of *other* packages, which is highly weird. But I
don't see any other problems.
Puzzling, huh? I already run a SMART offline/long self-test on the disk
(to rule out it being a problem) and it passed with flying colors. I
also already used badblocks on this very disk (but in read-only mode),
and it also didn't find any problems.
I have a Quantum FIREBALLlct15 drive here.
Thanks,
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi, Ondrej and others,
On Sep 27 2005, Ondrej Zary wrote:
> I have a Socket 7 board somewhere with bad L2 cache - it was unstable
> but memtest was unable to find anything.
Right.
> However, GoldMemory found some errors - they disappeared after
> disabling L2 cache and crashes disappeared too.
I have not yet tried disabling the cache on my case (since both L1 and
L2 caches here are integrated into the processor). May be a possibility,
though.
> It's not free but at least shareware - you can find it at
> http://www.goldmemory.cz/
Thank you very much for this hint. It indeed found problems that
memtest86+ didn't find. I think that it would be nice to have some of
those tests integrated in memtest86+.
Thanks again,
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
On Sep 27 2005, Lennart Sorensen wrote:
> The board is allowed 1.5GB using 3 x 512M. I believe the 512M modules
> must be double sided to work but I am not 100% sure of that.
Right now, I'm using just a single 512MB module, but it is single-sided
(I guess that by double-sided you guys mean that it has chips on both
sides of the module, right?). The only double-sided module that I have
here is the 256MB module.
OTOH, with just one 512MB everything *seems* to be working fine, but,
honestly, I'm not sure.
> It is also generally unstable if set to anything over PC100 memory speed
> in my experience (my machine has the same board).
Hummm, nice to see that you have also experienced this. With 256 + 128,
I had to use PC100 to have it work stably.
> The memory speed detection doesn't work properly. I have found it
> perfectly stable when set to PC100 in bios and using PC133 memory. It
> seems to prefer having the extra margin.
I'd obviously prefer to have everything working at PC133 speed, but
wouldn't mind running at PC100 speed if I could use everything, since I
sometimes need to use some large programs (for some dynamic programming
problems).
Thanks for sharing your experiences,
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
On Sep 28 2005, Nigel Cunningham wrote:
> Hi Rogerio.
Hi, Nigel.
> On Tue, 2005-09-27 at 21:10, Rog?rio Brito wrote:
> > Hi there. I'm seeing a really strange problem on my system lately and I
> > am not really sure that it has anything to do with the kernels.
>
> I've seen the thread mostly following the hardware line. I'd like to
> enquire down the kernel path because I've seen occasional, impossible
> to reproduce problems too.
Nice. I also don't want to rule out anything before I really understand
what's going on.
> Can I ask first a few questions:
Of course.
> 1) Are you using vanilla kernels, or do you have other patches applied?
Yes, all the kernels that I use are just plain vanilla kernels taken
straight from kernel.org. No other patches applied.
> 2) Are you using ext3 only?
Yes, I am.
> 3) Is the corruption only ever in memory, or seen on disk too?
I have noticed the problem mostly on disk. One strange situation was
when I was untarring a kernel tree (compressed with bzip2) and in the
middle of the extraction, bzip2 complained that the thing was
corrupted.
I removed what was extracted right away and tried again to extract the
tree (at this point, suspecting even that something in software had
problems). The problem with bzip2 occurred again. Then, I rebooted the
system an the problem magically went away.
> 4) Is the corruption only in one filesystem or spread across several
> (if applicable)? (ie in / but not /home or others?)
I only have one filesystem right now, but given the difficulties that
I'm seeing, I do plan to go back to a multiple filesystem setup (which I
always used but thought that was overkill---nothing like time to teach
us something what is safest).
If you want to know anything else, don't hesistate to ask.
Regards,
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
On Sat, 1 Oct 2005 18:36:55 -0300, Rog?rio Brito <[email protected]> wrote:
>
>I have noticed the problem mostly on disk. One strange situation was
>when I was untarring a kernel tree (compressed with bzip2) and in the
>middle of the extraction, bzip2 complained that the thing was
>corrupted.
>
>I removed what was extracted right away and tried again to extract the
>tree (at this point, suspecting even that something in software had
>problems). The problem with bzip2 occurred again. Then, I rebooted the
>system an the problem magically went away.
This rings a bell, recently I reported a problem:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0508.1/1332.html
Turned out to be bad memory stick :o)
Cheers,
Grant.
Rog?rio Brito <[email protected]> wrote:
> On Sep 28 2005, Nigel Cunningham wrote:
>> 3) Is the corruption only ever in memory, or seen on disk too?
>
> I have noticed the problem mostly on disk. One strange situation was
> when I was untarring a kernel tree (compressed with bzip2) and in the
> middle of the extraction, bzip2 complained that the thing was
> corrupted.
>
> I removed what was extracted right away and tried again to extract the
> tree (at this point, suspecting even that something in software had
> problems). The problem with bzip2 occurred again. Then, I rebooted the
> system an the problem magically went away.
I have a similar problem:
It's a corruption while reading data from the HDD into the cache.
The affected page will contain (pseudo?)random data in the first four
bytes (at least on my system it did).
If you waited long enough, the cache page would be discarded and the next
read from the disk would be correct. However, if it happens e.g. in an
inode block, the corruption may find it's way to the disk and/or fubar
your data.
This happens mostly if there are concurrent DMA transfers like playing
sound or watching TV on bttv cards. I'm affected by the later cause,
setting no_overlay reduced it.
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.
Hi, Grant, Nigel and others following this thread.
On Oct 02 2005, Grant Coady wrote:
> On Sat, 1 Oct 2005 18:36:55 -0300, Rog?rio Brito <[email protected]> wrote:
> >I removed what was extracted right away and tried again to extract
> >the tree (at this point, suspecting even that something in software
> >had problems). The problem with bzip2 occurred again. Then, I
> >rebooted the system an the problem magically went away.
>
> This rings a bell, recently I reported a problem:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0508.1/1332.html
Thanks for the information. I am on-and-off experimenting with
goldmemory and memtester86+ to see if I can find something with more
than 512MB that is stable.
I am, right now, using 512MB + 256MB slowed down to PC100 speeds. It
seems to be stable with this configuration (having survived some memory
tests, the decoding of lots of FLAC files in a row and using the machine
as usual---with low consumption things like mutt and browsing with
lynx).
> Turned out to be bad memory stick :o)
The thing is that any stick alone doesn't seem to generate a problem.
Only when they are used simultaneously
I will test it more to see what may be wrong with my setup. :-( I still
have not isolated and understood the problem completely. :-(
Thanks for the feedback, Rog?rio.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi Rogerio,
On Mon, Oct 03, 2005 at 01:17:19AM -0300, Rog?rio Brito wrote:
(...)
> The thing is that any stick alone doesn't seem to generate a problem.
> Only when they are used simultaneously
>
> I will test it more to see what may be wrong with my setup. :-( I still
> have not isolated and understood the problem completely. :-(
This is a common problem caused by flaky motherboards and/or poor
power supplies. You should first take a look at your motherboard's
manual to see if it *really* supports your configuration. Often,
they won't support several dual-side sticks simply because there
are too many chips connected to each signal pin. For instance, my
mobo (A7M266-D) has a lot of trouble if I use more than 2 sticks,
and it is documented that I need registered RAM to do this.
Also, sometimes your mobo will not have been carefully tested by
the maker with every combination of memory sticks. It might be
your case. Sometimes it helps to increase the RAM voltage (you
might have a jumper for this on the mobo or may be able to do
this in the BIOS). In my case, it helped to set the RAM to 2.7V,
but that was not enough to get a stable setup.
Last possible trouble may come from the power supply. If it's
not strong enough to maintain a perfect voltage output during
slightly higher intensity peaks, it can cause what you observe.
Hoping this helps,
Willy
On Oct 02 2005, Bodo Eggert wrote:
> Rog?rio Brito <[email protected]> wrote:
> > I removed what was extracted right away and tried again to extract
> > the tree (at this point, suspecting even that something in software
> > had problems). The problem with bzip2 occurred again. Then, I
> > rebooted the system an the problem magically went away.
>
> I have a similar problem:
I am still investigating the problem. I am not planning on resting right
now. I really want to understand what's going on with this system.
Too bad that I am quite na?ve and don't understand much about hardware
in general. :-(
> This happens mostly if there are concurrent DMA transfers like playing
> sound or watching TV on bttv cards. I'm affected by the later cause,
> setting no_overlay reduced it.
Humm, I think that I may have seen something like this in the past: I
have two CD readers here (both with DMA turned on) and I was once
extracting audio to be converted to MP3 and I noticed one strange
corruption that I have not been able to reproduce again:
Bits of what was extracted from one file appeared in the other disc and
the result was something like a mix of static and alternation between
the two music sources. Weird, huh?
Thanks for the concern, Rog?rio Brito.
P.S.: I will reboot my system and force an fsck as soon as I can, just
in case.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi.
On Sun, 2005-10-02 at 07:36, Rog?rio Brito wrote:
> On Sep 28 2005, Nigel Cunningham wrote:
> > Hi Rogerio.
>
> Hi, Nigel.
>
> > On Tue, 2005-09-27 at 21:10, Rog?rio Brito wrote:
> > > Hi there. I'm seeing a really strange problem on my system lately and I
> > > am not really sure that it has anything to do with the kernels.
> >
> > I've seen the thread mostly following the hardware line. I'd like to
> > enquire down the kernel path because I've seen occasional, impossible
> > to reproduce problems too.
>
> Nice. I also don't want to rule out anything before I really understand
> what's going on.
>
> > Can I ask first a few questions:
>
> Of course.
>
> > 1) Are you using vanilla kernels, or do you have other patches applied?
>
> Yes, all the kernels that I use are just plain vanilla kernels taken
> straight from kernel.org. No other patches applied.
Ok. That's helpful.
> > 2) Are you using ext3 only?
>
> Yes, I am.
>
> > 3) Is the corruption only ever in memory, or seen on disk too?
>
> I have noticed the problem mostly on disk. One strange situation was
> when I was untarring a kernel tree (compressed with bzip2) and in the
> middle of the extraction, bzip2 complained that the thing was
> corrupted.
>
> I removed what was extracted right away and tried again to extract the
> tree (at this point, suspecting even that something in software had
> problems). The problem with bzip2 occurred again. Then, I rebooted the
> system an the problem magically went away.
If you see it in a form where you can see the amount of corruption, can
you see if it is just four bytes?
I'm asking because I have recently started seeing
impossible-to-reliably-reproduce corruption here, which seems to be only
four bytes at a time, in memory originally but possibly also appearing
on disk (probably because of syncing). I originally wondered if it might
be Suspend2 related (in the first instance, assume I messed up :)), but
I haven't been sure. The corruption I'm seeing only affects the root
filesystem. None of this makes much sense if I assume it's a Suspend2
bug. I could have a bad pointer access somewhere, but the rest is just
confusing.
Regards,
Nigel
> > 4) Is the corruption only in one filesystem or spread across several
> > (if applicable)? (ie in / but not /home or others?)
>
> I only have one filesystem right now, but given the difficulties that
> I'm seeing, I do plan to go back to a multiple filesystem setup (which I
> always used but thought that was overkill---nothing like time to teach
> us something what is safest).
>
> If you want to know anything else, don't hesistate to ask.
>
>
> Regards,
--
On Sep 29 2005, Alan Cox wrote:
> Some fixes went in early 2.4 and they got refined later on. See the
> function quirk_vialatency). There is a brief summary at the first URL
> listed still. Essentially the chip has a flaw where it can lose a
> transfer.
>
> If people see this behaviour on a KT133 can you please check the quirk
> is being run and displaying
>
> printk(KERN_INFO "Applying VIA southbridge workaround.\n");
Just as an information, I get the following messages on my system:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
rbrito@dumont:~$ dmesg | grep -i via
Disabling VIA memory write queue (PCI ID 0305, rev 02): [55] 89 & 1f -> 09
PCI: Disabling Via external APIC routing
agpgart: Detected VIA Twister-K/KT133x/KM133 chipset
parport_pc: VIA 686A/8231 detected
parport_pc: VIA parallel port: io=0x378, irq=7
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
Netfilter messages via NETLINK v0.30.
rbrito@dumont:~$ dmesg | grep -i memor
Memory: 775776k/786352k available (1847k kernel code, 10076k reserved, 733k data, 148k init, 0k highmem)
Disabling VIA memory write queue (PCI ID 0305, rev 02): [55] 89 & 1f -> 09
Non-volatile memory driver v1.2
Freeing unused kernel memory: 148k freed
rbrito@dumont:~$
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Is this what is supposed to appear when one is using a 2.6.1x kernel?
Thanks for any hints, Rog?rio Brito.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
Hi, Ville.
On Sep 28 2005, Ville Herva wrote:
> You may be running into this problem:
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0207.2/0574.html
> http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/1727.html
> http://www.cs.helsinki.fi/linux/linux-kernel/2002-01/1048.html
> http://marc.theaimsgroup.com/?l=linux-kernel&m=99889965423508&w=2
>
> (A google search will turn up more.)
Thank you very much for these links. It seems that I may be not alone
here, unfortunately. :-(
> Placing network card to a different PCI slot helped somewhat as did
> upgrading the bios.
I have not played with the network cards, but I have already upgraded
the BIOS firmware to the latest version that I could find (in the hope
that I could get the Duron 1.3GHz being actually identified as such,
instead of operating at 1.1GHz).
> It seemed to be a KT133 Northbridge DMA issue. My impression is that
> KT133 is utter crap period.
Well, is this a problem particular with KT133 or is this a generic thing
with VIA chipsets?
I'm interested because I don't know the other chipset options that are
Open Source friendly---it seems that Nvidia-based ones have to have
reverse-engineered drivers (e.g., forcedeth), which is quite bad, IMO.
I'm intenging to get another system as soon as the dust settles and
x86_64 and SATA drives become mainstream enough to be readily available
here in Brazil for reasonable prices.
But, then, I'd be concerned in getting a chipset from an company that
plays nice with Linux (and the *BSDs too, for that matter). Opinions are
more than welcome.
Thanks, Rog?rio Brito.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
On Sat, Oct 01, 2005 at 06:28:06PM -0300, Rog?rio Brito wrote:
> Right now, I'm using just a single 512MB module, but it is single-sided
> (I guess that by double-sided you guys mean that it has chips on both
> sides of the module, right?). The only double-sided module that I have
> here is the 256MB module.
>
> OTOH, with just one 512MB everything *seems* to be working fine, but,
> honestly, I'm not sure.
Well maybe a single sided 512M can still have the same interface as a
double sided. Depends how it is wired I suppose.
> Hummm, nice to see that you have also experienced this. With 256 + 128,
> I had to use PC100 to have it work stably.
>
> I'd obviously prefer to have everything working at PC133 speed, but
> wouldn't mind running at PC100 speed if I could use everything, since I
> sometimes need to use some large programs (for some dynamic programming
> problems).
Actually you probably DON'T want the ram to run PC133 since at PC133 the
latency is a bit higher (in clock counts) than at PC100, so overall the
latency stays about the same. On the other hand running the ram
asynchrounous from the front side bus of the cpu makes getting memory
access aligned more complicated and inserts different delays. So most
likely the system really runs fastest when the ram matches the cpu bus
speed which on an A7V is 100MHz (since it never did actually support any
133FSB cpus, you needed the fixed KT133A chipset for that that the A7V-E
had on it). I also only run a 700MHz cpu so heat isn't a problem. I
know the 1GHz cpu made a lot of heat and really needed good cooling. I
don't remember what cpu speed you have.
Len Sorensen
Rog?rio Brito wrote (ao):
> On Sep 28 2005, Nigel Cunningham wrote:
> > 3) Is the corruption only ever in memory, or seen on disk too?
>
> I have noticed the problem mostly on disk. One strange situation was
> when I was untarring a kernel tree (compressed with bzip2) and in the
> middle of the extraction, bzip2 complained that the thing was
> corrupted.
>
> I removed what was extracted right away and tried again to extract the
> tree (at this point, suspecting even that something in software had
> problems). The problem with bzip2 occurred again. Then, I rebooted the
> system an the problem magically went away.
That would mean the corruption existed in memory only. The kernel
tarball got sucked into memory and got corrupted. On reboot, the tarball
gets read in again, and this time no corruption. The on disk tarball was
oke it seems.
If you run memtest86+ (latest version) for at least 24 hours it _should_
find something.
Kind regards, Sander
--
Humilis IT Services and Solutions
http://www.humilis.net
Hi
On Tue, 2005-10-04 at 20:28, Sander wrote:
> Rog?rio Brito wrote (ao):
> > On Sep 28 2005, Nigel Cunningham wrote:
> > > 3) Is the corruption only ever in memory, or seen on disk too?
> >
> > I have noticed the problem mostly on disk. One strange situation was
> > when I was untarring a kernel tree (compressed with bzip2) and in the
> > middle of the extraction, bzip2 complained that the thing was
> > corrupted.
> >
> > I removed what was extracted right away and tried again to extract the
> > tree (at this point, suspecting even that something in software had
> > problems). The problem with bzip2 occurred again. Then, I rebooted the
> > system an the problem magically went away.
>
> That would mean the corruption existed in memory only. The kernel
> tarball got sucked into memory and got corrupted. On reboot, the tarball
> gets read in again, and this time no corruption. The on disk tarball was
> oke it seems.
>
> If you run memtest86+ (latest version) for at least 24 hours it _should_
> find something.
Assuming that it really is a memory issue. Don't discount the
possibility of a kernel bug too quickly, especially when it apparently
worked fine in the past.
Just my 2c, feel free to discount anyway :)
Regards,
Nigel
Hi Rog?rio,
Sorry, was away for a week.
On Sat, 1 Oct 2005, [iso-8859-1] Rog?rio Brito wrote:
> > Try removing the 256MB module?...
>
> Right now, I'm only using one 512MB module, but after I have already
> paid for the second one, and it wasn't cheap. :-(
Wasn't it 512 + 512 + 256 MB modules that you had? I just suggested
removing only one 256MB module and testing with 2 x 512MB. Which on the
one hand wouldn't be that bad as only having 512MB, and on the other hand
just for a test...
Good luck
Guennadi
---
Guennadi Liakhovetski
Hi, Guennadi.
On Oct 09 2005, Guennadi Liakhovetski wrote:
> Sorry, was away for a week.
No problems. I've been quite busy also.
> On Sat, 1 Oct 2005, [iso-8859-1] Rog?rio Brito wrote:
>
> > > Try removing the 256MB module?...
> >
> > Right now, I'm only using one 512MB module, but after I have already
> > paid for the second one, and it wasn't cheap. :-(
>
> Wasn't it 512 + 512 + 256 MB modules that you had?
Exactly, but I didn't manage to get the 2x512MB modules useable in my
machine. In fact, sometimes the machine wouldn't even POST with the two
modules, but as soon as I removed any one of them, the machine was back
to normal.
> I just suggested removing only one 256MB module and testing with 2 x
> 512MB. Which on the one hand wouldn't be that bad as only having
> 512MB, and on the other hand just for a test...
Right now, I am using 512 + 256 running at PC100 speeds, with latencies
all set to 3-3-3. Now, it seems to run stably, but is slower than what I
would like it to run, of course.
I will still keep trying some combinations, but some of them seem
definitely ruled out (like having both 512 MB modules at the same time).
Thank you very much for your comments, Rog?rio Brito.
--
Rog?rio Brito : [email protected] : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/