Hello all,
I have been telling this story to a few people, and nobody seems to have a
clue about what is going on... Alan suggested me to post a description of
the problem to this list, so this is what I am doing.
So, I had a Dell Inspiron 5000 which worked great for a while. It was
running a more-or-less stock Red Hat 6.1 with the stock kernel from
it. At some point, the hard drive in that machine was broken so I had to
buy a new one. The new drive was an IBM Travelstar 20G.
I installed a Debian system on it, with a reiserfs root partition, which
was the only partition besides an ext2 /boot partition. Everything seemed
to work fine, but after a while I started getting massive metadata
corruption on it. Whenever I did an apt-get dist-upgrade, something weird
happened, such as files that couldn't be stat()ed nor unlink()ed and
directories that would make the kernel oops nicely if written to.
I could never figure out what was wrong with it. The reiserfs people
seemed to have no clue about what was going there.
In the meantime, I got a new machine. An IBM Thinkpad T21, which is now
my main machine. After the previous experience, I decided to not trust
reiserfs this time, so I installed using ext2. Again, I installed Debian
Woody. I needed to rebuild the kernel myself as I needed the soundcard to
work, and the stock Debian one didn't even seem to have APM working, so I
installed the 2.2.18 source from Debian, configured it, and
compiled/installed using make-kpkg and dpkg.
Unfortunately, after importing 15k mail messages or so into Gnus (which is
a pretty disk-intensive activity -- I use nnml so every mail goes into a
separate file) and apt-get upgrading a couple of times, I started getting
file system corruption again. /tmp/.X0-lock was turned into a weird file
with abnormal length and couldn't be removed, so I tried to manually force
a fsck and this resulted in a lot of problems being reported, and
lost+found getting 656 files into it. (Some of which are files from the
Gnus mail repository, and other seem to come from TeX.)
So, this looks pretty interesting to me. I got these metadata corruption
problems (no data corruption that I know of) on two different machines
with different hardware and different file systems. Maybe it's a kernel
bug?
Another interesting thing is that both machines use a Travelstar 20G
drive. Maybe the drive's firmware is to blame, but I know at least two
more people that are using that same drive on Thinkpads for quite a long
time and have had no problems at all with it. (Using both XFS and ext2.)
Some system information: (I don't have the Ispiron at hand anymore, so I
can only be detailed about the Thinkpad)
milkplus:~# /sbin/lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX - 82443BX/ZX Host bridge
(rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX - 82443BX/ZX AGP bridge
(rev 03)
00:02.0 CardBus bridge: Texas Instruments PCI1450 (rev 03)
00:02.1 CardBus bridge: Texas Instruments PCI1450 (rev 03)
00:03.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 09)00:03.1 Serial controller: Xircom: Unknown device 000c
00:05.0 Multimedia audio controller: Cirrus Logic CS 4614/22/24
[CrystalClear SoundFusion Audio Accelerator] (rev 01)
00:07.0 Bridge: Intel Corporation 82371AB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB PIIX4 ACPI (rev 03)
01:00.0 VGA compatible controller: S3 Inc. 86C270-294 Savage/MX-/IX (rev
13)
milkplus:~# dmesg | grep hda
ide0: BM-DMA at 0x1850-0x1857, BIOS settings: hda:DMA, hdb:pio
hda: IBM-DJSA-220, ATA DISK drive
hda: IBM-DJSA-220, 19077MB w/1874kB Cache, CHS=2584/240/63, UDMA
hda: hda1 hda3 < hda5 hda6 > hda4
milkplus:~# hdparm /dev/hda
/dev/hda:
multcount = 0 (off)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 2584/240/63, sectors = 39070080, start = 0
Any idea? What am I doing wrong?
Thanks in advance,
--
Ettore
(I am not subscribed to the list, so please reply to my own address too.)
>milkplus:~# hdparm /dev/hda
>/dev/hda:
> multcount = 0 (off)
> I/O support = 0 (default 16-bit)
> unmaskirq = 0 (off)
> using_dma = 1 (on)
> keepsettings = 0 (off)
> nowerr = 0 (off)
> readonly = 0 (off)
> readahead = 8 (on)
> geometry = 2584/240/63, sectors = 39070080, start = 0
>
>Any idea? What am I doing wrong?
You could try turning off DMA (rebuild your kernel again, and turn off "use
DMA by default"). UDMA is known to work reliably only with a (reasonably
broad) subset of chipsets, and it is likely that laptop chipsets get the
least testing. If turning off DMA fixes the problem for you, we at least
know where to start looking.
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
> You could try turning off DMA (rebuild your kernel again, and turn off "use
> DMA by default").
Would this be in any way different from just `hdparm -d0 /dev/hda'?
> UDMA is known to work reliably only with a (reasonably
> broad) subset of chipsets, and it is likely that laptop chipsets get the
> least testing. If turning off DMA fixes the problem for you, we at least
> know where to start looking.
Sure I can try this, although it's hard to safely say if the problem is
fixed or not, as it's not reliably reproduceable.
BTW, the Inspiron seemed to work just fine with DMA turned on, before the
drive was replaced, with the 2.2.16 kernel that Red Hat ships. (I always
had DMA turned on, and that was for about six months, without any problems
ever.)
Also, I have some friends using T20s with the same drive without any
problems, with DMA turned on.
Is there any kind of IDE DMA test I could run to see if it works reliably?
--
Ettore
Hi Ettore,
I have no idea if this is related to your problem since you didn't mention
that key part, but with the same drive, I managed to trash my root partition
incredibly badly by trying to use DMA and then do APM suspend or hibernate.
On wakeup, I'd get an 'hda: lost interrupt' but then things would appear to
carry on.
The fix for me was to rebuild the kernel and make sure CONFIG_APM_ALLOW_INTS
was enabled. So, do you ever use power management and is this similar, or do
you have a completely different problem ?
Tim
--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
On 06 Mar 2001 17:01:02 -0800, Tim Wright wrote:
> Hi Ettore,
> I have no idea if this is related to your problem since you didn't mention
> that key part, but with the same drive, I managed to trash my root partition
> incredibly badly by trying to use DMA and then do APM suspend or hibernate.
> On wakeup, I'd get an 'hda: lost interrupt' but then things would appear to
> carry on.
>
> The fix for me was to rebuild the kernel and make sure CONFIG_APM_ALLOW_INTS
> was enabled. So, do you ever use power management and is this similar, or do
> you have a completely different problem ?
Wow, this sounds like this might be the problem. I just checked my
`.config' and indeed `CONFIG_APM_ALLOW_INTS' is not enabled. And indeed
I have been suspending/resuming the machine a few times before the
partition got corrupted.
So, does DMA work correctly on your system after setting this option?
I have now disabled it completely as a safety measure (and as suggested
by somebody else on this list), and indeed I have not had any more
troubles for now. (I have been forcing a fsck every day before turning
the machine off.)
Thanks a lot for the hint! I will now rebuild my kernel with that
option turned on.
--
Ettore
On Tue, Mar 06, 2001 at 08:10:10PM -0500, Ettore Perazzoli wrote:
> On 06 Mar 2001 17:01:02 -0800, Tim Wright wrote:
[...]
> > The fix for me was to rebuild the kernel and make sure CONFIG_APM_ALLOW_INTS
> > was enabled. So, do you ever use power management and is this similar, or do
> > you have a completely different problem ?
>
> Wow, this sounds like this might be the problem. I just checked my
> `.config' and indeed `CONFIG_APM_ALLOW_INTS' is not enabled. And indeed
> I have been suspending/resuming the machine a few times before the
> partition got corrupted.
>
> So, does DMA work correctly on your system after setting this option?
Yes, it does. I have the drive running in UDMA mode 2, and get ~16MB/s from
'hdparm -t -T'. I have the "use DMA automatically" option turned on in the
kernel, so I inherit the BIOS settings which are correct.
I've used standby and hibernation with complete success since.
Regards,
Tim
--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
On 07 Mar 2001 12:22:22 -0800, Tim Wright wrote:
> On Tue, Mar 06, 2001 at 08:10:10PM -0500, Ettore Perazzoli wrote:
> > On 06 Mar 2001 17:01:02 -0800, Tim Wright wrote:
> Yes, it does. I have the drive running in UDMA mode 2, and get ~16MB/s from
> 'hdparm -t -T'. I have the "use DMA automatically" option turned on in the
> kernel, so I inherit the BIOS settings which are correct.
>
> I've used standby and hibernation with complete success since.
This seemed to fix the problem for me as well. I have had DMA turned
on since then, and I have experienced no file system corruption anymore.
Thanks!
Maybe the help message for this kernel option (CONFIG_APM_ALLOW_INTS)
should report in big blocky letters that disabling it might cause major
data loss with some drive/bios combinations?.. I was not aware that I
was touching such a sensitive parameter when I rebuilt the kernel, and
the help message didn't warn me in any way.
--
Ettore