2000-11-13 18:53:43

by Ville Herva

[permalink] [raw]
Subject: 2.2.18pre19 and HP DAT40i: mysterious medium error

The problem:
When writing to /dev/st0, I get

st0: Error with sense data: [valid=0] Info fld=0x0, Deferred
st09:00: sense key Medium Error
Additional sense indicates Track following error
st0: Error with sense data: Info fld=0x0, Current st09:00: sense key
Medium Error
Additional sense indicates Sequential positioning error
st0: Error on write filemark.

after some gigs. The writing stops and tar (or whatever I'm using)
reports an error. The test command I usually use is

dd if=/dev/sda of=/dev/st0 bs=65536

but others fail as well (see below).

The hardware:
o Tyan S1832 DL Tiger 100 motherboard (440BX)
BIOS upgraded to 1.18 (newer ones (2.x) do not support PII)
o 2x PII 300Mhz Klamath
o 2x 128MB DIMM
o Seagate Barracuda 9GB 7200RPM SCSI-levy (narrow-scsi)
o 2x IBM 34 GXP 34GB 7200rpm IDE
o floppy drive
o Matrox Millennium G200 display adapter
o 3COM 905B 100/10 NIC
o Adaptec 2940-UW Pro SCSI adapter, newest firmware
o HP DAT40i (internal, wide) C5683A Firmware rev: C908
(2940 is the SCSI adapter HP recommends to be used with this tape drive.)

The software:
x kernel 2.2.18pre19 + RAID-2.2.17-A0 + IDE-2.2.18-17.all.20001027
+ e2compr-0.4.38 + lm_sensors
o tar-1.13.17
o dd 4.0p
o glibc-2.1.3-21
o redhat-6.2 + updates in general

What I have tried (I hope I remember everything):
o tar from fs. Same result.
x dd if=/dev/hdc etc. Same problem.
o MANY different values for dd bs=X/tar --block-factor=X.
o Cleansing tape, dozens of times. That's not it.
o Tens of different tapes: 40GB DATs, 24GB DATs, ones that came with
the drive, ones from different batch, different brand. Always the
same error.
o Different tape drive: HP 24i C1537A internal, narrow. Same
problem (The problem originally appeared with this combo.)
x Upgrading from kernel-2.2.15pre17 + raid + ide + e2compr to
2.2.18pre19. The problem persist. (The problem originally
appeared with 2.2.15pre17.)
o kernel-2.2.15pre17 + newest st.c and aic7xxx.c. Problem
persists.
o Different scsi cable. Didn't help.
o Different order of the tape drive and the hard disk on the
cable. No cure.
o Upgraded tape drive firmware (this was only possible on Windows
with the HP whizbangard tape admin program). No help.
o Upgraded the scsi adapter firmware/bios. No help.
o Kai M?kisara suggested that this could be a heat or power supply
problem. I tried putting the drive on an external scsi rack
that has its own power supply. The rack was open and the drive
was the only device in it. I'm sure it had enough power and cool
air. Didn't help.
x Different scsi adapter: Adaptec 29160. Same problem. On the 29160,
the tape drive and the harddisk were on different buses. The
termination was surely correct as it was just as in the user's guide
of the tape drive: single device on the last connector of a
terminated cable.
x I also tried putting the harddisk on the 2940UW Pro and the tape
drive 29160. Still the same message.
x Different machines: on a Celeron 466, i815 and 2.2.18pre18 the same
scsi adapter and tape drive *do work*. I ran the test almost ten
times. With the same kernel (2.2.15pre16 + patches) it also worked
many many times, but failed ONCE when reading the tape with
*different* medium error. That could have been a cleansing problem,
since the problem went away after running the cleansing tape once.
I also verified the 2.2.18pre19 kernel image I use on the problematic
machine works with this machine.
x 2x200Ppro (440FX), 2.2.18pre18SMP. Different 2940 Adaptec, and a
almost similar 9GB scsi disk (difference being that this one is wide
whereas the one on the problematic machine is narrow). Works like a
charm.
x HP's TapeWare backup software with latest patches. (The setup blindly
installs wrong version of the sgm.o kernel module the software
includes (anyone know what this sgm.o is for?). Even after
installing the latest service pack and recompiling the kernel module,
TapeWare tends to hang often requiring a reboot.) When it works, it
has a nice GUI, though, and I *was* able to produce the problem with
it, too. Different error message, of course.
x And last... (drum roll)... I tried calling HP support. Despite my
_very_ low expectations I was able to contact someone who knows what
a tape drive (and linux!) is. They suggested trying TapeAssure (diag
program of some sort), but there's no Linux version. They did even
call me back and suggest different SCSI adapter settings (lower
transfer rate etc). That didn't help. They said they were going to
bring the issue up "with their engineers" and I haven't heard of them
since.

(Kai: for your convenience, I've marked the points that I think I haven't
yet reported to you with 'x').

The place where the writing fails varies from 10MB to 20GB. On successive
trials with the same data, the point of failure *tends to be the same*,
even with different tapes! On fifth run or so, however, the writing
process may go much farther.

This could of course be caused by the RAID, IDE and/or e2compr patch, but
it seems a bit far fetched since the "dd if=/dev/sda of=/dev/st0" should
touch none of them (the two ide disks are md'd together, and the only
e2compr'd fs is on that device.)

I'm also a bit confused of the fact that the said dd command sometimes
yields transfer rate of ~3MB/s (for multiple gigas of same data) but
sometimes only ~1MB/s (with the same data, tape drive and scsi
adapter). What's also weird is that it seems that the failure rate has
gone up during these months of debugging. At first, it seemed to fail
maybe every second or third time, but now it happens practically every
time. On the other machines, it still works.

Originally I thought the problem was caused by too low input data rate
when tarring from the compressed fs. dd'ing the scsi disk should however
be fast enough.

There are absolutely no other problems with the machine or the kernel.
I've ran memtest's, kernel compilations (a la -j10), seti@homes etc and
even before installing more fans (before the tape drive was installed)
when the machine ran at >60C ambient temperature(!), there were NO
stability problems. The temperature inside the box and and on CPU's
remains well below 35C nowadays.

I'm _very_ low on ides here. The only thing I've not tried is installing
NT on the problematic machine, but I a somewhat hasty to do that since the
machine is in production use. The machine is known, though, to have worked
well with a 8GB scsi drive and the 2940UW in its previous life as an NT
workstation.

If anybody can suggest anything I could try, please do so... I'm also
interested in success/failure reports on similar hardware. If any
information is missing from this post, please ask.

I also want to thank Kai M?kisara for his forbearing efforts with this
problems -- even though a solution is yet to be found.


--
Ville Herva [email protected] +358-50-5164500
Viasys Oy Pohjantie 3 FIN-02100 Espoo +358-9-4301460
PGP key available: http://www.iki.fi/v/pgp.html fax +358-9-4301221