Date: Mon, 13 Nov 2000 20:53:05 +0200
From: Ville Herva <vherva@viasys.com>
To: linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org
Cc: kai.makisara@metla.fi, skarkkai@woods.iki.fi, mkp@viasys.com,
        a-it@viasys.com
Subject: 2.2.18pre19 and HP DAT40i: mysterious medium error
Message-ID: <20001113205305.A1361@viasys.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org

The problem: 
  When writing to /dev/st0, I get 

  st0: Error with sense data: [valid=0] Info fld=0x0, Deferred
  st09:00: sense key Medium Error
  Additional sense indicates Track following error
  st0: Error with sense data: Info fld=0x0, Current st09:00: sense key
   Medium Error
  Additional sense indicates Sequential positioning error
  st0: Error on write filemark.

  after some gigs. The writing stops and tar (or whatever I'm using)
  reports an error. The test command I usually use is

  dd if=/dev/sda of=/dev/st0 bs=65536

  but others fail as well (see below).

The hardware:
        o Tyan S1832 DL Tiger 100 motherboard (440BX)
	  BIOS upgraded to 1.18 (newer ones (2.x) do not support PII)
        o 2x PII 300Mhz Klamath
        o 2x 128MB DIMM        
        o Seagate Barracuda 9GB 7200RPM SCSI-levy (narrow-scsi)
        o 2x IBM 34 GXP 34GB 7200rpm IDE
        o floppy drive
        o Matrox Millennium G200 display adapter
        o 3COM 905B 100/10 NIC
        o Adaptec 2940-UW Pro SCSI adapter, newest firmware
        o HP DAT40i (internal, wide) C5683A  Firmware rev: C908
(2940 is the SCSI adapter HP recommends to be used with this tape drive.)

The software:
        x kernel 2.2.18pre19 + RAID-2.2.17-A0 + IDE-2.2.18-17.all.20001027
	  + e2compr-0.4.38 + lm_sensors
	o tar-1.13.17
	o dd 4.0p
	o glibc-2.1.3-21
	o redhat-6.2 + updates in general

What I have tried (I hope I remember everything):
        o tar from fs. Same result.
	x dd if=/dev/hdc etc. Same problem.
	o MANY different values for dd bs=X/tar --block-factor=X.
        o Cleansing tape, dozens of times. That's not it.
	o Tens of different tapes: 40GB DATs, 24GB DATs, ones that came with 
	  the drive, ones from different batch, different brand. Always the 
	  same error.
        o Different tape drive: HP 24i C1537A internal, narrow. Same 
	  problem (The problem originally appeared with this combo.)
        x Upgrading from kernel-2.2.15pre17 + raid + ide + e2compr to
	  2.2.18pre19. The problem persist. (The problem originally 
	  appeared with 2.2.15pre17.)
        o kernel-2.2.15pre17 + newest st.c and aic7xxx.c. Problem
	  persists.
	o Different scsi cable. Didn't help.
	o Different order of the tape drive and the hard disk on the
	  cable. No cure.
	o Upgraded tape drive firmware (this was only possible on Windows
	  with the HP whizbangard tape admin program). No help.
	o Upgraded the scsi adapter firmware/bios. No help.
	o Kai M?kisara suggested that this could be a heat or power supply
	  problem. I tried putting the drive on an external scsi rack 
	  that has its own power supply. The rack was open and the drive
	  was the only device in it. I'm sure it had enough power and cool 
	  air. Didn't help.
	x Different scsi adapter: Adaptec 29160. Same problem. On the 29160, 
	  the tape drive and the harddisk were on different buses. The
	  termination was surely correct as it was just as in the user's guide
	  of the tape drive: single device on the last connector of a 
	  terminated cable.
	x I also tried putting the harddisk on the 2940UW Pro and the tape 
	  drive 29160. Still the same message.
	x Different machines: on a Celeron 466, i815 and 2.2.18pre18 the same
	  scsi adapter and tape drive *do work*. I ran the test almost ten
	  times. With the same kernel (2.2.15pre16 + patches) it also worked
	  many many times, but failed ONCE when reading the tape with
	  *different* medium error. That could have been a cleansing problem,
	  since the problem went away after running the cleansing tape once. 
	  I also verified the 2.2.18pre19 kernel image I use on the problematic
	  machine works with this machine.
	x 2x200Ppro (440FX), 2.2.18pre18SMP. Different 2940 Adaptec, and a
    	  almost similar 9GB scsi disk (difference being that this one is wide
	  whereas the one on the problematic machine is narrow). Works like a
	  charm.
	x HP's TapeWare backup software with latest patches. (The setup blindly
  	  installs wrong version of the sgm.o kernel module the software
	  includes (anyone know what this sgm.o is for?). Even after 
	  installing the latest service pack and recompiling the kernel module,
	  TapeWare tends to hang often requiring a reboot.) When it works, it 
	  has a nice GUI, though, and I *was* able to produce the problem with
	  it, too. Different error message, of course.
	x And last... (drum roll)... I tried calling HP support. Despite my
  	  _very_ low expectations I was able to contact someone who knows what
	  a tape drive (and linux!) is. They suggested trying TapeAssure (diag
	  program of some sort), but there's no Linux version. They did even
	  call me back and suggest different SCSI adapter settings (lower
	  transfer rate etc). That didn't help. They said they were going to
	  bring the issue up "with their engineers" and I haven't heard of them
	  since.

(Kai: for your convenience, I've marked the points that I think I haven't
yet reported to you with 'x').

The place where the writing fails varies from 10MB to 20GB. On successive
trials with the same data, the point of failure *tends to be the same*,
even with different tapes! On fifth run or so, however, the writing
process may go much farther.

This could of course be caused by the RAID, IDE and/or e2compr patch, but
it seems a bit far fetched since the "dd if=/dev/sda of=/dev/st0" should
touch none of them (the two ide disks are md'd together, and the only
e2compr'd fs is on that device.)

I'm also a bit confused of the fact that the said dd command sometimes
yields transfer rate of ~3MB/s (for multiple gigas of same data) but
sometimes only ~1MB/s (with the same data, tape drive and scsi
adapter). What's also weird is that it seems that the failure rate has
gone up during these months of debugging. At first, it seemed to fail
maybe every second or third time, but now it happens practically every
time. On the other machines, it still works.

Originally I thought the problem was caused by too low input data rate
when tarring from the compressed fs. dd'ing the scsi disk should however
be fast enough.

There are absolutely no other problems with the machine or the kernel.
I've ran memtest's, kernel compilations (a la -j10), seti@homes etc and
even before installing more fans (before the tape drive was installed)
when the machine ran at >60C ambient temperature(!), there were NO
stability problems. The temperature inside the box and and on CPU's
remains well below 35C nowadays.

I'm _very_ low on ides here. The only thing I've not tried is installing
NT on the problematic machine, but I a somewhat hasty to do that since the
machine is in production use. The machine is known, though, to have worked
well with a 8GB scsi drive and the 2940UW in its previous life as an NT
workstation.

If anybody can suggest anything I could try, please do so... I'm also
interested in success/failure reports on similar hardware. If any
information is missing from this post, please ask.

I also want to thank Kai M?kisara for his forbearing efforts with this
problems -- even though a solution is yet to be found.


--
Ville Herva            vhe@viasys.com                +358-50-5164500
Viasys Oy              Pohjantie 3  FIN-02100 Espoo  +358-9-4301460
PGP key available: http://www.iki.fi/v/pgp.html  fax +358-9-4301221
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/