Message-ID: <44E44B3E.10708@tls.msk.ru>
Date: Thu, 17 Aug 2006 14:55:58 +0400
From: Michael Tokarev <mjt@tls.msk.ru>
User-Agent: Mail/News 1.5 (X11/20060318)
MIME-Version: 1.0
To: linux-kernel@vger.kernel.org
CC: linux-scsi@vger.kernel.org
Subject: Random scsi disk disappearing
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2152
Lines: 60

An old problem, very annoying.

>From time to time, an scsi disk just disappears from
the bus, without any [error] messages whatsoever.
The only relevant stuff in dmesg is logging from md
(softraid) layer, about "error updating superblock"
and later "giving up and removing the disk from the
array" - not even error number.

When I try to access such a disk (/dev/sdX device),
I got "No such device or address" error back.

It's still listed in /sys/block and /proc/scsi/scsi,
but any access to the device gives this error.

But the disk is here, I know it is.  Deleting it from
kernel:

  echo y > /sys/block/sdX/device/delete

and adding it back:

  echo scsi add-single-device x y z > /proc/scsi/scsi

works just fine, linux finds "new" scsi device and it
happily works again.

This happens on alot of different machines, with different
disk drives (ok, most of them are from Seagate, but not
all).  I can't say for sure that it happens on different
scsi controllers - at least majority of them are adaptecs,
using aic7xxx or aix79xx driver.

I suspected the disks are too hot - nope, according to
smartctrl, the themp is far from bad (typically about
25..35 Celsius, and the themperature is not changing much).
Bad cables, bad power supply, bad anything else?  Not sure
either, at least I can't guess more: the machines are
really different, some has good, under-loaded power supplies
(and server chassis/motherboards/allthestuff) some has less
good ones - makes no difference.  And the thing is - having
in mind really sporadic disappearing, not depending on current
load, time of day (eg, during nights, there's no one on site
so no one to touch cables etc), ...  Well, I just can't think
of any reason, at all.

But one thing bothers me most: there's NO LOGGING from scsi
layer.  None, zero, not at all.

Has anyone else seen something similar?  Any pointers on how
to debug the issue?

Thanks.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/