2007-02-27 14:06:46

by Eric Buddington

[permalink] [raw]
Subject: USB misbehavior causes system hang

2.6.20-mm2 #1 Mon Feb 26 13:16:04 EST 2007 i686 unknown

I have an external USB drive (WD MyBook 5000YS), which I use for backups.

When I try to back up to it, it works for a while, but inevitably
starts resetting like mad, gives I/O errors, and then (here's the
problem), the softdog module reboots the system.

---------------------------
scsi1 : SCSI emulation for USB Mass Storage devices
input: Western Digital External HDD as /class/input/input8
input: USB HID v1.11 Device [Western Digital External HDD] on usb-0000:00:03.2-6
.2
scsi 1:0:0:0: Direct-Access WD 5000YS External 106a PQ: 0 ANSI: 4
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: assuming drive cache: write through
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: assuming drive cache: write through
sda: sda1 sda2
sd 1:0:0:0: Attached scsi disk sda
sd 1:0:0:0: Attached scsi generic sg0 type 0
reiser4[pdflush(194)]: disable_write_barrier (fs/reiser4/wander.c:234)[zam-1055]
:
NOTICE: md1 does not support write barriers, using synchronous write instead.
reiser4: sda2: found disk format 4.0.0.

---- (works fine here for a while, then:) ----------

usb 1-6.2: reset high speed USB device using ehci_hcd and address 36
usb 1-6.2: device descriptor read/64, error -110
usb 1-6.2: device descriptor read/64, error -110
usb 1-6.2: reset high speed USB device using ehci_hcd and address 36
usb 1-6.2: device descriptor read/64, error -110
usb 1-6.2: device descriptor read/64, error -110
usb 1-6.2: reset high speed USB device using ehci_hcd and address 36
usb 1-6.2: device descriptor read/8, error -110
usb 1-6.2: device descriptor read/8, error -110
usb 1-6.2: reset high speed USB device using ehci_hcd and address 36
usb 1-6.2: device descriptor read/8, error -110
usb 1-6.2: device descriptor read/8, error -110
sd 1:0:0:0: scsi: Device offlined - not ready after error recovery
sd 1:0:0:0: SCSI error: return code = 0x00050000
end_request: I/O error, dev sda, sector 919931588
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 919931828
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
...
SoftDog: Initiating system reboot.

-----------------------------------------------------

Now, the USB problem may well be a device or cabling issue, but I
don't think that this drive failure should trigger a reboot - I assume
the drive failure is somehow constipating the entire disk I/O system,
and preventing my softdog-patting script from running.

My softdog script loops 'date >/tmp/$logfile;echo 1 > /dev/watchdog; sleep
30'. Nothing in it accesses the USB disk. /tmp is tmpfs.

Should the USB failure cause this problem?

-Eric



2007-02-27 14:17:11

by Oliver Neukum

[permalink] [raw]
Subject: Re: USB misbehavior causes system hang

Am Dienstag, 27. Februar 2007 15:06 schrieb Eric Buddington:

Exactly this portion of the log would hold the reason for the reset.

> usb 1-6.2: reset high speed USB device using ehci_hcd and address 36
> usb 1-6.2: device descriptor read/64, error -110
> usb 1-6.2: device descriptor read/64, error -110
[..]
> end_request: I/O error, dev sda, sector 919931828
> sd 1:0:0:0: rejecting I/O to offline device
> sd 1:0:0:0: rejecting I/O to offline device
> sd 1:0:0:0: rejecting I/O to offline device

By that point USB is no longer involved. These requests error out
in the SCSI layer.

> My softdog script loops 'date >/tmp/$logfile;echo 1 > /dev/watchdog; sleep
> 30'. Nothing in it accesses the USB disk. /tmp is tmpfs.

If you've written to the USB disk a lot, you are likely to cause writes
to it as soon as you are under memory pressure.

> Should the USB failure cause this problem?

No.

Regards
Oliver

2007-02-27 21:51:39

by Pete Zaitcev

[permalink] [raw]
Subject: Re: USB misbehavior causes system hang

On Tue, 27 Feb 2007 09:06:21 -0500, Eric Buddington <[email protected]> wrote:

> sd 1:0:0:0: rejecting I/O to offline device
> ...
> SoftDog: Initiating system reboot.

> Now, the USB problem may well be a device or cabling issue, but I
> don't think that this drive failure should trigger a reboot - I assume
> the drive failure is somehow constipating the entire disk I/O system,
> and preventing my softdog-patting script from running.

Have you tried ub? In theory, its threadless design is supposed to
help with just this kind of a problem. Please let me know, I'm very
curous.

However, the main issue here is the OOM with all the dirty data.
We saw that before. For some weird reason, ext3 is especially good
at producing the immense amounts of write-out. Are you on ext3 or
VFAT on that drive?

Please try to find the CPU traces by hitting SysRq-w, SysRq-p. CPU
is looping under a lock somewhere and eventually it cases the watchdog
to trigger. It may be a USB issue, maybe a VM issue. I can't tell
until we get stack traces.

This does not help you to deal with the unreliable drive, I'm afraid,
but it would be a great service if you pinned down the reason of looping.

-- Pete