2009-04-21 11:11:22

by Rogério Brito

[permalink] [raw]
Subject: [2.6.30-rc2] usb reset during big file transfer and ext3 error

Hi there.

I have an amd64 system running Debian's sid distribution and I installed
Linux 2.6.30-rc2 on it, as a way to get improvements for the i915
subsystem.

Unfortunately, when I was transferring the contents of 2 DVDs from the
main IDE HD to a USB external HD, I got errors from the USB host, the
writes on the external HD become failures and the ext3 filesystem there
enters into error mode, going read-only.

I eventually lose the access to the device (i.e., the /dev/sd??? device
isn't there anymore) and I then have to re-run fsck on the given
filesystem.

This has already happened 2 or 3 times already and I observed that it
only occurs when there is high traffic---if I am, say, compiling the
kernel on that external HD, I don't see any problems.

Attached is part of the dmesg log that shows the problem. I put the
whole dmesg at <http://rb.doesntexist.org/linux/>.

As always, if any further information is needed, please let me know.


Thanks, Rogério Brito.

--
Rogério Brito : rbrito@{mackenzie,ime.usp}.br : GPG key 1024D/7C2CAEB8
http://www.ime.usp.br/~rbrito : http://meusite.mackenzie.com.br/rbrito
Projects: algorithms.berlios.de : lame.sf.net : vrms.alioth.debian.org


Attachments:
(No filename) (1.17 kB)
dmesg-usb-reset-and-ext3-error.txt (28.35 kB)
Download all attachments

2009-04-22 00:41:22

by Robert Hancock

[permalink] [raw]
Subject: Re: [2.6.30-rc2] usb reset during big file transfer and ext3 error

(ccing linux-usb)

Rogério Brito wrote:
> Hi there.
>
> I have an amd64 system running Debian's sid distribution and I installed
> Linux 2.6.30-rc2 on it, as a way to get improvements for the i915
> subsystem.
>
> Unfortunately, when I was transferring the contents of 2 DVDs from the
> main IDE HD to a USB external HD, I got errors from the USB host, the
> writes on the external HD become failures and the ext3 filesystem there
> enters into error mode, going read-only.
>
> I eventually lose the access to the device (i.e., the /dev/sd??? device
> isn't there anymore) and I then have to re-run fsck on the given
> filesystem.
>
> This has already happened 2 or 3 times already and I observed that it
> only occurs when there is high traffic---if I am, say, compiling the
> kernel on that external HD, I don't see any problems.
>
> Attached is part of the dmesg log that shows the problem. I put the
> whole dmesg at <http://rb.doesntexist.org/linux/>.
>
> As always, if any further information is needed, please let me know.

You're seeing these:

[103051.265045] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
retry 1
[103051.265156] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
retry 2
[103051.265281] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
retry 3
[103051.265406] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
retry 4

According to the EHCI spec, XactErr is "Set to a one by the Host
Controller during status update in the case where the host did not
receive a valid response from the device (Timeout, CRC, Bad PID, etc.)"

Quite likely this is some kind of hardware problem - maybe the USB port
doesn't quite provide enough power for the drive, etc. A lot of these
USB enclosure devices are also rather poor quality in general..

2009-04-22 22:08:14

by Rogério Brito

[permalink] [raw]
Subject: Re: [2.6.30-rc2] usb reset during big file transfer and ext3 error

Hi, Robert.

On Apr 21 2009, Robert Hancock wrote:
> (ccing linux-usb)

Ok.

> Rogério Brito wrote:
(...)
>> Unfortunately, when I was transferring the contents of 2 DVDs from the
>> main IDE HD to a USB external HD, I got errors from the USB host, the
>> writes on the external HD become failures and the ext3 filesystem there
>> enters into error mode, going read-only.
>>
>> I eventually lose the access to the device (i.e., the /dev/sd??? device
>> isn't there anymore) and I then have to re-run fsck on the given
>> filesystem.
>>
>> This has already happened 2 or 3 times already and I observed that it
>> only occurs when there is high traffic---if I am, say, compiling the
>> kernel on that external HD, I don't see any problems.

I just saw it reoccur once more, this time inducing a stacktrace related
to ext3. :-(

>> Attached is part of the dmesg log that shows the problem. I put the
>> whole dmesg at <http://rb.doesntexist.org/linux/>.
>>
>> As always, if any further information is needed, please let me know.
>
> You're seeing these:
>
> [103051.265045] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
> retry 1
> [103051.265156] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
> retry 2
> [103051.265281] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
> retry 3
> [103051.265406] ehci_hcd 0000:00:1d.7: detected XactErr len 1536/4096
> retry 4

Precisely.

> According to the EHCI spec, XactErr is "Set to a one by the Host
> Controller during status update in the case where the host did not
> receive a valid response from the device (Timeout, CRC, Bad PID,
> etc.)"

Is there any way of controlling the number of retries in the host
controller? Or, perhaps, of controlling the time between retries so that
the device can shape it up again?

> Quite likely this is some kind of hardware problem - maybe the USB
> port doesn't quite provide enough power for the drive, etc.

I see. The first thing I thought about when I saw this comment of yours
was that there could be some heat issue and the drive not cooling
down.

In this particular case, the USB enclosure is externally powered and it
conatins a SATA drive. I also had never seen it occour before when
connected to an EHCI port on another system, even while transferring
more data.

> A lot of these USB enclosure devices are also rather poor quality in
> general..

Agreed. Not everybody does things correctly by the book. OTOH, these are
the devices present in "the real world". Would there be workarounds for
such situations?


Thanks, Rogério Brito.

--
Rogério Brito : rbrito@{mackenzie,ime.usp}.br : GPG key 1024D/7C2CAEB8
http://www.ime.usp.br/~rbrito : http://meusite.mackenzie.com.br/rbrito
Projects: algorithms.berlios.de : lame.sf.net : vrms.alioth.debian.org

2009-04-23 02:54:05

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.30-rc2] usb reset during big file transfer and ext3 error

On Wed, 22 Apr 2009, [utf-8] Rogério Brito wrote:

> > According to the EHCI spec, XactErr is "Set to a one by the Host
> > Controller during status update in the case where the host did not
> > receive a valid response from the device (Timeout, CRC, Bad PID,
> > etc.)"
>
> Is there any way of controlling the number of retries in the host
> controller? Or, perhaps, of controlling the time between retries so that
> the device can shape it up again?

It's not all that simple. The host controller allows the OS to set the
number of hardware retries to 1, 2, 3, or unlimited. Linux uses 3;
those XactErr debugging messages in your log show that the driver was
extending the number of retries in software.

It's not possible to change the time interval between retries done by
the hardware. While it is possible in theory to change the interval
between retries done by the driver, it would be rather difficult and
so ehci-hcd doesn't attempt it.

The software retries were introduced to solve one particular problem:
Many EHCI controllers will generate a transaction error if a data
transfer is occurring on one port at the same time as a device is being
unplugged on another port. This is clearly a hardware bug, and the
software retries were intended to work around it. In practice only a
couple of software retries are needed; if the transfer hasn't succeeded
by that point then it's never going to succeed. I set the upper limit
to 32 retries just to be conservative.

Delaying longer in order to allow the device to shape itself up is
generally hopeless. I've haven't seen more than one or two cases where
that would work -- and it's quite possible that those cases would have
worked out okay if the software retry mechanism had existed back when
they occurred. If transaction errors aren't caused by noise in the
cable then they are almost always caused by bugs or failures in the
device. Once a device's firmware has crashed, it doesn't magically fix
itself.

Alan Stern