2008-10-07 00:39:31

by L A Walsh

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Ok, this is my "latest" theory about why my SATA disks have been acting
strange.

Normally I have the drives set to go into standby after 30 minutes of
inactivity. This "can" work -- unless (and this may be obvious to some
people, but it's not entirely intuitive) ...unless you query the drive's
temperature with smartctl periodically.

So..._using_ the "-n standby" on smartctl doesn't have an effect unless
the drive is already on standby -- but if it is *not* on standby, then
it counts as drive activity and resets the "goto sleep timer". This
isn't the worst problem -- more of an annoyance. I didn't try to keep
track of all the drives' temperatures until I started having the 2nd
problem which is decidedly "nastier"...

Second problem -- if a drive is in standby, then if smartctl or
smartd try to run the short or long self-tests, the kernel starts
issuing time-out errors, and the drive is eventually, _logically_
removed from the system. It never comes back from standby.

If I *access* the drive (do an 'ls' of a directory on the drive that
isn't in the cache buffers), then after a ~20 second pause, the drive
has spun up and all is good. But, for some reason, the "smart" test
functionality isn't causing the drive to wake up. Instead the kernel
views the drive as OTL (OutToLunch) and removes it from the device
table. This is, IMO, the more serious problem and is a regression
compared to PATA disk functionality.

The bit of periodically checking temps resetting the activity timer --
that isn't something I normally was trying to do -- I only started that
to try to debug why the drives were going offline (didn't know if temps
were related, among other reasons). But in the process of checking the
temps, I was also (I am guessing about the functionality based on
observation) resetting the inactivity timer.

So the real problem is why issuing a smart command isn't re-starting
the drive -- or bringing it back from standby. Whereas a "normal" disk
read seems to bring it back to normal functioning just fine (and can
then do the smart-test).

Does this give anyone ideas about where the problem might be? Also
sorta explains why my hangs have been infrequent, because I've been
periodically polling the temps of all the drives -- and only when I stop
the polling would the drive timeout, then die the next morning when
smartd tried to run a short test between 1 and 2 am.



2008-10-07 01:09:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Linda Walsh wrote:
> So the real problem is why issuing a smart command isn't re-starting
> the drive -- or bringing it back from standby. Whereas a "normal" disk
> read seems to bring it back to normal functioning just fine (and can
> then do the smart-test).
>
> Does this give anyone ideas about where the problem might be? Also
> sorta explains why my hangs have been infrequent, because I've been
> periodically polling the temps of all the drives -- and only when I stop
> the polling would the drive timeout, then die the next morning when
> smartd tried to run a short test between 1 and 2 am.

Sounds like a firmware problem to me. Issuing ATA_CMD_VERIFY on block
0 before issuing test commands should work around the problem. Also,
which controller are you using? Can you post the failing kernel log?

--
tejun

2008-10-07 01:36:25

by Linda Walsh

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Controller is a Promise TX4/300
Is this what you were looking for?:

Oct 6 16:59:14 ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x6 frozen
Oct 6 16:59:14 ish kernel: ata2.00: cmd
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Oct 6 16:59:14 ish kernel: res
40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 6 16:59:14 ish kernel: ata2.00: status: { DRDY }
Oct 6 16:59:20 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 16:59:24 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 16:59:30 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 16:59:34 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 16:59:40 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:00:09 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:00:09 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct 6 17:00:14 ish dhcpd: Forward map from ns1.sc.tlinx.org to
192.168.3.242 FAILED: Has an A record but no DHCID, not mine.
Oct 6 17:00:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:00:15 ish kernel: ata2: reset failed, giving up
Oct 6 17:00:15 ish kernel: ata2.00: disabled
Oct 6 17:00:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t4
Oct 6 17:00:15 ish kernel: ata2: hotplug_status 0x22
Oct 6 17:00:20 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:00:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:00:30 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:00:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:00:40 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:01:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:01:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct 6 17:01:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:01:15 ish kernel: ata2: reset failed, giving up
Oct 6 17:01:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t3
Oct 6 17:01:15 ish kernel: ata2: hotplug_status 0x22
Oct 6 17:01:20 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:01:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:01:30 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:01:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:01:40 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:02:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:02:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct 6 17:02:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:02:15 ish kernel: ata2: reset failed, giving up
Oct 6 17:02:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t2
Oct 6 17:02:15 ish kernel: ata2: hotplug_status 0x22
Oct 6 17:02:21 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:02:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:02:31 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:02:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:02:41 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:03:01 ish sshd[4020]: error: channel 0: chan_read_failed for
istate 3
Oct 6 17:03:10 ish syslog-ng[13177]: last message repeated 2 times
Oct 6 17:03:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:03:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct 6 17:03:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:03:15 ish kernel: ata2: reset failed, giving up
Oct 6 17:03:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t1
Oct 6 17:03:15 ish kernel: ata2: hotplug_status 0x22
Oct 6 17:03:21 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:03:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:03:31 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:03:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:03:41 ish kernel: ata2: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:04:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:04:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct 6 17:04:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct 6 17:04:15 ish kernel: ata2: reset failed, giving up
Oct 6 17:04:15 ish kernel: ata2: EH pending after 5 tries, giving up
Oct 6 17:04:15 ish kernel: sd 2:0:0:0: rejecting I/O to offline device
Oct 6 17:04:15 ish kernel: program smartctl is using a deprecated SCSI
ioctl, please convert it to SG_IO
Oct 6 17:04:15 ish kernel: sd 2:0:0:0: [sdc] START_STOP FAILED
Oct 6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:06:31 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:07:30 ish syslog-ng[13177]: last message repeated 2 times
Oct 6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:08:32 ish kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x6 frozen
Oct 6 17:08:32 ish kernel: ata1.00: cmd
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Oct 6 17:08:32 ish kernel: res
40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 6 17:08:32 ish kernel: ata1.00: status: { DRDY }
Oct 6 17:08:38 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:08:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:08:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:08:48 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:08:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:08:58 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:09:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:09:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:09:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct 6 17:09:32 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:09:32 ish kernel: ata1: reset failed, giving up
Oct 6 17:09:32 ish kernel: ata1.00: disabled
Oct 6 17:09:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t4
Oct 6 17:09:32 ish kernel: ata1: hotplug_status 0x88
Oct 6 17:09:38 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:09:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:09:48 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:09:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:09:57 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:09:58 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:10:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:10:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct 6 17:10:32 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:10:32 ish kernel: ata1: reset failed, giving up
Oct 6 17:10:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t3
Oct 6 17:10:32 ish kernel: ata1: hotplug_status 0x88
Oct 6 17:10:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:10:38 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:10:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:10:48 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:10:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:10:58 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:11:09 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:11:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:11:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct 6 17:11:33 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:11:33 ish kernel: ata1: reset failed, giving up
Oct 6 17:11:33 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0xe frozen t2
Oct 6 17:11:33 ish kernel: ata1: hotplug_status 0x88
Oct 6 17:11:38 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:11:43 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:11:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Oct 6 17:11:48 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:11:53 ish kernel: ata1: COMRESET failed (errno=-16)
Oct 6 17:11:58 ish kernel: ata1: link is slow to respond, please be
patient (ready=-19)
Oct 6 17:12:17 ish xinetd[2021]: Exiting...
Oct 6 17:12:17 ish kernel: nfsd: last server has exited
Oct 6 17:12:17 ish kernel: nfsd: unexporting all filesystems
Oct 6 17:12:17 ish apcupsd[1989]: apcupsd exiting, signal 15
Oct 6 17:12:17 ish apcupsd[1989]: apcupsd shutdown succeeded
Oct 6 17:12:17 ish rpc.statd[2074]: Caught signal 15, un-registering
and exiting.
Oct 6 17:12:17 ish mountd[2075]: Caught signal 15, un-registering and
exiting.
Oct 6 17:12:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5
returned.
Tejun Heo wrote:
> Linda Walsh wrote:
>
>> So the real problem is why issuing a smart command isn't re-starting
>> the drive -- or bringing it back from standby. Whereas a "normal" disk
>> read seems to bring it back to normal functioning just fine (and can
>> then do the smart-test).
>>
>> Does this give anyone ideas about where the problem might be? Also
>> sorta explains why my hangs have been infrequent, because I've been
>> periodically polling the temps of all the drives -- and only when I stop
>> the polling would the drive timeout, then die the next morning when
>> smartd tried to run a short test between 1 and 2 am.
>>
>
> Sounds like a firmware problem to me. Issuing ATA_CMD_VERIFY on block
> 0 before issuing test commands should work around the problem. Also,
> which controller are you using? Can you post the failing kernel log?
>
>

2008-10-07 01:44:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Linda Walsh wrote:
> Controller is a Promise TX4/300

Yeap. After the drive goes offline, does unplugging and replugging
the power cable to the harddrive makes it come back?

--
tejun

2008-10-07 10:16:32

by Linda Walsh

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo wrote:
> Linda Walsh wrote:
>
>> Controller is a Promise TX4/300
>>
>
> Yeap. After the drive goes offline, does unplugging and replugging
> the power cable to the harddrive makes it come back?
>
>
That's not easy to do. It's an internal drive ... will have to find
some time
to take the system down and apart for that type of testing..

If I powercycle the whole machine it comes back up ...but that's
probably not what you mean...:-/

2008-10-07 22:28:34

by Linda Walsh

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo wrote:
> Linda Walsh wrote:
>
>> Controller is a Promise TX4/300
>> Yeap. After the drive goes offline, does unplugging and replugging
>> the power cable to the harddrive makes it come back?
>>
----
No. It hangs the computer. about 2-3 seconds after plugging the
drives back in. Did it twice to verify it wasn't a fluke. Verified
drives removed from /dev, then
plugged them back in -- was able to do about 1-2 ls commands on /dev, then
keyboard goes dead.

First time I tried unplugging the power cables and replugging --
that hung...
2nd time tried unplugging a sata cable and replugging -- that hung too.

Hopefully you won't need any more tests of this exact nature...? :-)


2008-10-08 00:02:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Linda Walsh wrote:
> Tejun Heo wrote:
>> Linda Walsh wrote:
>>
>>> Controller is a Promise TX4/300
>>> Yeap. After the drive goes offline, does unplugging and replugging
>>> the power cable to the harddrive makes it come back?
>>>
> ----
> No. It hangs the computer. about 2-3 seconds after plugging the
> drives back in. Did it twice to verify it wasn't a fluke. Verified
> drives removed from /dev, then
> plugged them back in -- was able to do about 1-2 ls commands on /dev, then
> keyboard goes dead.
>
> First time I tried unplugging the power cables and replugging --
> that hung...
> 2nd time tried unplugging a sata cable and replugging -- that hung too.

Ah.. okay, so the controller went bonkers then. Any chance you can
shell out ~15 bucks and try a sil SATA controller?

> Hopefully you won't need any more tests of this exact nature...? :-)

Wasn't it fun and empowering? :-P

--
tejun

2008-10-22 03:41:38

by Linda Walsh

[permalink] [raw]
Subject: re: Promise SATA-standby +selftest=hungdrive; Sil works...

This is with 2.6.26.5 (there are multiple other problems with 2.6.27[.0]).

The problem with the drive going "offline" doesn't happen with a
sil_sata(3124) controller -- so no need to unplug and replug...


I.e. when the drives are in standby, if smartd or a smartctl command
attempts to run a drive self-test (short), I get timeout errors from the
Promise controller (which hangs the sys if I try unplugging/replugging
the cable to the hung drive).

The drives correctly spin up to speed and perform the short-test with
the sil controller.

It would seem there is a problem with the Promise controller or driver?


Tejun Heo wrote:
> Linda Walsh wrote:
>
>> Tejun Heo wrote:
>>
>>> Linda Walsh wrote:
>>>
>>>
>>>> Controller is a Promise TX4/300
>>>> Yeap. After the drive goes offline, does unplugging and replugging
>>>> the power cable to the harddrive makes it come back?
>>>>
>>>>
>> ----
>> No. It hangs the computer. about 2-3 seconds after plugging the
>> drives back in. Did it twice to verify it wasn't a fluke. Verified
>> drives removed from /dev, then
>> plugged them back in -- was able to do about 1-2 ls commands on /dev, then
>> keyboard goes dead.
>>
>> First time I tried unplugging the power cables and replugging --
>> that hung...
>> 2nd time tried unplugging a sata cable and replugging -- that hung too.
>>
>
> Ah.. okay, so the controller went bonkers then. Any chance you can
> shell out ~15 bucks and try a sil SATA controller?
>
>
>> Hopefully you won't need any more tests of this exact nature...? :-)
>>
>
> Wasn't it fun and empowering? :-P
>
>

2008-10-22 04:13:48

by Tejun Heo

[permalink] [raw]
Subject: Re: Promise SATA-standby +selftest=hungdrive; Sil works...

Linda Walsh wrote:
> This is with 2.6.26.5 (there are multiple other problems with 2.6.27[.0]).
>
> The problem with the drive going "offline" doesn't happen with a
> sil_sata(3124) controller -- so no need to unplug and replug...
>
>
> I.e. when the drives are in standby, if smartd or a smartctl command
> attempts to run a drive self-test (short), I get timeout errors from the
> Promise controller (which hangs the sys if I try unplugging/replugging
> the cable to the hung drive).
> The drives correctly spin up to speed and perform the short-test with
> the sil controller.
>
> It would seem there is a problem with the Promise controller or driver?

Yeah, Mikael found out that hardreset requires controller reset before
it. Hopefully, it will be fixed soon.

Thanks.

--
tejun