2008-10-08 00:36:43

by Christian Mueller

[permalink] [raw]
Subject: RE: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Hi Tejun,

The messages I saw on the screen.


Thank you,
Christian

-----Original Message-----
From: Tejun Heo [mailto:[email protected]]
Sent: Tuesday, October 07, 2008 5:14 PM
To: Christian Mueller
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in
standby or sleep, PATA models did. (fwd)

Christian Mueller wrote:
> Hi Tejun,
>
> I am using an LSI 3080-X. I have tried dmesg and it does not show
smartmon
> output.

Then, what made you believe you're having the same problem? Also, can
you please revive the cc list?

Thanks.

--
tejun


2008-10-08 00:30:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

(cc'ing linux-ide and Mikael Pettersson)

Christian Mueller wrote:
> The messages I saw on the screen.

If the drive doesn't wake up and times out the smart command while
suspended (which probably is a bug in the firmware but it also can be
the controller's fault), then after the timeout, the driver should kick
in and reset the drive which will wake up the device and retry the
command. It's not the prettiest picture but it's still gonna work. In
Linda's case, it looks like the controller (sata_promise) went bonkers
on hardreset and requires power cycle to get back to working state.
That's why I asked Linda to try another controller.

Anyways, if you're seeing a similar problem, the driver or controller
probably can't do proper reset after timeout and I can't really help
with SAS driver on FreeBSD. :-P

Mikael, the original thread is the following.

http://article.gmane.org/gmane.linux.utilities.smartmontools/5842

Any ideas why hardreset doesn't work after SMART command timed out?

Thanks.

--
tejun

2008-10-12 14:56:20

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo writes:
> (cc'ing linux-ide and Mikael Pettersson)
>
> Christian Mueller wrote:
> > The messages I saw on the screen.
>
> If the drive doesn't wake up and times out the smart command while
> suspended (which probably is a bug in the firmware but it also can be
> the controller's fault), then after the timeout, the driver should kick
> in and reset the drive which will wake up the device and retry the
> command. It's not the prettiest picture but it's still gonna work. In
> Linda's case, it looks like the controller (sata_promise) went bonkers
> on hardreset and requires power cycle to get back to working state.
> That's why I asked Linda to try another controller.
>
> Anyways, if you're seeing a similar problem, the driver or controller
> probably can't do proper reset after timeout and I can't really help
> with SAS driver on FreeBSD. :-P
>
> Mikael, the original thread is the following.
>
> http://article.gmane.org/gmane.linux.utilities.smartmontools/5842
>
> Any ideas why hardreset doesn't work after SMART command timed out?

I've looked at the above posting and tried to reproduce the problem.

Linda wrote that "smartctl -n standby -A <device>" sometimes wakes
the device up, even though it shouldn't. On my Promise test box
(FC6 user-space with kernel 2.6.27 and smartmontools-5.37-1.2.fc6,
I put my test disks in standby with "hdparm -y /dev/sdb", and then
ran numerous "smartctl -n standby -A /dev/sdb -d ata" commands.
smartctl always noticed the standby state and never woke any disk.

I then dropped the "-n standby" to observe the wakeup behaviour.
On one disk (Seagate Barracuda 7200.9) the wakeups were completely
reliable with no signs of timeouts or libata EH activity.
On another disk (Hitachi Deskstar HDS722525VLSA80) the first one or
two times I ran "smartctl -A /dev/sdb -d ata" when the disk was in
standby it would wake up ok, but the next time it would suffer from
timeouts, failed COMRESETs, and eventually libata would disable the port.
When reloading sata_promise that port would fail detection but other
ports would be ok.

So I suspect two issues:
- smartctl -n standby waking Linda's disks from standby is probably
a disk firmware issue or a smartctl issue. I see no evidence that
sata_promise is to blame for that.
- hardreset in sata_promise seems broken. I'll take a closer look
at that in about a week's time (I'll be busy with other work the
next couple of days).

/Mikael

2008-10-13 05:18:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Mikael Pettersson wrote:
> - smartctl -n standby waking Linda's disks from standby is probably
> a disk firmware issue or a smartctl issue. I see no evidence that
> sata_promise is to blame for that.

/me agrees. I think it's most likely a problem in the firmware.

> - hardreset in sata_promise seems broken. I'll take a closer look
> at that in about a week's time (I'll be busy with other work the
> next couple of days).

This looks like a rather serious problem, so please take a look at
this. Also, after the drive went down, does unplugging and replugging
the signal cable fix the problem, if not does doing the same thing
with the power cable make any difference?

Thanks.

--
tejun

2008-10-13 07:04:30

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo writes:
> Mikael Pettersson wrote:
> > - smartctl -n standby waking Linda's disks from standby is probably
> > a disk firmware issue or a smartctl issue. I see no evidence that
> > sata_promise is to blame for that.
>
> /me agrees. I think it's most likely a problem in the firmware.
>
> > - hardreset in sata_promise seems broken. I'll take a closer look
> > at that in about a week's time (I'll be busy with other work the
> > next couple of days).
>
> This looks like a rather serious problem, so please take a look at
> this. Also, after the drive went down, does unplugging and replugging
> the signal cable fix the problem, if not does doing the same thing
> with the power cable make any difference?

I couldn't bring the port back up with any combination of signal/power
cable hotplugging. Even rmmod sata_promise; modprobe sata_promise failed
to bring the port back up. Hence my suspicion that hardreset is borked
or doesn't kick in.

2008-10-13 07:10:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Mikael Pettersson wrote:
> Tejun Heo writes:
> > > - hardreset in sata_promise seems broken. I'll take a closer look
> > > at that in about a week's time (I'll be busy with other work the
> > > next couple of days).
> >
> > This looks like a rather serious problem, so please take a look at
> > this. Also, after the drive went down, does unplugging and replugging
> > the signal cable fix the problem, if not does doing the same thing
> > with the power cable make any difference?
>
> I couldn't bring the port back up with any combination of signal/power
> cable hotplugging.

Heh.. that rules out drive firmware stuck in wonderland.

> Even rmmod sata_promise; modprobe sata_promise failed to bring the
> port back up. Hence my suspicion that hardreset is borked or doesn't
> kick in.

It looks like the controller requires harder or more proper kick in
the ass to return to sane state. :-)

--
tejun

2008-10-19 23:41:00

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

On Mon, 13 Oct 2008 14:16:24 +0900, Tejun Heo wrote:
>Mikael Pettersson wrote:
>> - hardreset in sata_promise seems broken. I'll take a closer look
>> at that in about a week's time (I'll be busy with other work the
>> next couple of days).
>
>This looks like a rather serious problem, so please take a look at
>this.

I've done more tests now, and the problem is that errors detected
outside of sata_promise itself, typically timeouts, don't trigger
the pdc_reset_port() call needed to bring the ATA engine behind the
port back to sanity.

And the reason no Promise-specific reset is done on timeouts is
that libata-eh freezes the port before calling ->error_handler.
sata_promise's error_handler only does a reset if the port is
non-frozen, and I think that's because we don't want to destroy
error status bits needed by EH autopsy.

The solution I've been testing is the straightforward one of
overriding ->prereset with code which calls pdc_reset_port()
before calling the default prereset. (See patch below.)
(Promise's own driver issues resets whenever there's a sign
of a problem.)

One of my test disks will often trigger a timeout if smartctl
accesses it when it's spun down. Previously the port would not
recover from that, but now it's just a brief reset/detect and
then it's up again.

/Mikael

--- linux-2.6.27/drivers/ata/sata_promise.c.~1~ 2008-07-14 10:22:36.000000000 +0200
+++ linux-2.6.27/drivers/ata/sata_promise.c 2008-10-20 00:20:58.000000000 +0200
@@ -153,6 +153,7 @@ static void pdc_freeze(struct ata_port *
static void pdc_sata_freeze(struct ata_port *ap);
static void pdc_thaw(struct ata_port *ap);
static void pdc_sata_thaw(struct ata_port *ap);
+static int pdc_prereset(struct ata_link *link, unsigned long deadline);
static void pdc_error_handler(struct ata_port *ap);
static void pdc_post_internal_cmd(struct ata_queued_cmd *qc);
static int pdc_pata_cable_detect(struct ata_port *ap);
@@ -175,6 +176,7 @@ static const struct ata_port_operations
.sff_irq_clear = pdc_irq_clear,

.post_internal_cmd = pdc_post_internal_cmd,
+ .prereset = pdc_prereset,
.error_handler = pdc_error_handler,
};

@@ -691,6 +693,12 @@ static void pdc_sata_thaw(struct ata_por
readl(host_mmio + hotplug_offset); /* flush */
}

+static int pdc_prereset(struct ata_link *link, unsigned long deadline)
+{
+ pdc_reset_port(link->ap);
+ return ata_sff_prereset(link, deadline);
+}
+
static void pdc_error_handler(struct ata_port *ap)
{
if (!(ap->pflags & ATA_PFLAG_FROZEN))

2008-10-21 04:20:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Hello, Mikael.

> --- linux-2.6.27/drivers/ata/sata_promise.c.~1~ 2008-07-14 10:22:36.000000000 +0200
> +++ linux-2.6.27/drivers/ata/sata_promise.c 2008-10-20 00:20:58.000000000 +0200
> @@ -153,6 +153,7 @@ static void pdc_freeze(struct ata_port *
> static void pdc_sata_freeze(struct ata_port *ap);
> static void pdc_thaw(struct ata_port *ap);
> static void pdc_sata_thaw(struct ata_port *ap);
> +static int pdc_prereset(struct ata_link *link, unsigned long deadline);
> static void pdc_error_handler(struct ata_port *ap);
> static void pdc_post_internal_cmd(struct ata_queued_cmd *qc);
> static int pdc_pata_cable_detect(struct ata_port *ap);
> @@ -175,6 +176,7 @@ static const struct ata_port_operations
> .sff_irq_clear = pdc_irq_clear,
>
> .post_internal_cmd = pdc_post_internal_cmd,
> + .prereset = pdc_prereset,
> .error_handler = pdc_error_handler,
> };
>
> @@ -691,6 +693,12 @@ static void pdc_sata_thaw(struct ata_por
> readl(host_mmio + hotplug_offset); /* flush */
> }
>
> +static int pdc_prereset(struct ata_link *link, unsigned long deadline)
> +{
> + pdc_reset_port(link->ap);

I would put this into ->hardreset itself as the controller can also
get out of sync with reality during reset. Other than that, looks
like the correct approach.

Thanks.

--
tejun

2008-10-21 07:58:20

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo writes:
> Hello, Mikael.
>
> > --- linux-2.6.27/drivers/ata/sata_promise.c.~1~ 2008-07-14 10:22:36.000000000 +0200
> > +++ linux-2.6.27/drivers/ata/sata_promise.c 2008-10-20 00:20:58.000000000 +0200
> > @@ -153,6 +153,7 @@ static void pdc_freeze(struct ata_port *
> > static void pdc_sata_freeze(struct ata_port *ap);
> > static void pdc_thaw(struct ata_port *ap);
> > static void pdc_sata_thaw(struct ata_port *ap);
> > +static int pdc_prereset(struct ata_link *link, unsigned long deadline);
> > static void pdc_error_handler(struct ata_port *ap);
> > static void pdc_post_internal_cmd(struct ata_queued_cmd *qc);
> > static int pdc_pata_cable_detect(struct ata_port *ap);
> > @@ -175,6 +176,7 @@ static const struct ata_port_operations
> > .sff_irq_clear = pdc_irq_clear,
> >
> > .post_internal_cmd = pdc_post_internal_cmd,
> > + .prereset = pdc_prereset,
> > .error_handler = pdc_error_handler,
> > };
> >
> > @@ -691,6 +693,12 @@ static void pdc_sata_thaw(struct ata_por
> > readl(host_mmio + hotplug_offset); /* flush */
> > }
> >
> > +static int pdc_prereset(struct ata_link *link, unsigned long deadline)
> > +{
> > + pdc_reset_port(link->ap);
>
> I would put this into ->hardreset itself as the controller can also
> get out of sync with reality during reset.

The only thing I see going on between prereset and (hard/soft)reset
is an optional freeze, so I don't see why moving the pdc_reset_port()
into the beginning of hardreset() would make any difference.

sata_promise currently uses the ->hardreset and ->softreset inherited
from ata_sff_port_ops, so it would need to override both to ensure that
we always do pdc_reset_port() before libata does its thing. That's why
I felt doing that in ->prereset would be the right solution.

/Mikael

2008-10-21 08:01:29

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

On Mon, 20 Oct 2008 01:40:21 +0200, Mikael Pettersson wrote:
>On Mon, 13 Oct 2008 14:16:24 +0900, Tejun Heo wrote:
>>Mikael Pettersson wrote:
>>> - hardreset in sata_promise seems broken. I'll take a closer look
>>> at that in about a week's time (I'll be busy with other work the
>>> next couple of days).
>>
>>This looks like a rather serious problem, so please take a look at
>>this.
>
>I've done more tests now, and the problem is that errors detected
>outside of sata_promise itself, typically timeouts, don't trigger
>the pdc_reset_port() call needed to bring the ATA engine behind the
>port back to sanity.

I did a round of regression tests which identified another
related but different problem.

In kernels 2.6.24 and 2.6.25 sata_promise would actually recover
from the timeouts, while in kernels 2.6.26 and 2.6.27 it would not.
Before 2.6.26 sata_promise explicitly used sata_std_hardreset, but
in the "make reset related methods proper port operations" commit
(a1efdaba2dbd6fb89e23a87b66d3f4dd92c9f5af), Tejun changed sata_promise
to use the hardreset it now inherits from ata_sff_port_ops, namely
sata_sff_hardreset. This change looks accidental. The main difference
between these two procedures is that the sff version will poll the
legacy status register until the port becomes ready.

Changing sata_promise to use sata_std_hardreset in 2.6.26/.27
makes the EH after the timeouts much more reliable. Not as
tidy as with the previous ->prereset fix, but still working.

/Mikael

2008-10-21 08:57:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Hello, Mikael.
> I did a round of regression tests which identified another
> related but different problem.
>
> In kernels 2.6.24 and 2.6.25 sata_promise would actually recover
> from the timeouts, while in kernels 2.6.26 and 2.6.27 it would not.
> Before 2.6.26 sata_promise explicitly used sata_std_hardreset, but
> in the "make reset related methods proper port operations" commit
> (a1efdaba2dbd6fb89e23a87b66d3f4dd92c9f5af), Tejun changed sata_promise
> to use the hardreset it now inherits from ata_sff_port_ops, namely
> sata_sff_hardreset. This change looks accidental. The main difference
> between these two procedures is that the sff version will poll the
> legacy status register until the port becomes ready.

Hmm... it's quite likely that I've introduced the regression but that
commit ain't it (I actually wrote a script to verify the inheritance
change doesn't actually change the function table). What used to be
sata_std_hardreset() is now sata_sff_hardreset(). The change was made
while separating out SFF as [S]ATA as per the standard doesn't have
any way to wait for device readiness. The TF-polling is SFF specific
and thus moved out to sata_sff_hardreset().

So, in both 2.6.24 and 2.6.25, sata_promise did wait for device
readiness after hardreset as does 2.6.26 or any later version.

> Changing sata_promise to use sata_std_hardreset in 2.6.26/.27
> makes the EH after the timeouts much more reliable. Not as
> tidy as with the previous ->prereset fix, but still working.

The only behavior change between 2.6.25 and 2.6.26 as far as
sata_promise is concerned is that hardrset is preferred over
softreset. Here's what I think is going on.

Previously, after a timeout, libata-eh will invoke softreset and if
that works all should have been fine whether hardreset actually worked
or not. Now, after something goes wrong, libata EH calls hardreset
and as hardreset doesn't work properly without the controller reset so
it fails. So, the libata core layer change exposed a bug in
hardreset, which was one of the reasons why the change was made -
hardreset being the last line of defense, using it occasionally
doesn't make sense test-coverage-wise.

I agree the core layer changes can be quite confusing but they were
necessary to keep the core layer scalable. libata now has oodles of
different types of low level drivers and things were and still are
getting quite treacherous for drivers living on the edge.

Anyways, so, please fix hardreset. If it can't wait for device
readiness reliably, the right thing to do is to use
sata_std_hardreset() which will trigger follow-up softreset to wait
for device readiness and classify devices but if adding the controller
reset makes the hardreset more reliable, please do so.

Thanks.

--
tejun

2008-10-21 09:04:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Hello,

Mikael Pettersson wrote:
> Tejun Heo writes:
> > Hello, Mikael.
> > I would put this into ->hardreset itself as the controller can also
> > get out of sync with reality during reset.
>
> The only thing I see going on between prereset and (hard/soft)reset
> is an optional freeze, so I don't see why moving the pdc_reset_port()
> into the beginning of hardreset() would make any difference.
>
> sata_promise currently uses the ->hardreset and ->softreset inherited
> from ata_sff_port_ops, so it would need to override both to ensure that
> we always do pdc_reset_port() before libata does its thing. That's why
> I felt doing that in ->prereset would be the right solution.

Hmm.. reset sequence goes on like the following.

1. prereset
2. hardreset, if fail, retry
3. follow-up softreset if requested, if fail, goto #2
4. postreset, if successful

So, if some PHY event happens while the reset is waiting for device
readiness and makes the controller state go out of sync with the
drive. ->prereset() will NOT be called for the following retry.

As a rule, ->hardreset should be able to reset the controller from all
possible situations. ->prereset can be used to smooth out initial
reset tries (ie. during initial probing, waiting for device readiness
before SRST for SFF controllers w/o hardreset) but at best its
function is advisory. When things go wrong, ->hardreset should be
able to provide solution whatever state the controller is in.

If both hard and soft resets work better with the controller reset
added, I think it would be best to override both.

Thanks.

--
tejun

2008-10-21 09:35:53

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

Tejun Heo writes:
> Hello,
>
> Mikael Pettersson wrote:
> > Tejun Heo writes:
> > > Hello, Mikael.
> > > I would put this into ->hardreset itself as the controller can also
> > > get out of sync with reality during reset.
> >
> > The only thing I see going on between prereset and (hard/soft)reset
> > is an optional freeze, so I don't see why moving the pdc_reset_port()
> > into the beginning of hardreset() would make any difference.
> >
> > sata_promise currently uses the ->hardreset and ->softreset inherited
> > from ata_sff_port_ops, so it would need to override both to ensure that
> > we always do pdc_reset_port() before libata does its thing. That's why
> > I felt doing that in ->prereset would be the right solution.
>
> Hmm.. reset sequence goes on like the following.
>
> 1. prereset
> 2. hardreset, if fail, retry
> 3. follow-up softreset if requested, if fail, goto #2
> 4. postreset, if successful
>
> So, if some PHY event happens while the reset is waiting for device
> readiness and makes the controller state go out of sync with the
> drive. ->prereset() will NOT be called for the following retry.

I see. Ok, then I'll forget about ->prereset and bind ->hardreset to
code which does pdc_reset_port() before invoking sata_sff_hardreset().

Thanks for your input.

/Mikael