Calling ap->ops->set_piomode(ap, dev) on a device/controller which got
already removed, locks the system hard. Reproducibly on an X60 attached to
a dock station containing a cdrom device with doing
$ echo 1 > /sys/devices/platform/dock.0/undock && echo 123 > /dev/sr0
This calls ata_eh_reset(...) which in turn tries to force PIO mode 0. But
the device is already gone.
Bisecting revealed the following commit as culprit:
commit cdeab1140799f09c5f728a5ff85e0bdfa5679cd2
Author: Tejun Heo <[email protected]>
Date: Mon Oct 29 16:41:09 2007 +0900
libata: relocate forcing PIO0 on reset
Forcing PIO0 on reset was done inside ata_bus_softreset(), which is a
bit out of place as it should be applied to all resets - hard, soft
and implementation which don't use ata_bus_softreset(). Relocate it
such that...
* For new EH, it's done in ata_eh_reset() before calling prereset.
* For old EH, it's done before calling ap->ops->phy_reset() in
ata_bus_probe().
This makes PIO0 forced after all resets. Another difference is that
reset itself is done after PIO0 is forced.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Alan Cox <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>
ATTENTION! The following patch solves the problem on my system, but please
be aware that I don't really know what I'm doing because I don't have the
big picture. There's surely a better way to check if the device/controller
is still functional than calling ata_link_{online,offline}.
Signed-off-by: Holger Macht <[email protected]>
---
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 4e31071..d6a7c57 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2131,23 +2131,25 @@ int ata_eh_reset(struct ata_link *link, int classify,
ata_eh_about_to_do(link, NULL, ehc->i.action & ATA_EH_RESET_MASK);
- ata_link_for_each_dev(dev, link) {
- /* If we issue an SRST then an ATA drive (not ATAPI)
- * may change configuration and be in PIO0 timing. If
- * we do a hard reset (or are coming from power on)
- * this is true for ATA or ATAPI. Until we've set a
- * suitable controller mode we should not touch the
- * bus as we may be talking too fast.
- */
- dev->pio_mode = XFER_PIO_0;
+ if (ata_link_online(link) != ata_link_offline(link)) {
+ ata_link_for_each_dev(dev, link) {
+ /* If we issue an SRST then an ATA drive (not ATAPI)
+ * may change configuration and be in PIO0 timing. If
+ * we do a hard reset (or are coming from power on)
+ * this is true for ATA or ATAPI. Until we've set a
+ * suitable controller mode we should not touch the
+ * bus as we may be talking too fast.
+ */
+ dev->pio_mode = XFER_PIO_0;
- /* If the controller has a pio mode setup function
- * then use it to set the chipset to rights. Don't
- * touch the DMA setup as that will be dealt with when
- * configuring devices.
- */
- if (ap->ops->set_piomode)
- ap->ops->set_piomode(ap, dev);
+ /* If the controller has a pio mode setup function
+ * then use it to set the chipset to rights. Don't
+ * touch the DMA setup as that will be dealt with when
+ * configuring devices.
+ */
+ if (ap->ops->set_piomode)
+ ap->ops->set_piomode(ap, dev);
+ }
}
/* Determine which reset to use and record in ehc->i.action.
Hello,
Holger Macht wrote:
> Calling ap->ops->set_piomode(ap, dev) on a device/controller which got
> already removed, locks the system hard. Reproducibly on an X60 attached to
> a dock station containing a cdrom device with doing
>
> $ echo 1 > /sys/devices/platform/dock.0/undock && echo 123 > /dev/sr0
>
> This calls ata_eh_reset(...) which in turn tries to force PIO mode 0. But
> the device is already gone.
>
> Bisecting revealed the following commit as culprit:
>
> commit cdeab1140799f09c5f728a5ff85e0bdfa5679cd2
> Author: Tejun Heo <[email protected]>
> Date: Mon Oct 29 16:41:09 2007 +0900
>
> libata: relocate forcing PIO0 on reset
>
> Forcing PIO0 on reset was done inside ata_bus_softreset(), which is a
> bit out of place as it should be applied to all resets - hard, soft
> and implementation which don't use ata_bus_softreset(). Relocate it
> such that...
>
> * For new EH, it's done in ata_eh_reset() before calling prereset.
>
> * For old EH, it's done before calling ap->ops->phy_reset() in
> ata_bus_probe().
>
> This makes PIO0 forced after all resets. Another difference is that
> reset itself is done after PIO0 is forced.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Alan Cox <[email protected]>
> Signed-off-by: Jeff Garzik <[email protected]>
>
>
> ATTENTION! The following patch solves the problem on my system, but please
> be aware that I don't really know what I'm doing because I don't have the
> big picture. There's surely a better way to check if the device/controller
> is still functional than calling ata_link_{online,offline}.
In the above example, even the reset sequence itself can cause hang if
the hardware is implemented slightly differently. The reason why
set_piomode() locks up but reset sequence doesn't is simple dumb luck.
I think the proper fix is to tell libata to detach the cdrom before
undocking.
> Signed-off-by: Holger Macht <[email protected]>
NACK.
--
tejun
On Mon 11. Feb - 11:37:15, Tejun Heo wrote:
> Hello,
>
> Holger Macht wrote:
> > Calling ap->ops->set_piomode(ap, dev) on a device/controller which got
> > already removed, locks the system hard. Reproducibly on an X60 attached to
> > a dock station containing a cdrom device with doing
> >
> > $ echo 1 > /sys/devices/platform/dock.0/undock && echo 123 > /dev/sr0
> >
> > This calls ata_eh_reset(...) which in turn tries to force PIO mode 0. But
> > the device is already gone.
> >
> > Bisecting revealed the following commit as culprit:
> >
> > commit cdeab1140799f09c5f728a5ff85e0bdfa5679cd2
> > Author: Tejun Heo <[email protected]>
> > Date: Mon Oct 29 16:41:09 2007 +0900
> >
> > libata: relocate forcing PIO0 on reset
> >
> > Forcing PIO0 on reset was done inside ata_bus_softreset(), which is a
> > bit out of place as it should be applied to all resets - hard, soft
> > and implementation which don't use ata_bus_softreset(). Relocate it
> > such that...
> >
> > * For new EH, it's done in ata_eh_reset() before calling prereset.
> >
> > * For old EH, it's done before calling ap->ops->phy_reset() in
> > ata_bus_probe().
> >
> > This makes PIO0 forced after all resets. Another difference is that
> > reset itself is done after PIO0 is forced.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Acked-by: Alan Cox <[email protected]>
> > Signed-off-by: Jeff Garzik <[email protected]>
> >
> >
> > ATTENTION! The following patch solves the problem on my system, but please
> > be aware that I don't really know what I'm doing because I don't have the
> > big picture. There's surely a better way to check if the device/controller
> > is still functional than calling ata_link_{online,offline}.
>
> In the above example, even the reset sequence itself can cause hang if
> the hardware is implemented slightly differently. The reason why
> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
> I think the proper fix is to tell libata to detach the cdrom before
> undocking.
Wouldn't the proper fix be to call ata_acpi_handle_hotplug _somewhere_?
(which is currently called nowhere AFAICS)
Anyway, kernel hackers keep telling me that the kernel should just do the
right thing. Shouldn't userspace never be able to freeze the system?
It's completely ok for me to handle this from userspace, if that's the
position of the libata developers.
In this case, we should change the dock driver to default to
immediate_undock=false, because otherwise it's far too risky to freeze the
system.
Regards,
Holger
On Mon 11. Feb - 11:37:15, Tejun Heo wrote:
> Hello,
>
> Holger Macht wrote:
> > Calling ap->ops->set_piomode(ap, dev) on a device/controller which got
> > already removed, locks the system hard. Reproducibly on an X60 attached to
> > a dock station containing a cdrom device with doing
> >
> > $ echo 1 > /sys/devices/platform/dock.0/undock && echo 123 > /dev/sr0
> >
> > This calls ata_eh_reset(...) which in turn tries to force PIO mode 0. But
> > the device is already gone.
> >
> > Bisecting revealed the following commit as culprit:
> >
> > commit cdeab1140799f09c5f728a5ff85e0bdfa5679cd2
> > Author: Tejun Heo <[email protected]>
> > Date: Mon Oct 29 16:41:09 2007 +0900
> >
> > libata: relocate forcing PIO0 on reset
> >
> > Forcing PIO0 on reset was done inside ata_bus_softreset(), which is a
> > bit out of place as it should be applied to all resets - hard, soft
> > and implementation which don't use ata_bus_softreset(). Relocate it
> > such that...
> >
> > * For new EH, it's done in ata_eh_reset() before calling prereset.
> >
> > * For old EH, it's done before calling ap->ops->phy_reset() in
> > ata_bus_probe().
> >
> > This makes PIO0 forced after all resets. Another difference is that
> > reset itself is done after PIO0 is forced.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Acked-by: Alan Cox <[email protected]>
> > Signed-off-by: Jeff Garzik <[email protected]>
> >
> >
> > ATTENTION! The following patch solves the problem on my system, but please
> > be aware that I don't really know what I'm doing because I don't have the
> > big picture. There's surely a better way to check if the device/controller
> > is still functional than calling ata_link_{online,offline}.
>
> In the above example, even the reset sequence itself can cause hang if
> the hardware is implemented slightly differently. The reason why
> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
Another thing, whether it's poor luck or not, it worked like this than at
least 2.6.22. And it was _heavily_ tested. The above commit broke it
between 2.6.24-rc1 and 2.6.24-rc2, which is at least a regression.
Regards,
Holger
Hello,
Holger Macht wrote:
>> In the above example, even the reset sequence itself can cause hang if
>> the hardware is implemented slightly differently. The reason why
>> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
>
> Another thing, whether it's poor luck or not, it worked like this than at
> least 2.6.22. And it was _heavily_ tested. The above commit broke it
> between 2.6.24-rc1 and 2.6.24-rc2, which is at least a regression.
Yes, in a sense but the change is in the area where there's no defined
behavior, so I don't think it's a good idea to revert the change here.
Let's concentrate on finding the proper solution.
Thanks.
--
tejun
Hello,
Holger Macht wrote:
>> In the above example, even the reset sequence itself can cause hang if
>> the hardware is implemented slightly differently. The reason why
>> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
>> I think the proper fix is to tell libata to detach the cdrom before
>> undocking.
>
> Wouldn't the proper fix be to call ata_acpi_handle_hotplug _somewhere_?
> (which is currently called nowhere AFAICS)
It should be called via ata_acpi_{ap|dev}_notify() callbacks installed
via acpi_install_notify_handler(). Can you add dump_stack() in the
function and verify that it actually is being called? It could be that
the method is called too late or libata takes too long to actually
unplug the device. Hmmm... It seems what ata_acpi_handle_hotplug() does
isn't enough for undock. It probably should request detaching the
device instead of just notifying hotplug event. Anyways, please lemme
know whether and when the function is called.
> Anyway, kernel hackers keep telling me that the kernel should just do the
> right thing. Shouldn't userspace never be able to freeze the system?
Yeah, I think most things should be done automatically but it's true
that somethings are a bit awkward to handle in kernel. Also, if you're
root, you can almost always crash the machine from userland.
> It's completely ok for me to handle this from userspace, if that's the
> position of the libata developers.
Let's see whether we can fix the ACPI handler first.
> In this case, we should change the dock driver to default to
> immediate_undock=false, because otherwise it's far too risky to freeze the
> system.
I'm not too familiar with how docks work. Can you please explain
briefly what immediate_undock is?
Thanks.
--
tejun
On Mon, Feb 11, 2008 at 11:04:46AM +0100, Holger Macht wrote:
> Wouldn't the proper fix be to call ata_acpi_handle_hotplug _somewhere_?
> (which is currently called nowhere AFAICS)
I think so. The T61 at least generated ACPI dock and undock messages for
IDE master/slaves and we can use those.
On Mon 11. Feb - 20:16:38, Tejun Heo wrote:
> Hello,
>
> Holger Macht wrote:
> >> In the above example, even the reset sequence itself can cause hang if
> >> the hardware is implemented slightly differently. The reason why
> >> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
> >> I think the proper fix is to tell libata to detach the cdrom before
> >> undocking.
> >
> > Wouldn't the proper fix be to call ata_acpi_handle_hotplug _somewhere_?
> > (which is currently called nowhere AFAICS)
>
> It should be called via ata_acpi_{ap|dev}_notify() callbacks installed
> via acpi_install_notify_handler(). Can you add dump_stack() in the
> function and verify that it actually is being called? It could be that
> the method is called too late or libata takes too long to actually
> unplug the device. Hmmm... It seems what ata_acpi_handle_hotplug() does
> isn't enough for undock. It probably should request detaching the
> device instead of just notifying hotplug event. Anyways, please lemme
> know whether and when the function is called.
I already checked, it's never called AFAICS. And I couldn't find a place
where it should be installed, otherwise, I would have sent a patch. The
dock driver already calls the notify methods on devices in the dock
station before doing the real undock.
> > Anyway, kernel hackers keep telling me that the kernel should just do the
> > right thing. Shouldn't userspace never be able to freeze the system?
>
> Yeah, I think most things should be done automatically but it's true
> that somethings are a bit awkward to handle in kernel. Also, if you're
> root, you can almost always crash the machine from userland.
>
> > It's completely ok for me to handle this from userspace, if that's the
> > position of the libata developers.
>
> Let's see whether we can fix the ACPI handler first.
Yes,
> > In this case, we should change the dock driver to default to
> > immediate_undock=false, because otherwise it's far too risky to freeze the
> > system.
>
> I'm not too familiar with how docks work. Can you please explain
> briefly what immediate_undock is?
immediate_undock=1:
User presses undock button on the dock station, dock driver calls ACPI
undock method immediately.
immediate_undock=0:
User presses undock button on the dock station, dock driver throws uevent
and waits for userland to undock the system via sysfs.
immediate_undock is currently set to 1 by default.
Regards,
Holger
On Mon, Feb 11, 2008 at 11:29:07AM +0100, Holger Macht wrote:
> > In the above example, even the reset sequence itself can cause hang if
> > the hardware is implemented slightly differently. The reason why
> > set_piomode() locks up but reset sequence doesn't is simple dumb luck.
>
> Another thing, whether it's poor luck or not, it worked like this than at
> least 2.6.22. And it was _heavily_ tested. The above commit broke it
> between 2.6.24-rc1 and 2.6.24-rc2, which is at least a regression.
Not neccessarily. It may just be timing chance on your box.
I agree however we should be doing the reset after the PIO0 switch if it
turns out that the precise order of the two events matters.
Alan
> In the above example, even the reset sequence itself can cause hang if
> the hardware is implemented slightly differently. The reason why
> set_piomode() locks up but reset sequence doesn't is simple dumb luck.
> I think the proper fix is to tell libata to detach the cdrom before
> undocking.
Just a thought but which ICH variant is this. For the older ones we have
to turn IORDY off in software before an eject and they cannot fully
support hot unplug
Alan
Holger Macht wrote:
>> It should be called via ata_acpi_{ap|dev}_notify() callbacks installed
>> via acpi_install_notify_handler(). Can you add dump_stack() in the
>> function and verify that it actually is being called? It could be that
>> the method is called too late or libata takes too long to actually
>> unplug the device. Hmmm... It seems what ata_acpi_handle_hotplug() does
>> isn't enough for undock. It probably should request detaching the
>> device instead of just notifying hotplug event. Anyways, please lemme
>> know whether and when the function is called.
>
> I already checked, it's never called AFAICS. And I couldn't find a place
> where it should be installed, otherwise, I would have sent a patch. The
> dock driver already calls the notify methods on devices in the dock
> station before doing the real undock.
ata_acpi_associate() calls acpi_install_notify_handler() for each
device. Isn't that enough?
> immediate_undock=1:
> User presses undock button on the dock station, dock driver calls ACPI
> undock method immediately.
>
> immediate_undock=0:
> User presses undock button on the dock station, dock driver throws uevent
> and waits for userland to undock the system via sysfs.
>
> immediate_undock is currently set to 1 by default.
Thanks for the explanation.
--
tejun
On Mon 11. Feb - 22:11:32, Tejun Heo wrote:
> Holger Macht wrote:
> >> It should be called via ata_acpi_{ap|dev}_notify() callbacks installed
> >> via acpi_install_notify_handler(). Can you add dump_stack() in the
> >> function and verify that it actually is being called? It could be that
> >> the method is called too late or libata takes too long to actually
> >> unplug the device. Hmmm... It seems what ata_acpi_handle_hotplug() does
> >> isn't enough for undock. It probably should request detaching the
> >> device instead of just notifying hotplug event. Anyways, please lemme
> >> know whether and when the function is called.
> >
> > I already checked, it's never called AFAICS. And I couldn't find a place
> > where it should be installed, otherwise, I would have sent a patch. The
> > dock driver already calls the notify methods on devices in the dock
> > station before doing the real undock.
>
> ata_acpi_associate() calls acpi_install_notify_handler() for each
> device. Isn't that enough?
It should be. I tried once more and noticed that
ata_acpi_handle_hotplug(...) is called when the cdrom is about to be
removed via the bay driver (just removing the device, not the whole dock
station). Actually there is a connection between the bay and the dock
driver, and one of them should notice that the cdrom/bay device is
dependent on the dock, but I don't know what's going wrong here. Kristen
(CC) should definitely know more about this interaction...
On a related note, shouldn't ata_acpi_handle_hotplug delete the device
like what is done when doing
echo 1 > /sys/devices/.../block/sr0/device/delete
?
Regards,
Holger
Holger Macht wrote:
> On a related note, shouldn't ata_acpi_handle_hotplug delete the device
> like what is done when doing
>
> echo 1 > /sys/devices/.../block/sr0/device/delete
Yeap, when the ata_acpi_handle_hotplug() was added, the focus was
supporting hotplug when the controller itself doesn't have hotplug
notification. Removing device for undocking would need explicitly
killing it.
--
tejun
On Mon 11. Feb - 22:11:32, Tejun Heo wrote:
> Holger Macht wrote:
> >> It should be called via ata_acpi_{ap|dev}_notify() callbacks installed
> >> via acpi_install_notify_handler(). Can you add dump_stack() in the
> >> function and verify that it actually is being called? It could be that
> >> the method is called too late or libata takes too long to actually
> >> unplug the device. Hmmm... It seems what ata_acpi_handle_hotplug() does
> >> isn't enough for undock. It probably should request detaching the
> >> device instead of just notifying hotplug event. Anyways, please lemme
> >> know whether and when the function is called.
> >
> > I already checked, it's never called AFAICS. And I couldn't find a place
> > where it should be installed, otherwise, I would have sent a patch. The
> > dock driver already calls the notify methods on devices in the dock
> > station before doing the real undock.
>
> ata_acpi_associate() calls acpi_install_notify_handler() for each
> device. Isn't that enough?
I don't think it is. When the device/bay is inside a dock station, we need
to register for dock events, too. I've sent a patch in a different mail:
http://lkml.org/lkml/2008/2/14/123
Regards,
Holger