2008-08-29 21:12:43

by Elias Oltmanns

[permalink] [raw]
Subject: [RFC] Disk shock protection in GNU/Linux (take 2)

[ Resending with correct address for lkml, sorry. ]

Hi all,

this is the second version of the patch series I posted a month ago.
There are the following changes:

- ata_id_has_unload() checks for the major version of the ATA spec the
drive claims to comply with.
- A disk head unload request issued to a device will effectively cause
the same to be executed for all devices on the same port and stop all
I/O to that port. Tejun told me that modern CD/DVD writers should have
no difficulties to recover from buffer under-runs caused by such a
behaviour (I haven't had a chance to put it to the test).
- As for the part dealing with libata, I have been following Tejun's
advice to rely on EH for the purposes of serialisation and in order to
prevent spurious resets. Hopefully, this has turned out
satisfactorily. As a nice side effect I don't have to touch any scsi
stuff at all due to this approach.
- Various minor changes intended to optimise the code or simply to make
more compliant with kernel coding conventions.

Unless there are any immediate objections from anyone, could the
subsystem maintainers please voice their opinion whether these patches
are likely to make it into 2.6.28? For obvious reasons, I'd like to make
sure that the changes to libata and ide are introduced at the same time
even though they don't depend on each other technically. Does that make
the patch set a candidate for the mm tree, or should the patches go
through the libata and ide tree respectively?

Here are the short descriptions of the four patches (based on
next-20080829):

1. This is a small patch to ata.h in order to provide a simple check for
support of the unload feature as indicated in a device's ID.
2. Here disk head unloading is implemented in the libata subsystem.
3. The same for ide.
4. A little bit of documentation.

Regards,

Elias


2008-08-29 21:17:50

by Elias Oltmanns

[permalink] [raw]
Subject: [PATCH 1/4] Introduce ata_id_has_unload()

Add a function to check an ATA device's id for head unload support as
specified in ATA-7.

Signed-off-by: Elias Oltmanns <[email protected]>
---

include/linux/ata.h | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/include/linux/ata.h b/include/linux/ata.h
index 80364b6..d9a94bd 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -707,6 +707,23 @@ static inline int ata_id_has_dword_io(const u16 *id)
return 0;
}

+static inline int ata_id_has_unload(const u16 *id)
+{
+ /*
+ * ATA-7 specifies two places to indicate unload feature
+ * support. Since I don't really understand the difference,
+ * I'll just check both and only return zero if none of them
+ * indicates otherwise.
+ */
+ if (ata_id_major_version(id) >= 7
+ && (((id[ATA_ID_CFSSE] & 0xC000) == 0x4000
+ && id[ATA_ID_CFSSE] & (1 << 13))
+ || ((id[ATA_ID_CSF_DEFAULT] & 0xC000) == 0x4000
+ && (id[ATA_ID_CSF_DEFAULT] & (1 << 13)))))
+ return 1;
+ return 0;
+}
+
static inline int ata_id_current_chs_valid(const u16 *id)
{
/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command

2008-08-29 21:21:44

by Elias Oltmanns

[permalink] [raw]
Subject: [PATCH 2/4] libata: Implement disk shock protection support

On user request (through sysfs), the IDLE IMMEDIATE command with UNLOAD
FEATURE as specified in ATA-7 is issued to the device and processing of
the request queue is stopped thereafter until the speified timeout
expires or user space asks to resume normal operation. This is supposed
to prevent the heads of a hard drive from accidentally crashing onto the
platter when a heavy shock is anticipated (like a falling laptop
expected to hit the floor). In fact, the whole port stops processing
commands until the timeout has expired in order to avoid any resets due
to failed commands on another device.

Signed-off-by: Elias Oltmanns <[email protected]>
---

drivers/ata/ahci.c | 2
drivers/ata/ata_piix.c | 7 ++
drivers/ata/libata-core.c | 8 ++
drivers/ata/libata-eh.c | 51 +++++++++++
drivers/ata/libata-scsi.c | 205 +++++++++++++++++++++++++++++++++++++++++++++
drivers/ata/libata.h | 9 ++
include/linux/libata.h | 5 +
7 files changed, 287 insertions(+), 0 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index c729e69..78281af 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -316,6 +316,8 @@ static struct device_attribute *ahci_shost_attrs[] = {

static struct device_attribute *ahci_sdev_attrs[] = {
&dev_attr_sw_activity,
+ &dev_attr_unload_feature,
+ &dev_attr_unload_heads,
NULL
};

diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index b1d08a8..9b42f8d 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -298,8 +298,15 @@ static struct pci_driver piix_pci_driver = {
#endif
};

+static struct device_attribute *piix_sdev_attrs[] = {
+ &dev_attr_unload_feature,
+ &dev_attr_unload_heads,
+ NULL
+};
+
static struct scsi_host_template piix_sht = {
ATA_BMDMA_SHT(DRV_NAME),
+ .sdev_attrs = piix_sdev_attrs,
};

static struct ata_port_operations piix_pata_ops = {
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 79e3a8e..f1e036f 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5267,6 +5267,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
init_timer_deferrable(&ap->fastdrain_timer);
ap->fastdrain_timer.function = ata_eh_fastdrain_timerfn;
ap->fastdrain_timer.data = (unsigned long)ap;
+ ap->park_timer.function = ata_scsi_park_timeout;
+ init_timer(&ap->park_timer);

ap->cbl = ATA_CBL_NONE;

@@ -6138,6 +6140,11 @@ static int __init ata_init(void)
if (!ata_aux_wq)
goto free_wq;

+ if (ata_scsi_register_pm_notifier()) {
+ destroy_workqueue(ata_aux_wq);
+ goto free_wq;
+ }
+
printk(KERN_DEBUG "libata version " DRV_VERSION " loaded.\n");
return 0;

@@ -6153,6 +6160,7 @@ static void __exit ata_exit(void)
kfree(ata_force_tbl);
destroy_workqueue(ata_wq);
destroy_workqueue(ata_aux_wq);
+ ata_scsi_unregister_pm_notifier();
}

subsys_initcall(ata_init);
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index c1db2f2..af75d59 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2446,6 +2446,51 @@ int ata_eh_reset(struct ata_link *link, int classify,
goto retry;
}

+static void ata_eh_park_devs(struct ata_port *ap, int park)
+{
+ struct ata_link *link;
+ struct ata_device *dev;
+ struct ata_taskfile tf;
+ struct request_queue *q;
+ unsigned int err_mask;
+
+ ata_port_for_each_link(link, ap) {
+ ata_link_for_each_dev(dev, link) {
+ if (!dev->sdev)
+ continue;
+ ata_tf_init(dev, &tf);
+ q = dev->sdev->request_queue;
+ spin_lock_irq(q->queue_lock);
+ if (park) {
+ blk_stop_queue(q);
+ tf.command = ATA_CMD_IDLEIMMEDIATE;
+ tf.feature = 0x44;
+ tf.lbal = 0x4c;
+ tf.lbam = 0x4e;
+ tf.lbah = 0x55;
+ } else {
+ blk_start_queue(q);
+ tf.command = ATA_CMD_CHK_POWER;
+ }
+ spin_unlock(q->queue_lock);
+ spin_lock(ap->lock);
+ if (dev->flags & ATA_DFLAG_NO_UNLOAD) {
+ spin_unlock_irq(ap->lock);
+ continue;
+ }
+ spin_unlock_irq(ap->lock);
+
+ tf.flags |= ATA_TFLAG_DEVICE | ATA_TFLAG_ISADDR;
+ tf.protocol |= ATA_PROT_NODATA;
+ err_mask = ata_exec_internal(dev, &tf, NULL, DMA_NONE,
+ NULL, 0, 0);
+ if ((err_mask || tf.lbal != 0xc4) && park)
+ ata_dev_printk(dev, KERN_ERR,
+ "head unload failed\n");
+ }
+ }
+}
+
static int ata_eh_revalidate_and_attach(struct ata_link *link,
struct ata_device **r_failed_dev)
{
@@ -2829,6 +2874,12 @@ int ata_eh_recover(struct ata_port *ap, ata_prereset_fn_t prereset,
}
}

+ if (ap->link.eh_context.i.action & ATA_EH_PARK) {
+ ata_eh_park_devs(ap, 1);
+ wait_event(ata_scsi_park_wq, !timer_pending(&ap->park_timer));
+ ata_eh_park_devs(ap, 0);
+ }
+
/* the rest */
ata_port_for_each_link(link, ap) {
struct ata_eh_context *ehc = &link->eh_context;
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 4d066ad..ffcc016 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -46,6 +46,7 @@
#include <linux/libata.h>
#include <linux/hdreg.h>
#include <linux/uaccess.h>
+#include <linux/suspend.h>

#include "libata.h"

@@ -113,6 +114,77 @@ static struct scsi_transport_template ata_scsi_transport_template = {
.user_scan = ata_scsi_user_scan,
};

+DECLARE_WAIT_QUEUE_HEAD(ata_scsi_park_wq);
+
+#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION)
+static atomic_t ata_scsi_park_count = ATOMIC_INIT(0);
+
+static int ata_scsi_pm_notifier(struct notifier_block *nb, unsigned long val,
+ void *null)
+{
+ switch (val) {
+ case PM_SUSPEND_PREPARE:
+ atomic_dec(&ata_scsi_park_count);
+ wait_event(ata_scsi_park_wq,
+ atomic_read(&ata_scsi_park_count) == -1);
+ break;
+ case PM_POST_SUSPEND:
+ atomic_inc(&ata_scsi_park_count);
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block ata_scsi_pm_notifier_block = {
+ .notifier_call = ata_scsi_pm_notifier,
+};
+
+int ata_scsi_register_pm_notifier(void)
+{
+ return register_pm_notifier(&ata_scsi_pm_notifier_block);
+}
+
+int ata_scsi_unregister_pm_notifier(void)
+{
+ return unregister_pm_notifier(&ata_scsi_pm_notifier_block);
+}
+
+static inline void ata_scsi_signal_unpark(void)
+{
+ atomic_dec(&ata_scsi_park_count);
+ wake_up_all(&ata_scsi_park_wq);
+}
+
+static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
+ unsigned long timeout)
+{
+ if (unlikely(atomic_inc_and_test(&ata_scsi_park_count))) {
+ ata_scsi_signal_unpark();
+ return -EBUSY;
+ }
+ if (mod_timer(timer, timeout)) {
+ atomic_dec(&ata_scsi_park_count);
+ return 1;
+ }
+
+ return 0;
+}
+#else /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
+static inline void ata_scsi_signal_unpark(void)
+{
+ wake_up_all(&ata_scsi_park_wq);
+}
+
+static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
+ unsigned long timeout)
+{
+ return mod_timer(timer, timeout);
+}
+#endif /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
+

static const struct {
enum link_pm value;
@@ -183,6 +255,136 @@ DEVICE_ATTR(link_power_management_policy, S_IRUGO | S_IWUSR,
ata_scsi_lpm_show, ata_scsi_lpm_put);
EXPORT_SYMBOL_GPL(dev_attr_link_power_management_policy);

+static ssize_t ata_scsi_park_show(struct device *device,
+ struct device_attribute *attr, char *buf)
+{
+ struct scsi_device *sdev = to_scsi_device(device);
+ struct ata_port *ap;
+ unsigned int seconds;
+
+ ap = ata_shost_to_port(sdev->host);
+
+ spin_lock_irq(ap->lock);
+ if (timer_pending(&ap->park_timer))
+ /*
+ * Adding 1 in order to guarantee nonzero value until timer
+ * has actually expired.
+ */
+ seconds = jiffies_to_msecs(ap->park_timer.expires - jiffies)
+ / 1000 + 1;
+ else
+ seconds = 0;
+ spin_unlock_irq(ap->lock);
+
+ return snprintf(buf, 20, "%u\n", seconds);
+}
+
+static ssize_t ata_scsi_park_store(struct device *device,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+#define MAX_PARK_TIMEOUT 30
+ struct scsi_device *sdev = to_scsi_device(device);
+ struct ata_port *ap;
+ struct ata_device *dev;
+ unsigned long seconds;
+ int rc;
+
+ rc = strict_strtoul(buf, 10, &seconds);
+ if (rc || seconds > MAX_PARK_TIMEOUT)
+ return -EINVAL;
+
+ ap = ata_shost_to_port(sdev->host);
+ dev = ata_scsi_find_dev(ap, sdev);
+ if (unlikely(!dev))
+ return -ENODEV;
+
+ spin_lock_irq(ap->lock);
+ if (dev->flags & ATA_DFLAG_NO_UNLOAD) {
+ rc = -EOPNOTSUPP;
+ goto unlock;
+ }
+
+ if (seconds) {
+ rc = ata_scsi_mod_park_timer(&ap->park_timer,
+ msecs_to_jiffies(seconds * 1000)
+ + jiffies);
+ if (!rc) {
+ ap->link.eh_info.action |= ATA_EH_PARK;
+ ata_port_schedule_eh(ap);
+ } else if (rc == 1)
+ rc = 0;
+ } else {
+ if (del_timer(&ap->park_timer))
+ ata_scsi_signal_unpark();
+ }
+unlock:
+ spin_unlock_irq(ap->lock);
+
+ return rc ? rc : len;
+}
+DEVICE_ATTR(unload_heads, S_IRUGO | S_IWUSR,
+ ata_scsi_park_show, ata_scsi_park_store);
+EXPORT_SYMBOL_GPL(dev_attr_unload_heads);
+
+static ssize_t ata_scsi_unload_feature_show(struct device *device,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct scsi_device *sdev = to_scsi_device(device);
+ struct ata_port *ap = ata_shost_to_port(sdev->host);
+ struct ata_device *dev = ata_scsi_find_dev(ap, sdev);
+ int val;
+
+ if (!dev)
+ return -ENODEV;
+ if (dev->class != ATA_DEV_ATA && dev->class != ATA_DEV_ATAPI)
+ return -EOPNOTSUPP;
+ spin_lock_irq(ap->lock);
+ val = !(dev->flags & ATA_DFLAG_NO_UNLOAD);
+ spin_unlock_irq(ap->lock);
+
+ return snprintf(buf, 4, "%u\n", val);
+}
+
+static ssize_t ata_scsi_unload_feature_store(struct device *device,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct scsi_device *sdev = to_scsi_device(device);
+ struct ata_port *ap;
+ struct ata_device *dev;
+ int val;
+
+ val = buf[0] - '0';
+ if ((val != 0 && val != 1) || (buf[1] != '\0' && buf[1] != '\n')
+ || buf[2] != '\0')
+ return -EINVAL;
+ ap = ata_shost_to_port(sdev->host);
+ dev = ata_scsi_find_dev(ap, sdev);
+ if (!dev)
+ return -ENODEV;
+ if (dev->class != ATA_DEV_ATA && dev->class != ATA_DEV_ATAPI)
+ return -EOPNOTSUPP;
+
+ spin_lock_irq(ap->lock);
+ if (val == 1)
+ dev->flags &= ~ATA_DFLAG_NO_UNLOAD;
+ else
+ dev->flags |= ATA_DFLAG_NO_UNLOAD;
+ spin_unlock_irq(ap->lock);
+
+ return len;
+}
+DEVICE_ATTR(unload_feature, S_IRUGO | S_IWUSR,
+ ata_scsi_unload_feature_show, ata_scsi_unload_feature_store);
+EXPORT_SYMBOL_GPL(dev_attr_unload_feature);
+
+void ata_scsi_park_timeout(unsigned long data)
+{
+ ata_scsi_signal_unpark();
+}
+
static void ata_scsi_set_sense(struct scsi_cmnd *cmd, u8 sk, u8 asc, u8 ascq)
{
cmd->result = (DRIVER_SENSE << 24) | SAM_STAT_CHECK_CONDITION;
@@ -954,6 +1156,9 @@ static int atapi_drain_needed(struct request *rq)
static int ata_scsi_dev_config(struct scsi_device *sdev,
struct ata_device *dev)
{
+ if (!ata_id_has_unload(dev->id))
+ dev->flags |= ATA_DFLAG_NO_UNLOAD;
+
/* configure max sectors */
blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);

diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index ade5c75..a486577 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -148,6 +148,15 @@ extern void ata_scsi_hotplug(struct work_struct *work);
extern void ata_schedule_scsi_eh(struct Scsi_Host *shost);
extern void ata_scsi_dev_rescan(struct work_struct *work);
extern int ata_bus_probe(struct ata_port *ap);
+extern wait_queue_head_t ata_scsi_park_wq;
+void ata_scsi_park_timeout(unsigned long data);
+#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION)
+extern int ata_scsi_register_pm_notifier(void);
+extern int ata_scsi_unregister_pm_notifier(void);
+#else
+static inline int ata_scsi_register_pm_notifier(void) { return 0; }
+static inline int ata_scsi_unregister_pm_notifier(void) { return 0; }
+#endif

/* libata-eh.c */
extern unsigned long ata_internal_cmd_timeout(struct ata_device *dev, u8 cmd);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 225bfc5..4b5e073 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -146,6 +146,7 @@ enum {
ATA_DFLAG_SPUNDOWN = (1 << 14), /* XXX: for spindown_compat */
ATA_DFLAG_SLEEPING = (1 << 15), /* device is sleeping */
ATA_DFLAG_DUBIOUS_XFER = (1 << 16), /* data transfer not verified */
+ ATA_DFLAG_NO_UNLOAD = (1 << 17), /* device doesn't support unload */
ATA_DFLAG_INIT_MASK = (1 << 24) - 1,

ATA_DFLAG_DETACH = (1 << 24),
@@ -319,6 +320,7 @@ enum {
ATA_EH_RESET = ATA_EH_SOFTRESET | ATA_EH_HARDRESET,
ATA_EH_ENABLE_LINK = (1 << 3),
ATA_EH_LPM = (1 << 4), /* link power management action */
+ ATA_EH_PARK = (1 << 5), /* unload heads and stop I/O */

ATA_EH_PERDEV_MASK = ATA_EH_REVALIDATE,

@@ -452,6 +454,8 @@ enum link_pm {
MEDIUM_POWER,
};
extern struct device_attribute dev_attr_link_power_management_policy;
+extern struct device_attribute dev_attr_unload_heads;
+extern struct device_attribute dev_attr_unload_feature;
extern struct device_attribute dev_attr_em_message_type;
extern struct device_attribute dev_attr_em_message;
extern struct device_attribute dev_attr_sw_activity;
@@ -714,6 +718,7 @@ struct ata_port {
int *pm_result;
enum link_pm pm_policy;

+ struct timer_list park_timer;
struct timer_list fastdrain_timer;
unsigned long fastdrain_cnt;


2008-08-29 21:27:43

by Elias Oltmanns

[permalink] [raw]
Subject: [PATCH 3/4] ide: Implement disk shock protection support

On user request (through sysfs), the IDLE IMMEDIATE command with UNLOAD
FEATURE as specified in ATA-7 is issued to the device and processing of
the request queue is stopped thereafter until the speified timeout
expires or user space asks to resume normal operation. This is supposed
to prevent the heads of a hard drive from accidentally crashing onto the
platter when a heavy shock is anticipated (like a falling laptop
expected to hit the floor). In fact, the whole port stops processing
commands until the timeout has expired in order to avoid resets due to
failed commands on another device.

Signed-off-by: Elias Oltmanns <[email protected]>
---

drivers/ide/ide-io.c | 30 +++++
drivers/ide/ide-probe.c | 3
drivers/ide/ide-taskfile.c | 10 +-
drivers/ide/ide.c | 287 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/ide.h | 17 ++-
5 files changed, 341 insertions(+), 6 deletions(-)

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index d0579f1..657c0d8 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -675,7 +675,33 @@ EXPORT_SYMBOL_GPL(ide_devset_execute);

static ide_startstop_t ide_special_rq(ide_drive_t *drive, struct request *rq)
{
+ ide_hwif_t *hwif = drive->hwif;
+ ide_task_t task;
+ struct ide_taskfile *tf = &task.tf;
+
+ memset(&task, 0, sizeof(task));
switch (rq->cmd[0]) {
+ case REQ_PARK_HEADS: {
+ struct completion *waiting = rq->end_io_data;
+
+ drive->sleep = drive->hwif->park_timer.expires;
+ drive->dev_flags |= IDE_DFLAG_SLEEPING;
+ complete(waiting);
+ if (drive->dev_flags & IDE_DFLAG_NO_UNLOAD) {
+ ide_end_request(drive, 1, 0);
+ return ide_stopped;
+ }
+ tf->command = ATA_CMD_IDLEIMMEDIATE;
+ tf->feature = 0x44;
+ tf->lbal = 0x4c;
+ tf->lbam = 0x4e;
+ tf->lbah = 0x55;
+ task.tf_flags |= IDE_TFLAG_CUSTOM_HANDLER;
+ break;
+ }
+ case REQ_UNPARK_HEADS:
+ tf->command = ATA_CMD_CHK_POWER;
+ break;
case REQ_DEVSET_EXEC:
{
int err, (*setfunc)(ide_drive_t *, int) = rq->special;
@@ -695,6 +721,10 @@ static ide_startstop_t ide_special_rq(ide_drive_t *drive, struct request *rq)
ide_end_request(drive, 0, 0);
return ide_stopped;
}
+ task.tf_flags |= IDE_TFLAG_TF | IDE_TFLAG_DEVICE;
+ task.rq = rq;
+ hwif->data_phase = task.data_phase = TASKFILE_NO_DATA;
+ return do_rw_taskfile(drive, &task);
}

static void ide_check_pm_state(ide_drive_t *drive, struct request *rq)
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index b5e54d2..789390b 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -842,6 +842,9 @@ static void ide_port_tune_devices(ide_hwif_t *hwif)

if (hwif->dma_ops)
ide_set_dma(drive);
+
+ if (!ata_id_has_unload(drive->id))
+ drive->dev_flags |= IDE_DFLAG_NO_UNLOAD;
}
}

diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
index a4c2d91..7f89127 100644
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -152,7 +152,15 @@ static ide_startstop_t task_no_data_intr(ide_drive_t *drive)

if (!custom)
ide_end_drive_cmd(drive, stat, ide_read_error(drive));
- else if (tf->command == ATA_CMD_SET_MULTI)
+ else if (tf->command == ATA_CMD_IDLEIMMEDIATE) {
+ drive->hwif->tp_ops->tf_read(drive, task);
+ if (tf->lbal != 0xc4) {
+ printk(KERN_ERR "%s: head unloading failed!\n",
+ drive->name);
+ ide_tf_dump(drive->name, tf);
+ }
+ ide_end_drive_cmd(drive, stat, ide_read_error(drive));
+ } else if (tf->command == ATA_CMD_SET_MULTI)
drive->mult_count = drive->mult_req;

return ide_stopped;
diff --git a/drivers/ide/ide.c b/drivers/ide/ide.c
index a498245..75914aa 100644
--- a/drivers/ide/ide.c
+++ b/drivers/ide/ide.c
@@ -59,6 +59,7 @@
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/device.h>
+#include <linux/suspend.h>


/* default maximum number of failures */
@@ -77,6 +78,165 @@ DEFINE_MUTEX(ide_cfg_mtx);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(ide_lock);
EXPORT_SYMBOL(ide_lock);

+#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION)
+static atomic_t ide_park_count = ATOMIC_INIT(0);
+DECLARE_WAIT_QUEUE_HEAD(ide_park_wq);
+
+static int ide_pm_notifier(struct notifier_block *nb, unsigned long val,
+ void *null)
+{
+ switch (val) {
+ case PM_SUSPEND_PREPARE:
+ atomic_dec(&ide_park_count);
+ wait_event(ide_park_wq, atomic_read(&ide_park_count) == -1);
+ break;
+ case PM_POST_SUSPEND:
+ atomic_inc(&ide_park_count);
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block ide_pm_notifier_block = {
+ .notifier_call = ide_pm_notifier,
+};
+
+static inline int ide_register_pm_notifier(void)
+{
+ return register_pm_notifier(&ide_pm_notifier_block);
+}
+
+static inline int ide_unregister_pm_notifier(void)
+{
+ return unregister_pm_notifier(&ide_pm_notifier_block);
+}
+
+static inline void signal_unpark(void)
+{
+ atomic_dec(&ide_park_count);
+ wake_up_all(&ide_park_wq);
+}
+
+static inline int ide_mod_park_timer(struct timer_list *timer,
+ unsigned long timeout)
+{
+ if (unlikely(atomic_inc_and_test(&ide_park_count))) {
+ signal_unpark();
+ return -EBUSY;
+ }
+ if (mod_timer(timer, timeout)) {
+ signal_unpark();
+ return 1;
+ }
+
+ return 0;
+}
+#else /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
+static inline int ide_register_pm_notifier(void) { return 0; }
+
+static inline int ide_unregister_pm_notifier(void) { return 0; }
+
+static inline void signal_unpark(void) { }
+
+static inline int ide_mod_park_timer(struct timer_list *timer,
+ unsigned long timeout)
+{
+ return mod_timer(timer, timeout);
+}
+#endif /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
+
+static int issue_park_cmd(ide_drive_t *drive, struct completion *wait,
+ u8 op_code)
+{
+ ide_drive_t *odrive = drive;
+ ide_hwif_t *hwif = drive->hwif;
+ ide_hwgroup_t *hwgroup = hwif->hwgroup;
+ struct request_queue *q;
+ struct request *rq;
+ gfp_t gfp_mask = (op_code == REQ_PARK_HEADS) ? __GFP_WAIT : GFP_NOWAIT;
+ int count = 0;
+
+ do {
+ q = drive->queue;
+ if (drive->dev_flags & IDE_DFLAG_SLEEPING
+ && op_code == REQ_PARK_HEADS) {
+ drive->sleep = hwif->park_timer.expires;
+ goto next_step;
+ }
+
+ if (unlikely(drive->dev_flags & IDE_DFLAG_NO_UNLOAD
+ && op_code == REQ_UNPARK_HEADS))
+ goto resume;
+
+ spin_unlock_irq(&ide_lock);
+ rq = blk_get_request(q, READ, gfp_mask);
+ spin_lock_irq(&ide_lock);
+ if (unlikely(!rq))
+ goto resume;
+
+ rq->cmd[0] = op_code;
+ rq->cmd_len = 1;
+ rq->cmd_type = REQ_TYPE_SPECIAL;
+ rq->cmd_flags |= REQ_SOFTBARRIER;
+ __elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 0);
+ if (op_code == REQ_PARK_HEADS) {
+ rq->end_io_data = wait;
+ blk_stop_queue(q);
+ q->request_fn(q);
+ count++;
+ } else {
+resume:
+ drive->dev_flags &= ~IDE_DFLAG_SLEEPING;
+ if (hwgroup->sleeping) {
+ del_timer(&hwgroup->timer);
+ hwgroup->sleeping = 0;
+ hwgroup->busy = 0;
+ }
+ blk_start_queue(q);
+ }
+
+next_step:
+ do {
+ drive = drive->next;
+ } while (drive->hwif != hwif);
+ } while (drive != odrive);
+
+ return count;
+}
+
+static void unpark_work(struct work_struct *work)
+{
+ ide_hwif_t *hwif = container_of(work, ide_hwif_t, unpark_work);
+ ide_drive_t *drive;
+
+ mutex_lock(&ide_setting_mtx);
+ spin_lock_irq(&ide_lock);
+ if (unlikely(!hwif->present || timer_pending(&hwif->park_timer)))
+ goto done;
+
+ drive = hwif->hwgroup->drive;
+ while (drive->hwif != hwif)
+ drive = drive->next;
+
+ issue_park_cmd(drive, NULL, REQ_UNPARK_HEADS);
+done:
+ signal_unpark();
+ spin_unlock_irq(&ide_lock);
+ mutex_unlock(&ide_setting_mtx);
+ put_device(&hwif->gendev);
+}
+
+static void park_timeout(unsigned long data)
+{
+ ide_hwif_t *hwif = (ide_hwif_t *)data;
+
+ /* FIXME: Which work queue would be the right one? */
+ kblockd_schedule_work(NULL, &hwif->unpark_work);
+}
+
static void ide_port_init_devices_data(ide_hwif_t *);

/*
@@ -100,6 +260,11 @@ void ide_init_port_data(ide_hwif_t *hwif, unsigned int index)

hwif->tp_ops = &default_tp_ops;

+ INIT_WORK(&hwif->unpark_work, unpark_work);
+ hwif->park_timer.function = park_timeout;
+ hwif->park_timer.data = (unsigned long)hwif;
+ init_timer(&hwif->park_timer);
+
ide_port_init_devices_data(hwif);
}

@@ -581,6 +746,118 @@ static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
return sprintf(buf, "%s\n", (char *)&drive->id[ATA_ID_SERNO]);
}

+static ssize_t park_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ ide_drive_t *drive = to_ide_device(dev);
+ ide_hwif_t *hwif = drive->hwif;
+ unsigned int seconds;
+
+ spin_lock_irq(&ide_lock);
+ if (!(drive->dev_flags & IDE_DFLAG_PRESENT)) {
+ spin_unlock_irq(&ide_lock);
+ return -ENODEV;
+ }
+
+ if (timer_pending(&hwif->park_timer))
+ /*
+ * Adding 1 in order to guarantee nonzero value until timer
+ * has actually expired.
+ */
+ seconds = jiffies_to_msecs(hwif->park_timer.expires - jiffies)
+ / 1000 + 1;
+ else
+ seconds = 0;
+ spin_unlock_irq(&ide_lock);
+
+ return snprintf(buf, 20, "%u\n", seconds);
+}
+
+static ssize_t park_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+#define MAX_PARK_TIMEOUT 30
+ ide_drive_t *drive = to_ide_device(dev);
+ ide_hwif_t *hwif = drive->hwif;
+ DECLARE_COMPLETION_ONSTACK(wait);
+ unsigned long timeout;
+ int rc, count = 0;
+
+ rc = strict_strtoul(buf, 10, &timeout);
+ if (rc || timeout > MAX_PARK_TIMEOUT)
+ return -EINVAL;
+
+ mutex_lock(&ide_setting_mtx);
+ spin_lock_irq(&ide_lock);
+ if (unlikely(!(drive->dev_flags & IDE_DFLAG_PRESENT))) {
+ rc = -ENODEV;
+ goto unlock;
+ }
+ if (drive->dev_flags & IDE_DFLAG_NO_UNLOAD) {
+ rc = -EOPNOTSUPP;
+ goto unlock;
+ }
+
+ if (timeout) {
+ timeout = msecs_to_jiffies(timeout * 1000) + jiffies;
+ rc = ide_mod_park_timer(&hwif->park_timer, timeout);
+ if (unlikely(rc < 0))
+ goto unlock;
+ else if (rc)
+ rc = 0;
+ else
+ get_device(&hwif->gendev);
+ count = issue_park_cmd(drive, &wait, REQ_PARK_HEADS);
+ } else {
+ if (del_timer(&hwif->park_timer)) {
+ issue_park_cmd(drive, NULL, REQ_UNPARK_HEADS);
+ signal_unpark();
+ put_device(&hwif->gendev);
+ }
+ }
+
+unlock:
+ spin_unlock_irq(&ide_lock);
+
+ for (; count; count--)
+ wait_for_completion(&wait);
+ mutex_unlock(&ide_setting_mtx);
+
+ return rc ? rc : len;
+}
+
+ide_devset_rw_flag(no_unload, IDE_DFLAG_NO_UNLOAD);
+
+static ssize_t unload_feature_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ ide_drive_t *drive = to_ide_device(dev);
+ unsigned int val;
+
+ spin_lock_irq(&ide_lock);
+ val = !get_no_unload(drive);
+ spin_unlock_irq(&ide_lock);
+
+ return snprintf(buf, 4, "%u\n", val);
+}
+
+static ssize_t unload_feature_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ ide_drive_t *drive = to_ide_device(dev);
+ int val;
+
+ val = buf[0] - '0';
+ if ((val != 0 && val != 1)
+ || (buf[1] != '\0' && buf[1] != '\n') || buf[2] != '\0')
+ return -EINVAL;
+
+ val = ide_devset_execute(drive, &ide_devset_no_unload, !val);
+
+ return val ? val : len;
+}
+
static struct device_attribute ide_dev_attrs[] = {
__ATTR_RO(media),
__ATTR_RO(drivename),
@@ -588,6 +865,8 @@ static struct device_attribute ide_dev_attrs[] = {
__ATTR_RO(model),
__ATTR_RO(firmware),
__ATTR(serial, 0400, serial_show, NULL),
+ __ATTR(unload_feature, 0644, unload_feature_show, unload_feature_store),
+ __ATTR(unload_heads, 0644, park_show, park_store),
__ATTR_NULL
};

@@ -844,6 +1123,12 @@ static int __init ide_init(void)
goto out_port_class;
}

+ ret = ide_register_pm_notifier();
+ if (ret) {
+ class_destroy(ide_port_class);
+ goto out_port_class;
+ }
+
proc_ide_create();

return 0;
@@ -858,6 +1143,8 @@ static void __exit ide_exit(void)
{
proc_ide_destroy();

+ ide_unregister_pm_notifier();
+
class_destroy(ide_port_class);

bus_unregister(&ide_bus_type);
diff --git a/include/linux/ide.h b/include/linux/ide.h
index 3eece03..5e1ee98 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -156,6 +156,8 @@ enum {
*/
#define REQ_DRIVE_RESET 0x20
#define REQ_DEVSET_EXEC 0x21
+#define REQ_PARK_HEADS 0x22
+#define REQ_UNPARK_HEADS 0x23

/*
* Check for an interrupt and acknowledge the interrupt status
@@ -571,6 +573,8 @@ enum {
/* retrying in PIO */
IDE_DFLAG_DMA_PIO_RETRY = (1 << 25),
IDE_DFLAG_LBA = (1 << 26),
+ /* don't unload heads */
+ IDE_DFLAG_NO_UNLOAD = (1 << 27),
};

struct ide_drive_s {
@@ -818,6 +822,9 @@ typedef struct hwif_s {
unsigned sharing_irq: 1; /* 1 = sharing irq with another hwif */
unsigned sg_mapped : 1; /* sg_table and sg_nents are ready */

+ struct timer_list park_timer; /* protected by queue_lock */
+ struct work_struct unpark_work;
+
struct device gendev;
struct device *portdev;

@@ -950,6 +957,11 @@ __IDE_DEVSET(_name, 0, get_##_func, set_##_func)
#define ide_ext_devset_rw_sync(_name, _func) \
__IDE_DEVSET(_name, DS_SYNC, get_##_func, set_##_func)

+#define ide_devset_rw_flag(_name, _field) \
+ide_devset_get_flag(_name, _field); \
+ide_devset_set_flag(_name, _field); \
+IDE_DEVSET(_name, DS_SYNC, get_##_name, set_##_name)
+
#define ide_decl_devset(_name) \
extern const struct ide_devset ide_devset_##_name

@@ -969,11 +981,6 @@ ide_devset_get(_name, _field); \
ide_devset_set(_name, _field); \
IDE_DEVSET(_name, DS_SYNC, get_##_name, set_##_name)

-#define ide_devset_rw_flag(_name, _field) \
-ide_devset_get_flag(_name, _field); \
-ide_devset_set_flag(_name, _field); \
-IDE_DEVSET(_name, DS_SYNC, get_##_name, set_##_name)
-
struct ide_proc_devset {
const char *name;
const struct ide_devset *setting;

2008-08-29 21:29:40

by Elias Oltmanns

[permalink] [raw]
Subject: [PATCH 4/4] Add documentation for hard disk shock protection interface

Put some information (and pointers to more) into the kernel's doc tree,
describing briefly the interface to the kernel's disk head unloading
facility. Information about how to set up a complete shock protection
system under GNU/Linux can be found on the web and is referenced
accordingly.

Signed-off-by: Elias Oltmanns <[email protected]>
---

Documentation/laptops/disk-shock-protection.txt | 131 +++++++++++++++++++++++
1 files changed, 131 insertions(+), 0 deletions(-)
create mode 100644 Documentation/laptops/disk-shock-protection.txt

diff --git a/Documentation/laptops/disk-shock-protection.txt b/Documentation/laptops/disk-shock-protection.txt
new file mode 100644
index 0000000..bd483a3
--- /dev/null
+++ b/Documentation/laptops/disk-shock-protection.txt
@@ -0,0 +1,131 @@
+Hard disk shock protection
+==========================
+
+Author: Elias Oltmanns <[email protected]>
+Last modified: 2008-08-28
+
+
+0. Contents
+-----------
+
+1. Intro
+2. The interface
+3. References
+4. CREDITS
+
+
+1. Intro
+--------
+
+ATA/ATAPI-7 specifies the IDLE IMMEDIATE command with unload feature.
+Issuing this command should cause the drive to switch to idle mode and
+unload disk heads. This feature is being used in modern laptops in
+conjunction with accelerometers and appropriate software to implement
+a shock protection facility. The idea is to stop all I/O operations on
+the internal hard drive and park its heads on the ramp when critical
+situations are anticipated. The desire to have such a feature
+available on GNU/Linux systems has been the original motivation to
+implement a generic disk head parking interface in the Linux kernel.
+Please note, however, that other components have to be set up on your
+system in order to get disk shock protection working (see section
+3. Referneces below for pointers to more information about that).
+
+
+2. The interface
+----------------
+
+The interface works as follows: Writing an integer value to
+/sys/block/*/device/unload_heads will take the heads of the respective
+drive off the platter and block all I/O operations for the specified
+number of seconds. When the timeout expires and no further disk head
+park request has been issued in the meantime, normal operation will be
+resumed. The maximal value accepted for a timeout is 30 seconds.
+However, you can always reset a running timer to any value between 0
+and 30 by issuing a subsequent head park request before the timer of
+the previous one has expired. In particular, the total timeout can
+exceed 30 seconds and, more importantly, you can abort a timer and
+resume normal operation immediately by specifying a timeout of 0.
+Reading from /sys/block/*/device/unload_heads will report zero if no
+timer is running and the number of seconds until the timer expires
+otherwise.
+
+There is a technical detail of this implementation that may cause some
+confusion and should be discussed here. When a head park request has
+been issued to a device successfully, all I/O operations on the
+controller port this device is attached to will be deferred. That is
+to say, any other device that may be connected to the same port will
+be affected too. For that reason, head parking requests will be sent
+to all devices that support this feature sharing the same port before
+that port is taken offline, as it were. As far as PATA (old style IDE)
+configurations are concerned, there can only be two devices attached
+to any single port. In SATA world we have port multipliers which means
+that a user issued head parking request to one device may actually
+result in stopping I/O to a whole bunch of deviices. Hwoever, since
+this feature is supposed to be used on laptops and does not seem to be
+very useful in any other environment, there will be mostly one device
+per port. Even if the CD/DVD writer happens to be connected to the
+same port as the hard drive, it generally *should* recover just fine
+from the occasional buffer under-run incurred by a head park request
+to the HD.
+
+Write access to /sys/block/*/device/unload_heads is denied with
+-EOPNOTSUPP if the device does not support the unload feature. Read
+access, on the other hand, is granted on all devices, so it is easy to
+find out whether two devices share the same port and are subject to
+the limitation described in the previous paragraph. Just do, for
+example:
+
+# echo 30 > /sys/block/sda/device/unload_heads
+
+and check whether
+
+# cat /sys/block/device/sdb/unload_heads
+
+gives you a nonzero value (assuming, of course, there actually are
+devices sda and sdb up and running in your system).
+
+Finally, there are some hard drives that only comply with an earlier
+version of the ATA standard than ATA-7, but do support the unload
+feature nonetheless. Unfortunately, there is no safe way Linux can
+detect these devices, so you won't be able to write to the
+unload_heads attribute. If you know that your device really does
+support the unload feature (for instance, because the vendor of your
+laptop or the hard drive itself told you so), the you can tell the
+kernel to enable the usage of this feature for that drive by means of
+the unload_feature attribute:
+
+# echo 1 > /sys/block/*/device/unload_feature
+
+will enable the feature on that particular device, and giving 0
+instead of 1 will disable it again.
+
+
+3. References
+-------------
+
+There are several laptops from different brands featuring shock
+protection capabilities. As manufacturers have refused to support open
+source development of the required software components so far, Linux
+support for shock protection varies considerably between different
+hardware implementations. Ideally, this section should contain a list
+of pointers at different projects aiming at an implementation of shock
+protection on different systeems. Unfortunately, I only know of a
+single project which, although still considered experimental, is fit
+for use. Please feel free to add projects that have been the victims
+of my ignorance.
+
+- http://www.thinkwiki.org/wiki/HDAPS
+ See this page for information about Linux support of the hard disk
+ active protection system as implemented in IBM/Lenovo Thinkpads.
+ (FIXME: The information there will have to be updated once this
+ patch has been approved or the user interface has been agreed upon
+ at least.)
+
+
+4. CREDITS
+----------
+
+This implementation of disk head parking has been based on a patch
+originally published by Jon Escombe <[email protected]>. Assisted by
+various kernel developers, the author of this document has rewritten
+the original patch in order to make it fit for upstream submission.

2008-08-30 09:34:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Hello,

Elias Oltmanns wrote:
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index c729e69..78281af 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -316,6 +316,8 @@ static struct device_attribute *ahci_shost_attrs[] = {
>
> static struct device_attribute *ahci_sdev_attrs[] = {
> &dev_attr_sw_activity,
> + &dev_attr_unload_feature,
> + &dev_attr_unload_heads,
> NULL

Ehhh... This really should be in libata core layer. Please create the
default attrs and let ahci define its own.

> index b1d08a8..9b42f8d 100644
> --- a/drivers/ata/ata_piix.c
> +++ b/drivers/ata/ata_piix.c
> @@ -298,8 +298,15 @@ static struct pci_driver piix_pci_driver = {
> #endif
> };
>
> +static struct device_attribute *piix_sdev_attrs[] = {
> + &dev_attr_unload_feature,
> + &dev_attr_unload_heads,
> + NULL
> +};
> +
> static struct scsi_host_template piix_sht = {
> ATA_BMDMA_SHT(DRV_NAME),
> + .sdev_attrs = piix_sdev_attrs,
> };

Which would make this unnecessary and make disk unloading available to
all libata drivers.

> @@ -5267,6 +5267,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
> init_timer_deferrable(&ap->fastdrain_timer);
> ap->fastdrain_timer.function = ata_eh_fastdrain_timerfn;
> ap->fastdrain_timer.data = (unsigned long)ap;
> + ap->park_timer.function = ata_scsi_park_timeout;
> + init_timer(&ap->park_timer);

Why do you need a timeout when you can just msleep()?

> +static void ata_eh_park_devs(struct ata_port *ap, int park)
> +{
> + struct ata_link *link;
> + struct ata_device *dev;
> + struct ata_taskfile tf;
> + struct request_queue *q;
> + unsigned int err_mask;
> +
> + ata_port_for_each_link(link, ap) {
> + ata_link_for_each_dev(dev, link) {
> + if (!dev->sdev)
> + continue;

You probably want to do if (dev->class != ATA_DEV_ATA) here.

> + ata_tf_init(dev, &tf);
> + q = dev->sdev->request_queue;
> + spin_lock_irq(q->queue_lock);
> + if (park) {
> + blk_stop_queue(q);

Queue is already plugged when EH is entered. No need for this.

> + tf.command = ATA_CMD_IDLEIMMEDIATE;
> + tf.feature = 0x44;
> + tf.lbal = 0x4c;
> + tf.lbam = 0x4e;
> + tf.lbah = 0x55;
n> + } else {
> + blk_start_queue(q);

Neither this.

> + tf.command = ATA_CMD_CHK_POWER;
> + }
> + spin_unlock(q->queue_lock);
> + spin_lock(ap->lock);

And no need to play with locks at all.

> + if (dev->flags & ATA_DFLAG_NO_UNLOAD) {
> + spin_unlock_irq(ap->lock);
> + continue;
> + }
> + spin_unlock_irq(ap->lock);
> +
> + tf.flags |= ATA_TFLAG_DEVICE | ATA_TFLAG_ISADDR;
> + tf.protocol |= ATA_PROT_NODATA;
> + err_mask = ata_exec_internal(dev, &tf, NULL, DMA_NONE,
> + NULL, 0, 0);
> + if ((err_mask || tf.lbal != 0xc4) && park)
> + ata_dev_printk(dev, KERN_ERR,
> + "head unload failed\n");
> + }
> + }
> +}
> +
> static int ata_eh_revalidate_and_attach(struct ata_link *link,
> struct ata_device **r_failed_dev)
> {
> @@ -2829,6 +2874,12 @@ int ata_eh_recover(struct ata_port *ap, ata_prereset_fn_t prereset,
> }
> }
>
> + if (ap->link.eh_context.i.action & ATA_EH_PARK) {
> + ata_eh_park_devs(ap, 1);
> + wait_event(ata_scsi_park_wq, !timer_pending(&ap->park_timer));

I would just msleep() here.

> + ata_eh_park_devs(ap, 0);

And does the device need this explicit wake up? It will wake up when
it's necessary.

> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 4d066ad..ffcc016 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -46,6 +46,7 @@
> #include <linux/libata.h>
> #include <linux/hdreg.h>
> #include <linux/uaccess.h>
> +#include <linux/suspend.h>
>
> #include "libata.h"
>
> @@ -113,6 +114,77 @@ static struct scsi_transport_template ata_scsi_transport_template = {
> .user_scan = ata_scsi_user_scan,
> };
>
> +DECLARE_WAIT_QUEUE_HEAD(ata_scsi_park_wq);
> +
> +#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION)
> +static atomic_t ata_scsi_park_count = ATOMIC_INIT(0);
> +
> +static int ata_scsi_pm_notifier(struct notifier_block *nb, unsigned long val,
> + void *null)
> +{
> + switch (val) {
> + case PM_SUSPEND_PREPARE:
> + atomic_dec(&ata_scsi_park_count);
> + wait_event(ata_scsi_park_wq,
> + atomic_read(&ata_scsi_park_count) == -1);
> + break;
> + case PM_POST_SUSPEND:
> + atomic_inc(&ata_scsi_park_count);
> + break;
> + default:
> + return NOTIFY_DONE;
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block ata_scsi_pm_notifier_block = {
> + .notifier_call = ata_scsi_pm_notifier,
> +};
> +
> +int ata_scsi_register_pm_notifier(void)
> +{
> + return register_pm_notifier(&ata_scsi_pm_notifier_block);
> +}
> +
> +int ata_scsi_unregister_pm_notifier(void)
> +{
> + return unregister_pm_notifier(&ata_scsi_pm_notifier_block);
> +}

Why are these PM notifiers necessary?

> +static inline void ata_scsi_signal_unpark(void)
> +{
> + atomic_dec(&ata_scsi_park_count);
> + wake_up_all(&ata_scsi_park_wq);
> +}
> +
> +static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
> + unsigned long timeout)
> +{
> + if (unlikely(atomic_inc_and_test(&ata_scsi_park_count))) {
> + ata_scsi_signal_unpark();
> + return -EBUSY;
> + }
> + if (mod_timer(timer, timeout)) {
> + atomic_dec(&ata_scsi_park_count);
> + return 1;
> + }
> +
> + return 0;
> +}
>
> +#else /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
> +static inline void ata_scsi_signal_unpark(void)
> +{
> + wake_up_all(&ata_scsi_park_wq);
> +}
> +
> +static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
> + unsigned long timeout)
> +{
> + return mod_timer(timer, timeout);
> +}
> +#endif /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */

And these all can go. If you're worried about recurring events you
can just update timestamp from the sysfs write function and do...

deadline = last_timestamp + delay;
while ((now = jiffies) < deadline) {
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(deadline - now);
set_current_state(TASK_RUNNING);
}

> +static ssize_t ata_scsi_unload_feature_store(struct device *device,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct scsi_device *sdev = to_scsi_device(device);
> + struct ata_port *ap;
> + struct ata_device *dev;
> + int val;
> +
> + val = buf[0] - '0';
> + if ((val != 0 && val != 1) || (buf[1] != '\0' && buf[1] != '\n')
> + || buf[2] != '\0')
> + return -EINVAL;
> + ap = ata_shost_to_port(sdev->host);
> + dev = ata_scsi_find_dev(ap, sdev);
> + if (!dev)
> + return -ENODEV;
> + if (dev->class != ATA_DEV_ATA && dev->class != ATA_DEV_ATAPI)
> + return -EOPNOTSUPP;
> +
> + spin_lock_irq(ap->lock);
> + if (val == 1)
> + dev->flags &= ~ATA_DFLAG_NO_UNLOAD;
> + else
> + dev->flags |= ATA_DFLAG_NO_UNLOAD;
> + spin_unlock_irq(ap->lock);
> +
> + return len;
> +}
> +DEVICE_ATTR(unload_feature, S_IRUGO | S_IWUSR,
> + ata_scsi_unload_feature_show, ata_scsi_unload_feature_store);
> +EXPORT_SYMBOL_GPL(dev_attr_unload_feature);

Hmmm.... Maybe you can just disable it by echoing -1 to the unload file?

Thanks.

--
tejun

2008-08-30 11:57:04

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH 1/4] Introduce ata_id_has_unload()

Hello.

Elias Oltmanns wrote:

> Add a function to check an ATA device's id for head unload support as
> specified in ATA-7.
>
> Signed-off-by: Elias Oltmanns <[email protected]>
>
[...]
> diff --git a/include/linux/ata.h b/include/linux/ata.h
> index 80364b6..d9a94bd 100644
> --- a/include/linux/ata.h
> +++ b/include/linux/ata.h
> @@ -707,6 +707,23 @@ static inline int ata_id_has_dword_io(const u16 *id)
> return 0;
> }
>
> +static inline int ata_id_has_unload(const u16 *id)
> +{
> + /*
> + * ATA-7 specifies two places to indicate unload feature
> + * support. Since I don't really understand the difference,
> + * I'll just check both and only return zero if none of them
> + * indicates otherwise.
>

If you read the comments to the words 82:84 and 85:87, they say that
the former indicate the supported features, and the latter indicate the
enabed features AND in case a feature can't be disabled, the latter
words will have the corresponding bit set. So it should be sufficient to
check only one word.

> + */
> + if (ata_id_major_version(id) >= 7
> + && (((id[ATA_ID_CFSSE] & 0xC000) == 0x4000
> + && id[ATA_ID_CFSSE] & (1 << 13))
> + || ((id[ATA_ID_CSF_DEFAULT] & 0xC000) == 0x4000
> + && (id[ATA_ID_CSF_DEFAULT] & (1 << 13)))))
>


I think that it's preferrable to leave the operator on the same line
with the first operand...

WBR, Sergei

2008-08-30 17:30:19

by Elias Oltmanns

[permalink] [raw]
Subject: Re: [PATCH 1/4] Introduce ata_id_has_unload()

Sergei Shtylyov <[email protected]> wrote:
> Hello.
>
> Elias Oltmanns wrote:
>
>> Add a function to check an ATA device's id for head unload support as
>> specified in ATA-7.
>>
>> Signed-off-by: Elias Oltmanns <[email protected]>
>>
> [...]
>> diff --git a/include/linux/ata.h b/include/linux/ata.h
>> index 80364b6..d9a94bd 100644
>> --- a/include/linux/ata.h
>> +++ b/include/linux/ata.h
>> @@ -707,6 +707,23 @@ static inline int ata_id_has_dword_io(const u16 *id)
>> return 0;
>> }
>> +static inline int ata_id_has_unload(const u16 *id)
>> +{
>> + /*
>> + * ATA-7 specifies two places to indicate unload feature
>> + * support. Since I don't really understand the difference,
>> + * I'll just check both and only return zero if none of them
>> + * indicates otherwise.
>>
>
> If you read the comments to the words 82:84 and 85:87, they say that
> the former indicate the supported features, and the latter indicate
> the enabed features AND in case a feature can't be disabled, the
> latter words will have the corresponding bit set. So it should be
> sufficient to check only one word.

Yes, I tend to agree with you and, in fact, I have been leaning in this
direction myself. However, there is something that really bothers me.
Both entries describing bit 13 of word 87 and 84 are worded alike. In
particular, it says *supported* in both places, whereas in the case of the
other features it would say enabled in one and supported in the other
place.

Well, I'm willing to drop the check for word 87 since I don't like it
myself. Due to my lack of personal experience with inexplicable
implemenations of ATA standards in hardware though, I have to take your
word that this is safe.

>
>> + */
>> + if (ata_id_major_version(id) >= 7
>> + && (((id[ATA_ID_CFSSE] & 0xC000) == 0x4000
>> + && id[ATA_ID_CFSSE] & (1 << 13))
>> + || ((id[ATA_ID_CSF_DEFAULT] & 0xC000) == 0x4000
>> + && (id[ATA_ID_CSF_DEFAULT] & (1 << 13)))))
>>
>
>
> I think that it's preferrable to leave the operator on the same line
> with the first operand...

Not having too strong an opinion about it, I just thought that an
operator at the beginning of the line was another indication (apart from
indentation) that this still belongs to the condition. Still, I can
change it for the next series round.

Regards,

Elias

2008-08-30 18:01:22

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH 1/4] Introduce ata_id_has_unload()

Elias Oltmanns wrote:

>>>diff --git a/include/linux/ata.h b/include/linux/ata.h
>>>index 80364b6..d9a94bd 100644
>>>--- a/include/linux/ata.h
>>>+++ b/include/linux/ata.h
>>>@@ -707,6 +707,23 @@ static inline int ata_id_has_dword_io(const u16 *id)
>>> return 0;
>>> }
>>> +static inline int ata_id_has_unload(const u16 *id)
>>>+{
>>>+ /*
>>>+ * ATA-7 specifies two places to indicate unload feature
>>>+ * support. Since I don't really understand the difference,
>>>+ * I'll just check both and only return zero if none of them
>>>+ * indicates otherwise.

>> If you read the comments to the words 82:84 and 85:87, they say that
>>the former indicate the supported features, and the latter indicate
>>the enabed features AND in case a feature can't be disabled, the
>>latter words will have the corresponding bit set. So it should be
>>sufficient to check only one word.

> Yes, I tend to agree with you and, in fact, I have been leaning in this
> direction myself. However, there is something that really bothers me.
> Both entries describing bit 13 of word 87 and 84 are worded alike. In
> particular, it says *supported* in both places, whereas in the case of the
> other features it would say enabled in one and supported in the other
> place.

I think it says "supported" where the feature can't be disabled and
"enabled" where it can. Otherwise, this would make a little sense indeed.
Hm, I even found a quote in ATA/PI-7 rev. 4b backing this claim (should've
pasted it into previous mail):

6.17.43 Words (87:85): Features/command sets enabled

Words (87:85) shall indicate features/command sets enabled. If a defined bit
is cleared to zero, the indicated features/command set is not enabled. If a
supported features/command set is supported and cannot be disabled, it is
defined as supported and the bit shall be set to one.

>>>+ */
>>>+ if (ata_id_major_version(id) >= 7
>>>+ && (((id[ATA_ID_CFSSE] & 0xC000) == 0x4000
>>>+ && id[ATA_ID_CFSSE] & (1 << 13))
>>>+ || ((id[ATA_ID_CSF_DEFAULT] & 0xC000) == 0x4000
>>>+ && (id[ATA_ID_CSF_DEFAULT] & (1 << 13)))))
>>>

>> I think that it's preferrable to leave the operator on the same line
>>with the first operand...

> Not having too strong an opinion about it, I just thought that an
> operator at the beginning of the line was another indication (apart from
> indentation) that this still belongs to the condition. Still, I can

Do we need *another* indication? :-)

> change it for the next series round.

> Regards,

> Elias

MBR, Sergei

2008-08-30 23:40:28

by Elias Oltmanns

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Tejun Heo <[email protected]> wrote:
> Hello,
>
> Elias Oltmanns wrote:
>> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
>> index c729e69..78281af 100644
>> --- a/drivers/ata/ahci.c
>> +++ b/drivers/ata/ahci.c
>> @@ -316,6 +316,8 @@ static struct device_attribute *ahci_shost_attrs[] = {
>>
>> static struct device_attribute *ahci_sdev_attrs[] = {
>> &dev_attr_sw_activity,
>> + &dev_attr_unload_feature,
>> + &dev_attr_unload_heads,
>> NULL
>
> Ehhh... This really should be in libata core layer. Please create the
> default attrs and let ahci define its own.

Right, will do.

[...]
>> @@ -5267,6 +5267,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
>> init_timer_deferrable(&ap->fastdrain_timer);
>> ap->fastdrain_timer.function = ata_eh_fastdrain_timerfn;
>> ap->fastdrain_timer.data = (unsigned long)ap;
>> + ap->park_timer.function = ata_scsi_park_timeout;
>> + init_timer(&ap->park_timer);
>
> Why do you need a timeout when you can just msleep()?

Maybe I'm missing something but I don't see how I could use msleep()
here, see below.

>
>> +static void ata_eh_park_devs(struct ata_port *ap, int park)
>> +{
>> + struct ata_link *link;
>> + struct ata_device *dev;
>> + struct ata_taskfile tf;
>> + struct request_queue *q;
>> + unsigned int err_mask;
>> +
>> + ata_port_for_each_link(link, ap) {
>> + ata_link_for_each_dev(dev, link) {
>> + if (!dev->sdev)
>> + continue;
>
> You probably want to do if (dev->class != ATA_DEV_ATA) here.

Well, I really am concerned about dev->sdev. So far, I haven't quite
figured out yet whether under which circumstances I can safely assume
that the scsi counter part of dev including the block layer request
queue has been completely set up and configured so there won't be any
null pointer dereferences. However, if you think that I needn't bother
with stopping the request queue anyway, checking for ATA_DEV_ATA (what
about ATA_DEV_ATAPI?) should definitely be enough.

>
>> + ata_tf_init(dev, &tf);
>> + q = dev->sdev->request_queue;
>> + spin_lock_irq(q->queue_lock);
>> + if (park) {
>> + blk_stop_queue(q);
>
> Queue is already plugged when EH is entered. No need for this.

Quite right. It's just that it will be un- and replugged every
(3 * HZ) / 1000, so I thought it might be worthwhile to stop the queue
anyway. Perhaps it really isn't worth bothering and the code would
certainly be nicer to look at.

>
>> + tf.command = ATA_CMD_IDLEIMMEDIATE;
>> + tf.feature = 0x44;
>> + tf.lbal = 0x4c;
>> + tf.lbam = 0x4e;
>> + tf.lbah = 0x55;
> n> + } else {
>> + blk_start_queue(q);
>
> Neither this.
>
>> + tf.command = ATA_CMD_CHK_POWER;
>> + }
>> + spin_unlock(q->queue_lock);
>> + spin_lock(ap->lock);
>
> And no need to play with locks at all.

Just to be sure, are you just referring to the queue lock, or to the host
lock as well? Am I wrong in thinking that we have to protect all access
to dev->flags because bit operations are performed non atomically
virtually at any time?

>
>> + if (dev->flags & ATA_DFLAG_NO_UNLOAD) {
>> + spin_unlock_irq(ap->lock);
>> + continue;
>> + }
>> + spin_unlock_irq(ap->lock);
>> +
>> + tf.flags |= ATA_TFLAG_DEVICE | ATA_TFLAG_ISADDR;
>> + tf.protocol |= ATA_PROT_NODATA;
>> + err_mask = ata_exec_internal(dev, &tf, NULL, DMA_NONE,
>> + NULL, 0, 0);
>> + if ((err_mask || tf.lbal != 0xc4) && park)
>> + ata_dev_printk(dev, KERN_ERR,
>> + "head unload failed\n");
>> + }
>> + }
>> +}
>> +
>> static int ata_eh_revalidate_and_attach(struct ata_link *link,
>> struct ata_device **r_failed_dev)
>> {
>> @@ -2829,6 +2874,12 @@ int ata_eh_recover(struct ata_port *ap, ata_prereset_fn_t prereset,
>> }
>> }
>>
>> + if (ap->link.eh_context.i.action & ATA_EH_PARK) {
>> + ata_eh_park_devs(ap, 1);
>> + wait_event(ata_scsi_park_wq, !timer_pending(&ap->park_timer));
>
> I would just msleep() here.

Again, see below.

>
>> + ata_eh_park_devs(ap, 0);
>
> And does the device need this explicit wake up? It will wake up when
> it's necessary.

Probably, I should insert a comment somewhere. The problem is that
device internal power management will be disabled until the next command
is received. If you have laptop mode enabled and the device has received
the unload command while spinning with no more commands in the queue to
follow, the device may keep spinning for quite a while and won't go into
standby which rather defeats the purpose of laptop mode. This behaviour
is compliant with the specs and I can observe it on my system.

>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index 4d066ad..ffcc016 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -46,6 +46,7 @@
>> #include <linux/libata.h>
>> #include <linux/hdreg.h>
>> #include <linux/uaccess.h>
>> +#include <linux/suspend.h>
>>
>> #include "libata.h"
>>
>> @@ -113,6 +114,77 @@ static struct scsi_transport_template ata_scsi_transport_template = {
>> .user_scan = ata_scsi_user_scan,
>> };
>>
>> +DECLARE_WAIT_QUEUE_HEAD(ata_scsi_park_wq);
>> +
>> +#if defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION)
>> +static atomic_t ata_scsi_park_count = ATOMIC_INIT(0);
>> +
>> +static int ata_scsi_pm_notifier(struct notifier_block *nb, unsigned long val,
>> + void *null)
>> +{
>> + switch (val) {
>> + case PM_SUSPEND_PREPARE:
>> + atomic_dec(&ata_scsi_park_count);
>> + wait_event(ata_scsi_park_wq,
>> + atomic_read(&ata_scsi_park_count) == -1);
>> + break;
>> + case PM_POST_SUSPEND:
>> + atomic_inc(&ata_scsi_park_count);
>> + break;
>> + default:
>> + return NOTIFY_DONE;
>> + }
>> +
>> + return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block ata_scsi_pm_notifier_block = {
>> + .notifier_call = ata_scsi_pm_notifier,
>> +};
>> +
>> +int ata_scsi_register_pm_notifier(void)
>> +{
>> + return register_pm_notifier(&ata_scsi_pm_notifier_block);
>> +}
>> +
>> +int ata_scsi_unregister_pm_notifier(void)
>> +{
>> + return unregister_pm_notifier(&ata_scsi_pm_notifier_block);
>> +}
>
> Why are these PM notifiers necessary?

Since it's a user process that controls when we have to keep the heads
off the platter, a suspend operation has to be blocked *before* process
freezing when we happen to be in a precarious situation.

>
>> +static inline void ata_scsi_signal_unpark(void)
>> +{
>> + atomic_dec(&ata_scsi_park_count);
>> + wake_up_all(&ata_scsi_park_wq);
>> +}
>> +
>> +static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
>> + unsigned long timeout)
>> +{
>> + if (unlikely(atomic_inc_and_test(&ata_scsi_park_count))) {
>> + ata_scsi_signal_unpark();
>> + return -EBUSY;
>> + }
>> + if (mod_timer(timer, timeout)) {
>> + atomic_dec(&ata_scsi_park_count);
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>>
>> +#else /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
>> +static inline void ata_scsi_signal_unpark(void)
>> +{
>> + wake_up_all(&ata_scsi_park_wq);
>> +}
>> +
>> +static inline int ata_scsi_mod_park_timer(struct timer_list *timer,
>> + unsigned long timeout)
>> +{
>> + return mod_timer(timer, timeout);
>> +}
>> +#endif /* defined(CONFIG_PM_SLEEP) || defined(CONFIG_HIBERNATION) */
>
> And these all can go. If you're worried about recurring events you
> can just update timestamp from the sysfs write function and do...
>
> deadline = last_timestamp + delay;
> while ((now = jiffies) < deadline) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> schedule_timeout(deadline - now);
> set_current_state(TASK_RUNNING);
> }

Ah, I can see that this while loop can replace my call to wait_event() in
the eh sequence earlier on. However, I wonder how I am to replace the
call to mod_timer(). As you can see, we perform different actions
depending on whether the timer has merely been updated, or a new timer
has been started. Only in the latter case we want to schedule eh. In
order to achieve the equivalent in your setting while preventing any
races, I'd have to protect the deadline field in struct ata_port by the
host lock, i.e. something like:

spin_lock_irq(ap->lock);
now = jiffies;
rc = now > ap->deadline;
ap->deadline = now + timeout;
if (rc) {
ap->link.eh_info.action |= ATA_EH_PARK;
ata_port_schedule_eh(ap);
}
...
spin_unlock_irq(ap->lock);

and in the eh code a modified version of your loop:

spin_lock_irq(ap->lock);
while ((now = jiffies) < deadline) {
spin_unlock_irq(ap->lock);
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(deadline - now);
set_current_state(TASK_RUNNING);
spin_lock_irq(ap->lock);
}
spin_unlock_irq(ap->lock);

Is it worth all that or am I missing something? On the other hand, a
deadline field would occupy less space in the ata_port structure than a
timer_list field. What are your thoughts?

>
>> +static ssize_t ata_scsi_unload_feature_store(struct device *device,
>> + struct device_attribute *attr,
>> + const char *buf, size_t len)
>> +{
>> + struct scsi_device *sdev = to_scsi_device(device);
>> + struct ata_port *ap;
>> + struct ata_device *dev;
>> + int val;
>> +
>> + val = buf[0] - '0';
>> + if ((val != 0 && val != 1) || (buf[1] != '\0' && buf[1] != '\n')
>> + || buf[2] != '\0')
>> + return -EINVAL;
>> + ap = ata_shost_to_port(sdev->host);
>> + dev = ata_scsi_find_dev(ap, sdev);
>> + if (!dev)
>> + return -ENODEV;
>> + if (dev->class != ATA_DEV_ATA && dev->class != ATA_DEV_ATAPI)
>> + return -EOPNOTSUPP;
>> +
>> + spin_lock_irq(ap->lock);
>> + if (val == 1)
>> + dev->flags &= ~ATA_DFLAG_NO_UNLOAD;
>> + else
>> + dev->flags |= ATA_DFLAG_NO_UNLOAD;
>> + spin_unlock_irq(ap->lock);
>> +
>> + return len;
>> +}
>> +DEVICE_ATTR(unload_feature, S_IRUGO | S_IWUSR,
>> + ata_scsi_unload_feature_show, ata_scsi_unload_feature_store);
>> +EXPORT_SYMBOL_GPL(dev_attr_unload_feature);
>
> Hmmm.... Maybe you can just disable it by echoing -1 to the unload file?

Even though disabling it may be desirable in some cases, it's typically
*enabling* it that users will care about. Still, we can always accept -1
and -2 and I have to say I rather like the idea. Thanks for the
suggestion. Indeed, thank you very much for the thorough review.

Regards,

Elias

2008-08-31 09:26:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Hello,

Elias Oltmanns wrote:
> Tejun Heo <[email protected]> wrote:
>>> + ata_port_for_each_link(link, ap) {
>>> + ata_link_for_each_dev(dev, link) {
>>> + if (!dev->sdev)
>>> + continue;
>> You probably want to do if (dev->class != ATA_DEV_ATA) here.
>
> Well, I really am concerned about dev->sdev. So far, I haven't quite
> figured out yet whether under which circumstances I can safely assume
> that the scsi counter part of dev including the block layer request
> queue has been completely set up and configured so there won't be any
> null pointer dereferences. However, if you think that I needn't bother
> with stopping the request queue anyway, checking for ATA_DEV_ATA (what
> about ATA_DEV_ATAPI?) should definitely be enough.

Ah.. you need to part ATAPI too? If so, just test for
ata_dev_enabled(). One way or the other, there's no need to care
about SCSI or block layer when you're in EH.

>>> + ata_tf_init(dev, &tf);
>>> + q = dev->sdev->request_queue;
>>> + spin_lock_irq(q->queue_lock);
>>> + if (park) {
>>> + blk_stop_queue(q);
>> Queue is already plugged when EH is entered. No need for this.
>
> Quite right. It's just that it will be un- and replugged every
> (3 * HZ) / 1000, so I thought it might be worthwhile to stop the queue
> anyway. Perhaps it really isn't worth bothering and the code would
> certainly be nicer to look at.

While the EH is running, nothing gets through other than commands
issued by EH itself, so no need to worry about how upper layers would
behave.

>>> + spin_unlock(q->queue_lock);
>>> + spin_lock(ap->lock);
>> And no need to play with locks at all.
>
> Just to be sure, are you just referring to the queue lock, or to the host
> lock as well? Am I wrong in thinking that we have to protect all access
> to dev->flags because bit operations are performed non atomically
> virtually at any time?

Yes, when modifying the flags. You don't need to when testing a
feature.

>>> + ata_eh_park_devs(ap, 0);
>> And does the device need this explicit wake up? It will wake up when
>> it's necessary.
>
> Probably, I should insert a comment somewhere. The problem is that
> device internal power management will be disabled until the next command
> is received. If you have laptop mode enabled and the device has received
> the unload command while spinning with no more commands in the queue to
> follow, the device may keep spinning for quite a while and won't go into
> standby which rather defeats the purpose of laptop mode. This behaviour
> is compliant with the specs and I can observe it on my system.

Ah.. Okay. I somehow thought the command would spin down the disk.
It's just unloading the head. Please cross this one.

>>> +int ata_scsi_register_pm_notifier(void)
>>> +{
>>> + return register_pm_notifier(&ata_scsi_pm_notifier_block);
>>> +}
>>> +
>>> +int ata_scsi_unregister_pm_notifier(void)
>>> +{
>>> + return unregister_pm_notifier(&ata_scsi_pm_notifier_block);
>>> +}
>> Why are these PM notifiers necessary?
>
> Since it's a user process that controls when we have to keep the heads
> off the platter, a suspend operation has to be blocked *before* process
> freezing when we happen to be in a precarious situation.

Can you please elaborate a bit? The reloading is done by the kernel
after a timeout, right? What kind of precarious situation can the
kernel get into regarding suspend?

>> And these all can go. If you're worried about recurring events you
>> can just update timestamp from the sysfs write function and do...
>>
>> deadline = last_timestamp + delay;
>> while ((now = jiffies) < deadline) {
>> set_current_state(TASK_UNINTERRUPTIBLE);
>> schedule_timeout(deadline - now);
>> set_current_state(TASK_RUNNING);
>> }
>
> Ah, I can see that this while loop can replace my call to wait_event() in
> the eh sequence earlier on. However, I wonder how I am to replace the
> call to mod_timer(). As you can see, we perform different actions
> depending on whether the timer has merely been updated, or a new timer
> has been started. Only in the latter case we want to schedule eh. In
> order to achieve the equivalent in your setting while preventing any
> races, I'd have to protect the deadline field in struct ata_port by the
> host lock, i.e. something like:
>
> spin_lock_irq(ap->lock);
> now = jiffies;
> rc = now > ap->deadline;
> ap->deadline = now + timeout;
> if (rc) {
> ap->link.eh_info.action |= ATA_EH_PARK;
> ata_port_schedule_eh(ap);
> }
> ...
> spin_unlock_irq(ap->lock);

You can just do

spin_lock_irq(ap->lock);
ap->deadline = jiffies + timeout;
ap->link.eh_info.action |= ATA_EH_PARK;
ata_port_schedule_eh(ap);
spin_unlock_irq(ap->lock);

> and in the eh code a modified version of your loop:
>
> spin_lock_irq(ap->lock);
> while ((now = jiffies) < deadline) {
> spin_unlock_irq(ap->lock);
> set_current_state(TASK_UNINTERRUPTIBLE);
> schedule_timeout(deadline - now);
> set_current_state(TASK_RUNNING);
> spin_lock_irq(ap->lock);
> }
> spin_unlock_irq(ap->lock);

Yeah, this looks about right, but you can make it a bit simpler...

while (time_before((now = jiffies), ap->deadline))
schedule_timeout_uninterruptible(ap->deadline - now);

As locking on reader side doesn't mean much in cases like this. The
deadline update which triggered EH is guaranteed to be visible and the
the window between waking up from schedule_timeout and ap->deadline
dereference is way too small to be any meaningful.

> Is it worth all that or am I missing something? On the other hand, a
> deadline field would occupy less space in the ata_port structure than a
> timer_list field. What are your thoughts?

Is the above code more complex? I think it's simpler, no?

>>> +DEVICE_ATTR(unload_feature, S_IRUGO | S_IWUSR,
>>> + ata_scsi_unload_feature_show, ata_scsi_unload_feature_store);
>>> +EXPORT_SYMBOL_GPL(dev_attr_unload_feature);
>> Hmmm.... Maybe you can just disable it by echoing -1 to the unload file?
>
> Even though disabling it may be desirable in some cases, it's typically
> *enabling* it that users will care about. Still, we can always accept -1
> and -2 and I have to say I rather like the idea. Thanks for the
> suggestion. Indeed, thank you very much for the thorough review.

Oh.. what I meant was whether we need a separate sysfs node to
indicate whether unload feature is enabled or not but now I come to
think about it, that is per-device and the timer is per-port. :-)

Thanks.

--
tejun

2008-08-31 12:09:06

by Elias Oltmanns

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Tejun Heo <[email protected]> wrote:
> Hello,
>
> Elias Oltmanns wrote:
>> Tejun Heo <[email protected]> wrote:
>>>> + ata_port_for_each_link(link, ap) {
>>>> + ata_link_for_each_dev(dev, link) {
>>>> + if (!dev->sdev)
>>>> + continue;
>>> You probably want to do if (dev->class != ATA_DEV_ATA) here.
>>
>> Well, I really am concerned about dev->sdev. So far, I haven't quite
>> figured out yet whether under which circumstances I can safely assume
>> that the scsi counter part of dev including the block layer request
>> queue has been completely set up and configured so there won't be any
>> null pointer dereferences. However, if you think that I needn't bother
>> with stopping the request queue anyway, checking for ATA_DEV_ATA (what
>> about ATA_DEV_ATAPI?) should definitely be enough.
>
> Ah.. you need to part ATAPI too? If so, just test for
> ata_dev_enabled().

Well, I'm not quite sure really. Perhaps you are right and I'd better
leave ATAPI alone, especially given the problem that the unload command
might mess up a CD/DVD write operation. As long as no laptop HDs are
identified as ATAPI devices, there should be no problem with that.

> One way or the other, there's no need to care about SCSI or block
> layer when you're in EH.
>
>>>> + ata_tf_init(dev, &tf);
>>>> + q = dev->sdev->request_queue;
>>>> + spin_lock_irq(q->queue_lock);
>>>> + if (park) {
>>>> + blk_stop_queue(q);
>>> Queue is already plugged when EH is entered. No need for this.
>>
>> Quite right. It's just that it will be un- and replugged every
>> (3 * HZ) / 1000, so I thought it might be worthwhile to stop the queue
>> anyway. Perhaps it really isn't worth bothering and the code would
>> certainly be nicer to look at.
>
> While the EH is running, nothing gets through other than commands
> issued by EH itself, so no need to worry about how upper layers would
> behave.

Alright, I'll drop it.

>
>>>> + spin_unlock(q->queue_lock);
>>>> + spin_lock(ap->lock);
>>> And no need to play with locks at all.
>>
>> Just to be sure, are you just referring to the queue lock, or to the host
>> lock as well? Am I wrong in thinking that we have to protect all access
>> to dev->flags because bit operations are performed non atomically
>> virtually at any time?
>
> Yes, when modifying the flags. You don't need to when testing a
> feature.

Right.

[...]
>>>> +int ata_scsi_register_pm_notifier(void)
>>>> +{
>>>> + return register_pm_notifier(&ata_scsi_pm_notifier_block);
>>>> +}
>>>> +
>>>> +int ata_scsi_unregister_pm_notifier(void)
>>>> +{
>>>> + return unregister_pm_notifier(&ata_scsi_pm_notifier_block);
>>>> +}
>>> Why are these PM notifiers necessary?
>>
>> Since it's a user process that controls when we have to keep the heads
>> off the platter, a suspend operation has to be blocked *before* process
>> freezing when we happen to be in a precarious situation.
>
> Can you please elaborate a bit? The reloading is done by the kernel
> after a timeout, right? What kind of precarious situation can the
> kernel get into regarding suspend?

Sorry, I haven't expressed myself very clearly there, it seems. The user
space process detects some acceleration and starts writing timeouts to
the sysfs file. This causes the unload command to be issued to the
device and stops all I/O until the user space daemon decides that the
danger has passed, writes 0 to the sysfs file and leaves it alone
afterwards. Now, if the daemon happens to request head parking right at
the beginning of a suspend sequence, this means that we are in danger of
falling, i.e. we have to make sure that I/O is stopped until that user
space daemon gives the all-clear. However, suspending implies freezing
all processes which means that the daemon can't keep checking and
signalling to the kernel. The last timeout received before the daemon
has been frozen will expire and the suspend procedure goes ahead. By
means of the notifiers we can make sure that suspend is blocked until
the daemon says that everything is alright.

>
>>> And these all can go. If you're worried about recurring events you
>>> can just update timestamp from the sysfs write function and do...
>>>
>>> deadline = last_timestamp + delay;
>>> while ((now = jiffies) < deadline) {
>>> set_current_state(TASK_UNINTERRUPTIBLE);
>>> schedule_timeout(deadline - now);
>>> set_current_state(TASK_RUNNING);
>>> }
>>
>> Ah, I can see that this while loop can replace my call to wait_event() in
>> the eh sequence earlier on. However, I wonder how I am to replace the
>> call to mod_timer(). As you can see, we perform different actions
>> depending on whether the timer has merely been updated, or a new timer
>> has been started. Only in the latter case we want to schedule eh. In
>> order to achieve the equivalent in your setting while preventing any
>> races, I'd have to protect the deadline field in struct ata_port by the
>> host lock, i.e. something like:
>>
>> spin_lock_irq(ap->lock);
>> now = jiffies;
>> rc = now > ap->deadline;
>> ap->deadline = now + timeout;
>> if (rc) {
>> ap->link.eh_info.action |= ATA_EH_PARK;
>> ata_port_schedule_eh(ap);
>> }
>> ...
>> spin_unlock_irq(ap->lock);
>
> You can just do
>
> spin_lock_irq(ap->lock);
> ap->deadline = jiffies + timeout;
> ap->link.eh_info.action |= ATA_EH_PARK;
> ata_port_schedule_eh(ap);
> spin_unlock_irq(ap->lock);

Please note that I want to schedule EH (and thus the head unload -
check power command sequence) only once in the event of overlapping
timeouts. For instance, when the daemon sets a timeout of 2 seconds and
does so again after one second has elapsed, I want the following to
happen:

[ Daemon writes 2 to the unload_heads file ]
1. Set timer / deadline;
2. eh_info.action |= ATA_EH_PARK;
3. schedule EH;
4. execute EH and sleep, waiting for the timeout to expire;

[ Daemon writes 2 to the unload_heads file before the previous timeout
has expired. ]

5. update timer / deadline;
6. the EH thread keeps blocking until the new timeout has expired.

In particular, I don't want to reschedule EH in response to the second
write to the unload_heads file. Also, we have to consider the case where
the daemon signals to resume I/O prematurely by writing a timeout of 0.
In this case, the EH thread should be woken up immediately.

>
>> and in the eh code a modified version of your loop:
>>
>> spin_lock_irq(ap->lock);
>> while ((now = jiffies) < deadline) {
>> spin_unlock_irq(ap->lock);
>> set_current_state(TASK_UNINTERRUPTIBLE);
>> schedule_timeout(deadline - now);
>> set_current_state(TASK_RUNNING);
>> spin_lock_irq(ap->lock);
>> }
>> spin_unlock_irq(ap->lock);
>
> Yeah, this looks about right, but you can make it a bit simpler...
>
> while (time_before((now = jiffies), ap->deadline))
> schedule_timeout_uninterruptible(ap->deadline - now);
>
> As locking on reader side doesn't mean much in cases like this. The
> deadline update which triggered EH is guaranteed to be visible and the
> the window between waking up from schedule_timeout and ap->deadline
> dereference is way too small to be any meaningful.

Well, I'm persuaded as far as locking is concerned. Still, the problems
described above remain.

[...]
>>>> +DEVICE_ATTR(unload_feature, S_IRUGO | S_IWUSR,
>>>> + ata_scsi_unload_feature_show, ata_scsi_unload_feature_store);
>>>> +EXPORT_SYMBOL_GPL(dev_attr_unload_feature);
>>> Hmmm.... Maybe you can just disable it by echoing -1 to the unload file?
>>
>> Even though disabling it may be desirable in some cases, it's typically
>> *enabling* it that users will care about. Still, we can always accept -1
>> and -2 and I have to say I rather like the idea. Thanks for the
>> suggestion. Indeed, thank you very much for the thorough review.
>
> Oh.. what I meant was whether we need a separate sysfs node to
> indicate whether unload feature is enabled or not but now I come to
> think about it, that is per-device and the timer is per-port. :-)

That's no problem. The unload_heads node is per-device too even though
the timer is per port. I really don't think we need the extra node.

Regards,

Elias

2008-08-31 13:04:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Hello,

Elias Oltmanns wrote:
>> Ah.. you need to part ATAPI too? If so, just test for
>> ata_dev_enabled().
>
> Well, I'm not quite sure really. Perhaps you are right and I'd better
> leave ATAPI alone, especially given the problem that the unload command
> might mess up a CD/DVD write operation. As long as no laptop HDs are
> identified as ATAPI devices, there should be no problem with that.

Hmm... I think it would be safer to stick with ATA for the time being.

>> Can you please elaborate a bit? The reloading is done by the kernel
>> after a timeout, right? What kind of precarious situation can the
>> kernel get into regarding suspend?
>
> Sorry, I haven't expressed myself very clearly there, it seems. The user
> space process detects some acceleration and starts writing timeouts to
> the sysfs file. This causes the unload command to be issued to the
> device and stops all I/O until the user space daemon decides that the
> danger has passed, writes 0 to the sysfs file and leaves it alone
> afterwards. Now, if the daemon happens to request head parking right at
> the beginning of a suspend sequence, this means that we are in danger of
> falling, i.e. we have to make sure that I/O is stopped until that user
> space daemon gives the all-clear. However, suspending implies freezing
> all processes which means that the daemon can't keep checking and
> signalling to the kernel. The last timeout received before the daemon
> has been frozen will expire and the suspend procedure goes ahead. By
> means of the notifiers we can make sure that suspend is blocked until
> the daemon says that everything is alright.

Is it really worth protecting against that? What if the machine
started to fall after the userland tasks have been frozen? And how
long the usual timeout would be? If the machine has been falling for
10 secs, there really isn't much point in trying to protect anything
unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
revision.

In libata, as with any other exceptions, suspend/resume are handled by
EH so while emergency head unload is in progress, suspend won't
commence which is about the same effect as the posted code sans the
timeout extension part. I don't really think there's any significant
danger in not being able to extend timeout while suspend is in
progress. It's not a very big window after all. If you're really
worried about it, you can also let libata reject suspend if head
unload is in progress.

Also, the suspend operation is unloading the head and spin down the
drive which sound like a good thing to do before crashing. Maybe we
can modify the suspend sequence such that it always unload the head
first and then issue spindown. That will ensure the head is in safe
position as soon as possible. If it's done this way, it'll be
probably a good idea to split unloading and loading operations and do
loading only when EH is being finished and the disk is not spun down.

To me, much more serious problem seems to be during hibernation. The
kernel is actively writing memory image to userland and it takes quite
a while and there's no protection whatsoever during that time.

>> spin_lock_irq(ap->lock);
>> ap->deadline = jiffies + timeout;
>> ap->link.eh_info.action |= ATA_EH_PARK;
>> ata_port_schedule_eh(ap);
>> spin_unlock_irq(ap->lock);
>
> Please note that I want to schedule EH (and thus the head unload -
> check power command sequence) only once in the event of overlapping
> timeouts. For instance, when the daemon sets a timeout of 2 seconds and
> does so again after one second has elapsed, I want the following to
> happen:
>
> [ Daemon writes 2 to the unload_heads file ]
> 1. Set timer / deadline;
> 2. eh_info.action |= ATA_EH_PARK;
> 3. schedule EH;
> 4. execute EH and sleep, waiting for the timeout to expire;
>
> [ Daemon writes 2 to the unload_heads file before the previous timeout
> has expired. ]
>
> 5. update timer / deadline;
> 6. the EH thread keeps blocking until the new timeout has expired.
>
> In particular, I don't want to reschedule EH in response to the second
> write to the unload_heads file. Also, we have to consider the case where
> the daemon signals to resume I/O prematurely by writing a timeout of 0.
> In this case, the EH thread should be woken up immediately.

Whether EH is scheduled multiple times or not doesn't matter at all.
EH can be happily scheduled without any actual action to do and that
does happen from time to time due to asynchronous nature of events.
libata EH doesn't have any problem with that. The only thing that's
required is there's at least one ata_schedule_eh() after the latest
EH-worthy event. So, the simpler code might enter EH one more time
once in the blue moon, but it won't do any harm. EH will just look
around and realize that there's nothing much to do and just exit.

Thanks.

--
tejun

Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support


Hi,

On Sunday 31 August 2008, Tejun Heo wrote:
> Hello,
>
> Elias Oltmanns wrote:
> >> Ah.. you need to part ATAPI too? If so, just test for
> >> ata_dev_enabled().
> >
> > Well, I'm not quite sure really. Perhaps you are right and I'd better
> > leave ATAPI alone, especially given the problem that the unload command
> > might mess up a CD/DVD write operation. As long as no laptop HDs are
> > identified as ATAPI devices, there should be no problem with that.
>
> Hmm... I think it would be safer to stick with ATA for the time being.

Seconded. To be honest I also don't like the change to issue UNLOAD to
all devices on the port (it only needlessly increases complexity right now
since the _only_ use case at the moment is ThinkPad w/ hdaps + 1 HD).

[ I really hoped for the minimal initial implementation. ]

> >> Can you please elaborate a bit? The reloading is done by the kernel
> >> after a timeout, right? What kind of precarious situation can the
> >> kernel get into regarding suspend?
> >
> > Sorry, I haven't expressed myself very clearly there, it seems. The user
> > space process detects some acceleration and starts writing timeouts to
> > the sysfs file. This causes the unload command to be issued to the
> > device and stops all I/O until the user space daemon decides that the
> > danger has passed, writes 0 to the sysfs file and leaves it alone
> > afterwards. Now, if the daemon happens to request head parking right at
> > the beginning of a suspend sequence, this means that we are in danger of
> > falling, i.e. we have to make sure that I/O is stopped until that user
> > space daemon gives the all-clear. However, suspending implies freezing
> > all processes which means that the daemon can't keep checking and
> > signalling to the kernel. The last timeout received before the daemon
> > has been frozen will expire and the suspend procedure goes ahead. By
> > means of the notifiers we can make sure that suspend is blocked until
> > the daemon says that everything is alright.
>
> Is it really worth protecting against that? What if the machine
> started to fall after the userland tasks have been frozen? And how
> long the usual timeout would be? If the machine has been falling for
> 10 secs, there really isn't much point in trying to protect anything
> unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
> revision.
>
> In libata, as with any other exceptions, suspend/resume are handled by
> EH so while emergency head unload is in progress, suspend won't
> commence which is about the same effect as the posted code sans the
> timeout extension part. I don't really think there's any significant
> danger in not being able to extend timeout while suspend is in
> progress. It's not a very big window after all. If you're really
> worried about it, you can also let libata reject suspend if head
> unload is in progress.
>
> Also, the suspend operation is unloading the head and spin down the
> drive which sound like a good thing to do before crashing. Maybe we
> can modify the suspend sequence such that it always unload the head
> first and then issue spindown. That will ensure the head is in safe
> position as soon as possible. If it's done this way, it'll be
> probably a good idea to split unloading and loading operations and do
> loading only when EH is being finished and the disk is not spun down.
>
> To me, much more serious problem seems to be during hibernation. The
> kernel is actively writing memory image to userland and it takes quite
> a while and there's no protection whatsoever during that time.

Which also brings again the question whether it is really the best to
use user-space solution instead of kernel thread?

After taking the look into the deamon program and hdaps driver I tend
to "Nope." answer. The kernel-space solution would be more reliable,
should result in significatly less code and would free us from having
a special purpose libata/ide interfaces. It should also make the
maintainance and future enhancements (i.e. making hibernation unload
friendly) a lot easier.

I imagine that this comes a bit late but can we at least give it an
another thought, please?

Thanks,
Bart

2008-08-31 16:15:07

by Elias Oltmanns

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Tejun Heo <[email protected]> wrote:
> Hello,
>
> Elias Oltmanns wrote:
[...]
>> Sorry, I haven't expressed myself very clearly there, it seems. The user
>> space process detects some acceleration and starts writing timeouts to
>> the sysfs file. This causes the unload command to be issued to the
>> device and stops all I/O until the user space daemon decides that the
>> danger has passed, writes 0 to the sysfs file and leaves it alone
>> afterwards. Now, if the daemon happens to request head parking right at
>> the beginning of a suspend sequence, this means that we are in danger of
>> falling, i.e. we have to make sure that I/O is stopped until that user
>> space daemon gives the all-clear. However, suspending implies freezing
>> all processes which means that the daemon can't keep checking and
>> signalling to the kernel. The last timeout received before the daemon
>> has been frozen will expire and the suspend procedure goes ahead. By
>> means of the notifiers we can make sure that suspend is blocked until
>> the daemon says that everything is alright.
>
> Is it really worth protecting against that? What if the machine
> started to fall after the userland tasks have been frozen? And how
> long the usual timeout would be? If the machine has been falling for
> 10 secs, there really isn't much point in trying to protect anything
> unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
> revision.

We are not just protecting against free fall. Think of a series of minor
shocks in close succession as it might occasionally happen on the train.

>
> In libata, as with any other exceptions, suspend/resume are handled by
> EH so while emergency head unload is in progress, suspend won't
> commence which is about the same effect as the posted code sans the
> timeout extension part. I don't really think there's any significant
> danger in not being able to extend timeout while suspend is in
> progress. It's not a very big window after all. If you're really
> worried about it, you can also let libata reject suspend if head
> unload is in progress.

Personaaly, I'm very much against plain rejection because of the odd
head unload. A delay in the suspend procedure, on the other hand, is
perfectly acceptable to me.

>
> Also, the suspend operation is unloading the head and spin down the
> drive which sound like a good thing to do before crashing. Maybe we
> can modify the suspend sequence such that it always unload the head
> first and then issue spindown. That will ensure the head is in safe
> position as soon as possible. If it's done this way, it'll be
> probably a good idea to split unloading and loading operations and do
> loading only when EH is being finished and the disk is not spun down.

Well, scsi midlayer will also issue a flush cache command. Besides, with
previous implementations I have observed occasional lock ups when
suspending while the unload timer was running. Once we have settled the
timer vs deadline issue, I'm willing to do some more investigation in
this area if you really insist that pm_notifiers should be avoided. But
then I am still not too sure about your reasoning and do feel happier
with these notifiers anyway.

>
> To me, much more serious problem seems to be during hibernation. The
> kernel is actively writing memory image to userland and it takes quite
> a while and there's no protection whatsoever during that time.

That's right. The first requirement to protect against this problem is
to have the policy all in kernel space which isn't going to happen for
some time yet. This really is a *best effort* solution rather than a
perfect one.

>
>>> spin_lock_irq(ap->lock);
>>> ap->deadline = jiffies + timeout;
>>> ap->link.eh_info.action |= ATA_EH_PARK;
>>> ata_port_schedule_eh(ap);
>>> spin_unlock_irq(ap->lock);
>>
>> Please note that I want to schedule EH (and thus the head unload -
>> check power command sequence) only once in the event of overlapping
>> timeouts. For instance, when the daemon sets a timeout of 2 seconds and
>> does so again after one second has elapsed, I want the following to
>> happen:
>>
>> [ Daemon writes 2 to the unload_heads file ]
>> 1. Set timer / deadline;
>> 2. eh_info.action |= ATA_EH_PARK;
>> 3. schedule EH;
>> 4. execute EH and sleep, waiting for the timeout to expire;
>>
>> [ Daemon writes 2 to the unload_heads file before the previous timeout
>> has expired. ]
>>
>> 5. update timer / deadline;
>> 6. the EH thread keeps blocking until the new timeout has expired.
>>
>> In particular, I don't want to reschedule EH in response to the second
>> write to the unload_heads file. Also, we have to consider the case where
>> the daemon signals to resume I/O prematurely by writing a timeout of 0.
>> In this case, the EH thread should be woken up immediately.
>
> Whether EH is scheduled multiple times or not doesn't matter at all.
> EH can be happily scheduled without any actual action to do and that
> does happen from time to time due to asynchronous nature of events.
> libata EH doesn't have any problem with that. The only thing that's
> required is there's at least one ata_schedule_eh() after the latest
> EH-worthy event. So, the simpler code might enter EH one more time
> once in the blue moon, but it won't do any harm. EH will just look
> around and realize that there's nothing much to do and just exit.

The whole EH machinery is a very complex beast. Any user of the
emergency head park facility has a particular interest that the system
spends as little time as possible in the EH code even if it's real error
recovery that matters most. Perhaps we could agree on the following
compromise:

spin_lock_irq(ap->lock);
old_deadline = ap->deadline;
ap->deadline = jiffies + timeout;
if (old_deadline < jiffies) {
ap->link.eh_info.action |= ATA_EH_PARK;
ata_port_schedule_eh(ap);
}
spin_unlock_irq(ap->lock);

There is still a race but it is very unlikely to trigger.

Still, you have dismissed my point about the equivalent of stopping a
running timer by specifying a 0 timeout. In fact, whenever the new
deadline is *before* the old deadline, we have to signal the sleeping EH
thread to wake up in time. This way we end up with something like
wait_event().

Regards,

Elias

2008-08-31 17:08:17

by Elias Oltmanns

[permalink] [raw]
Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> Hi,
>
> On Sunday 31 August 2008, Tejun Heo wrote:
>> Hello,
>>
>> Elias Oltmanns wrote:
>> >> Ah.. you need to part ATAPI too? If so, just test for
>> >> ata_dev_enabled().
>> >
>> > Well, I'm not quite sure really. Perhaps you are right and I'd better
>> > leave ATAPI alone, especially given the problem that the unload command
>> > might mess up a CD/DVD write operation. As long as no laptop HDs are
>> > identified as ATAPI devices, there should be no problem with that.
>>
>> Hmm... I think it would be safer to stick with ATA for the time being.
>
> Seconded. To be honest I also don't like the change to issue UNLOAD to
> all devices on the port (it only needlessly increases complexity right now
> since the _only_ use case at the moment is ThinkPad w/ hdaps + 1 HD).

Admittedly, I don't know very much about it myself but I seem to
remember that there are other vendors now shipping similar technology.
Even in the Thinkpad case I don't *know* that all relevant models have
HD and optical drives on seperate ports although I'm willing to believe
it if somebody tells me so.

Anyway, I've added Henrique to the Cc: list since he knows far more
about Thinkpads than I do and possibly about other notebooks too.

>
> [ I really hoped for the minimal initial implementation. ]

There is a lot to be said for the per-port solution as far as libata is
concerned. For the sake of consistency I tried to mimic the same
behaviour in ide but I agree that it makes things more complex there.

>
>> >> Can you please elaborate a bit? The reloading is done by the kernel
>> >> after a timeout, right? What kind of precarious situation can the
>> >> kernel get into regarding suspend?
>> >
>> > Sorry, I haven't expressed myself very clearly there, it seems. The user
>> > space process detects some acceleration and starts writing timeouts to
>> > the sysfs file. This causes the unload command to be issued to the
>> > device and stops all I/O until the user space daemon decides that the
>> > danger has passed, writes 0 to the sysfs file and leaves it alone
>> > afterwards. Now, if the daemon happens to request head parking right at
>> > the beginning of a suspend sequence, this means that we are in danger of
>> > falling, i.e. we have to make sure that I/O is stopped until that user
>> > space daemon gives the all-clear. However, suspending implies freezing
>> > all processes which means that the daemon can't keep checking and
>> > signalling to the kernel. The last timeout received before the daemon
>> > has been frozen will expire and the suspend procedure goes ahead. By
>> > means of the notifiers we can make sure that suspend is blocked until
>> > the daemon says that everything is alright.
>>
>> Is it really worth protecting against that? What if the machine
>> started to fall after the userland tasks have been frozen? And how
>> long the usual timeout would be? If the machine has been falling for
>> 10 secs, there really isn't much point in trying to protect anything
>> unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
>> revision.
>>
>> In libata, as with any other exceptions, suspend/resume are handled by
>> EH so while emergency head unload is in progress, suspend won't
>> commence which is about the same effect as the posted code sans the
>> timeout extension part. I don't really think there's any significant
>> danger in not being able to extend timeout while suspend is in
>> progress. It's not a very big window after all. If you're really
>> worried about it, you can also let libata reject suspend if head
>> unload is in progress.
>>
>> Also, the suspend operation is unloading the head and spin down the
>> drive which sound like a good thing to do before crashing. Maybe we
>> can modify the suspend sequence such that it always unload the head
>> first and then issue spindown. That will ensure the head is in safe
>> position as soon as possible. If it's done this way, it'll be
>> probably a good idea to split unloading and loading operations and do
>> loading only when EH is being finished and the disk is not spun down.
>>
>> To me, much more serious problem seems to be during hibernation. The
>> kernel is actively writing memory image to userland and it takes quite
>> a while and there's no protection whatsoever during that time.
>
> Which also brings again the question whether it is really the best to
> use user-space solution instead of kernel thread?
>
> After taking the look into the deamon program and hdaps driver I tend
> to "Nope." answer. The kernel-space solution would be more reliable,
> should result in significatly less code and would free us from having
> a special purpose libata/ide interfaces. It should also make the
> maintainance and future enhancements (i.e. making hibernation unload
> friendly) a lot easier.
>
> I imagine that this comes a bit late but can we at least give it an
> another thought, please?

Right, I'll try to give a concise statement of the problem. First
though, I absolutely agree that with regard to the suspend / hibernate
problem, an in kernel solution would ultimately be the safest option.
However, the way I see it, we would need a module with the following
characteristics:

- Policy: logic to decide when to park / unpark disks and an interface
to export tunables to user space.
- Input: capability to recognise and register with acceleration sensors
in the system and to gather data in an efficient manner. Since this is
kernel space, we have to make it bulletproof and account for the
possibility that there may be more than one such sensor installed in
the system (think: plug and play).
- Action: find all rotating media in the system and decide which of them
to protect how. Probably, some tunables for the user to fiddle with
are required here too. Remember that we have docking stations and the
like so more than one HD may show up on the bus.

All these corner cases that most users don't care or even tink about
won't hurt anyone as long as the daemon is in user space. This way, we
have a very simple solution for all of them: The user decides for each
instance of the daemon which accelerometer it gets its data from and
which HD it is supposed to protect. I don't like giving impressively
high percentages when all I'm doing is intelligent guess work, but the
vast majority of users will have only one daemon running, getting its
data from one accelerometer and protecting exactly one HD. However, it
is hard to imagine anything disastrous to happen *if* somebody should
happen to install a second accelerometer or connect to the docking
station. In kernel space we would have to take care of the oddest things
because a system supposed to increase security would suffer under a
reputation of locking the machine for good.

Regards,

Elias

Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

On Sunday 31 August 2008, Elias Oltmanns wrote:
> Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > Hi,
> >
> > On Sunday 31 August 2008, Tejun Heo wrote:
> >> Hello,
> >>
> >> Elias Oltmanns wrote:
> >> >> Ah.. you need to part ATAPI too? If so, just test for
> >> >> ata_dev_enabled().
> >> >
> >> > Well, I'm not quite sure really. Perhaps you are right and I'd better
> >> > leave ATAPI alone, especially given the problem that the unload command
> >> > might mess up a CD/DVD write operation. As long as no laptop HDs are
> >> > identified as ATAPI devices, there should be no problem with that.
> >>
> >> Hmm... I think it would be safer to stick with ATA for the time being.
> >
> > Seconded. To be honest I also don't like the change to issue UNLOAD to
> > all devices on the port (it only needlessly increases complexity right now
> > since the _only_ use case at the moment is ThinkPad w/ hdaps + 1 HD).
>
> Admittedly, I don't know very much about it myself but I seem to
> remember that there are other vendors now shipping similar technology.
> Even in the Thinkpad case I don't *know* that all relevant models have
> HD and optical drives on seperate ports although I'm willing to believe
> it if somebody tells me so.
>
> Anyway, I've added Henrique to the Cc: list since he knows far more
> about Thinkpads than I do and possibly about other notebooks too.
>
> >
> > [ I really hoped for the minimal initial implementation. ]
>
> There is a lot to be said for the per-port solution as far as libata is
> concerned. For the sake of consistency I tried to mimic the same
> behaviour in ide but I agree that it makes things more complex there.
>
> >
> >> >> Can you please elaborate a bit? The reloading is done by the kernel
> >> >> after a timeout, right? What kind of precarious situation can the
> >> >> kernel get into regarding suspend?
> >> >
> >> > Sorry, I haven't expressed myself very clearly there, it seems. The user
> >> > space process detects some acceleration and starts writing timeouts to
> >> > the sysfs file. This causes the unload command to be issued to the
> >> > device and stops all I/O until the user space daemon decides that the
> >> > danger has passed, writes 0 to the sysfs file and leaves it alone
> >> > afterwards. Now, if the daemon happens to request head parking right at
> >> > the beginning of a suspend sequence, this means that we are in danger of
> >> > falling, i.e. we have to make sure that I/O is stopped until that user
> >> > space daemon gives the all-clear. However, suspending implies freezing
> >> > all processes which means that the daemon can't keep checking and
> >> > signalling to the kernel. The last timeout received before the daemon
> >> > has been frozen will expire and the suspend procedure goes ahead. By
> >> > means of the notifiers we can make sure that suspend is blocked until
> >> > the daemon says that everything is alright.
> >>
> >> Is it really worth protecting against that? What if the machine
> >> started to fall after the userland tasks have been frozen? And how
> >> long the usual timeout would be? If the machine has been falling for
> >> 10 secs, there really isn't much point in trying to protect anything
> >> unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
> >> revision.
> >>
> >> In libata, as with any other exceptions, suspend/resume are handled by
> >> EH so while emergency head unload is in progress, suspend won't
> >> commence which is about the same effect as the posted code sans the
> >> timeout extension part. I don't really think there's any significant
> >> danger in not being able to extend timeout while suspend is in
> >> progress. It's not a very big window after all. If you're really
> >> worried about it, you can also let libata reject suspend if head
> >> unload is in progress.
> >>
> >> Also, the suspend operation is unloading the head and spin down the
> >> drive which sound like a good thing to do before crashing. Maybe we
> >> can modify the suspend sequence such that it always unload the head
> >> first and then issue spindown. That will ensure the head is in safe
> >> position as soon as possible. If it's done this way, it'll be
> >> probably a good idea to split unloading and loading operations and do
> >> loading only when EH is being finished and the disk is not spun down.
> >>
> >> To me, much more serious problem seems to be during hibernation. The
> >> kernel is actively writing memory image to userland and it takes quite
> >> a while and there's no protection whatsoever during that time.
> >
> > Which also brings again the question whether it is really the best to
> > use user-space solution instead of kernel thread?
> >
> > After taking the look into the deamon program and hdaps driver I tend
> > to "Nope." answer. The kernel-space solution would be more reliable,
> > should result in significatly less code and would free us from having
> > a special purpose libata/ide interfaces. It should also make the
> > maintainance and future enhancements (i.e. making hibernation unload
> > friendly) a lot easier.
> >
> > I imagine that this comes a bit late but can we at least give it an
> > another thought, please?
>
> Right, I'll try to give a concise statement of the problem. First
> though, I absolutely agree that with regard to the suspend / hibernate
> problem, an in kernel solution would ultimately be the safest option.

Not only that, IIRC there were some concerns regarding having bigger
power consumption with user/kernel-space solution.

> However, the way I see it, we would need a module with the following
> characteristics:
>
> - Policy: logic to decide when to park / unpark disks and an interface
> to export tunables to user space.
> - Input: capability to recognise and register with acceleration sensors
> in the system and to gather data in an efficient manner. Since this is
> kernel space, we have to make it bulletproof and account for the
> possibility that there may be more than one such sensor installed in
> the system (think: plug and play).
> - Action: find all rotating media in the system and decide which of them
> to protect how. Probably, some tunables for the user to fiddle with
> are required here too. Remember that we have docking stations and the
> like so more than one HD may show up on the bus.
>
> All these corner cases that most users don't care or even tink about
> won't hurt anyone as long as the daemon is in user space. This way, we
> have a very simple solution for all of them: The user decides for each
> instance of the daemon which accelerometer it gets its data from and
> which HD it is supposed to protect. I don't like giving impressively
> high percentages when all I'm doing is intelligent guess work, but the
> vast majority of users will have only one daemon running, getting its
> data from one accelerometer and protecting exactly one HD. However, it
> is hard to imagine anything disastrous to happen *if* somebody should
> happen to install a second accelerometer or connect to the docking
> station. In kernel space we would have to take care of the oddest things
> because a system supposed to increase security would suffer under a
> reputation of locking the machine for good.

We may attack the problem from the different angle in which we won't
have to worry about any odd corner cases at all:

- Add disk_shock module with the needed logic, keeping track of "system
accelerometer" & "system disk" objects, responsible for polling and also
(optionally) exporting tunables.

- When ATA devices are initialized check if they support UNLOAD
command and if yes advertise such capability to the block layer
(like we do it with flush cache currently). We can also solve
the problem of forcing UNLOAD support with using kernel parameters.

- Add [un]register_system_accelerometer() interface to disk_shock
and make accelerometer drivers decide whether to use it (currently
only hdaps driver will use it). Also add some standard methods for
obtaining data from accelerometer drivers. We may even glue the
new disk_shock with hdaps for now.

- Simlarly add [un]register_system_disk() interface (getting us a
access to disk queue) and make storage drivers decide whether to
use it (it is actually easier than in case of system accelerometer
devices since an extra UNLOAD command on shock is not a problem,
while false shock alert is).

- On shock disk_shock will queue the special REQ_PARK_HEADS request
and later it will queue REQ_UNPARK_HEADS one (this may need minor
tweaks in block layer as we needed for PM support in ide, which is
done in very similar way).

Given that at the moment we need to only handle _1_ accelerometer
we may start _really_ small and get things working. Later we can
extend the functionality and interfaces as needed (like allowing
user to specify arbitrary system accelerometer(s)/disk(s) mappings).

[ It is also entirely possible that we will never need to extend it! ]

It may sound as we would need to start from scratch and throw out
the current solution. This is not true, majority of code can be
nicely-recycled (i.e. logic from daemon, libata/ide UNLOAD support).

There is also one big pro of simplified kernel solution from user POV,
she/he doesn't have to worry about setting up _anything_. The feature
just "magically" starts working with the next kernel upgrade.

PS please note that I'm not NACK-ing the current solution, I'm just
thinking loudly if we can and should put some extra effort which will
results in better long-term solution (and of course less maintenance
work for me :)

Thanks,
Bart

Subject: Re: [PATCH 2/4] libata: Implement disk shock protection support

On Sun, 31 Aug 2008, Elias Oltmanns wrote:
> Admittedly, I don't know very much about it myself but I seem to
> remember that there are other vendors now shipping similar technology.

Apple, HP.

> Even in the Thinkpad case I don't *know* that all relevant models have
> HD and optical drives on seperate ports although I'm willing to believe
> it if somebody tells me so.

All the models I know of have them on separate ports/channels.

> Anyway, I've added Henrique to the Cc: list since he knows far more
> about Thinkpads than I do and possibly about other notebooks too.

I know Apple does it, and they need exactly the same queue freeze + ATA
unload immediate to have APS working. But they will want both a kernel
interface (just using the firmware) and the userspace interface (to use
something better than whatever is in Apple's firmware).

> There is a lot to be said for the per-port solution as far as libata is
> concerned. For the sake of consistency I tried to mimic the same
> behaviour in ide but I agree that it makes things more complex there.

Frankly? I doubt anybody really cares about the old ide driver for this
particular functionality.

How many systems have an accelerometer, are portable enough to need APS, and
have a SATA or PATA bridge that is not better driven by libata instead of
ide?

> > Which also brings again the question whether it is really the best to
> > use user-space solution instead of kernel thread?

Choice is good here. A really good imminent-shock predictor needs to do
some fairly decent ammount of digital signal processing (in 2d or 3d,
depending on the sensor). That stuff is a lot easier to do if you have
floating point math available to you, for example. That means userspace.

And some firmware can tell you "please do APS NOW!", so, for those you want
a kernel interface.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh