2006-10-09 11:32:57

by Elias Oltmanns

[permalink] [raw]
Subject: Debugging strange system lockups possibly triggered by ATA commands

Hi there,

recently, I've adapted the hdaps_protect patch to make it work with
kernel 2.6.18. This patch adds a file "protect" to the queue directory
of block devices managed by ide or libata in sysfs. Depending on the
drive's capabilities or a module/kernel parameter, an IDLE IMMEDIATE
with UNLOAD feature or a STANDBY IMMEDIATE command is issued when ever
a positive value is written to this "protect" file. After completion
of this command, the request queue of the respective device is stopped
in order to prevent it from performing further IO operations. The
queue is started again after a certain timeout has elapsed, that is,
as many seconds as the positive number that has originally been
written to the "protect" file.

The purpose of this patch is to provide an interface to unload the
disk heads on request from user space, e.g., in order to minimise the
chance for the heads to hit the platter in certain situations like
a laptop sliding off the lap. This makes it imperative to insert the
unload command at the head of the request queue.

Testing the patch, I experienced some nasty system lockups which I
cannot quite reliably reproduce, let alone having an idea as to what
might be the cause. Since these lockups occurred on my machine
regardless whether I used the ide piix driver in vanilla 2.6.18 or the
ata_piix driver with pata support enabled in Jeff Garzik's git tree
(upstream-linus as of 2006-09-29), and since the ide related part of
the patch had to be changed very little from 2.6.17 to 2.6.18, there
seem to be two options: Either I've missed an important change in the
way io requests and the request queue have to be handled in 2.6.18, or
the patch just demonstrates a flaw somewhere else in the kernel. The
former seems quite likely considering that I'm rather superficially
acquainted with the relevant api. The latter does not seem completely
unlikely, at least, as the problem occurs on ide as well as libata.

Unfortunately, my system just froze without displaying a panic
message. Moreover, the lockup appears to be hard to reproduce. Here
are some details about some of the tests I've performed so far:


1. vanilla 2.6.18:
------------------
I used my standard configuration for self compiled kernels and make
oldconfig to adjust it to 2.6.18. Basically, that means a highly
modularised kernel with ramdisk and initrd support compiled in - by
that time I hadn't realised yet that ramdisk support isn't needed for
initramfs support anymore. Amongst the modules: ide-core, ide-disk,
ide-generic, piix, no sata support. With the hdaps_protect patch applied, I
could reliably reproduce the system freeze by the following steps:
Boot into single user mode
# modprobe ibm-acpi
# while true; do echo -n 1 > /sys/block/hda/queue/protect; \
> echo -n 0 > /sys/block/hda/queue/protect; done
The system freezes and there is no way to reactivate it, except a cold
reset. Note that there was no freeze without ibm-acpi being loaded,
even modprobe ibm-acpi; modprobe -r ibm-acpi and the while loop did
not lead to a freeze. However, switching to the external monitor and
back again after loading ibm-acpi prevents the system from freezing
too which makes the whole thing even more difficult.


2. Branch upstream-linus from Jeff Garzik's git tree as of 2006-09-29:
----------------------------------------------------------------------
Here I used almost the identical configuration except that I disabled
ide support completely and enabled sata support and the module
ata_piix. Besides, #define ATA_ENABLE_PATA was set in
include/linux/libata.h.
With this setup the system shew the same behavior as described above.


3. Vanilla 2.6.18 with stripped configuration:
----------------------------------------------
In the hope to provide a minimal test case, I stripped the
configuration considerably, disabling several subsystems lke scsi, a
lot of networking stuff, and so on. Additionally, I disabled
ide-generic and ramdisk support, as I'm using initramfs anyway. The
module ibm_acpi was still included.
Regrettably, the freeze was not reproducible anymore.

4. Branch upstream-linus from Jeff Garzik's git tree as of 2006-10-09:
----------------------------------------------------------------------
Exact same config as in 2. Problem is not reproducible as in 3. and
I'm currently working on this system.


Admittedly, I'm completely lost at this point. That's why I'm asking
you for advice and suggestions how to debug this problem. If you want
to have a look at the patch in question, please see:
1. applying to vanilla 2.6.18
<http://www.uni-bonn.de/~oltmanns/linux/hdaps_protect-2.6.18-20060922-3.patch>
2. applying to Jeff's git tree as in examples 2. and 4. above:
<http://www.uni-bonn.de/~oltmanns/linux/hdaps_protect-2.6.18-20060922-pata-2.patch>

A slightly stripped version of the patch is available too, which has
been verified to trigger the described problem in exactly the same way
as the original but lacks the IDLE IMMEDIATE feature (leaving the
STANDBY IMMEDIATE option only) in order to make it (hopefully) more
readable and easier to understand. You can find this version of the
patch which applies to vanilla 2.6.18 here:
<http://www.uni-bonn.de/~oltmanns/linux/hdaps_protect-stripped-2.6.18-1.patch>

Kind regards and thanks for your help in advance,

Elias


2006-10-09 12:12:22

by Shem Multinymous

[permalink] [raw]
Subject: Re: [Hdaps-devel] Debugging strange system lockups possibly triggered by ATA commands

Hi Elias,

On 10/9/06, Elias Oltmanns <[email protected]> wrote:

> A slightly stripped version of the patch is available too, which has
> been verified to trigger the described problem in exactly the same way
> as the original but lacks the IDLE IMMEDIATE feature (leaving the
> STANDBY IMMEDIATE option only) in order to make it (hopefully) more
> readable and easier to understand.

What happens if you strip away *all* the head parking code, leaving
only the queue freeze code? Conversely, what happens if you issue the
head park command but don't freeze the queue?

Shem

2006-10-10 18:59:49

by Elias Oltmanns

[permalink] [raw]
Subject: Re: Debugging strange system lockups possibly triggered by ATA commands

Hi Shem,

"Shem Multinymous" <[email protected]> wrote:
> Hi Elias,
>
> On 10/9/06, Elias Oltmanns <[email protected]> wrote:
>
>> A slightly stripped version of the patch is available too, which has
>> been verified to trigger the described problem in exactly the same way
>> as the original but lacks the IDLE IMMEDIATE feature (leaving the
>> STANDBY IMMEDIATE option only) in order to make it (hopefully) more
>> readable and easier to understand.
>
> What happens if you strip away *all* the head parking code, leaving
> only the queue freeze code? Conversely, what happens if you issue the
> head park command but don't freeze the queue?

In all setups the loop
# while true; do hdparm -q -y /dev/hda; hdparm -q -C /dev/hda; done
didn't trigger the freeze. This should be very close to the effect of
the patch with all queue freezing stuff removed.

Regarding your question, it seems rather difficult to test the right
thing. Just issuing the commands and leaving the queue alone is easy
to implement and I couldn't reproduce the problem running such a
kernel. Testing the queue handling stuff without actually issuing any
commands seems rather difficult to me as its the callback mechanism
used to freeze the queue after command completion which I'd really
like to test. If I don't issue any command, I don't know how to test
whether the callback procedure and all the rest works as expected.

Regards,

Elias

2006-10-10 18:59:51

by Elias Oltmanns

[permalink] [raw]
Subject: Re: Debugging strange system lockups possibly triggered by ATA commands

Hi again,

here is some additional information and further test results:

Elias Oltmanns <[email protected]> wrote:
[...]
> Unfortunately, my system just froze without displaying a panic
> message. Moreover, the lockup appears to be hard to reproduce.

I've been made aware that this might be a hint for all sorts of flacky
hardware. Admittedly, the test case presented, which involves a very
tight while true loop, means a lot of stress for the hardware. Let me
point out, however, that this is just my best approach to trigger the
problem as fast and reliably as possible. It was only after I had
experienced such lockups during normal operation that I developed
this particular test case. "Normal operation" in this context means
running the hdapsd daemon which writes a positive number to the sysfs
protect attribute whenever it detects an unusual condition from
reading data from an acceleration sensor. As soon as hdapsd thinks
that everything is alright again, it writes a 0 to the protect
attribute.

This means that in practice a very short sequence of writes to the
protect attribute under certain conditions suffices to freeze the
system. Please note, that repeated writes of 1 to the protect
attribute within an interval of less then one second between each of
these writes does actually issue the park command to the disk only
once and just updates the unfreeze timer until there are no further
writes to protect and the timeout expires and the request queue is
started again.

> Here are some details about some of the tests I've performed so far:
>
>
> 1. vanilla 2.6.18:
> ------------------
> I used my standard configuration for self compiled kernels and make
> oldconfig to adjust it to 2.6.18. Basically, that means a highly
> modularised kernel with ramdisk and initrd support compiled in - by
> that time I hadn't realised yet that ramdisk support isn't needed for
> initramfs support anymore. Amongst the modules: ide-core, ide-disk,
> ide-generic, piix, no sata support. With the hdaps_protect patch applied, I
> could reliably reproduce the system freeze by the following steps:
> Boot into single user mode
> # modprobe ibm-acpi
> # while true; do echo -n 1 > /sys/block/hda/queue/protect; \
> > echo -n 0 > /sys/block/hda/queue/protect; done
> The system freezes and there is no way to reactivate it, except a cold
> reset. Note that there was no freeze without ibm-acpi being loaded,
> even modprobe ibm-acpi; modprobe -r ibm-acpi and the while loop did
> not lead to a freeze. However, switching to the external monitor and
> back again after loading ibm-acpi prevents the system from freezing
> too which makes the whole thing even more difficult.
[...]

The freezes have been observed in all setups of the four test cases
described in my original post. The problem can be reproduced with
the while loop as described above but without loading ibm-acpi.
It seems to be sufficient that the disk is currently performing some
io operations. Doing ls /usr/sbin instead of modprobe ibm-acpi and
starting the while loop rather shortly afterwards works as well. At
least, that makes much more sense than the connection between this
problem and ibm-acpi. This also indicates that the problem is not as
configuration dependent as implied before.

Regards,

Elias

2006-10-11 13:20:01

by Shem Multinymous

[permalink] [raw]
Subject: Re: Debugging strange system lockups possibly triggered by ATA commands

Hi Elias,

On 10/10/06, Elias Oltmanns <[email protected]> wrote:
> Testing the queue handling stuff without actually issuing any
> commands seems rather difficult to me as its the callback mechanism
> used to freeze the queue after command completion which I'd really
> like to test. If I don't issue any command, I don't know how to test
> whether the callback procedure and all the rest works as expected.

Try doing a dump_stack(), and nothing else, in the callback. This may
provide some hint.

(Remember to increase console_loglevel so you can see the result on
the console before the hang.)

Shem