2007-09-02 11:35:21

by Raphael Manfredi

[permalink] [raw]
Subject: [2.6.22.5] Machine freeze when smartd launches long scans

The last messages I have on the netconsole are:

hda: lost interrupt
hda: dma_timer_expiry: dma status == 0x61
hda: DMA timeout error

After that, the machine no longer responds to pings, the keyboard
is non-responsive (so I could not see what the local console said,
as my X session had blanked the screen, and ctrl-F1 has no effect).

There was no OOPS logged despite the NMI watchdog running.

The funny thing is that when I launch:

smartctl -t long /dev/hda

myself, everything goes fine and the test completes. It's only
when the tests are launched every Sunday morning by smartd that
the machine freezes.

This has been happening with 2.6.22.1 and 2.6.22.5 for the last
4 weeks. Before that, I was running 2.4.31 and everything was just
fine, never got a hang.

I don't think the hardware is at fault (I tried changing the disk,
same thing), and the fact that the machine freezes after a single
DMA timeout error is not normal.

I'm running an SMP version on a P4 with hyper-threading.

Any idea of what could be wrong in the IDE timeout error recovery?

Raphael