2000-12-08 10:18:47

by Andre Hedrick

[permalink] [raw]
Subject: FS Corruption Again, with known cause, and solution...

Okay gang, here is the skinny and it is fat and ugly!

Sometime ago some braindead engineers, at an unamed company that developed
the first DMA engine, thought it was a good idea to wack and stop the DMA
engine if there was a transaction delay of 1 micro-second or more;
regardless if the transfer was done! It is possible that with everyone
following their lead, this may be a global ASIC error. This would explain
the rise-fall-rise of the reports.

Rise == the arrive of the hardware error
Fall == Andrea A. elevator working to fake a scatter-gather (IMHO)
Fall == attempt to address the timeouts with expiry function (may keep)
Rise == newer larger drives that can produce out-of-bounds seeking.

How to kill it.....

Basically, issue the abort of the transfer and requeue the request, and
try again.

Problem, if the seek repeats the process and we continuily requeue the
request then we get into other issues.

Workaround 1, maintain a barf counter and after N barfs perform the task
in a forced PIO regardless (HUGE peformance hit, but good data).

Workaround 2, slice the request after first barf and make it smaller and
retry (nasty-dirty and we have to dump dma-buffers and make new-ones).

Workaround 3, give up and goto bed for now....

Now who wants to jump on this and work with me on the re-insert of the
request and try again, before we do tests on WA 1 & 2? I am brain dead
from reading the code sent to me by the vender because it is been hacked
and abused to death! They do not even like to show how fugly it is!


Andre Hedrick
Linux ATA Development