2004-09-03 14:52:16

by Jens Axboe

[permalink] [raw]
Subject: Re: Nasty IDE crasher in 2.6.9rc1


(suse.dk is not related to suse.de and it helpfully eats all messages
sent to unknown users. not so great :(

On Tue, Aug 31 2004, Alan Cox wrote:
> You never never issue unknown commands to drives. Thats how Mandrake destroyed
> CD-ROM drives. I knew this was in -mm and supposed to be getting sorted I was
> somewhat horrified to find it in 2.6.9rc1.
>
> This patch crashes two of my CF cards (one so badly you have to reformat it
> to get it back) and anything attached to an IT8212 controller. The correct
> fix is to do what the standard actually says and always check for cache
> flush. Contrary to the comment in the patch drives do report this correctly
> its just that some of them nop unknown commands.
>
> Please fix this patch segment for rc2, its not just wrong, its dangerous.

Ugh, that's bad. I agree with the change, thanks. Linus passed it on.

> Another problem with barrier is that it can take several minutes worst case
> for the command to complete on a large modern drive (timings c/o friendly
> ide drive engineer). That causes two problems I've pointed out to Jens that
> we need to fix before barriers are IMHO production grade

Can you pass me his results?

> 1. Anything based on fairness and latency is screwed. Throughput
> apparently is up so it makes sense for some users, and probably
> for others we should write cache off as Jens suggested.

Yes, it's a tradeoff. The user can decide himself what is most
important. It all depends on the work load, of course.

> 2. The timeouts on the command issue appear to be too small, and
> we will time out and reset the drive in loaded situations.

You don't seem to address that in your patch?

> Thankfully next generation ATA has both cache bypass writes and tagging.

But the tagging still isn't useful for this. Have they added
WIN_WRITE_DMA_EXT_QUEUED_FUA?

--
Jens Axboe


2004-09-03 15:02:36

by Alan

[permalink] [raw]
Subject: Re: Nasty IDE crasher in 2.6.9rc1

On Gwe, 2004-09-03 at 15:50, Jens Axboe wrote:
> (suse.dk is not related to suse.de and it helpfully eats all messages
> sent to unknown users. not so great :(

Ah sorry.

> > Another problem with barrier is that it can take several minutes worst case
> > for the command to complete on a large modern drive (timings c/o friendly
> > ide drive engineer). That causes two problems I've pointed out to Jens that
> > we need to fix before barriers are IMHO production grade
>
> Can you pass me his results?

I can ask. Its NDA data (not Maxtor). Or Eric might have public info ?
The later mail I reported my tests trying to make it as slow as possible
and I couldn't get worse than 7 seconds for the command.

> > 2. The timeouts on the command issue appear to be too small, and
> > we will time out and reset the drive in loaded situations.
>
> You don't seem to address that in your patch?

I'm not sure what the right answer is.


2004-09-03 15:29:16

by Jens Axboe

[permalink] [raw]
Subject: Re: Nasty IDE crasher in 2.6.9rc1

On Fri, Sep 03 2004, Alan Cox wrote:
> On Gwe, 2004-09-03 at 15:50, Jens Axboe wrote:
> > (suse.dk is not related to suse.de and it helpfully eats all messages
> > sent to unknown users. not so great :(
>
> Ah sorry.
>
> > > Another problem with barrier is that it can take several minutes worst case
> > > for the command to complete on a large modern drive (timings c/o friendly
> > > ide drive engineer). That causes two problems I've pointed out to Jens that
> > > we need to fix before barriers are IMHO production grade
> >
> > Can you pass me his results?
>
> I can ask. Its NDA data (not Maxtor). Or Eric might have public info ?
> The later mail I reported my tests trying to make it as slow as possible
> and I couldn't get worse than 7 seconds for the command.

IIRC, 7 seconds is the magic number that Microsoft uses for when a
command times out in the kernel... That might make the results a little
suspicious :)

>
> > > 2. The timeouts on the command issue appear to be too small, and
> > > we will time out and reset the drive in loaded situations.
> >
> > You don't seem to address that in your patch?
>
> I'm not sure what the right answer is.

I guess as a first measure just increasing the timeout two-fold will
cover most of the problem.

--
Jens Axboe