2002-06-21 09:25:15

by Jens Axboe

[permalink] [raw]
Subject: hda: error: DMA in progress..

Martin,

I gave 2.5.24 a spin, and it quickly dies with the error in subject,
under moderate disk load. It's an IBM travel star on a PIIX4.

--
Jens Axboe


2002-06-21 10:05:05

by Martin Dalecki

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

U?ytkownik Jens Axboe napisa?:
> Martin,
>
> I gave 2.5.24 a spin, and it quickly dies with the error in subject,
> under moderate disk load. It's an IBM travel star on a PIIX4.
>


if (test_bit(IDE_DMA, ch->active)) {
printk(KERN_ERR "%s: error: DMA in progress...\n", drive->name);
break;
}

Well I did the change we where talking about .waiting_for_dma -> xxx_bit(IDE_DMA.
And I was asking about it's possible interactions with TCQ.
And now we see that it is indeed apparently really interacting with
the TCQ code in bad ways. But if I look down from the above code (Just below in
ide.c)

if (blk_queue_plugged(&drive->queue)) {
BUG_ON(!drive->using_tcq);
break;
}

It seems like the check which is catching reality right now
is bogous in itself. Becouse having DMA running would be
only problematic if the queue was still plugged. (Right?)
So please just disable the check.

This time it's no new damage - just detecting weak code
from the past...

2002-06-21 10:12:14

by Jens Axboe

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

On Fri, Jun 21 2002, Martin Dalecki wrote:
> U?ytkownik Jens Axboe napisa?:
> >Martin,
> >
> >I gave 2.5.24 a spin, and it quickly dies with the error in subject,
> >under moderate disk load. It's an IBM travel star on a PIIX4.
> >
>
>
> if (test_bit(IDE_DMA, ch->active)) {
> printk(KERN_ERR "%s: error: DMA in progress...\n",
> drive->name);
> break;
> }
>
> Well I did the change we where talking about .waiting_for_dma ->
> xxx_bit(IDE_DMA.

Yeah I noticed.

> And I was asking about it's possible interactions with TCQ.

Haven't even tried TCQ yet, the above is just plain dma (no travelstarts
can do tcq).

> And now we see that it is indeed apparently really interacting with
> the TCQ code in bad ways. But if I look down from the above code (Just
> below in ide.c)
>
> if (blk_queue_plugged(&drive->queue)) {
> BUG_ON(!drive->using_tcq);
> break;
> }
>
> It seems like the check which is catching reality right now
> is bogous in itself. Becouse having DMA running would be
> only problematic if the queue was still plugged. (Right?)
> So please just disable the check.

Not exactly, let me see if I remember the race here... The queue can
become plugged when we queue one request with the drive (the only on the
queue at that time), and then try to queue another right after (hence
only a tcq issue). In that time period, we drop the queue lock, so it's
indeed possible for the block layer to plug the queue before we reach
the above code again. The drive can be in two states here, 1) IDE_DMA is
set because the drive didn't release the bus (or it did, and it already
reconnected), or 2) drive is disconnected from the bus.

For non-tcq, hitting IDE_DMA set queue_commands() is a bug. The old
IDE_BUSY/IDE_DMA worked because IDE_DMA must not be set if IDE_BUSY is
not set.

> This time it's no new damage - just detecting weak code
> from the past...

Smells like new breakage to me :-)

--
Jens Axboe

2002-06-21 10:28:27

by Jens Axboe

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

On Fri, Jun 21 2002, Jens Axboe wrote:
> > And now we see that it is indeed apparently really interacting with
> > the TCQ code in bad ways. But if I look down from the above code (Just
> > below in ide.c)
> >
> > if (blk_queue_plugged(&drive->queue)) {
> > BUG_ON(!drive->using_tcq);
> > break;
> > }
> >
> > It seems like the check which is catching reality right now
> > is bogous in itself. Becouse having DMA running would be
> > only problematic if the queue was still plugged. (Right?)
> > So please just disable the check.
>
> Not exactly, let me see if I remember the race here... The queue can
> become plugged when we queue one request with the drive (the only on the
> queue at that time), and then try to queue another right after (hence
> only a tcq issue). In that time period, we drop the queue lock, so it's
> indeed possible for the block layer to plug the queue before we reach
> the above code again. The drive can be in two states here, 1) IDE_DMA is
> set because the drive didn't release the bus (or it did, and it already
> reconnected), or 2) drive is disconnected from the bus.

Sorry that's not true (#1), it's _never_ valid for IDE_DMA to be set
here even when using TCQ (less so, if possible, for non-tcq :-). At
least according to the old semantics. That's only the case when using
speculative starts of the dma engine for tcq with release interrupt
enabled, which we don't use on Linux.

--
Jens Axboe

2002-06-21 10:31:51

by Martin Dalecki

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

U?ytkownik Jens Axboe napisa?:
> On Fri, Jun 21 2002, Martin Dalecki wrote:

>
>>And I was asking about it's possible interactions with TCQ.
>
>
> Haven't even tried TCQ yet, the above is just plain dma (no travelstarts
> can do tcq).

Argh...


>>
>> if (blk_queue_plugged(&drive->queue)) {
>> BUG_ON(!drive->using_tcq);
>> break;

>
> Not exactly, let me see if I remember the race here... The queue can
> become plugged when we queue one request with the drive (the only on the
> queue at that time), and then try to queue another right after (hence
> only a tcq issue). In that time period, we drop the queue lock, so it's
> indeed possible for the block layer to plug the queue before we reach
> the above code again. The drive can be in two states here, 1) IDE_DMA is
> set because the drive didn't release the bus (or it did, and it already
> reconnected), or 2) drive is disconnected from the bus.

OK. We have now just one single place where IDE_DMA gets unset ->
udma_stop. This to too early to reset IDE_BUSY. However it well
may be that ide_dma_intr() simply doesn't care about IDE_BUSY.
Let's have a look...

>
> For non-tcq, hitting IDE_DMA set queue_commands() is a bug. The old
> IDE_BUSY/IDE_DMA worked because IDE_DMA must not be set if IDE_BUSY is
> not set.
>
>
>>This time it's no new damage - just detecting weak code
>>from the past...
>
>
> Smells like new breakage to me :-)

Well lets look at ata_irq_intr, the end of it:

* Note that handler() may have set things up for another
* interrupt to occur soon, but it cannot happen until
* we exit from this routine, because it will be the
* same irq as is currently being serviced here, and Linux
* won't allow another of the same (on any CPU) until we return.
*/
if (startstop == ide_stopped) {
if (!ch->handler) { /* paranoia */
clear_bit(IDE_BUSY, ch->active);
do_request(ch);
} else {
printk("%s: %s: huh? expected NULL handler on exit\n", drive->name, __FUNCTION__);
}
} else if (startstop == ide_released)
queue_commands(drive);

I think the above needs more tough now...

2002-06-21 10:36:01

by Jens Axboe

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

On Fri, Jun 21 2002, Martin Dalecki wrote:
> U?ytkownik Jens Axboe napisa?:
> >On Fri, Jun 21 2002, Martin Dalecki wrote:
>
> >
> >>And I was asking about it's possible interactions with TCQ.
> >
> >
> >Haven't even tried TCQ yet, the above is just plain dma (no travelstarts
> >can do tcq).
>
> Argh...

Indeed

> >>
> >> if (blk_queue_plugged(&drive->queue)) {
> >> BUG_ON(!drive->using_tcq);
> >> break;
>
> >
> >Not exactly, let me see if I remember the race here... The queue can
> >become plugged when we queue one request with the drive (the only on the
> >queue at that time), and then try to queue another right after (hence
> >only a tcq issue). In that time period, we drop the queue lock, so it's
> >indeed possible for the block layer to plug the queue before we reach
> >the above code again. The drive can be in two states here, 1) IDE_DMA is
> >set because the drive didn't release the bus (or it did, and it already
> >reconnected), or 2) drive is disconnected from the bus.
>
> OK. We have now just one single place where IDE_DMA gets unset ->
> udma_stop. This to too early to reset IDE_BUSY. However it well
> may be that ide_dma_intr() simply doesn't care about IDE_BUSY.
> Let's have a look...

You can leave IDE_BUSY there, that's ok. It's not invalid for IDE_BUSY
to be set while IDE_DMA gets cleared. That's expected.

> >For non-tcq, hitting IDE_DMA set queue_commands() is a bug. The old
> >IDE_BUSY/IDE_DMA worked because IDE_DMA must not be set if IDE_BUSY is
> >not set.
> >
> >
> >>This time it's no new damage - just detecting weak code
> >>from the past...
> >
> >
> >Smells like new breakage to me :-)
>
> Well lets look at ata_irq_intr, the end of it:
>
> * Note that handler() may have set things up for another
> * interrupt to occur soon, but it cannot happen until
> * we exit from this routine, because it will be the
> * same irq as is currently being serviced here, and Linux
> * won't allow another of the same (on any CPU) until we return.
> */
> if (startstop == ide_stopped) {
> if (!ch->handler) { /* paranoia */
> clear_bit(IDE_BUSY, ch->active);
> do_request(ch);
> } else {
> printk("%s: %s: huh? expected NULL handler on
> exit\n", drive->name, __FUNCTION__);
> }
> } else if (startstop == ide_released)
> queue_commands(drive);
>
> I think the above needs more tough now...

Same case as the one I described in the email following this, will only
happen for TCQ with release interrupt enabled. Otherwise it's illegal to
release the bus from the tcq interrupt handler. Since I removed all
traces of that long ago, you can safely kill the

} else if (startstop == ide_released)
queue_commands(drive);

part of it.

The rest looks sane. If handler returns it's no longer busy
(ide_stopped), we clear IDE_BUSY (IDE_DMA damn well better be cleared at
this point as well!!) and let do_request() start a new request (heck or
the same, we don't know and don't care).

--
Jens Axboe

2002-06-21 11:10:09

by Martin Dalecki

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

U?ytkownik Jens Axboe napisa?:

>>OK. We have now just one single place where IDE_DMA gets unset ->
>>udma_stop. This to too early to reset IDE_BUSY. However it well
>>may be that ide_dma_intr() simply doesn't care about IDE_BUSY.
>>Let's have a look...
>
>
> You can leave IDE_BUSY there, that's ok. It's not invalid for IDE_BUSY
> to be set while IDE_DMA gets cleared. That's expected.
>

>>Well lets look at ata_irq_intr, the end of it:
>>
>> * Note that handler() may have set things up for another
>> * interrupt to occur soon, but it cannot happen until
>> * we exit from this routine, because it will be the
>> * same irq as is currently being serviced here, and Linux
>> * won't allow another of the same (on any CPU) until we return.
>> */
>> if (startstop == ide_stopped) {
>> if (!ch->handler) { /* paranoia */
>> clear_bit(IDE_BUSY, ch->active);
>> do_request(ch);
>> } else {
>> printk("%s: %s: huh? expected NULL handler on
>> exit\n", drive->name, __FUNCTION__);
>> }
>> } else if (startstop == ide_released)
>> queue_commands(drive);
>>
>>I think the above needs more tough now...
>
>
> Same case as the one I described in the email following this, will only
> happen for TCQ with release interrupt enabled. Otherwise it's illegal to
> release the bus from the tcq interrupt handler. Since I removed all
> traces of that long ago, you can safely kill the
>
> } else if (startstop == ide_released)
> queue_commands(drive);
>
> part of it.

I'm glad to get confirmation on this. This leaves only one place, vide:
do_request, where we can queue up new commands. Much easier to trace and
makes queue_commands never run from IRQ context, which is simplyfiying
things too.

> The rest looks sane. If handler returns it's no longer busy
> (ide_stopped), we clear IDE_BUSY (IDE_DMA damn well better be cleared at
> this point as well!!) and let do_request() start a new request (heck or
> the same, we don't know and don't care).

Right now the handlers are expected to clear IDE_BUSY and ->handler
themself. I have now an idea: Could you add a reporting about
the handler function there:

if (test_bit(IDE_DMA, ch->active)) {
printk(KERN_ERR "%s: error: DMA in progress... %p\n", drive->name, ch->handler);
break;
}

And please take a short look at System.map.

This will show which IRQ handler is the culprit...

If it's indeed ide_dma_intr, let's have a look on it:

We see that it's calling udma_stop() immediately. This should
reset IDE_DMA unconditionally.. immediately on enty:

static inline int udma_stop(struct ata_device *drive)
{
clear_bit(IDE_DMA, drive->channel->active);

return drive->channel->udma_stop(drive);
}

Argh... There is a race in the above it should be:

static inline int udma_stop(struct ata_device *drive)
{
int ret = drive->channel->udma_stop(drive);
clear_bit(IDE_DMA, drive->channel->active);
return ret;
}

Or we should move the clar_bit down do ide_dma_intr and
silbings behind __ata_end_request().
And finally we don't clear the IDE_BUSY on this code path.

2002-06-21 14:57:40

by Stelian Pop

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

> Could you add a reporting about
> the handler function there:
>
> if (test_bit(IDE_DMA, ch->active)) {
> printk(KERN_ERR "%s: error: DMA in progress... %p\n",
> drive->name, ch->handler);
> break;
> }
>
> And please take a short look at System.map.
>
> This will show which IRQ handler is the culprit...

Martin, I have the same problem on my Sony Vaio C1VE,
Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01), HITACHI_DK23AA-12 disk.

It doesn't even boot, the "DMA in progress error..." appears just
after having mounted the root partition. 2.5.23 worked on this laptop.

I've added, as you suggested the following patch:

===== drivers/ide/ide.c 1.110 vs edited =====
--- 1.110/drivers/ide/ide.c Thu Jun 20 13:28:43 2002
+++ edited/drivers/ide/ide.c Fri Jun 21 16:46:29 2002
@@ -863,7 +863,7 @@
drive->sleep = 0;

if (test_bit(IDE_DMA, ch->active)) {
- printk(KERN_ERR "%s: error: DMA in progress...\n", drive->name);
+ printk(KERN_ERR "%s: error: DMA in progress...%p\n", drive->name, ch->handler);
break;
}

And the bad news is that ch->handler is NULL...

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-06-24 08:55:06

by Stelian Pop

[permalink] [raw]
Subject: Re: hda: error: DMA in progress..

On Fri, Jun 21, 2002 at 04:57:24PM +0200, Stelian Pop wrote:

> Martin, I have the same problem on my Sony Vaio C1VE,
> Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01), HITACHI_DK23AA-12 disk.
>
> It doesn't even boot, the "DMA in progress error..." appears just
> after having mounted the root partition. 2.5.23 worked on this laptop.

Ok, after poking around the "waiting_for_dma" changes, I found that
the attached patch is solving all my problems, my laptop works again.

BIG FAT WARNING: I have a very limited knowledge in the ide driver
internals, the attached patch could destroy all your data!

Please advice.

Stelian.

===== include/linux/ide.h 1.90 vs edited =====
--- 1.90/include/linux/ide.h Thu Jun 20 13:35:15 2002
+++ edited/include/linux/ide.h Mon Jun 24 10:17:00 2002
@@ -766,10 +766,12 @@
*/
static inline ide_startstop_t udma_init(struct ata_device *drive, struct request *rq)
{
- int ret = drive->channel->udma_init(drive, rq);
- if (ret == ide_started)
- set_bit(IDE_DMA, drive->channel->active);
-
+ int ret;
+
+ set_bit(IDE_DMA, drive->channel->active);
+ ret = drive->channel->udma_init(drive, rq);
+ if (ret != ide_started)
+ clear_bit(IDE_DMA, drive->channel->active);
return ret;
}

--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com