2003-02-05 16:47:50

by Ross Biro

[permalink] [raw]
Subject: Re: 2.4.21-pre4: PDC ide driver problems with shared interrupts

Benjamin Herrenschmidt wrote:

>>Okay, I had to watch for it a bit longer and it turns out that the kernel PDC driver has a problem in this shared interrupt setup. When loads get high it seems to run into some timing problem which causes things like:
>>
>>Feb 4 01:02:22 admin kernel: hde: dma_intr: status=0xd0 { Busy }
>>
>>
>>
Since the busy bit is set, we know the drive must have received a
command. Since dma_intr thought the drive was not busy, an interrupt
must have snuck through between the command being issued and the dma
being started. I think in my original patch, I had the dma start
outside of the spinlock, that is a bug. The command to the controller
to start the dma must be inside of the spinlock.

I have not looked at 2.4.21-pre4 at all, so I could be entirely off base
here.

Ross


2003-02-05 17:02:03

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: 2.4.21-pre4: PDC ide driver problems with shared interrupts

On Wed, 2003-02-05 at 17:56, Ross Biro wrote:
> Benjamin Herrenschmidt wrote:
>
> >>Okay, I had to watch for it a bit longer and it turns out that the kernel PDC driver has a problem in this shared interrupt setup. When loads get high it seems to run into some timing problem which causes things like:
> >>
> >>Feb 4 01:02:22 admin kernel: hde: dma_intr: status=0xd0 { Busy }
> >>
> >>
> >>
> Since the busy bit is set, we know the drive must have received a
> command. Since dma_intr thought the drive was not busy, an interrupt
> must have snuck through between the command being issued and the dma
> being started. I think in my original patch, I had the dma start
> outside of the spinlock, that is a bug. The command to the controller
> to start the dma must be inside of the spinlock.

While I agree with you here, I don't think it's what's happening.

In ide-disk, do_rw_disk sets up the taskfile, then basically calls
hwif->ide_dma_read/write to start the command.

In ide-dma.c in 2.4.21-pre4, what happens is:

/* PRD table */
hwif->OUTL(hwif->dmatable_dma, hwif->dma_prdtable);
/* specify r/w */
hwif->OUTB(reading, hwif->dma_command);
/* read dma_status for INTR & ERROR flags */
dma_stat = hwif->INB(hwif->dma_status);
/* clear INTR & ERROR flags */
hwif->OUTB(dma_stat|6, hwif->dma_status);
drive->waiting_for_dma = 1;
if (drive->media != ide_disk)
return 0;
.../...
Then issue command byte.

Below we clear the DMA status _and_ set waiting_for_dma to 1.
That means that if an IRQ sneaks in, we will call
drive_is_ready(), which shouldn't return INTR 1 since we
just cleared it. I don't see how a race could happen here,
but I might have missed something.

Even if, on SMP, the code below executes _simultaneously_
with ide_intr, the later will check for handler beeing
non-NULL before checking waiting_for_dma (drive_is_ready),
and thus will not race since we set the handler after.

The only thing I see is a possible wraparound of
waiting_for_dma. It's an u8, so it wraps at 255. However,
it's incremented in each __ide_dma_test_irq call. So if you
get more than 255 shared (network in your case) interrupts
before the end of the command, you die.

Alan: you can remove safely the waiting_for_dma++, I beleive,
in drive_is_ready(). I don't know how that code sneaked in
ide-dma. I indeed do that in ppc/pmac.c for other reasons
(sort of timeout condition on the DMA controller that happens
when I get an initial error), but this is totally unrelated
HW on which I know I have no shared IRQ.

Stephan: Can you try editing ide-dma.c, function
__ide_dma_test_irq(), and remove that line:

- drive->waiting_for_dma++;

And tell us if it helps in any way.

Ben.

2003-02-06 12:11:27

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.21-pre4: PDC ide driver problems with shared interrupts

On 05 Feb 2003 18:12:31 +0100
Benjamin Herrenschmidt <[email protected]> wrote:


> Stephan: Can you try editing ide-dma.c, function
> __ide_dma_test_irq(), and remove that line:
>
> - drive->waiting_for_dma++;
>
> And tell us if it helps in any way.
>
> Ben.

Hello Ben,

as requested I tried the above "patch" and had no problem so far. Current
situation is:
(ide2, ide3 are PDC, eth2 is tg3)

CPU0
0: 6332048 XT-PIC timer
1: 14112 XT-PIC keyboard
2: 0 XT-PIC cascade
5: 0 XT-PIC EMU10K1
7: 14950 XT-PIC HiSax
9: 30600647 XT-PIC ide2, ide3, aic7xxx, aic7xxx, eth0, eth1, eth2
12: 234451 XT-PIC PS/2 Mouse
15: 2 XT-PIC ide1
NMI: 0
LOC: 0
ERR: 0
MIS: 0

I would not say this is a rock-solid test case. I will continue to stress the
setup and keep you informed. Anyway it looks stable up to now.

Regards,
Stephan

2003-02-06 22:53:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: 2.4.21-pre4: PDC ide driver problems with shared interrupts

On Thu, 2003-02-06 at 13:20, Stephan von Krawczynski wrote:
> On 05 Feb 2003 18:12:31 +0100
> Benjamin Herrenschmidt <[email protected]> wrote:
>
>
> > Stephan: Can you try editing ide-dma.c, function
> > __ide_dma_test_irq(), and remove that line:
> >
> > - drive->waiting_for_dma++;
> >
> > And tell us if it helps in any way.
> >
> > Ben.
>
> Hello Ben,
>
> as requested I tried the above "patch" and had no problem so far. Current
> situation is:
> (ide2, ide3 are PDC, eth2 is tg3)

Ok, well, if it' still stable by now, I beleive we can safely remove
that line from ide_dma_test_irq(). AFAIK, it really have nothing to do
here.

(I suspect it got copied from ide-pmac somewhat... I use it as a counter
in there to implement some timeout when the DMA engine didn't start at
all because the disk issued an error, and on these, I know for sure
the IRQ isn't shared...)

Alan, can you include that ?

===== drivers/ide/ide-dma.c 1.10 vs edited =====
--- 1.10/drivers/ide/ide-dma.c Sat Feb 1 20:37:36 2003
+++ edited/drivers/ide/ide-dma.c Fri Feb 7 00:03:43 2003
@@ -826,7 +826,6 @@
if (!drive->waiting_for_dma)
printk(KERN_WARNING "%s: (%s) called while not
waiting\n",
drive->name, __FUNCTION__);
- drive->waiting_for_dma++;
return 0;
}

(Patch against Marcelo's 2.4.21-pre4)


2003-02-07 09:01:33

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.21-pre4: PDC ide driver problems with shared interrupts

On 07 Feb 2003 00:04:18 +0100
Benjamin Herrenschmidt <[email protected]> wrote:

> On Thu, 2003-02-06 at 13:20, Stephan von Krawczynski wrote:
> > On 05 Feb 2003 18:12:31 +0100
> > Benjamin Herrenschmidt <[email protected]> wrote:
> >
> >
> > > Stephan: Can you try editing ide-dma.c, function
> > > __ide_dma_test_irq(), and remove that line:
> > >
> > > - drive->waiting_for_dma++;
> > >
> > > And tell us if it helps in any way.
> > >
> > > Ben.
> >
> > Hello Ben,
> >
> > as requested I tried the above "patch" and had no problem so far. Current
> > situation is:
> > (ide2, ide3 are PDC, eth2 is tg3)
>
> Ok, well, if it' still stable by now, I beleive we can safely remove
> that line from ide_dma_test_irq(). AFAIK, it really have nothing to do
> here.

Hello all,

it is still working ok, currently we are at:

CPU0
0: 13848205 XT-PIC timer
1: 54117 XT-PIC keyboard
2: 0 XT-PIC cascade
5: 0 XT-PIC EMU10K1
7: 27260 XT-PIC HiSax
9: 67048861 XT-PIC ide2, ide3, aic7xxx, aic7xxx, eth0, eth1, eth2
12: 765541 XT-PIC PS/2 Mouse
15: 229 XT-PIC ide1
NMI: 0
LOC: 0
ERR: 0
MIS: 0

Regards,
Stephan