2008-07-31 17:10:51

by Haavard Skinnemoen

[permalink] [raw]
Subject: [PATCH] atmel_spi: fix hang due to missed interrupt

From: Gerard Kam <[email protected]>

For some time my at91sam9260 board with JFFS2 on serial flash (m25p80) would
hang when accessing the serial flash and SPI bus. Slowing the SPI clock
down to 9 MHz reduced the occurrence of the hang from "always" during boot
to a nuisance level that allowed other SW development to continue. Finally
had to address this issue when an application stresses the I/O to always
cause a hang.

Hang seems to be caused by a missed SPI interrupt, so that the task ends up
waiting forever after calling spi_sync(). The fix has 2 parts. First is to
halt the DMA engine before the "current" PDC registers are loaded. This
ensures that the "next" registers are loaded before the DMA operation takes
off. The second part of the fix is a kludge that adds a "completion"
interrupt in case the ENDRX interrupt for the last segment of the DMA
chaining operation was missed.

The patch allows the SPI clock for the serial flash to be increased from 9
MHz to 15 MHz (or more?). No hangs or SPI overruns were encountered.

Signed-off-by: Gerard Kam <[email protected]>

While this patch does indeed improve things, I still see overruns and
CRC errors on my NGW100 board when running the DataFlash at 10 MHz.
However, I think some improvement is better than nothing, so I'm
passing this on for inclusion in 2.6.27.

Signed-off-by: Haavard Skinnemoen <[email protected]>
---
drivers/spi/atmel_spi.c | 17 ++++++++++++-----
1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/spi/atmel_spi.c b/drivers/spi/atmel_spi.c
index 0c71656..95190c6 100644
--- a/drivers/spi/atmel_spi.c
+++ b/drivers/spi/atmel_spi.c
@@ -184,7 +184,8 @@ static void atmel_spi_next_xfer(struct spi_master *master,
{
struct atmel_spi *as = spi_master_get_devdata(master);
struct spi_transfer *xfer;
- u32 len, remaining, total;
+ u32 len, remaining;
+ u32 ieval;
dma_addr_t tx_dma, rx_dma;

if (!as->current_transfer)
@@ -197,6 +198,8 @@ static void atmel_spi_next_xfer(struct spi_master *master,
xfer = NULL;

if (xfer) {
+ spi_writel(as, PTCR, SPI_BIT(RXTDIS) | SPI_BIT(TXTDIS));
+
len = xfer->len;
atmel_spi_next_xfer_data(master, xfer, &tx_dma, &rx_dma, &len);
remaining = xfer->len - len;
@@ -234,6 +237,8 @@ static void atmel_spi_next_xfer(struct spi_master *master,
as->next_transfer = xfer;

if (xfer) {
+ u32 total;
+
total = len;
atmel_spi_next_xfer_data(master, xfer, &tx_dma, &rx_dma, &len);
as->next_remaining_bytes = total - len;
@@ -250,9 +255,11 @@ static void atmel_spi_next_xfer(struct spi_master *master,
" next xfer %p: len %u tx %p/%08x rx %p/%08x\n",
xfer, xfer->len, xfer->tx_buf, xfer->tx_dma,
xfer->rx_buf, xfer->rx_dma);
+ ieval = SPI_BIT(ENDRX) | SPI_BIT(OVRES);
} else {
spi_writel(as, RNCR, 0);
spi_writel(as, TNCR, 0);
+ ieval = SPI_BIT(RXBUFF) | SPI_BIT(ENDRX) | SPI_BIT(OVRES);
}

/* REVISIT: We're waiting for ENDRX before we start the next
@@ -265,7 +272,7 @@ static void atmel_spi_next_xfer(struct spi_master *master,
*
* It should be doable, though. Just not now...
*/
- spi_writel(as, IER, SPI_BIT(ENDRX) | SPI_BIT(OVRES));
+ spi_writel(as, IER, ieval);
spi_writel(as, PTCR, SPI_BIT(TXTEN) | SPI_BIT(RXTEN));
}

@@ -396,7 +403,7 @@ atmel_spi_interrupt(int irq, void *dev_id)

ret = IRQ_HANDLED;

- spi_writel(as, IDR, (SPI_BIT(ENDTX) | SPI_BIT(ENDRX)
+ spi_writel(as, IDR, (SPI_BIT(RXBUFF) | SPI_BIT(ENDRX)
| SPI_BIT(OVRES)));

/*
@@ -418,7 +425,7 @@ atmel_spi_interrupt(int irq, void *dev_id)
if (xfer->delay_usecs)
udelay(xfer->delay_usecs);

- dev_warn(master->dev.parent, "fifo overrun (%u/%u remaining)\n",
+ dev_warn(master->dev.parent, "overrun (%u/%u remaining)\n",
spi_readl(as, TCR), spi_readl(as, RCR));

/*
@@ -442,7 +449,7 @@ atmel_spi_interrupt(int irq, void *dev_id)
spi_readl(as, SR);

atmel_spi_msg_done(master, as, msg, -EIO, 0);
- } else if (pending & SPI_BIT(ENDRX)) {
+ } else if (pending & (SPI_BIT(RXBUFF) | SPI_BIT(ENDRX))) {
ret = IRQ_HANDLED;

spi_writel(as, IDR, pending);
--
1.5.6.3


2008-08-01 13:50:08

by Haavard Skinnemoen

[permalink] [raw]
Subject: Re: [PATCH] atmel_spi: fix hang due to missed interrupt

Haavard Skinnemoen <[email protected]> wrote:
> spi_writel(as, RNCR, 0);
> spi_writel(as, TNCR, 0);
> + ieval = SPI_BIT(RXBUFF) | SPI_BIT(ENDRX) | SPI_BIT(OVRES);

Actually, I think the real bug happens right here: Writing RNCR to 0
will clear any pending ENDRX interrupt, so if the transfer is completed
before this, we won't see any interrupt. These writes are also
completely pointless -- RNCR is zeroed automatically after it gets
shifted into RCR. TNCR works the same way.

The RXBUFF interrupt is only cleared by writing a nonzero RCR or RNCR,
so your patch should fix it. But I'm wondering if there may be another
race left to fix: If we queue two transfers, and both of them complete
before we handle the interrupt, I think we only consider one of them to
be complete. If RXBUFF is set, we should complete any "next" transfer
we have queued up as well.

It could be your patch fixes this last case too -- when this happens,
RXBUFF stays set when we return from the interrupt handler, so the
interrupt gets retriggered immediately. We could handle this more
efficiently, but I think it's handled correctly with your patch applied.

I'll see if I can find a way to clean up the somewhat headache-inducing
control flow in this driver, but until then, your patch should
definitely improve things.

As for the overruns, I'm beginning to suspect that the only way to get
rid of those and still maintain a reasonable transfer rate is to use
bounce buffers in faster RAM (e.g. on-chip SRAM).

Haavard

2008-08-01 20:14:18

by Gerard Kam

[permalink] [raw]
Subject: RE: [PATCH] atmel_spi: fix hang due to missed interrupt

Hi there

> -----Original Message-----
> From: Haavard Skinnemoen [mailto:[email protected]]
> Sent: Friday, August 01, 2008 6:50 AM
>
> Haavard Skinnemoen <[email protected]> wrote:
> > spi_writel(as, RNCR, 0);
> > spi_writel(as, TNCR, 0);

> These writes are also completely pointless -- RNCR is zeroed
> automatically after it gets shifted into RCR.

While looking at the patch yesterday I was thinking the same thing. Now it
bugs me that this observation didn't occur when I was working on this
problem. Maybe the code symmetry makes it look "correct".

> Actually, I think the real bug happens right here

You're probably correct. A race condition that intermittently clears a
pending interrupt fits the observed symptom.


> As for the overruns, I'm beginning to suspect that the only way to get
> rid of those and still maintain a reasonable transfer rate is to use
> bounce buffers in faster RAM (e.g. on-chip SRAM).

For my at91sam9260 board, I eliminated one cause of SPI overruns by lowering
the interrupt priorities of the six USARTs (default was 5, changed to 4)
relative to the two SPI controllers (default is 5). The test I used for
this issue is 'ls -lR' on the flash filesystem.

Regards -- Gerard