At the end of a transfer, check that the DMA engine in the
SDHCI controller actually did what it was meant to and didn't
overrun the end of the buffer.
This seems to be triggered by a timeout during an CMD25 (multiple block
write) to a card. The mmc_block module then issues a command to find out
how much data was moved and this seems to end up triggering this DMA
check. The result is the card's queue generates an OOPS as the stack has
been trampled on due to the extra data transfered.
Signed-off-by: Ben Dooks <[email protected]>
Index: linux.git/drivers/mmc/host/sdhci.c
===================================================================
--- linux.git.orig/drivers/mmc/host/sdhci.c 2008-10-29 16:59:38.000000000 +0000
+++ linux.git/drivers/mmc/host/sdhci.c 2008-10-29 16:59:41.000000000 +0000
@@ -736,6 +736,23 @@ static void sdhci_set_transfer_mode(stru
writew(mode, host->ioaddr + SDHCI_TRANSFER_MODE);
}
+static void shdci_check_dma_overrun(struct sdhci_host *host, struct mmc_data *data)
+{
+ u32 dma_pos = readl(host->ioaddr + SDHCI_DMA_ADDRESS);
+ u32 dma_start = sg_dma_address(data->sg);
+ u32 dma_end = dma_start + data->sg->length;
+
+ /* Test whether we ended up moving more data than
+ * was originally requested. */
+
+ if (dma_pos <= dma_end)
+ return;
+
+ printk(KERN_ERR "%s: dma overrun, dma %08x, req %08x..%08x\n",
+ mmc_hostname(host->mmc), dma_pos,
+ dma_start, dma_end);
+}
+
static void sdhci_finish_data(struct sdhci_host *host)
{
struct mmc_data *data;
@@ -749,6 +766,8 @@ static void sdhci_finish_data(struct sdh
if (host->flags & SDHCI_USE_ADMA)
sdhci_adma_table_post(host, data);
else {
+ shdci_check_dma_overrun(host, data);
+
dma_unmap_sg(mmc_dev(host->mmc), data->sg,
data->sg_len, (data->flags & MMC_DATA_READ) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
--
Ben ([email protected], http://www.fluff.org/)
'a smiley only costs 4 bytes'
Maybe I didn't understand it right, but if the DMA controller could overrun
a buffer, don't you ALSO need to add defensive padding (i.e. increase the
buffer) to make sure nothing important gets overrun?
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
On Mon, Nov 03, 2008 at 07:12:00PM -0200, Henrique de Moraes Holschuh wrote:
> Maybe I didn't understand it right, but if the DMA controller could overrun
> a buffer, don't you ALSO need to add defensive padding (i.e. increase the
> buffer) to make sure nothing important gets overrun?
This is only generated by problems elsewhere in the driver, such as
getting the timeout clock wrong. It is here just as a precaution and
as an aide to debugging, it should not trigger in normal circumstances.
There is a seperate problem where the DMA buffer is passed from the stack
which is, IIRC, a complete no-no under Linux.
--
Ben
Q: What's a light-year?
A: One-third less calories than a regular year.
On Mon, 03 Nov 2008, Ben Dooks wrote:
> On Mon, Nov 03, 2008 at 07:12:00PM -0200, Henrique de Moraes Holschuh wrote:
> > Maybe I didn't understand it right, but if the DMA controller could overrun
> > a buffer, don't you ALSO need to add defensive padding (i.e. increase the
> > buffer) to make sure nothing important gets overrun?
>
> This is only generated by problems elsewhere in the driver, such as
> getting the timeout clock wrong. It is here just as a precaution and
> as an aide to debugging, it should not trigger in normal circumstances.
Then why is it just a WARN_ON, since you had a rogue DMA operation
overwriting unknown kernel memory? Seems like an outright BUG_ON to me.
> There is a seperate problem where the DMA buffer is passed from the stack
> which is, IIRC, a complete no-no under Linux.
Can't say much on that. I just found it strange that something as damaging
as an overrun was only getting a WARN_ON and no defensive measure. If it is
not going to happen normally, it might not require a defensive buffer, but
once it happens, it looks like one must reboot ASAP from what you said...
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
On Mon, Nov 03, 2008 at 11:28:40PM -0200, Henrique de Moraes Holschuh wrote:
> On Mon, 03 Nov 2008, Ben Dooks wrote:
> > On Mon, Nov 03, 2008 at 07:12:00PM -0200, Henrique de Moraes Holschuh wrote:
> > > Maybe I didn't understand it right, but if the DMA controller could overrun
> > > a buffer, don't you ALSO need to add defensive padding (i.e. increase the
> > > buffer) to make sure nothing important gets overrun?
> >
> > This is only generated by problems elsewhere in the driver, such as
> > getting the timeout clock wrong. It is here just as a precaution and
> > as an aide to debugging, it should not trigger in normal circumstances.
>
> Then why is it just a WARN_ON, since you had a rogue DMA operation
> overwriting unknown kernel memory? Seems like an outright BUG_ON to me.
It is a problem, but it doesn't kill the entire system. We could print it
at a higher level. The WARN_ON()/BUG_ON() where not appropriate, as we do
not need a whole stack backtrace, and I belive the mmc block thread somehow
seems to manage to keep running even with an OOPS.
> > There is a seperate problem where the DMA buffer is passed from the stack
> > which is, IIRC, a complete no-no under Linux.
>
> Can't say much on that. I just found it strange that something as damaging
> as an overrun was only getting a WARN_ON and no defensive measure. If it is
> not going to happen normally, it might not require a defensive buffer, but
> once it happens, it looks like one must reboot ASAP from what you said...
--
Ben ([email protected], http://www.fluff.org/)
'a smiley only costs 4 bytes'
On Mon, 03 Nov 2008 20:09:50 +0000
Ben Dooks <[email protected]> wrote:
> At the end of a transfer, check that the DMA engine in the
> SDHCI controller actually did what it was meant to and didn't
> overrun the end of the buffer.
>
> This seems to be triggered by a timeout during an CMD25 (multiple block
> write) to a card. The mmc_block module then issues a command to find out
> how much data was moved and this seems to end up triggering this DMA
> check. The result is the card's queue generates an OOPS as the stack has
> been trampled on due to the extra data transfered.
>
> Signed-off-by: Ben Dooks <[email protected]>
I'm sorry, but I don't see how this is anywhere near acceptable. This
should be a panic at the very least, and until this can be sorted out
and avoided the driver should avoid using DMA on these chips.
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
rdesktop, core developer http://www.rdesktop.org
WARNING: This correspondence is being monitored by the
Swedish government. Make sure your server uses encryption
for SMTP traffic and consider using PGP for end-to-end
encryption.
On Fri, Nov 14, 2008 at 11:00:26PM +0100, Pierre Ossman wrote:
> On Mon, 03 Nov 2008 20:09:50 +0000
> Ben Dooks <[email protected]> wrote:
>
> > At the end of a transfer, check that the DMA engine in the
> > SDHCI controller actually did what it was meant to and didn't
> > overrun the end of the buffer.
> >
> > This seems to be triggered by a timeout during an CMD25 (multiple block
> > write) to a card. The mmc_block module then issues a command to find out
> > how much data was moved and this seems to end up triggering this DMA
> > check. The result is the card's queue generates an OOPS as the stack has
> > been trampled on due to the extra data transfered.
> >
> > Signed-off-by: Ben Dooks <[email protected]>
>
> I'm sorry, but I don't see how this is anywhere near acceptable. This
> should be a panic at the very least, and until this can be sorted out
> and avoided the driver should avoid using DMA on these chips.
A panic won't help get the debug logs out of the kernel. This only
turned up whilst debugging the controller, I got the timeout clock
calculation wrong and thus ended up timing out pretty much all the
CMD25s and seeing this problem.
--
Ben ([email protected], http://www.fluff.org/)
'a smiley only costs 4 bytes'
On Sun, 16 Nov 2008 00:05:08 +0000
Ben Dooks <[email protected]> wrote:
> On Fri, Nov 14, 2008 at 11:00:26PM +0100, Pierre Ossman wrote:
> >
> > I'm sorry, but I don't see how this is anywhere near acceptable. This
> > should be a panic at the very least, and until this can be sorted out
> > and avoided the driver should avoid using DMA on these chips.
>
> A panic won't help get the debug logs out of the kernel.
Indeed, but it will avoid corrupting the system.
> This only
> turned up whilst debugging the controller, I got the timeout clock
> calculation wrong and thus ended up timing out pretty much all the
> CMD25s and seeing this problem.
>
Just because you had to provoke it doesn't mean it won't appear under
"normal" circumstances for others.
Until this problem is fully understood I think DMA should be turned off
(or possibly needs to be explicitly forced on using Kconfig or a module
parameter).
Do you have any contacts at Samsung that can help out here?
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
rdesktop, core developer http://www.rdesktop.org
WARNING: This correspondence is being monitored by the
Swedish government. Make sure your server uses encryption
for SMTP traffic and consider using PGP for end-to-end
encryption.