Hi all,
I've been looking at how ath9k does DMA and comparing with the
recommendations in the Linux kernel documentation for the DMA API[1]. To
me it looks like we risk setting up incorrect DMA descriptors on
platforms that can reorder writes because all data in the descriptor may
not be written to memory before the descriptor is linked into the DMA
chain.
The patch below attempts to remove this risk by inserting a write memory
barrier between where we set up a descriptor and where we add it to the
DMA chain. My hope is that this may solve some of the harder chip lockups
on MIPS but more testing is required to determine if it has this effect.
Any thoughts?
/Bj?rn
1. http://www.mjmwired.net/kernel/Documentation/DMA-API.txt
---
diff --git a/drivers/net/wireless/ath/ath9k/beacon.c b/drivers/net/wireless/ath/ath9k/beacon.c
index 081192e..8e43443 100644
--- a/drivers/net/wireless/ath/ath9k/beacon.c
+++ b/drivers/net/wireless/ath/ath9k/beacon.c
@@ -450,6 +450,8 @@ void ath_beacon_tasklet(unsigned long data)
"beacon queue %u did not stop?\n", sc->beacon.beaconq);
}
+ wmb();
+
/* NB: cabq traffic should already be queued and primed */
ath9k_hw_puttxbuf(ah, sc->beacon.beaconq, bfaddr);
ath9k_hw_txstart(ah, sc->beacon.beaconq);
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
index 9c166f3..2cc0271 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -90,6 +90,8 @@ static void ath_rx_buf_link(struct ath_softc *sc, struct ath_buf *bf)
common->rx_bufsize,
0);
+ wmb();
+
if (sc->rx.rxlink == NULL)
ath9k_hw_putrxbuf(ah, bf->bf_daddr);
else
@@ -500,6 +502,9 @@ int ath_startrecv(struct ath_softc *sc)
goto start_recv;
bf = list_first_entry(&sc->rx.rxbuf, struct ath_buf, list);
+
+ wmb();
+
ath9k_hw_putrxbuf(ah, bf->bf_daddr);
ath9k_hw_rxena(ah);
diff --git a/drivers/net/wireless/ath/ath9k/xmit.c b/drivers/net/wireless/ath/ath9k/xmit.c
index f7da6b2..4b8c428 100644
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -1250,6 +1250,8 @@ static void ath_tx_txqaddbuf(struct ath_softc *sc, struct ath_txq *txq,
ath_print(common, ATH_DBG_QUEUE,
"qnum: %d, txq depth: %d\n", txq->axq_qnum, txq->axq_depth);
+ wmb();
+
if (sc->sc_ah->caps.hw_caps & ATH9K_HW_CAP_EDMA) {
if (txq->axq_depth >= ATH_TXFIFO_DEPTH) {
list_splice_tail_init(head, &txq->txq_fifo_pending);
On 2010-10-09 5:19 PM, Bj?rn Smedman wrote:
> On Tue, Oct 5, 2010 at 9:50 PM, Luis R. Rodriguez
> <[email protected]> wrote:
>> Felix is more familiar with this area so I'll let him chime with
>> his ACK/NACK.
>>
>> Luis
>
> So Felix, what do you think? I realize it may not be a common problem
> on any currently popular platform, but don't you agree it is in
> principle unsafe to write to the ds_link field of dma descriptors
> without a guarantee that those writes will be performed in the
> intended order?
Haven't had time to review this in detail yet. I'll take another look at
it later...
- Felix
On Mon, Oct 04, 2010 at 10:55:08PM +0200, Bj?rn Smedman wrote:
> Hi all,
>
> I've been looking at how ath9k does DMA and comparing with the
> recommendations in the Linux kernel documentation for the DMA API[1]. To
> me it looks like we risk setting up incorrect DMA descriptors on
> platforms that can reorder writes because all data in the descriptor may
> not be written to memory before the descriptor is linked into the DMA
> chain.
>
> The patch below attempts to remove this risk by inserting a write memory
> barrier between where we set up a descriptor and where we add it to the
> DMA chain. My hope is that this may solve some of the harder chip lockups
> on MIPS but more testing is required to determine if it has this effect.
>
> Any thoughts?
>
> /Bj?rn
>
> 1. http://www.mjmwired.net/kernel/Documentation/DMA-API.txt
I think this seems OK...?
John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.
On Tue, Oct 05, 2010 at 06:17:44AM -0700, John W. Linville wrote:
> On Mon, Oct 04, 2010 at 10:55:08PM +0200, Bj?rn Smedman wrote:
> > Hi all,
> >
> > I've been looking at how ath9k does DMA and comparing with the
> > recommendations in the Linux kernel documentation for the DMA API[1]. To
> > me it looks like we risk setting up incorrect DMA descriptors on
> > platforms that can reorder writes because all data in the descriptor may
> > not be written to memory before the descriptor is linked into the DMA
> > chain.
> >
> > The patch below attempts to remove this risk by inserting a write memory
> > barrier between where we set up a descriptor and where we add it to the
> > DMA chain. My hope is that this may solve some of the harder chip lockups
> > on MIPS but more testing is required to determine if it has this effect.
> >
> > Any thoughts?
> >
> > /Bj?rn
> >
> > 1. http://www.mjmwired.net/kernel/Documentation/DMA-API.txt
>
> I think this seems OK...?
Felix is more familiar with this area so I'll let him chime with
his ACK/NACK.
Luis
On 10/09/2010 08:23 AM, Felix Fietkau wrote:
> On 2010-10-09 5:19 PM, Bj?rn Smedman wrote:
>> On Tue, Oct 5, 2010 at 9:50 PM, Luis R. Rodriguez
>> <[email protected]> wrote:
>>> Felix is more familiar with this area so I'll let him chime with
>>> his ACK/NACK.
>>>
>>> Luis
>>
>> So Felix, what do you think? I realize it may not be a common problem
>> on any currently popular platform, but don't you agree it is in
>> principle unsafe to write to the ds_link field of dma descriptors
>> without a guarantee that those writes will be performed in the
>> intended order?
> Haven't had time to review this in detail yet. I'll take another look at
> it later...
Had a chance to look at it? I've been carrying this in my
tree and it hasn't caused any obvious problems, for what that's
worth...
Ben
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
On Tue, Oct 5, 2010 at 9:50 PM, Luis R. Rodriguez
<[email protected]> wrote:
> Felix is more familiar with this area so I'll let him chime with
> his ACK/NACK.
>
> ?Luis
So Felix, what do you think? I realize it may not be a common problem
on any currently popular platform, but don't you agree it is in
principle unsafe to write to the ds_link field of dma descriptors
without a guarantee that those writes will be performed in the
intended order?
/Bj?rn
On 2010-10-29 1:36 AM, Ben Greear wrote:
> On 10/09/2010 08:23 AM, Felix Fietkau wrote:
>> On 2010-10-09 5:19 PM, Bj?rn Smedman wrote:
>>> On Tue, Oct 5, 2010 at 9:50 PM, Luis R. Rodriguez
>>> <[email protected]> wrote:
>>>> Felix is more familiar with this area so I'll let him chime with
>>>> his ACK/NACK.
>>>>
>>>> Luis
>>>
>>> So Felix, what do you think? I realize it may not be a common problem
>>> on any currently popular platform, but don't you agree it is in
>>> principle unsafe to write to the ds_link field of dma descriptors
>>> without a guarantee that those writes will be performed in the
>>> intended order?
>> Haven't had time to review this in detail yet. I'll take another look at
>> it later...
>
> Had a chance to look at it? I've been carrying this in my
> tree and it hasn't caused any obvious problems, for what that's
> worth...
Not sure if the patch helps with anything, but I think it can go in...
- Felix
On Thu, Oct 28, 2010 at 4:45 PM, Felix Fietkau <[email protected]> wrote:
> On 2010-10-29 1:36 AM, Ben Greear wrote:
>> On 10/09/2010 08:23 AM, Felix Fietkau wrote:
>>> On 2010-10-09 5:19 PM, Björn Smedman wrote:
>>>> On Tue, Oct 5, 2010 at 9:50 PM, Luis R. Rodriguez
>>>> <[email protected]> wrote:
>>>>> Felix is more familiar with this area so I'll let him chime with
>>>>> his ACK/NACK.
>>>>>
>>>>> Luis
>>>>
>>>> So Felix, what do you think? I realize it may not be a common problem
>>>> on any currently popular platform, but don't you agree it is in
>>>> principle unsafe to write to the ds_link field of dma descriptors
>>>> without a guarantee that those writes will be performed in the
>>>> intended order?
>>> Haven't had time to review this in detail yet. I'll take another look at
>>> it later...
>>
>> Had a chance to look at it? I've been carrying this in my
>> tree and it hasn't caused any obvious problems, for what that's
>> worth...
> Not sure if the patch helps with anything, but I think it can go in...
I'd rather not add it if its not fixing anything. I hate code that is
there just because we have a fuzzy feeling it helps.
So NACK unless it fixes something.
Luis
On 11/01/2010 03:45 PM, Peter Stuge wrote:
> Björn Smedman wrote:
>> long and unwanted weekend of debugging ath9k DMA errors
>
> Yeah, I've seen those as well, especially with AR5008 but also AR9002.
>
> I have severe intermittent issues with PCI functionality of an AR9280
> card. It locks up in a state where it either returns absolute garbage
> so severe that the kernel plain rejects it, or just mildly garbage
> data so that it reports corrupt PCI ids. It can fix itself in a
> running system, so that a later lspci shows the correct device ids
> and the driver actually loads and runs.
>
> I typically get PCI fatal interrupts after resurrecting the card
> (that takes a couple of power cycles including removing the laptop
> battery). I don't suppose anyone can explain *in detail* what causee
> these issues and what could be done to fix them?
>
> Finally I've also experienced complete system lockup after a
> previously working system was left on but idle overnight.
>
> All this fun with commit 2d10d8737ccdba752d60106abbc6ed4f37404923 but
> I believe these problems are nothing new.
I think at least some of that DEADBEAF stuff was related to the
DMA bugs. Since we applied all of those patches, we haven't seen
those problems..but we also haven't been testing a whole lot lately..waiting
for wireless-testing to finish merging in those tx-dma patches before
we continue...
Thanks,
Ben
>
>
> //Peter
> _______________________________________________
> ath9k-devel mailing list
> [email protected]
> https://lists.ath9k.org/mailman/listinfo/ath9k-devel
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
2010/11/1 Bob Copeland <[email protected]>:
> 2010/11/1 Bj?rn Smedman <[email protected]>:
[snip]
>> IMHO there is a right and wrong, at the very least inside a computer.
>> If we can't read the documentation (Documentation/memory-barriers.txt)
>> and decide if the patch is correct or incorrect (I'm not saying I can
>> with 100% certainty) then perhaps we should ask somebody who can?
>
[snip]
> One thing such a patch _does_ need, though, is a comment that describes
> why there is a barrier. ?Otherwise when people reorganize the code,
> they may forget to take the barrier with it, and it also lets
> late-comers know which data is serialized by the barrier. ?In the best
> case, said late-comers are more knowledgeable than me and fix the crap
> code that I write.
First let me apologize for the tone in my previous mail. It came out
too hard this Monday (due to a long and unwanted weekend of debugging
ath9k DMA errors). I did not mean to suggest that any code is crap
regardless of origin (except maybe if I wrote it).
Thanx Bob for your input. If we agree the memory barriers are needed
for correctness I will add comments and post as PATCH.
/Bj?rn
On Fri, Oct 29, 2010 at 2:57 AM, Luis R. Rodriguez <[email protected]> wrote:
[snip]
> I'd rather not add it if its not fixing anything. I hate code that is
> there just because we have a fuzzy feeling it helps.
>
> So NACK unless it fixes something.
>
> ?Luis
IMHO there is a right and wrong, at the very least inside a computer.
If we can't read the documentation (Documentation/memory-barriers.txt)
and decide if the patch is correct or incorrect (I'm not saying I can
with 100% certainty) then perhaps we should ask somebody who can?
BTW, I also hate code that is there just because somebody has a fuzzy
feeling it helps. We definitely don't need more of that. But I do like
code that is correct in principle, code that I can change and get a
predictable result. Code that works most of the time on most platforms
(as long as I don't make any changes) is no friend of mine.
/Bj?rn
2010/11/1 Bj?rn Smedman <[email protected]>:
> 2010/11/1 Bob Copeland <[email protected]>:
>> late-comers know which data is serialized by the barrier. ?In the best
>> case, said late-comers are more knowledgeable than me and fix the crap
>> code that I write.
>
> First let me apologize for the tone in my previous mail. It came out
> too hard this Monday (due to a long and unwanted weekend of debugging
> ath9k DMA errors). I did not mean to suggest that any code is crap
> regardless of origin (except maybe if I wrote it).
Err, no worries, I don't think anyone took it that way (I don't do much
ath9k stuff anyway). I literally meant my own code is often crap so I
try to leave breadcrumbs for people, esp for nitpicky things like
barriers :)
> Thanx Bob for your input. If we agree the memory barriers are needed
> for correctness I will add comments and post as PATCH.
Sure, my opinion may not matter for ath9k, but I think it is a good
idea.
--
Bob Copeland %% http://www.bobcopeland.com
Björn Smedman wrote:
> long and unwanted weekend of debugging ath9k DMA errors
Yeah, I've seen those as well, especially with AR5008 but also AR9002.
I have severe intermittent issues with PCI functionality of an AR9280
card. It locks up in a state where it either returns absolute garbage
so severe that the kernel plain rejects it, or just mildly garbage
data so that it reports corrupt PCI ids. It can fix itself in a
running system, so that a later lspci shows the correct device ids
and the driver actually loads and runs.
I typically get PCI fatal interrupts after resurrecting the card
(that takes a couple of power cycles including removing the laptop
battery). I don't suppose anyone can explain *in detail* what causee
these issues and what could be done to fix them?
Finally I've also experienced complete system lockup after a
previously working system was left on but idle overnight.
All this fun with commit 2d10d8737ccdba752d60106abbc6ed4f37404923 but
I believe these problems are nothing new.
//Peter
2010/11/1 Bj?rn Smedman <[email protected]>:
> On Fri, Oct 29, 2010 at 2:57 AM, Luis R. Rodriguez <[email protected]> wrote:
> [snip]
>> I'd rather not add it if its not fixing anything. I hate code that is
>> there just because we have a fuzzy feeling it helps.
>>
>> So NACK unless it fixes something.
>>
>> ?Luis
>
> IMHO there is a right and wrong, at the very least inside a computer.
> If we can't read the documentation (Documentation/memory-barriers.txt)
> and decide if the patch is correct or incorrect (I'm not saying I can
> with 100% certainty) then perhaps we should ask somebody who can?
For what it's worth, I agree that there should be a memory barrier
between updating the contents of descriptors and linking them into
the linked list seen by the hardware. If I may quote dma-mapping.txt:
"Consistent DMA memory does not preclude the usage of proper memory
barriers. The CPU may reorder stores to consistent memory just as
it may normal memory." .. and then it gives an example which is
exactly this case.
I had proposed such a patch for ath5k a while ago, but I never pushed
it upstream since it didn't seem to help any of the synchronization
problems we were seeing on the platforms for which we have hardware,
while better locking did. If anyone in PPC land wants to see if
an eieio helps them let me know :)
One thing such a patch _does_ need, though, is a comment that describes
why there is a barrier. Otherwise when people reorganize the code,
they may forget to take the barrier with it, and it also lets
late-comers know which data is serialized by the barrier. In the best
case, said late-comers are more knowledgeable than me and fix the crap
code that I write.
--
Bob Copeland %% http://www.bobcopeland.com