This updates stop_queue(s) kdoc as currently there's
a undocumented dependency.
Stopping the queue from anywhere else than the ops->tx()
callback will result in a hard to debug deadlock and
system freeze (on UP).
Here's an example stacktrace that has been captured only
by special debugging patches to the PPC-decrementer to
detect this machine freeze.
http://bu3sch.de/misc/freeze1.jpg
http://bu3sch.de/misc/freeze2.jpg
Signed-off-by: Michael Buesch <[email protected]>
Index: mac80211/include/net/mac80211.h
===================================================================
--- mac80211.orig/include/net/mac80211.h 2007-06-01 11:20:32.000000000 +0200
+++ mac80211/include/net/mac80211.h 2007-06-01 11:23:02.000000000 +0200
@@ -967,6 +967,10 @@ void ieee80211_wake_queue(struct ieee802
* @queue: queue number (counted from zero).
*
* Drivers should use this function instead of netif_stop_queue.
+ *
+ * It's currently only safe to call this function from the
+ * ops->tx() callback. Calls from elsewhere will result in
+ * hard to debug deadlocks (freezes the system on UP).
*/
void ieee80211_stop_queue(struct ieee80211_hw *hw, int queue);
@@ -983,6 +987,10 @@ void ieee80211_start_queues(struct ieee8
* @hw: pointer as obtained from ieee80211_alloc_hw().
*
* Drivers should use this function instead of netif_stop_queue.
+ *
+ * It's currently only safe to call this function from the
+ * ops->tx() callback. Calls from elsewhere will result in
+ * hard to debug deadlocks (freezes the system on UP).
*/
void ieee80211_stop_queues(struct ieee80211_hw *hw);
--
I'd still prefer someone fixes the issue. I don't understand
the code well enough to fix it.
--
Greetings Michael.
On Friday 01 June 2007 10:32, James Ketrenos wrote:
> We definitely need to wake the queue outside of open/stop/tx. If you stop
> the queue due to the HW ring being full, you won't be able to wake the
> queue until the HW has asynchronously freed a Tx slot.
>
Yeah, there are other safe (and correct) places like interrupt handlers. If
you're not seeing hard freezes with a lot of data transfer, you're probably
fine. I still need to investigate why bcm43xx can't do it.. I'm not sure
what's going on yet.
-Michael Wu
2007/6/1, Michael Buesch <[email protected]>:
> On Friday 01 June 2007 23:00:16 Olivier Cornu wrote:
> > The only call to ieee80211_stop_queue() grep can find is in
> > bcm43xx_dma.c:1171, i.e. bcm43xx_dma_tx(). And bcm43xx_dma_tx() is
> > only called from bcm43xx_tx().
> > What did i miss?
>
> Periodic work. I removed that in my current tree.
> And I think we also call it when resetting the device. But I'm not sure.
I guess i see why we did not understand each other: these are calls to
ieee80211_stop_queues, not directly to ieee80211_stop_queue (even
though stop_queues ultimately calls stop_queue).
The difference is relevant because, if some kind of locking (for
example) was needed in the special contexts stop_queues only might be
called from, these locks would be held inside stop_queues, around its
stop_queue calls. :)
Is there any way to tell mac80211 the device is in a [temporary]
"non-functional" state (reset, association loss...) ?
On Friday 01 June 2007 11:48:11 Johannes Berg wrote:
> On Fri, 2007-06-01 at 11:29 +0200, Michael Buesch wrote:
> > This updates stop_queue(s) kdoc as currently there's
> > a undocumented dependency.
> >
> > Stopping the queue from anywhere else than the ops->tx()
> > callback will result in a hard to debug deadlock and
> > system freeze (on UP).
>
> Ugh. Any way to actually detect a wrong spot to call it at runtime?
Probably by verifying if netif_tx_lock is locked somehow.
But I'm not sure.
--
Greetings Michael.
Michael Wu wrote:
> On Friday 01 June 2007 09:58, James Ketrenos wrote:
>> In this way we are putting in a work around that doesn't result in an API
>> change, and that can eventually (hopefully) be fixed the "right way"
>> (whatever that may end up being)
>>
> A workaround could be implemented, but I currently don't see any cases where
> stopping/waking the queue outside of open/stop/tx is necessary or correct.
Is calling the ieee80211_stop_queue[s] from open/stop OK then? The doc update said only in ops->tx. Should the driver be calling it in the stop callback, or will the stack stop the queues for us?
iwlwifi currently calls ieee80211_stop_queue from the Tx handler and calls ieee80211_stop_queues in the event the adapter is reset (due to HW error, RF kill transition, hw tear down, etc.) as it asynchronously brings re-initializes the hardware.
We definitely need to wake the queue outside of open/stop/tx. If you stop the queue due to the HW ring being full, you won't be able to wake the queue until the HW has asynchronously freed a Tx slot.
iwlwifi calls ieee80211_wake_queue during buffer reclaiming after the HW indicates the Tx has completed.
James
> (in bcm43xx, stopping the tx rings is more correct and effective for what is
> being done there - stopping TX so the radio can be recalibrated)
>
> -Michael Wu
Michael Buesch wrote:
> This updates stop_queue(s) kdoc as currently there's
> a undocumented dependency.
>
> Stopping the queue from anywhere else than the ops->tx()
> callback will result in a hard to debug deadlock and
> system freeze (on UP).
...
> --
>
> I'd still prefer someone fixes the issue. I don't understand
> the code well enough to fix it.
Perhaps having ieee80211_stop_queue keep a mac80211 internal state flag that is checked within the Tx operation before the driver's Tx callback is made. Then, if that bit is set, mac80211 can invoke the internal "deadlock if outside of Tx" queue stop action to halt Tx?
In this way we are putting in a work around that doesn't result in an API change, and that can eventually (hopefully) be fixed the "right way" (whatever that may end up being)
James
On Saturday 02 June 2007 15:17:55 Olivier Cornu wrote:
> 2007/6/1, Michael Buesch <[email protected]>:
> > On Friday 01 June 2007 23:00:16 Olivier Cornu wrote:
> > > The only call to ieee80211_stop_queue() grep can find is in
> > > bcm43xx_dma.c:1171, i.e. bcm43xx_dma_tx(). And bcm43xx_dma_tx() is
> > > only called from bcm43xx_tx().
> > > What did i miss?
> >
> > Periodic work. I removed that in my current tree.
> > And I think we also call it when resetting the device. But I'm not sure.
We also call it for a device reset. Which is required for
a band switch (in certain cases), for example. Also for other things.
> I guess i see why we did not understand each other: these are calls to
> ieee80211_stop_queues, not directly to ieee80211_stop_queue (even
> though stop_queues ultimately calls stop_queue).
> The difference is relevant because, if some kind of locking (for
> example) was needed in the special contexts stop_queues only might be
> called from, these locks would be held inside stop_queues, around its
> stop_queue calls. :)
Ok, I see. stop_queues should take the locks etc.. so that there's
no deadlock.
> Is there any way to tell mac80211 the device is in a [temporary]
> "non-functional" state (reset, association loss...) ?
As far as I can see ieee80211_stop_queues() is exactly for that, but
it's broken. In the tx handler, where it should be safe to call, it's
useless to call stop_queues().
--
Greetings Michael.
On Friday 01 June 2007 22:52:38 Olivier Cornu wrote:
> - stop_queues on the other hand means stop all queues, and this
> should only be needed on stop, reset, etc (why would you stop all
> queues on tx?). In this regard, a single ring device should probably
> still call stop_queues in these contexts.
That's _exactly_ the issue. ;)
--
Greetings Michael.
On Fri, 2007-06-01 at 11:29 +0200, Michael Buesch wrote:
> This updates stop_queue(s) kdoc as currently there's
> a undocumented dependency.
>
> Stopping the queue from anywhere else than the ops->tx()
> callback will result in a hard to debug deadlock and
> system freeze (on UP).
Ugh. Any way to actually detect a wrong spot to call it at runtime?
johannes
On Friday 01 June 2007 19:32:04 James Ketrenos wrote:
> iwlwifi currently calls ieee80211_stop_queue from the Tx handler and calls ieee80211_stop_queues in the event the adapter is reset (due to HW error, RF kill transition, hw tear down, etc.) as it asynchronously brings re-initializes the hardware.
I think we do that in bcm43xx, too, and I think it is
indeed racy, too. So it may trigger the freeze, too.
A workaround might be to aquire the netdev_tx_lock around
the stop_queues call. But I'm not sure.
--
Greetings Michael.
On Friday 01 June 2007 22:36:54 Michael Wu wrote:
> On Friday 01 June 2007 10:32, James Ketrenos wrote:
> > We definitely need to wake the queue outside of open/stop/tx. If you stop
> > the queue due to the HW ring being full, you won't be able to wake the
> > queue until the HW has asynchronously freed a Tx slot.
> >
> Yeah, there are other safe (and correct) places like interrupt handlers. If
> you're not seeing hard freezes with a lot of data transfer,
It is actually pretty hard to trigger.
You need a _lot_ of traffic and you need to trigger the queue_stop
now and then. (I think we did it about every 30 seconds in bcm43xx)
With that setup it takes several minutes to trigger. (Up to about 5)
And _if_ it triggers it completely freezes the system (UP machine)
without any message. So it's pretty nasty to debug.
--
Greetings Michael.
2007/6/2, Michael Buesch <[email protected]>:
> On Saturday 02 June 2007 15:17:55 Olivier Cornu wrote:
> > Is there any way to tell mac80211 the device is in a [temporary]
> > "non-functional" state (reset, association loss...) ?
>
> As far as I can see ieee80211_stop_queues() is exactly for that, but
> it's broken.
_If_ it is so, then stop_queues is misnamed.
In its current form, stop_queues suggests to driver writers it is just
a shortcut to save them looping through each queue to stop them all.
--
Olivier Cornu
On Friday 01 June 2007 09:58, James Ketrenos wrote:
> In this way we are putting in a work around that doesn't result in an API
> change, and that can eventually (hopefully) be fixed the "right way"
> (whatever that may end up being)
>
A workaround could be implemented, but I currently don't see any cases where
stopping/waking the queue outside of open/stop/tx is necessary or correct.
(in bcm43xx, stopping the tx rings is more correct and effective for what is
being done there - stopping TX so the radio can be recalibrated)
-Michael Wu
2007/6/1, Michael Buesch <[email protected]>:
> On Friday 01 June 2007 13:55:31 Olivier Cornu wrote:
> > Unless i missed something, stop_queue() isn't called outside of the tx
> > handler in bcm43xx code...
>
> You missed something.
The only call to ieee80211_stop_queue() grep can find is in
bcm43xx_dma.c:1171, i.e. bcm43xx_dma_tx(). And bcm43xx_dma_tx() is
only called from bcm43xx_tx().
What did i miss?
--
Olivier Cornu
2007/6/1, Michael Buesch <[email protected]>:
> On Friday 01 June 2007 23:00:16 Olivier Cornu wrote:
> > 2007/6/1, Michael Buesch <[email protected]>:
> > > On Friday 01 June 2007 13:55:31 Olivier Cornu wrote:
> > > > Unless i missed something, stop_queue() isn't called outside of the tx
> > > > handler in bcm43xx code...
> > >
> > > You missed something.
> >
> > The only call to ieee80211_stop_queue() grep can find is in
> > bcm43xx_dma.c:1171, i.e. bcm43xx_dma_tx(). And bcm43xx_dma_tx() is
> > only called from bcm43xx_tx().
> > What did i miss?
>
> Periodic work. I removed that in my current tree.
I only have the wireless-dev tree, that must be why... ;)
--
Olivier Cornu
On Friday 01 June 2007 23:00:16 Olivier Cornu wrote:
> 2007/6/1, Michael Buesch <[email protected]>:
> > On Friday 01 June 2007 13:55:31 Olivier Cornu wrote:
> > > Unless i missed something, stop_queue() isn't called outside of the tx
> > > handler in bcm43xx code...
> >
> > You missed something.
>
> The only call to ieee80211_stop_queue() grep can find is in
> bcm43xx_dma.c:1171, i.e. bcm43xx_dma_tx(). And bcm43xx_dma_tx() is
> only called from bcm43xx_tx().
> What did i miss?
Periodic work. I removed that in my current tree.
And I think we also call it when resetting the device. But I'm not sure.
--
Greetings Michael.
2007/6/1, James Ketrenos <[email protected]>:
> Michael Wu wrote:
> > A workaround could be implemented, but I currently don't see any cases where
> > stopping/waking the queue outside of open/stop/tx is necessary or correct.
>
> Is calling the ieee80211_stop_queue[s] from open/stop OK then? The doc update said only in ops->tx. Should the driver be calling it in the stop callback, or will the stack stop the queues for us?
>
> iwlwifi currently calls ieee80211_stop_queue from the Tx handler and calls ieee80211_stop_queues in the event the adapter is reset (due to HW error, RF kill transition, hw tear down, etc.) as it asynchronously brings re-initializes the hardware.
Out of pure "design intuition", there seems to be a real difference
between stop_queue and stop_queues. In other words, stop_queues should
not be seen as just a shortcut for multi-ring devices, as each should
actually be called from different contexts:
- stop_queue will indeed stop one queue only, but stopping a single
queue seems to make sense only on tx (when a tx_ring gets full).
- stop_queues on the other hand means stop all queues, and this
should only be needed on stop, reset, etc (why would you stop all
queues on tx?). In this regard, a single ring device should probably
still call stop_queues in these contexts.
(The same difference makes sense for the wake_... counterparts)
--
Olivier Cornu
On Friday 01 June 2007 13:55:31 Olivier Cornu wrote:
> 2007/6/1, Michael Buesch <[email protected]>:
> > This updates stop_queue(s) kdoc as currently there's
> > a undocumented dependency.
> >
> > Stopping the queue from anywhere else than the ops->tx()
> > callback will result in a hard to debug deadlock and
> > system freeze (on UP).
>
> Sorry but: why would one call ieee80211_stop_queue() from outside of
> the tx handler in the first place? Wouldn't that simply be a driver
> design problem?
No it wouldn't. A driver may want to stop (and probably flush) TX queues
when it resets itself due to an error. We don't want the stack to
try to call ops->tx() while we are restarting the hardware.
Other IMO valid cases might exist.
> Unless i missed something, stop_queue() isn't called outside of the tx
> handler in bcm43xx code...
You missed something.
--
Greetings Michael.
2007/6/1, Michael Buesch <[email protected]>:
> This updates stop_queue(s) kdoc as currently there's
> a undocumented dependency.
>
> Stopping the queue from anywhere else than the ops->tx()
> callback will result in a hard to debug deadlock and
> system freeze (on UP).
Sorry but: why would one call ieee80211_stop_queue() from outside of
the tx handler in the first place? Wouldn't that simply be a driver
design problem?
Unless i missed something, stop_queue() isn't called outside of the tx
handler in bcm43xx code...
--
Olivier Cornu