LinuxLists.cc - [RFC] ath9k: Detect and work-around tx-queue hang.

2013-02-22 02:06:53

Subject: [RFC] ath9k: Detect and work-around tx-queue hang.

From: Ben Greear <[email protected]>

We see TX lockups on ar9380 NICs when running 32 stations
each with a 56kbps stream of MTU sized UDP packets.
We see lockups on the AP and also on the station, seems
random which hits first.

The test case further involves a programmable attenuator,
and the attenuation is taken from -30 to -85 signal level
in steps of 10db. Each step runs for 1 minute before
increasing the attenuation. The problem normally
shows up around signal level of -70 (noise is reported
as around -95).

When the lockup hits, it is typically on a single queue
(BE). The symptom is that there is no obvious transmit
activity on that queue, the acq-depth and axq-ampdu-depth
are zero, the queue is stopped, and the pending-frames is
at or above the maximum allowed. The VO queue continues
to function, and RX logic functions fine.

Just resetting the chip does not fix the problem: The
pending-frames usually stays at max. So, this patch also
adds hacks to force pending-frames to zero. It also
quietens some warnings about pending-frame underruns
because sometimes, the tx status does appear many seconds
later.

Finally, the reset fixup code is logged at ath_err because
I think everyone should be aware of events like this.

We see the same problem with ath9k rate control and
minstrel-ht. We have not tested other ath9k chipsets
in this manner.

Small numbers of high-speed stations do not hit this
problem, or at least not in our test cases.

Signed-off-by: Ben Greear <[email protected]>
---
drivers/net/wireless/ath/ath9k/ath9k.h | 2 ++
drivers/net/wireless/ath/ath9k/link.c | 30 ++++++++++++++++++++++++++++--
drivers/net/wireless/ath/ath9k/main.c | 5 +++--
drivers/net/wireless/ath/ath9k/xmit.c | 15 ++++++++++++++-
4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/ath9k.h b/drivers/net/wireless/ath/ath9k/ath9k.h
index d7897dcf..cc8d560 100644
--- a/drivers/net/wireless/ath/ath9k/ath9k.h
+++ b/drivers/net/wireless/ath/ath9k/ath9k.h
@@ -194,6 +194,7 @@ struct ath_txq {
u32 axq_ampdu_depth;
bool stopped;
bool axq_tx_inprogress;
+ bool clear_pending_frames_on_flush;
struct list_head axq_acq;
struct list_head txq_fifo[ATH_TXFIFO_DEPTH];
u8 txq_headidx;
@@ -684,6 +685,7 @@ struct ath_softc {
u16 curtxpow;
bool ps_enabled;
bool ps_idle;
+ bool reset_force_noretry;
short nbcnvifs;
short nvifs;
unsigned long ps_usecount;
diff --git a/drivers/net/wireless/ath/ath9k/link.c b/drivers/net/wireless/ath/ath9k/link.c
index 7b88b9c..b59565c 100644
--- a/drivers/net/wireless/ath/ath9k/link.c
+++ b/drivers/net/wireless/ath/ath9k/link.c
@@ -38,18 +38,44 @@ void ath_tx_complete_poll_work(struct work_struct *work)
if (txq->axq_depth) {
if (txq->axq_tx_inprogress) {
needreset = true;
+ ath_err(ath9k_hw_common(sc->sc_ah),
+ "tx hung, queue: %i axq-depth: %i, ampdu-depth: %i resetting the chip\n",
+ i, txq->axq_depth,
+ txq->axq_ampdu_depth);
ath_txq_unlock(sc, txq);
break;
} else {
txq->axq_tx_inprogress = true;
}
+ } else {
+ /* Check for software TX hang. It seems
+ * sometimes pending-frames is not properly
+ * decremented, and the tx queue hangs.
+ * Considered hung if: axq-depth is zero,
+ * ampdu-depth is zero, queue-is-stopped,
+ * and we have pending frames.
+ */
+ if (txq->stopped &&
+ (txq->axq_ampdu_depth == 0) &&
+ (txq->pending_frames > 0)) {
+ if (txq->axq_tx_inprogress) {
+ ath_err(ath9k_hw_common(sc->sc_ah),
+ "soft tx hang: queue: %i pending-frames: %i, resetting chip\n",
+ i, txq->pending_frames);
+ needreset = true;
+ txq->clear_pending_frames_on_flush = true;
+ sc->reset_force_noretry = true;
+ ath_txq_unlock(sc, txq);
+ break;
+ } else {
+ txq->axq_tx_inprogress = true;
+ }
+ }
}
ath_txq_unlock_complete(sc, txq);
}

if (needreset) {
- ath_dbg(ath9k_hw_common(sc->sc_ah), RESET,
- "tx hung, resetting the chip\n");
ath9k_queue_reset(sc, RESET_TYPE_TX_HANG);
return;
}
diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index 5c8758d..0de0e50 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -587,8 +587,9 @@ void ath9k_queue_reset(struct ath_softc *sc, enum ath_reset_type type)
void ath_reset_work(struct work_struct *work)
{
struct ath_softc *sc = container_of(work, struct ath_softc, hw_reset_work);
-
- ath_reset(sc, true);
+ bool retry_tx = !sc->reset_force_noretry;
+ sc->reset_force_noretry = false;
+ ath_reset(sc, retry_tx);
}

/**********************/
diff --git a/drivers/net/wireless/ath/ath9k/xmit.c b/drivers/net/wireless/ath/ath9k/xmit.c
index 741918a..093c77e 100644
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -1543,6 +1543,15 @@ void ath_draintxq(struct ath_softc *sc, struct ath_txq *txq, bool retry_tx)
if ((sc->sc_ah->caps.hw_caps & ATH9K_HW_CAP_HT) && !retry_tx)
ath_txq_drain_pending_buffers(sc, txq);

+ if (txq->clear_pending_frames_on_flush && (txq->pending_frames != 0)) {
+ ath_err(ath9k_hw_common(sc->sc_ah),
+ "Pending frames still exist on txq: %i after drain: %i axq-depth: %i ampdu-depth: %i\n",
+ txq->mac80211_qnum, txq->pending_frames, txq->axq_depth,
+ txq->axq_ampdu_depth);
+ txq->pending_frames = 0;
+ }
+ txq->clear_pending_frames_on_flush = false;
+
ath_txq_unlock_complete(sc, txq);
}

@@ -2066,8 +2075,12 @@ static void ath_tx_complete(struct ath_softc *sc, struct sk_buff *skb,

q = skb_get_queue_mapping(skb);
if (txq == sc->tx.txq_map[q]) {
- if (WARN_ON(--txq->pending_frames < 0))
+ if (--txq->pending_frames < 0) {
+ if (net_ratelimit())
+ ath_err(common, "txq: %p had negative pending_frames, q: %i\n",
+ txq, q);
txq->pending_frames = 0;
+ }

if (txq->stopped &&
txq->pending_frames < sc->tx.txq_max_pending[q]) {
--
1.7.3.4

2013-02-22 12:27:14

by Sujith Manoharan

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

Felix Fietkau wrote:
> Please also check if the station(s) that the frames are queued for are
> in powersave state for some reason. That would prevent the tx path from
> throwing them in the hw queue, yet they'd still take up pending-frame
> slots. I was planning on fixing this eventually by expiring frames that
> stay in the queue for too long, but haven't decided on the exact
> approach yet.

PS is disabled for multi-VIF.

Sujith

2013-02-22 12:55:16

by Ben Greear

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

On 02/22/2013 04:38 AM, Felix Fietkau wrote:
> On 2013-02-22 1:25 PM, Sujith Manoharan wrote:
>> Felix Fietkau wrote:
>>> Please also check if the station(s) that the frames are queued for are
>>> in powersave state for some reason. That would prevent the tx path from
>>> throwing them in the hw queue, yet they'd still take up pending-frame
>>> slots. I was planning on fixing this eventually by expiring frames that
>>> stay in the queue for too long, but haven't decided on the exact
>>> approach yet.
>>
>> PS is disabled for multi-VIF.
> What about off-channel PS due to scans, etc.

Scan is always locked to the same channel in this setup (once
a single station is associated).

The stations stay associated while this problem happens (the
high-priority queue seems to work just fine, which may be
the reason they stay associated just fine.)

In some cases, I see packets delivered around 30 seconds late...
aside from PS and off-channel..any idea what could make a packet
stick around that long in the tx queues?

Thanks,
Ben

>
> - Felix
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-02-22 12:38:21

by Felix Fietkau

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

On 2013-02-22 1:25 PM, Sujith Manoharan wrote:
> Felix Fietkau wrote:
>> Please also check if the station(s) that the frames are queued for are
>> in powersave state for some reason. That would prevent the tx path from
>> throwing them in the hw queue, yet they'd still take up pending-frame
>> slots. I was planning on fixing this eventually by expiring frames that
>> stay in the queue for too long, but haven't decided on the exact
>> approach yet.
>
> PS is disabled for multi-VIF.
What about off-channel PS due to scans, etc.

- Felix

2013-02-22 04:50:49

by Sujith Manoharan

[permalink] [raw]

Subject: Re: [RFC] ath9k: Detect and work-around tx-queue hang.

Ben Greear wrote:
> I'll be happy to test patches, but I'm not sure how to go about
> debugging the real problem on my own. Maybe some stats could
> be added to the xmit debugfs file to help diagnose the problem,
> or maybe some other debugfs info will help?
>
> I can't reproduce the problem with ath9k debugging set at the
> previous suggested level, so it would have to be something
> less invasive.
>
> As for just stations going out of range, it remains locked up
> even with signal level goes back to -20, so it's not just a simple
> station-out-of range issues..

Sure, but I think that filtered frames are not handled properly,
especially with aggregation, since the debugfs stats from your earlier email
showed a large counter (from a private patch ?):

TXERR Filtered: 224 0 0 0

Sujith

2013-02-22 11:36:33

by Felix Fietkau

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

On 2013-02-22 6:26 AM, Ben Greear wrote:
> On 02/21/2013 08:49 PM, Sujith Manoharan wrote:
>> Ben Greear wrote:
>>> I'll be happy to test patches, but I'm not sure how to go about
>>> debugging the real problem on my own. Maybe some stats could
>>> be added to the xmit debugfs file to help diagnose the problem,
>>> or maybe some other debugfs info will help?
>>>
>>> I can't reproduce the problem with ath9k debugging set at the
>>> previous suggested level, so it would have to be something
>>> less invasive.
>>>
>>> As for just stations going out of range, it remains locked up
>>> even with signal level goes back to -20, so it's not just a simple
>>> station-out-of range issues..
>>
>> Sure, but I think that filtered frames are not handled properly,
>> especially with aggregation, since the debugfs stats from your earlier email
>> showed a large counter (from a private patch ?):
>>
>> TXERR Filtered: 224 0 0 0
>
> Yeah, guess that patch never made it upstream. The pertinent bit is:
Please also check if the station(s) that the frames are queued for are
in powersave state for some reason. That would prevent the tx path from
throwing them in the hw queue, yet they'd still take up pending-frame
slots. I was planning on fixing this eventually by expiring frames that
stay in the queue for too long, but haven't decided on the exact
approach yet.

- Felix

2013-02-22 05:26:36

by Ben Greear

[permalink] [raw]

Subject: Re: [RFC] ath9k: Detect and work-around tx-queue hang.

On 02/21/2013 08:49 PM, Sujith Manoharan wrote:
> Ben Greear wrote:
>> I'll be happy to test patches, but I'm not sure how to go about
>> debugging the real problem on my own. Maybe some stats could
>> be added to the xmit debugfs file to help diagnose the problem,
>> or maybe some other debugfs info will help?
>>
>> I can't reproduce the problem with ath9k debugging set at the
>> previous suggested level, so it would have to be something
>> less invasive.
>>
>> As for just stations going out of range, it remains locked up
>> even with signal level goes back to -20, so it's not just a simple
>> station-out-of range issues..
>
> Sure, but I think that filtered frames are not handled properly,
> especially with aggregation, since the debugfs stats from your earlier email
> showed a large counter (from a private patch ?):
>
> TXERR Filtered: 224 0 0 0

Yeah, guess that patch never made it upstream. The pertinent bit is:

+++ b/drivers/net/wireless/ath/ath9k/debug.c
@@ -579,6 +579,7 @@ static ssize_t read_file_xmit(struct file *file, char __user *user_buf,
PR("AMPDUs Completed:", a_completed);
PR("AMPDUs Retried: ", a_retries);
PR("AMPDUs XRetried: ", a_xretries);
+ PR("TXERR Filtered: ", txerr_filtered);
PR("FIFO Underrun: ", fifo_underrun);
PR("TXOP Exceeded: ", xtxop);
PR("TXTIMER Expiry: ", timer_exp);
@@ -867,6 +868,8 @@ void ath_debug_stat_tx(struct ath_softc *sc, struct ath_buf *bf,
TX_STAT_INC(qnum, completed);
}

+ if (ts->ts_status & ATH9K_TXERR_FILT)
+ TX_STAT_INC(qnum, txerr_filtered);
if (ts->ts_status & ATH9K_TXERR_FIFO)
TX_STAT_INC(qnum, fifo_underrun);
if (ts->ts_status & ATH9K_TXERR_XTXOP)

I'll post this and a few other small patches when I get a chance.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-02-22 04:30:22

by Sujith Manoharan

[permalink] [raw]

Subject: Re: [RFC] ath9k: Detect and work-around tx-queue hang.

Hi,

This is definitely a work-around. :)
I think we should debug a bit more to find out the actual bug rather than
add more hacks to the already hackish TX poll routine.

Sujith

[email protected] wrote:
> From: Ben Greear <[email protected]>
>
> We see TX lockups on ar9380 NICs when running 32 stations
> each with a 56kbps stream of MTU sized UDP packets.
> We see lockups on the AP and also on the station, seems
> random which hits first.
>
> The test case further involves a programmable attenuator,
> and the attenuation is taken from -30 to -85 signal level
> in steps of 10db. Each step runs for 1 minute before
> increasing the attenuation. The problem normally
> shows up around signal level of -70 (noise is reported
> as around -95).
>
> When the lockup hits, it is typically on a single queue
> (BE). The symptom is that there is no obvious transmit
> activity on that queue, the acq-depth and axq-ampdu-depth
> are zero, the queue is stopped, and the pending-frames is
> at or above the maximum allowed. The VO queue continues
> to function, and RX logic functions fine.
>
> Just resetting the chip does not fix the problem: The
> pending-frames usually stays at max. So, this patch also
> adds hacks to force pending-frames to zero. It also
> quietens some warnings about pending-frame underruns
> because sometimes, the tx status does appear many seconds
> later.
>
> Finally, the reset fixup code is logged at ath_err because
> I think everyone should be aware of events like this.
>
> We see the same problem with ath9k rate control and
> minstrel-ht. We have not tested other ath9k chipsets
> in this manner.
>
> Small numbers of high-speed stations do not hit this
> problem, or at least not in our test cases.
>
> Signed-off-by: Ben Greear <[email protected]>
> ---
> drivers/net/wireless/ath/ath9k/ath9k.h | 2 ++
> drivers/net/wireless/ath/ath9k/link.c | 30 ++++++++++++++++++++++++++++--
> drivers/net/wireless/ath/ath9k/main.c | 5 +++--
> drivers/net/wireless/ath/ath9k/xmit.c | 15 ++++++++++++++-
> 4 files changed, 47 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath9k/ath9k.h b/drivers/net/wireless/ath/ath9k/ath9k.h
> index d7897dcf..cc8d560 100644
> --- a/drivers/net/wireless/ath/ath9k/ath9k.h
> +++ b/drivers/net/wireless/ath/ath9k/ath9k.h
> @@ -194,6 +194,7 @@ struct ath_txq {
> u32 axq_ampdu_depth;
> bool stopped;
> bool axq_tx_inprogress;
> + bool clear_pending_frames_on_flush;
> struct list_head axq_acq;
> struct list_head txq_fifo[ATH_TXFIFO_DEPTH];
> u8 txq_headidx;
> @@ -684,6 +685,7 @@ struct ath_softc {
> u16 curtxpow;
> bool ps_enabled;
> bool ps_idle;
> + bool reset_force_noretry;
> short nbcnvifs;
> short nvifs;
> unsigned long ps_usecount;
> diff --git a/drivers/net/wireless/ath/ath9k/link.c b/drivers/net/wireless/ath/ath9k/link.c
> index 7b88b9c..b59565c 100644
> --- a/drivers/net/wireless/ath/ath9k/link.c
> +++ b/drivers/net/wireless/ath/ath9k/link.c
> @@ -38,18 +38,44 @@ void ath_tx_complete_poll_work(struct work_struct *work)
> if (txq->axq_depth) {
> if (txq->axq_tx_inprogress) {
> needreset = true;
> + ath_err(ath9k_hw_common(sc->sc_ah),
> + "tx hung, queue: %i axq-depth: %i, ampdu-depth: %i resetting the chip\n",
> + i, txq->axq_depth,
> + txq->axq_ampdu_depth);
> ath_txq_unlock(sc, txq);
> break;
> } else {
> txq->axq_tx_inprogress = true;
> }
> + } else {
> + /* Check for software TX hang. It seems
> + * sometimes pending-frames is not properly
> + * decremented, and the tx queue hangs.
> + * Considered hung if: axq-depth is zero,
> + * ampdu-depth is zero, queue-is-stopped,
> + * and we have pending frames.
> + */
> + if (txq->stopped &&
> + (txq->axq_ampdu_depth == 0) &&
> + (txq->pending_frames > 0)) {
> + if (txq->axq_tx_inprogress) {
> + ath_err(ath9k_hw_common(sc->sc_ah),
> + "soft tx hang: queue: %i pending-frames: %i, resetting chip\n",
> + i, txq->pending_frames);
> + needreset = true;
> + txq->clear_pending_frames_on_flush = true;
> + sc->reset_force_noretry = true;
> + ath_txq_unlock(sc, txq);
> + break;
> + } else {
> + txq->axq_tx_inprogress = true;
> + }
> + }
> }
> ath_txq_unlock_complete(sc, txq);
> }
>
> if (needreset) {
> - ath_dbg(ath9k_hw_common(sc->sc_ah), RESET,
> - "tx hung, resetting the chip\n");
> ath9k_queue_reset(sc, RESET_TYPE_TX_HANG);
> return;
> }
> diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
> index 5c8758d..0de0e50 100644
> --- a/drivers/net/wireless/ath/ath9k/main.c
> +++ b/drivers/net/wireless/ath/ath9k/main.c
> @@ -587,8 +587,9 @@ void ath9k_queue_reset(struct ath_softc *sc, enum ath_reset_type type)
> void ath_reset_work(struct work_struct *work)
> {
> struct ath_softc *sc = container_of(work, struct ath_softc, hw_reset_work);
> -
> - ath_reset(sc, true);
> + bool retry_tx = !sc->reset_force_noretry;
> + sc->reset_force_noretry = false;
> + ath_reset(sc, retry_tx);
> }
>
> /**********************/
> diff --git a/drivers/net/wireless/ath/ath9k/xmit.c b/drivers/net/wireless/ath/ath9k/xmit.c
> index 741918a..093c77e 100644
> --- a/drivers/net/wireless/ath/ath9k/xmit.c
> +++ b/drivers/net/wireless/ath/ath9k/xmit.c
> @@ -1543,6 +1543,15 @@ void ath_draintxq(struct ath_softc *sc, struct ath_txq *txq, bool retry_tx)
> if ((sc->sc_ah->caps.hw_caps & ATH9K_HW_CAP_HT) && !retry_tx)
> ath_txq_drain_pending_buffers(sc, txq);
>
> + if (txq->clear_pending_frames_on_flush && (txq->pending_frames != 0)) {
> + ath_err(ath9k_hw_common(sc->sc_ah),
> + "Pending frames still exist on txq: %i after drain: %i axq-depth: %i ampdu-depth: %i\n",
> + txq->mac80211_qnum, txq->pending_frames, txq->axq_depth,
> + txq->axq_ampdu_depth);
> + txq->pending_frames = 0;
> + }
> + txq->clear_pending_frames_on_flush = false;
> +
> ath_txq_unlock_complete(sc, txq);
> }
>
> @@ -2066,8 +2075,12 @@ static void ath_tx_complete(struct ath_softc *sc, struct sk_buff *skb,
>
> q = skb_get_queue_mapping(skb);
> if (txq == sc->tx.txq_map[q]) {
> - if (WARN_ON(--txq->pending_frames < 0))
> + if (--txq->pending_frames < 0) {
> + if (net_ratelimit())
> + ath_err(common, "txq: %p had negative pending_frames, q: %i\n",
> + txq, q);
> txq->pending_frames = 0;
> + }
>
> if (txq->stopped &&
> txq->pending_frames < sc->tx.txq_max_pending[q]) {
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-02-22 04:42:34

by Ben Greear

[permalink] [raw]

Subject: Re: [RFC] ath9k: Detect and work-around tx-queue hang.

On 02/21/2013 08:28 PM, Sujith Manoharan wrote:
> Hi,
>
> This is definitely a work-around. :)
> I think we should debug a bit more to find out the actual bug rather than
> add more hacks to the already hackish TX poll routine.

I'll be happy to test patches, but I'm not sure how to go about
debugging the real problem on my own. Maybe some stats could
be added to the xmit debugfs file to help diagnose the problem,
or maybe some other debugfs info will help?

I can't reproduce the problem with ath9k debugging set at the
previous suggested level, so it would have to be something
less invasive.

As for just stations going out of range, it remains locked up
even with signal level goes back to -20, so it's not just a simple
station-out-of range issues..

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-03-13 14:16:16

by Sujith Manoharan

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

Ben Greear wrote:
> For instance, is there any good way to know for certain
> if packets in the queue are in power-save or not? I know
> we at least attempt to disable power-save, but possibly
> it gets re-enabled somehow?

I am not sure if that could happen.

This issue got lost somehow, but reviewing xmit.c would probably
give some clues. :)

Sujith

2013-03-12 18:16:17

by Ben Greear

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

On 02/22/2013 04:55 AM, Ben Greear wrote:
> On 02/22/2013 04:38 AM, Felix Fietkau wrote:
>> On 2013-02-22 1:25 PM, Sujith Manoharan wrote:
>>> Felix Fietkau wrote:
>>>> Please also check if the station(s) that the frames are queued for are
>>>> in powersave state for some reason. That would prevent the tx path from
>>>> throwing them in the hw queue, yet they'd still take up pending-frame
>>>> slots. I was planning on fixing this eventually by expiring frames that
>>>> stay in the queue for too long, but haven't decided on the exact
>>>> approach yet.
>>>
>>> PS is disabled for multi-VIF.
>> What about off-channel PS due to scans, etc.
>
> Scan is always locked to the same channel in this setup (once
> a single station is associated).
>
> The stations stay associated while this problem happens (the
> high-priority queue seems to work just fine, which may be
> the reason they stay associated just fine.)
>
> In some cases, I see packets delivered around 30 seconds late...
> aside from PS and off-channel..any idea what could make a packet
> stick around that long in the tx queues?

One of our customers on a 3.5.7+ kernel hit the problem without
using any RF attenuator...just over-the-air communication to
their AP. It happened on both 2.4 and 5Ghz bands. Seems rx
signal is around 40 in their environment. It took them around
24 hours to hit the problem on average.

Last we checked, we could fairly easily reproduce this in
our lab using an attenuator and a certain setup, so if there
is any debugging we could add to help narrow down what
might be causing this, we can give that a try.

For instance, is there any good way to know for certain
if packets in the queue are in power-save or not? I know
we at least attempt to disable power-save, but possibly
it gets re-enabled somehow?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-03-13 14:18:28

by Felix Fietkau

[permalink] [raw]

Subject: Re: [ath9k-devel] [RFC] ath9k: Detect and work-around tx-queue hang.

On 2013-03-12 7:16 PM, Ben Greear wrote:
> On 02/22/2013 04:55 AM, Ben Greear wrote:
>> On 02/22/2013 04:38 AM, Felix Fietkau wrote:
>>> On 2013-02-22 1:25 PM, Sujith Manoharan wrote:
>>>> Felix Fietkau wrote:
>>>>> Please also check if the station(s) that the frames are queued for are
>>>>> in powersave state for some reason. That would prevent the tx path from
>>>>> throwing them in the hw queue, yet they'd still take up pending-frame
>>>>> slots. I was planning on fixing this eventually by expiring frames that
>>>>> stay in the queue for too long, but haven't decided on the exact
>>>>> approach yet.
>>>>
>>>> PS is disabled for multi-VIF.
>>> What about off-channel PS due to scans, etc.
>>
>> Scan is always locked to the same channel in this setup (once
>> a single station is associated).
>>
>> The stations stay associated while this problem happens (the
>> high-priority queue seems to work just fine, which may be
>> the reason they stay associated just fine.)
>>
>> In some cases, I see packets delivered around 30 seconds late...
>> aside from PS and off-channel..any idea what could make a packet
>> stick around that long in the tx queues?
>
> One of our customers on a 3.5.7+ kernel hit the problem without
> using any RF attenuator...just over-the-air communication to
> their AP. It happened on both 2.4 and 5Ghz bands. Seems rx
> signal is around 40 in their environment. It took them around
> 24 hours to hit the problem on average.
>
> Last we checked, we could fairly easily reproduce this in
> our lab using an attenuator and a certain setup, so if there
> is any debugging we could add to help narrow down what
> might be causing this, we can give that a try.
>
> For instance, is there any good way to know for certain
> if packets in the queue are in power-save or not? I know
> we at least attempt to disable power-save, but possibly
> it gets re-enabled somehow?
The ath_node struct tracks if a node is sleeping or not. If a node is
sleeping, its tid queues can still hold some frames but will not be
serviced by ath_txq_schedule.

- Felix