2019-07-26 12:11:53

by Stanislaw Gruszka

[permalink] [raw]
Subject: [RFC] mt76: fix tx hung regression on MT7630E

Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
I can observe firmware hangs on MT7630E on station mode: tx stop
functioning after minor activity (rx keep working) and on module
unload device fail to stop with messages:

[ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
[ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop

Loading module again results in failure to associate with AP.
Only machine power off / power on cycle can make device work again.

I have no idea why the commit caused F/W hangs. Most likely some proper
fix is needed of changing registers programming (or assuring proper order
of actions), but so far I can not came up with any better fix than
a partial revert of 41634aa8d6db.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
drivers/net/wireless/mediatek/mt76/tx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..fefe0ee52584 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
if (!test_bit(MT76_STATE_RUNNING, &dev->state))
return;

- tasklet_schedule(&dev->tx_tasklet);
+ mt76_txq_schedule(dev, txq->ac);
}
EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);

--
1.9.3



2019-07-29 13:55:54

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> I can observe firmware hangs on MT7630E on station mode: tx stop
> functioning after minor activity (rx keep working) and on module
> unload device fail to stop with messages:
>
> [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
>
> Loading module again results in failure to associate with AP.
> Only machine power off / power on cycle can make device work again.
>
> I have no idea why the commit caused F/W hangs. Most likely some proper
> fix is needed of changing registers programming (or assuring proper order
> of actions), but so far I can not came up with any better fix than
> a partial revert of 41634aa8d6db.

The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
it does not happen.

I'm not quite sure why, but MT7630E does not like when we poll tx status
on 2 cpus at once. Change like below:

diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
index 467b28379870..622251faa415 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
+++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
@@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
mt76.tx_napi);
int i;

- mt76x02_mac_poll_tx_status(dev, false);
+ mt76x02_mac_poll_tx_status(dev, true);

for (i = MT_TXQ_MCU; i >= 0; i--)
mt76_queue_tx_cleanup(dev, i, false);

is sufficient to avoid the hangs.

> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..fefe0ee52584 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> return;
>
> - tasklet_schedule(&dev->tx_tasklet);
> + mt76_txq_schedule(dev, txq->ac);

However I'm not sure if change to tasklet_schedule() is indeed correct,
since on tasklet we schedule all queues, hence queues that could
possibly be still blocked? And on mt76_wake_tx_queue() we schedule only
one queue.

Stanislaw

2019-07-29 14:26:16

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

> On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > I can observe firmware hangs on MT7630E on station mode: tx stop
> > functioning after minor activity (rx keep working) and on module
> > unload device fail to stop with messages:
> >
> > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> >
> > Loading module again results in failure to associate with AP.
> > Only machine power off / power on cycle can make device work again.
> >
> > I have no idea why the commit caused F/W hangs. Most likely some proper
> > fix is needed of changing registers programming (or assuring proper order
> > of actions), but so far I can not came up with any better fix than
> > a partial revert of 41634aa8d6db.
>
> The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> it does not happen.
>
> I'm not quite sure why, but MT7630E does not like when we poll tx status
> on 2 cpus at once. Change like below:

Hi Stanislaw,

have you tried to look at the commit: 6fe533378795f87
("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
Could it be a race between a registermap update and a stats load? (we are using a
different lock now)

>
> diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> index 467b28379870..622251faa415 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> mt76.tx_napi);
> int i;
>
> - mt76x02_mac_poll_tx_status(dev, false);
> + mt76x02_mac_poll_tx_status(dev, true);

I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
difference doing so is we do not run mt76x02_send_tx_status().

Regards,
Lorenzo

>
> for (i = MT_TXQ_MCU; i >= 0; i--)
> mt76_queue_tx_cleanup(dev, i, false);
>
> is sufficient to avoid the hangs.
>
> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..fefe0ee52584 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> > if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> > return;
> >
> > - tasklet_schedule(&dev->tx_tasklet);
> > + mt76_txq_schedule(dev, txq->ac);
>
> However I'm not sure if change to tasklet_schedule() is indeed correct,
> since on tasklet we schedule all queues, hence queues that could
> possibly be still blocked? And on mt76_wake_tx_queue() we schedule only
> one queue.
>
> Stanislaw


Attachments:
(No filename) (3.07 kB)
signature.asc (235.00 B)
Download all attachments

2019-07-30 16:27:59

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote:
> > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > > I can observe firmware hangs on MT7630E on station mode: tx stop
> > > functioning after minor activity (rx keep working) and on module
> > > unload device fail to stop with messages:
> > >
> > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> > >
> > > Loading module again results in failure to associate with AP.
> > > Only machine power off / power on cycle can make device work again.
> > >
> > > I have no idea why the commit caused F/W hangs. Most likely some proper
> > > fix is needed of changing registers programming (or assuring proper order
> > > of actions), but so far I can not came up with any better fix than
> > > a partial revert of 41634aa8d6db.
> >
> > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> > it does not happen.
> >
> > I'm not quite sure why, but MT7630E does not like when we poll tx status
> > on 2 cpus at once. Change like below:
>
> Hi Stanislaw,

Hi

> have you tried to look at the commit: 6fe533378795f87
> ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
> Could it be a race between a registermap update and a stats load? (we are using a
> different lock now)

This commit seems to be fine, reverting it did not solve the issue.

> > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > index 467b28379870..622251faa415 100644
> > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > mt76.tx_napi);
> > int i;
> >
> > - mt76x02_mac_poll_tx_status(dev, false);
> > + mt76x02_mac_poll_tx_status(dev, true);
>
> I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> difference doing so is we do not run mt76x02_send_tx_status().

I thought this is the problem, but it was my mistake during testing.
I tested the above change together with mt76_txq_schedule(dev, txq->ac)
change and get wrong impression it fixes the issue. But above change
alone does not help.

I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
originally posted here make the problem gone.

Stanislaw

2019-07-30 16:36:19

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

> On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote:
> > > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > > > I can observe firmware hangs on MT7630E on station mode: tx stop
> > > > functioning after minor activity (rx keep working) and on module
> > > > unload device fail to stop with messages:
> > > >
> > > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > >
> > > > Loading module again results in failure to associate with AP.
> > > > Only machine power off / power on cycle can make device work again.
> > > >
> > > > I have no idea why the commit caused F/W hangs. Most likely some proper
> > > > fix is needed of changing registers programming (or assuring proper order
> > > > of actions), but so far I can not came up with any better fix than
> > > > a partial revert of 41634aa8d6db.
> > >
> > > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> > > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> > > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> > > it does not happen.
> > >
> > > I'm not quite sure why, but MT7630E does not like when we poll tx status
> > > on 2 cpus at once. Change like below:
> >
> > Hi Stanislaw,
>
> Hi
>
> > have you tried to look at the commit: 6fe533378795f87
> > ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
> > Could it be a race between a registermap update and a stats load? (we are using a
> > different lock now)
>
> This commit seems to be fine, reverting it did not solve the issue.

ack

>
> > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > index 467b28379870..622251faa415 100644
> > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > mt76.tx_napi);
> > > int i;
> > >
> > > - mt76x02_mac_poll_tx_status(dev, false);
> > > + mt76x02_mac_poll_tx_status(dev, true);
> >
> > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > difference doing so is we do not run mt76x02_send_tx_status().
>
> I thought this is the problem, but it was my mistake during testing.
> I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> change and get wrong impression it fixes the issue. But above change
> alone does not help.
>
> I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> originally posted here make the problem gone.

so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
mt76_txq_schedule_all() in mt76x02_poll_tx(), right?

Lorenzo

>
> Stanislaw
>


Attachments:
(No filename) (3.20 kB)
signature.asc (235.00 B)
Download all attachments

2019-07-31 08:36:55

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > index 467b28379870..622251faa415 100644
> > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > mt76.tx_napi);
> > > > int i;
> > > >
> > > > - mt76x02_mac_poll_tx_status(dev, false);
> > > > + mt76x02_mac_poll_tx_status(dev, true);
> > >
> > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > difference doing so is we do not run mt76x02_send_tx_status().
> >
> > I thought this is the problem, but it was my mistake during testing.
> > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > change and get wrong impression it fixes the issue. But above change
> > alone does not help.
> >
> > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > originally posted here make the problem gone.
>
> so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> mt76_txq_schedule_all() in mt76x02_poll_tx(), right?

Yes.

Stanislaw

2019-07-31 08:54:17

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > index 467b28379870..622251faa415 100644
> > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > mt76.tx_napi);
> > > > > int i;
> > > > >
> > > > > - mt76x02_mac_poll_tx_status(dev, false);
> > > > > + mt76x02_mac_poll_tx_status(dev, true);
> > > >
> > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > difference doing so is we do not run mt76x02_send_tx_status().
> > >
> > > I thought this is the problem, but it was my mistake during testing.
> > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > change and get wrong impression it fixes the issue. But above change
> > > alone does not help.
> > >
> > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > originally posted here make the problem gone.
> >
> > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
>
> Yes.

Err, no, I should read more cerfully. It is partiall revert of
41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") :

diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..fefe0ee52584 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
if (!test_bit(MT76_STATE_RUNNING, &dev->state))
return;

- tasklet_schedule(&dev->tx_tasklet);
+ mt76_txq_schedule(dev, txq->ac);
}
EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);

2019-07-31 09:11:55

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

> On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > index 467b28379870..622251faa415 100644
> > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > > mt76.tx_napi);
> > > > > > int i;
> > > > > >
> > > > > > - mt76x02_mac_poll_tx_status(dev, false);
> > > > > > + mt76x02_mac_poll_tx_status(dev, true);
> > > > >
> > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > >
> > > > I thought this is the problem, but it was my mistake during testing.
> > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > change and get wrong impression it fixes the issue. But above change
> > > > alone does not help.
> > > >
> > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > originally posted here make the problem gone.
> > >
> > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> >
> > Yes.
>
> Err, no, I should read more cerfully. It is partiall revert of
> 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") :
>
> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..fefe0ee52584 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> return;
>
> - tasklet_schedule(&dev->tx_tasklet);
> + mt76_txq_schedule(dev, txq->ac);
> }
> EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);

reviewing the code I think:

- we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
not have tx_napi for usb and it will unnecessary go through tx queue checks.
We should probably do in mt76_wake_tx_queue() something like:

if (is_mmio())
tasklet_schedule(&dev->tx_tasklet);
else
mt76_txq_schedule(dev, txq->ac);

Another solution would be add a status_tasklet that just goes through the tx
queues receiving the usb tx completion and it schedules the tx_tasklet
What do you think?

- I guess it does not fix the 76x0e issue but we should just schedule tx queues in
mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
processing in mt76x02_poll_tx()

Regards,
Lorenzo


Attachments:
(No filename) (3.18 kB)
signature.asc (235.00 B)
Download all attachments

2019-08-05 10:01:44

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote:
> > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > index 467b28379870..622251faa415 100644
> > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > > > mt76.tx_napi);
> > > > > > > int i;
> > > > > > >
> > > > > > > - mt76x02_mac_poll_tx_status(dev, false);
> > > > > > > + mt76x02_mac_poll_tx_status(dev, true);
> > > > > >
> > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > > >
> > > > > I thought this is the problem, but it was my mistake during testing.
> > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > > change and get wrong impression it fixes the issue. But above change
> > > > > alone does not help.
> > > > >
> > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > > originally posted here make the problem gone.
> > > >
> > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> > >
> > > Yes.
> >
> > Err, no, I should read more cerfully. It is partiall revert of
> > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") :
> >
> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..fefe0ee52584 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> > if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> > return;
> >
> > - tasklet_schedule(&dev->tx_tasklet);
> > + mt76_txq_schedule(dev, txq->ac);
> > }
> > EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
>
> reviewing the code I think:
>
> - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
> not have tx_napi for usb and it will unnecessary go through tx queue checks.
> We should probably do in mt76_wake_tx_queue() something like:
>
> if (is_mmio())

Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ...

> tasklet_schedule(&dev->tx_tasklet);
> else
> mt76_txq_schedule(dev, txq->ac);
>
> Another solution would be add a status_tasklet that just goes through the tx
> queues receiving the usb tx completion and it schedules the tx_tasklet
> What do you think?
>
> - I guess it does not fix the 76x0e issue but we should just schedule tx queues in
> mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
> processing in mt76x02_poll_tx()

... but I think we have bug when do mt76_txq_schedule_all() in
tx_tasklet, because we can schedule on queues that are stoped.
So reverting 41634aa8d6db and then optimize by removing tx_tasklet
for mmio and remove not needed mt76_txq_schedule_all() calls looks
more reasoneble to me.

Stanislaw

2019-08-05 11:27:52

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

> On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote:
> > > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > index 467b28379870..622251faa415 100644
> > > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > > > > mt76.tx_napi);
> > > > > > > > int i;
> > > > > > > >
> > > > > > > > - mt76x02_mac_poll_tx_status(dev, false);
> > > > > > > > + mt76x02_mac_poll_tx_status(dev, true);
> > > > > > >
> > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > > > >
> > > > > > I thought this is the problem, but it was my mistake during testing.
> > > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > > > change and get wrong impression it fixes the issue. But above change
> > > > > > alone does not help.
> > > > > >
> > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > > > originally posted here make the problem gone.
> > > > >
> > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> > > >
> > > > Yes.
> > >
> > > Err, no, I should read more cerfully. It is partiall revert of
> > > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") :
> > >
> > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > > index 5397827668b9..fefe0ee52584 100644
> > > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> > > if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> > > return;
> > >
> > > - tasklet_schedule(&dev->tx_tasklet);
> > > + mt76_txq_schedule(dev, txq->ac);
> > > }
> > > EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
> >
> > reviewing the code I think:
> >
> > - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
> > not have tx_napi for usb and it will unnecessary go through tx queue checks.
> > We should probably do in mt76_wake_tx_queue() something like:
> >
> > if (is_mmio())
>
> Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ...
>
> > tasklet_schedule(&dev->tx_tasklet);
> > else
> > mt76_txq_schedule(dev, txq->ac);
> >
> > Another solution would be add a status_tasklet that just goes through the tx
> > queues receiving the usb tx completion and it schedules the tx_tasklet
> > What do you think?
> >
> > - I guess it does not fix the 76x0e issue but we should just schedule tx queues in
> > mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
> > processing in mt76x02_poll_tx()
>
> ... but I think we have bug when do mt76_txq_schedule_all() in
> tx_tasklet, because we can schedule on queues that are stoped.
> So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> for mmio and remove not needed mt76_txq_schedule_all() calls looks
> more reasoneble to me.

schedule a stopped queue seems not harmful at a first glance since we do not
copy pending skbs if we have not enough room in the dma ring. Maybe we can be
more conservative doing something like:

diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c
index d8f61e540bfd..c6482155e5e4 100644
--- a/drivers/net/wireless/mediatek/mt76/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/dma.c
@@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid,
goto unmap;

if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) {
+ if (!q->stopped) {
+ ieee80211_stop_queue(dev->hw,
+ skb_get_queue_mapping(skb));
+ q->stopped = true;
+ }
ret = -ENOMEM;
goto unmap;
}
diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..bd2d34c4f326 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
while (1) {
bool empty = false;

+ if (hwq->stopped)
+ break;
+
if (sq->swq_queued >= 4)
break;

Does it fix the issue you are facing?

Regards,
Lorenzo

>
> Stanislaw


Attachments:
(No filename) (5.12 kB)
signature.asc (235.00 B)
Download all attachments

2019-08-05 12:41:44

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote:
> > ... but I think we have bug when do mt76_txq_schedule_all() in
> > tx_tasklet, because we can schedule on queues that are stoped.
> > So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> > for mmio and remove not needed mt76_txq_schedule_all() calls looks
> > more reasoneble to me.
>
> schedule a stopped queue seems not harmful at a first glance since we do not
> copy pending skbs if we have not enough room in the dma ring.

mac80211 stop queues for various other reasons than
IEEE80211_QUEUE_STOP_REASON_DRIVER .

> Maybe we can be
> more conservative doing something like:
>
> diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c
> index d8f61e540bfd..c6482155e5e4 100644
> --- a/drivers/net/wireless/mediatek/mt76/dma.c
> +++ b/drivers/net/wireless/mediatek/mt76/dma.c
> @@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid,
> goto unmap;
>
> if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) {
> + if (!q->stopped) {
> + ieee80211_stop_queue(dev->hw,
> + skb_get_queue_mapping(skb));
> + q->stopped = true;
> + }
> ret = -ENOMEM;
> goto unmap;
> }
> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..bd2d34c4f326 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
> while (1) {
> bool empty = false;
>
> + if (hwq->stopped)
> + break;
> +
> if (sq->swq_queued >= 4)
> break;
>
> Does it fix the issue you are facing?

I'll not be able to test this patch this week. Will have access to
the hardware next week.

I checeked before, if
'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered
when MT7630E hangs and it is not. But maybe second part of the patch
will help.

Stanislaw

2019-08-12 11:51:51

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC] mt76: fix tx hung regression on MT7630E

On Mon, Aug 05, 2019 at 02:39:16PM +0200, Stanislaw Gruszka wrote:
> On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote:
> > > ... but I think we have bug when do mt76_txq_schedule_all() in
> > > tx_tasklet, because we can schedule on queues that are stoped.
> > > So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> > > for mmio and remove not needed mt76_txq_schedule_all() calls looks
> > > more reasoneble to me.
> >
> > schedule a stopped queue seems not harmful at a first glance since we do not
> > copy pending skbs if we have not enough room in the dma ring.
>
> mac80211 stop queues for various other reasons than
> IEEE80211_QUEUE_STOP_REASON_DRIVER .

After looking in more details, we check if queue is stopped in
ieee80211_tx_dequeue(). But we do not check that for skb's queued
in mtxq->retry_q .

> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..bd2d34c4f326 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
> > while (1) {
> > bool empty = false;
> >
> > + if (hwq->stopped)
> > + break;
> > +
> > if (sq->swq_queued >= 4)
> > break;
> >
> > Does it fix the issue you are facing?
>
> I'll not be able to test this patch this week. Will have access to
> the hardware next week.
>
> I checeked before, if
> 'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered
> when MT7630E hangs and it is not. But maybe second part of the patch
> will help.

Patch did not help.

I debugged problem a bit more and issue is related with HW encryption.
Using full mac80211 SW encyption by retuning -EOPNOTSUPP
mt76x02_set_key() make the problem gone as well. Looks there is
some race between setting HW keys and transmitting.

Stanislaw