2018-12-20 15:16:16

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

As reported by Randy we can overwhelm logs on some USB error conditions.
To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().

Reported-and-tested-by: Randy Oostdyk <[email protected]>
Signed-off-by: Stanislaw Gruszka <[email protected]>
---
drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
index 4b1744e9fb78..94459d5ee01b 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00.h
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
@@ -65,14 +65,19 @@
/* Utility printing macros
* rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
*/
+#define wiphy_err_ratelimited(wiphy, format, args...) \
+ dev_err_ratelimited(&(wiphy)->dev, format, ##args)
+#define wiphy_warn_ratelimited(wiphy, format, args...) \
+ dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
+
#define rt2x00_probe_err(fmt, ...) \
printk(KERN_ERR KBUILD_MODNAME ": %s: Error - " fmt, \
__func__, ##__VA_ARGS__)
#define rt2x00_err(dev, fmt, ...) \
- wiphy_err((dev)->hw->wiphy, "%s: Error - " fmt, \
+ wiphy_err_ratelimited((dev)->hw->wiphy, "%s: Error - " fmt, \
__func__, ##__VA_ARGS__)
#define rt2x00_warn(dev, fmt, ...) \
- wiphy_warn((dev)->hw->wiphy, "%s: Warning - " fmt, \
+ wiphy_warn_ratelimited((dev)->hw->wiphy, "%s: Warning - " fmt, \
__func__, ##__VA_ARGS__)
#define rt2x00_info(dev, fmt, ...) \
wiphy_info((dev)->hw->wiphy, "%s: Info - " fmt, \
--
2.7.5



2018-12-20 15:16:17

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

Some USB host devices/drivers on some conditions can always return
EPROTO error on submitted URBs. That can cause infinity loop in the
rt2x00 driver.

Since we can have single EPROTO errors we can not mark as device as
removed to avoid infinity loop. However we can count consecutive
EPROTO errors and mark device as removed if get lot of it.
I choose number 10 as threshold.

Reported-and-tested-by: Randy Oostdyk <[email protected]>
Signed-off-by: Stanislaw Gruszka <[email protected]>
---
drivers/net/wireless/ralink/rt2x00/rt2x00.h | 1 +
drivers/net/wireless/ralink/rt2x00/rt2x00usb.c | 22 +++++++++++++++++++---
2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
index 94459d5ee01b..88e0dc803227 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00.h
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
@@ -1022,6 +1022,7 @@ struct rt2x00_dev {
unsigned int extra_tx_headroom;

struct usb_anchor *anchor;
+ unsigned int num_proto_errs;

/* Clock for System On Chip devices. */
struct clk *clk;
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
index 086aad22743d..60b8bccab83d 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
@@ -31,6 +31,22 @@
#include "rt2x00.h"
#include "rt2x00usb.h"

+static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
+{
+ if (status == -ENODEV || status == -ENOENT)
+ return true;
+
+ if (status == -EPROTO)
+ rt2x00dev->num_proto_errs++;
+ else
+ rt2x00dev->num_proto_errs = 0;
+
+ if (rt2x00dev->num_proto_errs > 10)
+ return true;
+
+ return false;
+}
+
/*
* Interfacing with the HW.
*/
@@ -57,7 +73,7 @@ int rt2x00usb_vendor_request(struct rt2x00_dev *rt2x00dev,
if (status >= 0)
return 0;

- if (status == -ENODEV || status == -ENOENT) {
+ if (rt2x00usb_check_usb_error(rt2x00dev, status)) {
/* Device has disappeared. */
clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
break;
@@ -321,7 +337,7 @@ static bool rt2x00usb_kick_tx_entry(struct queue_entry *entry, void *data)

status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC);
if (status) {
- if (status == -ENODEV || status == -ENOENT)
+ if (rt2x00usb_check_usb_error(rt2x00dev, status))
clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
set_bit(ENTRY_DATA_IO_FAILED, &entry->flags);
rt2x00lib_dmadone(entry);
@@ -410,7 +426,7 @@ static bool rt2x00usb_kick_rx_entry(struct queue_entry *entry, void *data)

status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC);
if (status) {
- if (status == -ENODEV || status == -ENOENT)
+ if (rt2x00usb_check_usb_error(rt2x00dev, status))
clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
set_bit(ENTRY_DATA_IO_FAILED, &entry->flags);
rt2x00lib_dmadone(entry);
--
2.7.5


2018-12-20 15:16:18

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH v2 3/3] rt2x00: do not print error when queue is full

For unknown reasons printk() on some context can cause CPU hung on
embedded MT7620 AP/router MIPS platforms. What can result on wifi
disconnects.

This patch move queue full messages to debug level what is consistent
with other mac80211 drivers which drop packet silently if tx queue is
full. This make MT7620 OpenWRT routers more stable, what was reported
by various users.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
drivers/net/wireless/ralink/rt2x00/rt2x00queue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
index 92ddc19e7bf7..947fc8964e9a 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
@@ -671,7 +671,7 @@ int rt2x00queue_write_tx_frame(struct data_queue *queue, struct sk_buff *skb,
spin_lock(&queue->tx_lock);

if (unlikely(rt2x00queue_full(queue))) {
- rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
+ rt2x00_dbg(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
queue->qid);
ret = -ENOBUFS;
goto out;
--
2.7.5


2018-12-20 17:52:07

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

Shouldn't you now also revert the commits from
https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
since they caused throughput regression?

On 20/12/2018, Stanislaw Gruszka <[email protected]> wrote:
> For unknown reasons printk() on some context can cause CPU hung on
> embedded MT7620 AP/router MIPS platforms. What can result on wifi
> disconnects.
>
> This patch move queue full messages to debug level what is consistent
> with other mac80211 drivers which drop packet silently if tx queue is
> full. This make MT7620 OpenWRT routers more stable, what was reported
> by various users.
>
> Signed-off-by: Stanislaw Gruszka <[email protected]>
> ---
> drivers/net/wireless/ralink/rt2x00/rt2x00queue.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> index 92ddc19e7bf7..947fc8964e9a 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> @@ -671,7 +671,7 @@ int rt2x00queue_write_tx_frame(struct data_queue *queue,
> struct sk_buff *skb,
> spin_lock(&queue->tx_lock);
>
> if (unlikely(rt2x00queue_full(queue))) {
> - rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
> + rt2x00_dbg(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
> queue->qid);
> ret = -ENOBUFS;
> goto out;
> --
> 2.7.5
>
>

2018-12-21 09:59:42

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
> Shouldn't you now also revert the commits from
> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
> since they caused throughput regression?

For the record: this is not about currently posted patch and you
reference this:
https://lore.kernel.org/linux-wireless/[email protected]/

I'm not convinced for reverting. I would rater fix the problem
on top of the patches.

Looking at those images, the performance vary also on "good" case and
there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.

Could you specify after applying which one of those 5 patches
throughput regression starts ?

Thanks
Stanislaw

2018-12-21 11:19:37

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

On Thu, 2018-12-20 at 16:16 +0100, Stanislaw Gruszka wrote:
> As reported by Randy we can overwhelm logs on some USB error conditions.
> To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().
>
> Reported-and-tested-by: Randy Oostdyk <[email protected]>
> Signed-off-by: Stanislaw Gruszka <[email protected]>
> ---
> drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
[]
> @@ -65,14 +65,19 @@
> /* Utility printing macros
> * rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
> */
> +#define wiphy_err_ratelimited(wiphy, format, args...) \
> + dev_err_ratelimited(&(wiphy)->dev, format, ##args)
> +#define wiphy_warn_ratelimited(wiphy, format, args...) \
> + dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
> +

These should probably be generic in include/net/cfg80211.h



2018-12-21 11:57:46

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

On Fri, Dec 21, 2018 at 03:19:31AM -0800, Joe Perches wrote:
> On Thu, 2018-12-20 at 16:16 +0100, Stanislaw Gruszka wrote:
> > As reported by Randy we can overwhelm logs on some USB error conditions.
> > To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().
> >
> > Reported-and-tested-by: Randy Oostdyk <[email protected]>
> > Signed-off-by: Stanislaw Gruszka <[email protected]>
> > ---
> > drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
> []
> > @@ -65,14 +65,19 @@
> > /* Utility printing macros
> > * rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
> > */
> > +#define wiphy_err_ratelimited(wiphy, format, args...) \
> > + dev_err_ratelimited(&(wiphy)->dev, format, ##args)
> > +#define wiphy_warn_ratelimited(wiphy, format, args...) \
> > + dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
> > +
>
> These should probably be generic in include/net/cfg80211.h

Since there are no other users I do not see reason for that.
Those macros can be easily move, if other users would show up.

Regards
Stanislaw

2018-12-21 12:48:22

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

On Fri, 2018-12-21 at 12:57 +0100, Stanislaw Gruszka wrote:
> On Fri, Dec 21, 2018 at 03:19:31AM -0800, Joe Perches wrote:
> > On Thu, 2018-12-20 at 16:16 +0100, Stanislaw Gruszka wrote:
> > > As reported by Randy we can overwhelm logs on some USB error conditions.
> > > To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().
> > >
> > > Reported-and-tested-by: Randy Oostdyk <[email protected]>
> > > Signed-off-by: Stanislaw Gruszka <[email protected]>
> > > ---
> > > drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
> > > 1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
> > []
> > > @@ -65,14 +65,19 @@
> > > /* Utility printing macros
> > > * rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
> > > */
> > > +#define wiphy_err_ratelimited(wiphy, format, args...) \
> > > + dev_err_ratelimited(&(wiphy)->dev, format, ##args)
> > > +#define wiphy_warn_ratelimited(wiphy, format, args...) \
> > > + dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
> > > +
> >
> > These should probably be generic in include/net/cfg80211.h
>
> Since there are no other users I do not see reason for that.
> Those macros can be easily move, if other users would show up.

Other users won't show up if it's hidden away.



2018-12-21 12:51:51

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

On Fri, Dec 21, 2018 at 04:48:17AM -0800, Joe Perches wrote:
> On Fri, 2018-12-21 at 12:57 +0100, Stanislaw Gruszka wrote:
> > On Fri, Dec 21, 2018 at 03:19:31AM -0800, Joe Perches wrote:
> > > On Thu, 2018-12-20 at 16:16 +0100, Stanislaw Gruszka wrote:
> > > > As reported by Randy we can overwhelm logs on some USB error conditions.
> > > > To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().
> > > >
> > > > Reported-and-tested-by: Randy Oostdyk <[email protected]>
> > > > Signed-off-by: Stanislaw Gruszka <[email protected]>
> > > > ---
> > > > drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
> > > > 1 file changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
> > > []
> > > > @@ -65,14 +65,19 @@
> > > > /* Utility printing macros
> > > > * rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
> > > > */
> > > > +#define wiphy_err_ratelimited(wiphy, format, args...) \
> > > > + dev_err_ratelimited(&(wiphy)->dev, format, ##args)
> > > > +#define wiphy_warn_ratelimited(wiphy, format, args...) \
> > > > + dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
> > > > +
> > >
> > > These should probably be generic in include/net/cfg80211.h
> >
> > Since there are no other users I do not see reason for that.
> > Those macros can be easily move, if other users would show up.
>
> Other users won't show up if it's hidden away.

That would mean that others do not need this.

Thanks
Stanislaw

2018-12-22 17:22:14

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On 21/12/2018, Stanislaw Gruszka <[email protected]> wrote:
> On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
>> Shouldn't you now also revert the commits from
>> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
>> since they caused throughput regression?
>
> For the record: this is not about currently posted patch and you
> reference this:
> https://lore.kernel.org/linux-wireless/[email protected]/
>
> I'm not convinced for reverting. I would rater fix the problem
> on top of the patches.
>

Let me get this straight: You made these patches to fix wlan stalls
when "arrived at non-free entry" error message appears, but in turn
they result in lower throughput. Now you are moving err printks to dbg
level since the printks themselfs are causing wlan problems. Why would
you waste life fixing patches that don't do any good? Just revert all
of them!

> Looking at those images, the performance vary also on "good" case and
> there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
> is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
> reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.
>
> Could you specify after applying which one of those 5 patches
> throughput regression starts ?
>
> Thanks
> Stanislaw
>

I cannot test that ATM but the best would be to revert all of them.
And, for the future reference, please try sending them to the
openwrt-devel list first for testing before deciding to go upstream.

Regards, Tom

2018-12-25 22:43:12

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

Even with your patches that are currently being tested i had interface
frozen yesterday. 2 android pads, 2 android phones and laptop with
intel card on win10. Wlan frozen about half an hour after bootup, had
to restart interface that revelaed more queue problems:

[ 238.715836] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[ 238.722681] br-lan: port 2(wlan1) entered blocking state
[ 238.728107] br-lan: port 2(wlan1) entered forwarding state
[ 2403.051342] device wlan1 left promiscuous mode
[ 2403.056021] br-lan: port 2(wlan1) entered disabled state
[ 2403.213039] ieee80211 phy1: rt2800_config_channel: Warning - Using
incomplete support for external PA
[ 2403.522622] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 0 failed to flush
[ 2403.817143] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 1 failed to flush
[ 2404.058047] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 2 failed to flush
[ 2404.299509] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 3 failed to flush
[ 2404.566195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 0 failed to flush
[ 2404.806880] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 1 failed to flush
[ 2405.047624] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 2 failed to flush
[ 2405.288195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 3 failed to flush

2018-12-27 10:25:41

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Sat, Dec 22, 2018 at 02:12:13PM +0100, Tom Psyborg wrote:
> On 21/12/2018, Stanislaw Gruszka <[email protected]> wrote:
> > On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
> >> Shouldn't you now also revert the commits from
> >> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
> >> since they caused throughput regression?
> >
> > For the record: this is not about currently posted patch and you
> > reference this:
> > https://lore.kernel.org/linux-wireless/[email protected]/
> >
> > I'm not convinced for reverting. I would rater fix the problem
> > on top of the patches.
> >
>
> Let me get this straight: You made these patches to fix wlan stalls
> when "arrived at non-free entry" error message appears, but in turn
> they result in lower throughput. Now you are moving err printks to dbg
> level since the printks themselfs are causing wlan problems. Why would
> you waste life fixing patches that don't do any good? Just revert all
> of them!

I move prints to debug level for for "arrived at non-free entry" from
"Dropping frame due to full tx queue" . For some users those 5 patches
improve things, they fix problem of router connection hung completely.
Printk problem is connection stall due to CPU being busy, but
is possible to reconnect to AP after that. It's different issue.

Also 'result in lower throughput' is over-generalized statement. More
precise would be 'it lower throughput on some test cases', but even
in those specific cases, throughput vary randomly with and without
patches.

> > Looking at those images, the performance vary also on "good" case and
> > there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
> > is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
> > reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.
> >
> > Could you specify after applying which one of those 5 patches
> > throughput regression starts ?
> >
> > Thanks
> > Stanislaw
> >
>
> I cannot test that ATM but the best would be to revert all of them.

Ok, could anyone else confirm this throughput regression and point
offending commit ?

> And, for the future reference, please try sending them to the
> openwrt-devel list first for testing before deciding to go upstream.

Sure, I can send patches to openwrt-devel. However, for the record,
I informed about those patches interested people including you
in August:
https://lore.kernel.org/linux-wireless/[email protected]/
month before I submitted them upstream.

Thanks
Stanislaw

2018-12-27 10:32:43

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Tue, Dec 25, 2018 at 11:43:09PM +0100, Tom Psyborg wrote:
> Even with your patches that are currently being tested i had interface
> frozen yesterday. 2 android pads, 2 android phones and laptop with
> intel card on win10. Wlan frozen about half an hour after bootup, had
> to restart interface that revelaed more queue problems:
>
> [ 238.715836] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
> [ 238.722681] br-lan: port 2(wlan1) entered blocking state
> [ 238.728107] br-lan: port 2(wlan1) entered forwarding state
> [ 2403.051342] device wlan1 left promiscuous mode
> [ 2403.056021] br-lan: port 2(wlan1) entered disabled state
> [ 2403.213039] ieee80211 phy1: rt2800_config_channel: Warning - Using
> incomplete support for external PA

Yes, patches will not help if device is not programed properly,
i.e. when external PA is not well configured. I already pointed
this here:
https://lore.kernel.org/linux-wireless/[email protected]/

Register programming should be fixed there and also whould be good
to implement watchdog to recover from wifi hung it it happens.

> [ 2403.522622] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 0 failed to flush
> [ 2403.817143] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 1 failed to flush
> [ 2404.058047] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 2 failed to flush
> [ 2404.299509] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 3 failed to flush
> [ 2404.566195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 0 failed to flush
> [ 2404.806880] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 1 failed to flush
> [ 2405.047624] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 2 failed to flush
> [ 2405.288195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 3 failed to flush

Hmm, this one should be gone. Could you check attached patch?

Thanks
Stanislaw


Attachments:
(No filename) (1.89 kB)
rt2800_flush_soc.patch (668.00 B)
Download all attachments

2018-12-28 00:45:31

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

>
> I move prints to debug level for for "arrived at non-free entry" from
> "Dropping frame due to full tx queue" . For some users those 5 patches
> improve things, they fix problem of router connection hung completely.
> Printk problem is connection stall due to CPU being busy, but
> is possible to reconnect to AP after that. It's different issue.
>
yes they fix hung. same hung that does this single patch fix by moving
to dbg so they're not printed at all. FYI "Dropping frame due to full
tx queue" causes "arrived at non-free entry" eventually that freezes
interface so only reboot helps, removing "Dropping frame due to full
tx queue" from printing to logs prevents this kind of hung and is
better solution than 5 patches. oh and don't forget to add note that
rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
acidentally enable them again in a year or two!

> Also 'result in lower throughput' is over-generalized statement. More
> precise would be 'it lower throughput on some test cases', but even
> in those specific cases, throughput vary randomly with and without
> patches.
>

100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1

2019-01-02 08:19:39

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Fri, Dec 28, 2018 at 01:45:27AM +0100, Tom Psyborg wrote:
> >
> > I move prints to debug level for for "arrived at non-free entry" from
> > "Dropping frame due to full tx queue" . For some users those 5 patches
> > improve things, they fix problem of router connection hung completely.
> > Printk problem is connection stall due to CPU being busy, but
> > is possible to reconnect to AP after that. It's different issue.
> >
> yes they fix hung. same hung that does this single patch fix by moving
> to dbg so they're not printed at all. FYI "Dropping frame due to full
> tx queue" causes "arrived at non-free entry" eventually that freezes
> interface so only reboot helps, removing "Dropping frame due to full
> tx queue" from printing to logs prevents this kind of hung and is
> better solution than 5 patches.

I wonder why are you confident about this ? And also why 5 patches
fixed the hung if there were still "Dropping frame ..." prints ?

Anyway I'll ask people on the bz for confirmation, if stop printing
is sufficient to fix the wifi hung.

> oh and don't forget to add note that
> rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
> acidentally enable them again in a year or two!

I think I will remember that, for others git changelog should
be sufficient.

> > Also 'result in lower throughput' is over-generalized statement. More
> > precise would be 'it lower throughput on some test cases', but even
> > in those specific cases, throughput vary randomly with and without
> > patches.
> >
>
> 100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
> patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1

Ok, this is different case that you presented before on USB devices.
And I'm more concerned about it, because changes with MMIO ware
much bigger (for USB only some code was moved and some flush
changed were done). Please check attached patch, if it fixes
throughput regression for you ?

Thanks
Stanislaw


Attachments:
(No filename) (1.90 kB)
rt2800_only_check_wcid_for_match.patch (1.58 kB)
Download all attachments

2019-01-07 13:20:39

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <[email protected]> wrote:
>
> Some USB host devices/drivers on some conditions can always return
> EPROTO error on submitted URBs. That can cause infinity loop in the
> rt2x00 driver.
>
> Since we can have single EPROTO errors we can not mark as device as
> removed to avoid infinity loop. However we can count consecutive
> EPROTO errors and mark device as removed if get lot of it.
> I choose number 10 as threshold.
>
> Reported-and-tested-by: Randy Oostdyk <[email protected]>
> Signed-off-by: Stanislaw Gruszka <[email protected]>

> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> index 086aad22743d..60b8bccab83d 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> @@ -31,6 +31,22 @@
> #include "rt2x00.h"
> #include "rt2x00usb.h"
>
> +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
> +{
> + if (status == -ENODEV || status == -ENOENT)

I am not sure about this, but it looks to me like you would never see
ENOENT, but ETIMEDOUT:

https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64

In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and
passed up the chain.

retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status);


Maybe I am wrong about this, but then again I have neither ever seen
the driver respond to an ENOENT like this when an RT5592
"disappeared".


Kind regards,
jer

2019-01-07 15:09:16

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
> On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <[email protected]> wrote:
> >
> > Some USB host devices/drivers on some conditions can always return
> > EPROTO error on submitted URBs. That can cause infinity loop in the
> > rt2x00 driver.
> >
> > Since we can have single EPROTO errors we can not mark as device as
> > removed to avoid infinity loop. However we can count consecutive
> > EPROTO errors and mark device as removed if get lot of it.
> > I choose number 10 as threshold.
> >
> > Reported-and-tested-by: Randy Oostdyk <[email protected]>
> > Signed-off-by: Stanislaw Gruszka <[email protected]>
>
> > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > index 086aad22743d..60b8bccab83d 100644
> > --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > @@ -31,6 +31,22 @@
> > #include "rt2x00.h"
> > #include "rt2x00usb.h"
> >
> > +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
> > +{
> > + if (status == -ENODEV || status == -ENOENT)
>
> I am not sure about this, but it looks to me like you would never see
> ENOENT, but ETIMEDOUT:
>
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64
>
> In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and
> passed up the chain.
>
> retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status);
>
>
> Maybe I am wrong about this, but then again I have neither ever seen
> the driver respond to an ENOENT like this when an RT5592
> "disappeared".

According to Documentation/driver-api/usb/error-codes.rst
ENOENT is valid error code when interface does not exist or
when URB was unlinked.

Regards
Stanislaw

2019-01-08 09:30:34

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

Hi Stanislaw,


On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <[email protected]> wrote:
> On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
> > Maybe I am wrong about this, but then again I have neither ever seen
> > the driver respond to an ENOENT like this when an RT5592
> > "disappeared".

By "neither ever seen" I mean I have never seen rt2x00usb respond
properly like that because no -ENOENT was ever seen. I have many
system logs collected over the years showing the exact error that the
code was trying to catch and still does try to catch: with USB debug
messages enabled, I tend to see

usb: kworker.* timed out on ep.*

followed by rt2x00 specific warnings about the queues filling up:
rt2800usb_txdone: Warning - Data pending for entry N in queue N
[may repeat a few times]

followed by the fatal:

rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
with error -110
[many of these, system is slowly locking up]

So the only clue that I had was that perhaps rt2x00usb_vendor_request
wasn't catching the correct return value.

> According to Documentation/driver-api/usb/error-codes.rst
> ENOENT is valid error code when interface does not exist or
> when URB was unlinked.

rt2x00usb_vendor_request calls usb_control_msg.
usb_control_msg according to that document can return -ETIMEDOUT.

1. rt2x00usb_vendor_request calls usb_control_msg.
2. usb_control_msg calls usb_internal_control_msg and passes on its
return value.
3. usb_internal_control_msg calls usb_start_wait_urb and passes on its
return value.
4. usb_start_wait_urb very specifically substitutes -ETIMEDOUT for
-ENOENT as the return value of usb_kill_urb and
5. that is passed back all the way up to rt2x00usb_vendor_request.

Kind regards,
jer

2019-01-08 10:09:23

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Tue, 8 Jan 2019 at 10:30, Jeroen Roovers <[email protected]> wrote:
> I tend to see
>
> usb: kworker.* timed out on ep.*
>
> followed by rt2x00 specific warnings about the queues filling up:

Sorry that should read "preceded by".

> rt2800usb_txdone: Warning - Data pending for entry N in queue N
> [may repeat a few times]


Kind regards,
jer

2019-01-08 11:04:42

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On 08/01/2019, Jeroen Roovers <[email protected]> wrote:
> Hi Stanislaw,
>
>
> On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <[email protected]> wrote:
>> On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
>> > Maybe I am wrong about this, but then again I have neither ever seen
>> > the driver respond to an ENOENT like this when an RT5592
>> > "disappeared".
>
> By "neither ever seen" I mean I have never seen rt2x00usb respond
> properly like that because no -ENOENT was ever seen. I have many
> system logs collected over the years showing the exact error that the
> code was trying to catch and still does try to catch: with USB debug
> messages enabled, I tend to see
>
> usb: kworker.* timed out on ep.*
>
> followed by rt2x00 specific warnings about the queues filling up:
> rt2800usb_txdone: Warning - Data pending for entry N in queue N
> [may repeat a few times]
>
> followed by the fatal:
>
> rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> with error -110
> [many of these, system is slowly locking up]
>
> So the only clue that I had was that perhaps rt2x00usb_vendor_request
> wasn't catching the correct return value.

Hi

error message vendor request failed - do you get it on a real hardware
or in virtualized environment?

noticed it only happens when running attached USB device in
virtualbox, even in such case I was able once to hit the right
combination of ubuntu, vbox and vbox extensions and could run the USB
wifi without lockups. so you might consider fixing this bug at
different level. oh, and I'd also pay attention to hardware connection
at the USB port, at least one of the devices I tested seems flaky when
plugged in (edup mini adapter/RT53xx)

regards, Tom

2019-01-09 06:18:13

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

Hi Tom,

On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <[email protected]> wrote:
> > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> > with error -110
> > [many of these, system is slowly locking up]
> >
> > So the only clue that I had was that perhaps rt2x00usb_vendor_request
> > wasn't catching the correct return value.
>
> Hi
>
> error message vendor request failed - do you get it on a real hardware
> or in virtualized environment?

I only run these on bare metal. What I assume so far is that when
rt2x00usb_vendor_request starts failing like this, the MCU has failed.
Power cycling the system helps but is undesirable, and sometimes so
does a forced removal of rt2800usb, a short recovery period (cooling
down, reloading the firmware?) and then loading the module again.

But the problem I am looking to solve is not a hardware problem, it is
recovering gracefully from a failure in the RT5592, so I have been
looking intently at rt2x00usb_vendor_request because that's the
function that complains loudly and kills the entire kernel when the
RT5592 sees this failure.


Kind regards,
jer

2019-01-09 11:33:24

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Wed, Jan 09, 2019 at 07:17:59AM +0100, Jeroen Roovers wrote:
> Hi Tom,
>
> On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <[email protected]> wrote:
> > > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> > > with error -110
> > > [many of these, system is slowly locking up]
> > >
> > > So the only clue that I had was that perhaps rt2x00usb_vendor_request
> > > wasn't catching the correct return value.
> >
> > Hi
> >
> > error message vendor request failed - do you get it on a real hardware
> > or in virtualized environment?
>
> I only run these on bare metal. What I assume so far is that when
> rt2x00usb_vendor_request starts failing like this, the MCU has failed.
> Power cycling the system helps but is undesirable, and sometimes so
> does a forced removal of rt2800usb, a short recovery period (cooling
> down, reloading the firmware?) and then loading the module again.
>
> But the problem I am looking to solve is not a hardware problem, it is
> recovering gracefully from a failure in the RT5592, so I have been
> looking intently at rt2x00usb_vendor_request because that's the
> function that complains loudly and kills the entire kernel when the
> RT5592 sees this failure.

So would be below patch (on top of this set) be a solution for at
least to not kill the kernel. Or we looking for something better
i.e. watchdog ?

Regards
Stanislaw

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
index 60b8bccab83d..ee5bd0efd2c7 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
@@ -36,7 +36,7 @@ static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
if (status == -ENODEV || status == -ENOENT)
return true;

- if (status == -EPROTO)
+ if (status == -EPROTO || status == -ETIMEDOUT)
rt2x00dev->num_proto_errs++;
else
rt2x00dev->num_proto_errs = 0;


2019-01-10 07:50:08

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

Hi Stanislaw,

On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <[email protected]> wrote:

> So would be below patch (on top of this set) be a solution for at
> least to not kill the kernel. Or we looking for something better
> i.e. watchdog ?

I'll give it a spin. Thanks!


Kind regards,
jer

2019-01-10 14:29:57

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

Aaaaaaand the results are in.

On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <[email protected]> wrote:
>
> Hi Stanislaw,
>
> On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <[email protected]> wrote:
>
> > So would be below patch (on top of this set) be a solution for at
> > least to not kill the kernel. Or we looking for something better
> > i.e. watchdog ?
>
> I'll give it a spin. Thanks!

hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x07 failed for offset 0x6888 with error -110
kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x6888 with error -110
kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x6110 with error -110
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x07 failed for offset 0x1018 with error -110
kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x1018 with error -110
kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x1910 with error -110
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)

rt2x00usb_check_usb_error in your patch is set to
clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
errors were seen. Maybe I should set it to 1 as I have never seen an
RT5592 recover from this.

The system remained relatively stable until after I tried forcibly
removing and then loading the rt2800usb module. A simple `ifconfig`
then triggered a kernel panic. Sadly I couldn't capture it in time but
I did spot that more phyN (up to phy4) devices had been added.


Kind regards,
jer

2019-01-16 11:11:36

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Thu, Jan 10, 2019 at 03:29:44PM +0100, Jeroen Roovers wrote:
> Aaaaaaand the results are in.
>
> On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <[email protected]> wrote:
> >
> > Hi Stanislaw,
> >
> > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <[email protected]> wrote:
> >
> > > So would be below patch (on top of this set) be a solution for at
> > > least to not kill the kernel. Or we looking for something better
> > > i.e. watchdog ?
> >
> > I'll give it a spin. Thanks!
>
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x6888 with error -110
> kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6888 with error -110
> kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6110 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x1018 with error -110
> kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1018 with error -110
> kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1910 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
>
> rt2x00usb_check_usb_error in your patch is set to
> clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
> errors were seen. Maybe I should set it to 1 as I have never seen an
> RT5592 recover from this.

I know cases where we have only single USB error and device work after
that. But if I remember correctly it was only for EPROTO not ETIMEDOUT.

> The system remained relatively stable until after I tried forcibly
> removing and then loading the rt2800usb module. A simple `ifconfig`
> then triggered a kernel panic. Sadly I couldn't capture it in time but
> I did spot that more phyN (up to phy4) devices had been added.

Apparently we should not crash and that should be fixed, but without
logs there is nothing we can do.

Let try to attack problem from other side. RT5592 mises TSSI
calibration, so perhaps device overheat for you because of that.
Please apply attached patch which fixes txpower setting in mac80211
then set some low txpower value i.e:

iw dev wlan0 set txpower fixed 1500

and check if device still stop working with USB errors.

Regards
Stanislaw


Attachments:
(No filename) (3.32 kB)
mac80211_fix_set_txpower_for_no_chanctx.patch (2.21 kB)
Download all attachments

2019-01-22 17:32:20

by Jeroen Roovers

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Thu, 10 Jan 2019 at 15:29, Jeroen Roovers <[email protected]> wrote:
>
> Aaaaaaand the results are in.
>
> On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <[email protected]> wrote:
> >
> > Hi Stanislaw,
> >
> > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <[email protected]> wrote:
> >
> > > So would be below patch (on top of this set) be a solution for at
> > > least to not kill the kernel. Or we looking for something better
> > > i.e. watchdog ?
> >
> > I'll give it a spin. Thanks!
>
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x6888 with error -110
> kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6888 with error -110
> kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6110 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x1018 with error -110
> kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1018 with error -110
> kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1910 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
>
> rt2x00usb_check_usb_error in your patch is set to
> clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
> errors were seen. Maybe I should set it to 1 as I have never seen an
> RT5592 recover from this.
>
> The system remained relatively stable until after I tried forcibly
> removing and then loading the rt2800usb module. A simple `ifconfig`
> then triggered a kernel panic. Sadly I couldn't capture it in time but
> I did spot that more phyN (up to phy4) devices had been added.

Yes, even when some rt2x00usb_vendor_request starts to fail but keeps
on trying, the WLAN modules remain unrecoverable except by removing
power, so the 10 retries are usually not reached and the device is
never removed. Could it be that some operations do succeed, perhaps
because the MCU was reset and is now capable of responding to some
"vendor requests" but not others?

Checking for num_proto_errs > 1 would then make as much sense as
num_proto_errs > 10 when running an AP. Maybe it's different for an
STA?


Kind regards,
jer

2019-01-29 12:30:46

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] rt2x00: use ratelimited variants dev_warn/dev_err

Stanislaw Gruszka <[email protected]> writes:

> On Fri, Dec 21, 2018 at 03:19:31AM -0800, Joe Perches wrote:
>> On Thu, 2018-12-20 at 16:16 +0100, Stanislaw Gruszka wrote:
>> > As reported by Randy we can overwhelm logs on some USB error conditions.
>> > To avoid that use dev_warn_ratelimited() and dev_err_ratelimitd().
>> >
>> > Reported-and-tested-by: Randy Oostdyk <[email protected]>
>> > Signed-off-by: Stanislaw Gruszka <[email protected]>
>> > ---
>> > drivers/net/wireless/ralink/rt2x00/rt2x00.h | 9 +++++++--
>> > 1 file changed, 7 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
>> []
>> > @@ -65,14 +65,19 @@
>> > /* Utility printing macros
>> > * rt2x00_probe_err is for messages when rt2x00_dev is uninitialized
>> > */
>> > +#define wiphy_err_ratelimited(wiphy, format, args...) \
>> > + dev_err_ratelimited(&(wiphy)->dev, format, ##args)
>> > +#define wiphy_warn_ratelimited(wiphy, format, args...) \
>> > + dev_warn_ratelimited(&(wiphy)->dev, format, ##args)
>> > +
>>
>> These should probably be generic in include/net/cfg80211.h
>
> Since there are no other users I do not see reason for that.
> Those macros can be easily move, if other users would show up.

I agree with Joe, these should be in cfg80211.h.

--
Kalle Valo

2019-02-09 11:03:05

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Wed, Jan 02, 2019 at 09:19:34AM +0100, Stanislaw Gruszka wrote:
> On Fri, Dec 28, 2018 at 01:45:27AM +0100, Tom Psyborg wrote:
> > >
> > > I move prints to debug level for for "arrived at non-free entry" from
> > > "Dropping frame due to full tx queue" . For some users those 5 patches
> > > improve things, they fix problem of router connection hung completely.
> > > Printk problem is connection stall due to CPU being busy, but
> > > is possible to reconnect to AP after that. It's different issue.
> > >
> > yes they fix hung. same hung that does this single patch fix by moving
> > to dbg so they're not printed at all. FYI "Dropping frame due to full
> > tx queue" causes "arrived at non-free entry" eventually that freezes
> > interface so only reboot helps, removing "Dropping frame due to full
> > tx queue" from printing to logs prevents this kind of hung and is
> > better solution than 5 patches.
>
> I wonder why are you confident about this ? And also why 5 patches
> fixed the hung if there were still "Dropping frame ..." prints ?
>
> Anyway I'll ask people on the bz for confirmation, if stop printing
> is sufficient to fix the wifi hung.
>
> > oh and don't forget to add note that
> > rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
> > acidentally enable them again in a year or two!
>
> I think I will remember that, for others git changelog should
> be sufficient.
>
> > > Also 'result in lower throughput' is over-generalized statement. More
> > > precise would be 'it lower throughput on some test cases', but even
> > > in those specific cases, throughput vary randomly with and without
> > > patches.
> > >
> >
> > 100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
> > patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1
>
> Ok, this is different case that you presented before on USB devices.
> And I'm more concerned about it, because changes with MMIO ware
> much bigger (for USB only some code was moved and some flush
> changed were done). Please check attached patch, if it fixes
> throughput regression for you ?

So what is the point complaining that I do not provide patches for
testing and when I do, ignore that ?

Anyway I have now MT7620A device and was capable to reproduce the
performance regression. Will provide patches to fix it and still
prevent tx queues hungs.

Stanislaw

2019-02-09 11:11:06

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

Can you reproduce interface freeze with this patch if you enable debug
options to print all dbg messages?

2019-02-09 11:56:33

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> Can you reproduce interface freeze with this patch if you enable debug
> options to print all dbg messages?

Haven't tried that, but I expect enabling debug messages will
cause troubles at hardware where printk make CPU hog.

Stanislaw

2019-02-09 12:28:34

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On 09/02/2019, Stanislaw Gruszka <[email protected]> wrote:
> On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
>> Can you reproduce interface freeze with this patch if you enable debug
>> options to print all dbg messages?
>
> Haven't tried that, but I expect enabling debug messages will
> cause troubles at hardware where printk make CPU hog.
>
> Stanislaw
>

Take a look at my comment from 23.01.2019. on this link:
https://bugs.openwrt.org/index.php?do=details&task_id=2018
There might be a chance dbg printks do not cause problems.

2019-02-09 15:55:45

by Daniel Golle

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

Hi Tom,

On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Stanislaw Gruszka <[email protected]> wrote:
> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> Can you reproduce interface freeze with this patch if you enable debug
> >> options to print all dbg messages?
> >
> > Haven't tried that, but I expect enabling debug messages will
> > cause troubles at hardware where printk make CPU hog.
> >
> > Stanislaw
> >
>
> Take a look at my comment from 23.01.2019. on this link:
> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> There might be a chance dbg printks do not cause problems.

This is very likely just overheat because of missing TSSI. It happends
on boards where MT7620A is used without heatsink (and easy to
reproduce, I've tried. We just need TSSI and for bad board design it
will mean limiting to one TX-queue soon after serious traffic starts).
We should tell those users to glue heatsink on the chip for $0.05 they
can fix this in hardware -- and implement TSSI to mitigate at least the
very worst effects (such as hardware stuck until reset).
So dkg printks can make this worse, of course, because logging overhead
also burns CPU cycles and increases system load.

I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
also had problems like that and it was possible to improve it by all
sorts of things like removing power from unused ports of the Ethernet
switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
the trick, because they also only use 3 out of 5 ports, and maybe we
need to try harder to disable the power of unused ports (and also those
without cable plugged in only periodicly power them up and check for
link)....

Cheers

Daniel

2019-02-09 16:29:36

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On 09/02/2019, Daniel Golle <[email protected]> wrote:
> Hi Tom,
>
> On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
>> On 09/02/2019, Stanislaw Gruszka <[email protected]> wrote:
>> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
>> >> Can you reproduce interface freeze with this patch if you enable debug
>> >> options to print all dbg messages?
>> >
>> > Haven't tried that, but I expect enabling debug messages will
>> > cause troubles at hardware where printk make CPU hog.
>> >
>> > Stanislaw
>> >
>>
>> Take a look at my comment from 23.01.2019. on this link:
>> https://bugs.openwrt.org/index.php?do=details&task_id=2018
>> There might be a chance dbg printks do not cause problems.
>
> This is very likely just overheat because of missing TSSI. It happends
> on boards where MT7620A is used without heatsink (and easy to
> reproduce, I've tried. We just need TSSI and for bad board design it
> will mean limiting to one TX-queue soon after serious traffic starts).
> We should tell those users to glue heatsink on the chip for $0.05 they
> can fix this in hardware -- and implement TSSI to mitigate at least the
> very worst effects (such as hardware stuck until reset).
> So dkg printks can make this worse, of course, because logging overhead
> also burns CPU cycles and increases system load.
>
> I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
> also had problems like that and it was possible to improve it by all
> sorts of things like removing power from unused ports of the Ethernet
> switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
> the trick, because they also only use 3 out of 5 ports, and maybe we
> need to try harder to disable the power of unused ports (and also those
> without cable plugged in only periodicly power them up and check for
> link)....
>
> Cheers
>
> Daniel
>

Disagree. MT7620 in Xiaomi MiWiFi mini has thermal pad sticked to
enough large metal shielding that is sticked with another thermal pad
to even bigger metal part of case. So thermal dissipation in that case
is very good.
Also with pandorabox firmware 200Mbps of stable throughput is possible
without any throttling which I believe would not be the case if the
cooling system wasn't reliable.

Back on subject, this patch is tested and proven to work, so ACK from me.

2019-02-09 17:28:42

by Daniel Golle

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

Hi Tom,

On Sat, Feb 09, 2019 at 05:29:33PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Daniel Golle <[email protected]> wrote:
> > Hi Tom,
> >
> > On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> >> On 09/02/2019, Stanislaw Gruszka <[email protected]> wrote:
> >> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> >> Can you reproduce interface freeze with this patch if you enable debug
> >> >> options to print all dbg messages?
> >> >
> >> > Haven't tried that, but I expect enabling debug messages will
> >> > cause troubles at hardware where printk make CPU hog.
> >> >
> >> > Stanislaw
> >> >
> >>
> >> Take a look at my comment from 23.01.2019. on this link:
> >> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> >> There might be a chance dbg printks do not cause problems.
> >
> > This is very likely just overheat because of missing TSSI. It happends
> > on boards where MT7620A is used without heatsink (and easy to
> > reproduce, I've tried. We just need TSSI and for bad board design it
> > will mean limiting to one TX-queue soon after serious traffic starts).
> > We should tell those users to glue heatsink on the chip for $0.05 they
> > can fix this in hardware -- and implement TSSI to mitigate at least the
> > very worst effects (such as hardware stuck until reset).
> > So dkg printks can make this worse, of course, because logging overhead
> > also burns CPU cycles and increases system load.
> >
> > I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
> > also had problems like that and it was possible to improve it by all
> > sorts of things like removing power from unused ports of the Ethernet
> > switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
> > the trick, because they also only use 3 out of 5 ports, and maybe we
> > need to try harder to disable the power of unused ports (and also those
> > without cable plugged in only periodicly power them up and check for
> > link)....
> >
> > Cheers
> >
> > Daniel
> >
>
> Disagree. MT7620 in Xiaomi MiWiFi mini has thermal pad sticked to
> enough large metal shielding that is sticked with another thermal pad
> to even bigger metal part of case. So thermal dissipation in that case
> is very good.

Good to hear that, and yes it may still be, see below.

> Also with pandorabox firmware 200Mbps of stable throughput is possible

>20MiB/s download rate is very impressive for 2.4GHz WiFi, I never saw
anything even near that (using BitTorrent I get up to 8MiB/s when next
to the router in low-noise environment with proprietry ubiquiti gear).

> without any throttling which I believe would not be the case if the
> cooling system wasn't reliable.

pandorabox uses MediaTek proprietary driver and that limits to single
TX chain operation once temperature limit is hit (read the code, you
know where), same as in Xiaomi proprietary firmware.

>
> Back on subject, this patch is tested and proven to work, so ACK from me.

Thanks for testing. I will push Stanislavs complete series into OpenWrt
tomorrow.


Cheers


Daniel

2019-02-10 09:59:28

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Stanislaw Gruszka <[email protected]> wrote:
> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> Can you reproduce interface freeze with this patch if you enable debug
> >> options to print all dbg messages?
> >
> > Haven't tried that, but I expect enabling debug messages will
> > cause troubles at hardware where printk make CPU hog.
> >
> > Stanislaw
> >
>
> Take a look at my comment from 23.01.2019. on this link:
> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> There might be a chance dbg printks do not cause problems.

I'm sure printk cause problems on some hardware i.e. Wt3020 .
There is hypothesis that is because kernel try to emit
messages by misconfigured 56k serial line, but who knows.
This should be debugged and fixed but I have not idea how to do this.

Another issue not related with printk is tx queue hung, when
queue is full but we do not get any TX status interrupt. This was
mitigated in my previous set that caused throughput regression and
should be now fixed by latest RFC/RTF set.

But there are other issues, that could cause troubles.

Stanislaw


2019-02-12 15:02:36

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] rt2x00: check number of EPROTO errors

On Tue, Jan 22, 2019 at 06:32:05PM +0100, Jeroen Roovers wrote:
> > then triggered a kernel panic. Sadly I couldn't capture it in time but
> > I did spot that more phyN (up to phy4) devices had been added.
>
> Yes, even when some rt2x00usb_vendor_request starts to fail but keeps
> on trying, the WLAN modules remain unrecoverable except by removing
> power, so the 10 retries are usually not reached and the device is
> never removed. Could it be that some operations do succeed, perhaps
> because the MCU was reset and is now capable of responding to some
> "vendor requests" but not others?
>
> Checking for num_proto_errs > 1 would then make as much sense as
> num_proto_errs > 10 when running an AP. Maybe it's different for an
> STA?

Does it make sense to do num_proto_err > 3 check. Would that
be helpful on your problem with rt2800usb device? I don't want
make num_proto_err > 1 since this will remove device just after
2 errors.

Also have you tested tx_power patch with lowering txpower ?
Did it improve things ?

Stanislaw

2019-03-19 02:37:29

by Tom Psyborg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rt2x00: do not print error when queue is full

On 09/02/2019, Daniel Golle <[email protected]> wrote:

>>20MiB/s download rate is very impressive for 2.4GHz WiFi, I never saw
> anything even near that (using BitTorrent I get up to 8MiB/s when next
> to the router in low-noise environment with proprietry ubiquiti gear).

various apps and data transfer stacks may add significant overhead.
for real throughput testing i rely on iperf2 (tried iperf3 it was
mess). if interested take a look at bandwidth achieved on OCed QCA9531
device: https://forum.openwrt.org/t/different-wifi-speed-tx-rx-archer-c7-v2/18860/10?u=psyborg