2017-09-12 22:19:57

by James Cameron

[permalink] [raw]
Subject: rtl8821ae keep alive not set, connection lost

Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
why?

Wireless connection is lost after a few seconds or minutes, on every
OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
and any kernel with 40b368af4b75.

dmesg contains

wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost

iw event shows

wlp2s0: del station 2c:b0:5d:a6:86:eb
wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to inactivity
wlp2s0 (phy #0): disconnected (local request)

Workaround is to bounce the link, then reconnect;

ip link set wlp2s0 down
ip link set wlp2s0 up
iw dev wlp2s0 connect qz

A nearby monitor host captures a deauthentication packet sent by the
device.

Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
issues") which changes the width of DBI register read.

On the face of it, 40b368af4b75 looks correct, especially compared
against same function in rtl8723be.

I've no idea why reverting fixes the problem. I'm hoping someone here
might speculate and suggest ways to test.

As keep alive is set through this path, my guess is that keep alive is
not being set in the device. Or perhaps reading 16-bits perturbs
another register. Is there a way to test?

http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13

http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
of 40b368af4b75

--
James Cameron
http://quozl.netrek.org/


2017-09-20 21:48:25

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 09/20/2017 04:36 AM, James Cameron wrote:
> When the problem occurs, register 0x350 bit 25 is set, for which a
> comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> hang.
>
> So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> and _rtl8821ae_reset_pcie_interface_dma.
>
> Any ideas where to do this?

Thanks for the extended debugging.

I was able to repeat your findings. With the 8-bit read of REG_DBI_RDATA, I got
poor connection stability. Reverting that part made it stable again. For that
reason, I pushed the partial reversion of commit 40b368af4b75 ("rtlwifi: Fix
alignment issues").

Where did you detect that bit 25 of register 0x350 was set?

Larry

2017-09-21 14:40:17

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 09/21/2017 03:07 AM, James Cameron wrote:
> My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
> read of REG_DBI_RDATA, and has been stable for four hours. I'll focus
> on some more testing of this one. It is a surprise.
>
> http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg)
>
> Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
> with write_readback = true.

Again, thanks for your efforts.

At this point, my system has been up over 17 hours without a single drop. As a
result, I will leave the reversion of commit 40b368af4b75 in place. It seems
safer than turning off write_readback. After we get more testing, that could
still be an option.

Larry

2017-09-13 15:01:39

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 09/12/2017 05:09 PM, James Cameron wrote:
> Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
> why?
>
> Wireless connection is lost after a few seconds or minutes, on every
> OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
> and any kernel with 40b368af4b75.
>
> dmesg contains
>
> wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
>
> iw event shows
>
> wlp2s0: del station 2c:b0:5d:a6:86:eb
> wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to inactivity
> wlp2s0 (phy #0): disconnected (local request)
>
> Workaround is to bounce the link, then reconnect;
>
> ip link set wlp2s0 down
> ip link set wlp2s0 up
> iw dev wlp2s0 connect qz
>
> A nearby monitor host captures a deauthentication packet sent by the
> device.
>
> Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> issues") which changes the width of DBI register read.
>
> On the face of it, 40b368af4b75 looks correct, especially compared
> against same function in rtl8723be.
>
> I've no idea why reverting fixes the problem. I'm hoping someone here
> might speculate and suggest ways to test.
>
> As keep alive is set through this path, my guess is that keep alive is
> not being set in the device. Or perhaps reading 16-bits perturbs
> another register. Is there a way to test?
>
> http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
>
> http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
> of 40b368af4b75

James,

Thank you very much for making the effort to bisect this problem. I know that
several people have reported the problem, which we cannot duplicate; however,
most of them just say it drops the connection and do nothing more. In fact, we
are lucky to have them even report which kernel version they are running!

As we do not see the problem, we will be relying on you to help diagnose the
issue. Merely changing the read from 8 to 16 bits should not cause any change.

As _rtl8821ae_dbi_read() is only called from _rtl8821ae_enable_aspm_back_door(),
we want to test turning off ASPM. The following patch will accomplish this.
Unfortunately, the patch is white-space damaged, thus you will need to apply it
manually. Please try it to see if it helps your connection loss. Note that ASPM
settings are preserved through a module unload/reload sequence. Thus you will
need to reboot after rebuilding the driver.

diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
index 305b3abbf..755d3704b 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1982,8 +1982,8 @@ int rtl8821ae_hw_init(struct ieee80211_hw *hw)
ppsc->rfpwr_state = ERFON;

rtlpriv->cfg->ops->set_hw_reg(hw, HW_VAR_ETHER_ADDR, mac->mac_addr);
- _rtl8821ae_enable_aspm_back_door(hw);
- rtlpriv->intf_ops->enable_aspm(hw);
+ //_rtl8821ae_enable_aspm_back_door(hw);
+ //rtlpriv->intf_ops->enable_aspm(hw);

if (rtlhal->hw_type == HARDWARE_TYPE_RTL8812AE &&
(rtlhal->rfe_type == 1 || rtlhal->rfe_type == 5))


Thanks,

Larry

2017-09-21 08:07:54

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Thu, Sep 21, 2017 at 09:22:28AM +1000, James Cameron wrote:
> On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> > On 09/20/2017 04:36 AM, James Cameron wrote:
> > >When the problem occurs, register 0x350 bit 25 is set, for which a
> > >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> > >hang.
> > >
> > >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> > >and _rtl8821ae_reset_pcie_interface_dma.
> > >
> > >Any ideas where to do this?
> >
> > Thanks for the extended debugging.
> >
> > I was able to repeat your findings. With the 8-bit read of
> > REG_DBI_RDATA, I got poor connection stability. Reverting that part
> > made it stable again. For that reason, I pushed the partial
> > reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").
>
> That's great you were able to reproduce, thanks!
> [...]
> I'm still pondering a few more theories;
>
> - change write_readback, it is true now, and the while()/udelay in
> _rtl8821ae_dbi_read seems a waste, it never executes,

My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
read of REG_DBI_RDATA, and has been stable for four hours. I'll focus
on some more testing of this one. It is a surprise.

http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg)

Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
with write_readback = true.

> - clearing REG_DBI_CTRL write enable bits at the end of
> _rtl8821ae_dbi_write,

My test kernel "-qc" had reset of REG_DBI_ADDR as last step in both
_rtl8821ae_dbi_read and _rtl8821ae_dbi_write, and was very unstable,
not able to connect.

http://dev.laptop.org/~quozl/y/1dutbX.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dutuM.txt (dmesg)

My test kernel "-qd" had reset of REG_DBI_ADDR as last step in only
_rtl8821ae_dbi_write, and had poor connection stability.

http://dev.laptop.org/~quozl/y/1dutr3.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duuDc.txt (dmesg connection lost)

Based on the above two kernels, clearing REG_DBI_ADDR after a read is
a bad idea, and suggests there is some underlying asynchronicity about
the DBI access. Almost as if some other condition should signal
completion rather than zero in REG_DBI_FLAG+0.

> - switching to 32-bit access as used by rtl8192de.

My test kernel "-qe" changed RED_DBI_RDATA read to 32-bit, then used a
union hack to pull out the desired byte, and had poor connection
stability.

http://dev.laptop.org/~quozl/y/1duvIC.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duwI1.txt (dmesg connection lost)

--
James Cameron
http://quozl.netrek.org/

2017-09-14 00:39:37

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 09/13/2017 04:46 PM, James Cameron wrote:
>
> I'll give it some more testing and let you know, but it seems as
> capable of keeping a connection as 4.13 plus my earlier revert.
>

The change I sent earlier should be as good as reverting the change to
write_byte in your reversion.

There has been a report (in Russian unfortunately) at
https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP handling.
According to Google translate is as follows:

============================================================
Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP, which causes
the Internet to end. Wireshark looks quite interesting: ARP replays can be sent
by one large packet a few seconds after receiving the requests, ie. they seem to
be buffered somewhere.

arping, launched under strace, also hints at certain problems:
sendto(3,
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1",
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1,
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
alarm(1) = 0
rt_sigreturn() = 45
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ?
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
sendto(3,
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1",
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1,
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
alarm(1) = 0
rt_sigreturn() = 45
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ?
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ?
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
sendto(3,
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1",
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1,
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
============================================================

I need to explore that ENOBUFS return code.

Your case where the device is unresponsive to pings from another NIC until the
device transmits may also be an ARP problem.

For completeness, are you using the 2.4 of 5 GHz band? What is the make/model
your AP? If possible for you to determine, what firmware is it running?

Thanks,

Larry

2017-09-22 05:36:05

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Thu, Sep 21, 2017 at 09:40:14AM -0500, Larry Finger wrote:
> On 09/21/2017 03:07 AM, James Cameron wrote:
> >My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
> >read of REG_DBI_RDATA, and has been stable for four hours. I'll
> >focus on some more testing of this one. It is a surprise.
> >
> >http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg)
> >
> >Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't
> >happen with write_readback = true.
>
> Again, thanks for your efforts.
>
> At this point, my system has been up over 17 hours without a single
> drop. As a result, I will leave the reversion of commit 40b368af4b75
> in place. It seems safer than turning off write_readback. After we
> get more testing, that could still be an option.

Thanks for the reversion commit, I'll point others to it.

My apologies for sloppy work, the test kernel features got swapped!
"-qb" above was with write_readback off, and 16-bit read of
REG_DBI_RDATA, not 8-bit. Verified with objdump. It has run for 24
hours without a drop.

So at conclusion;

- the 16-bit read is good with or without write_readback.

- the 8-bit read is bad with or without write_readback, and tends to
lose connection much quicker without write_readback.

Been a pleasure working with you. Back to lurk mode.

--
James Cameron
http://quozl.netrek.org/

2017-09-20 09:36:42

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Tue, Sep 19, 2017 at 07:42:04PM +1000, James Cameron wrote:
> On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> > On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > > On 09/13/2017 04:46 PM, James Cameron wrote:
> > > >
> > > >I'll give it some more testing and let you know, but it seems as
> > > >capable of keeping a connection as 4.13 plus my earlier revert.
> > > >
> >
> > Testing went well; removing the call to enable ASPM was as good as
> > changing the DBI read back to 16-bit width.
> >
> > > The change I sent earlier should be as good as reverting the change
> > > to write_byte in your reversion.
> >
> > Yes, that would be the hope.
> >
> > But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> > read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> > so perhaps reading that register has an unexpected side-effect.
> >
>
> I've ruled that out after testing for several days different kernels
> based on v4.13;
>
> - add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
> after the call to enable_aspm; does not solve problem,
>
> - add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
> _rtl8821ae_check_pcie_dma_hang; does not solve problem,

When the problem occurs, register 0x350 bit 25 is set, for which a
comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
hang.

So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
and _rtl8821ae_reset_pcie_interface_dma.

Any ideas where to do this?

> [...]

--
James Cameron
http://quozl.netrek.org/

2017-09-13 21:46:56

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Wed, Sep 13, 2017 at 10:01:37AM -0500, Larry Finger wrote:
> Thank you very much for making the effort to bisect this problem. I
> know that several people have reported the problem, which we cannot
> duplicate; however, most of them just say it drops the connection
> and do nothing more. In fact, we are lucky to have them even report
> which kernel version they are running!

Yes, in the reported bugs that style is common; almost animistic, very
mystical, and based on heuristics rather than analysis. ;-)

> As we do not see the problem, we will be relying on you to help
> diagnose the issue. Merely changing the read from 8 to 16 bits
> should not cause any change.

Agreed.

> As _rtl8821ae_dbi_read() is only called from
> _rtl8821ae_enable_aspm_back_door(), we want to test turning off
> ASPM. The following patch will accomplish this. Unfortunately, the
> patch is white-space damaged, thus you will need to apply it
> manually. Please try it to see if it helps your connection
> loss. Note that ASPM settings are preserved through a module
> unload/reload sequence. Thus you will need to reboot after
> rebuilding the driver.

Went back to 4.13, added your test patch, and built kernel.

http://dev.laptop.org/~quozl/z/1dsFOW.txt is dmesg.

New symptom occurs; after 23 seconds since last transmission, the
device becomes unresponsive to ping from another host, but begins to
respond if the device transmits. Flurry of responses then it settles
down to regular ping.

64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=39 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=40 ttl=64 time=1.93 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=41 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=42 ttl=64 time=1.66 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=43 ttl=64 time=1.70 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=44 ttl=64 time=1.69 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=45 ttl=64 time=37.7 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=46 ttl=64 time=383 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=47 ttl=64 time=11464 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=48 ttl=64 time=10465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=49 ttl=64 time=9465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=50 ttl=64 time=8466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=51 ttl=64 time=7466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=52 ttl=64 time=6466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=53 ttl=64 time=5466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=54 ttl=64 time=4467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=55 ttl=64 time=3467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=56 ttl=64 time=2468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=57 ttl=64 time=1468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=58 ttl=64 time=469 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=59 ttl=64 time=1.79 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=60 ttl=64 time=1.75 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=61 ttl=64 time=1.72 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=62 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=63 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=64 ttl=64 time=1.95 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=65 ttl=64 time=1.68 ms

I'll give it some more testing and let you know, but it seems as
capable of keeping a connection as 4.13 plus my earlier revert.

--
James Cameron
http://quozl.netrek.org/

2017-09-14 09:27:45

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> On 09/13/2017 04:46 PM, James Cameron wrote:
> >
> >I'll give it some more testing and let you know, but it seems as
> >capable of keeping a connection as 4.13 plus my earlier revert.
> >

Testing went well; removing the call to enable ASPM was as good as
changing the DBI read back to 16-bit width.

> The change I sent earlier should be as good as reverting the change
> to write_byte in your reversion.

Yes, that would be the hope.

But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
so perhaps reading that register has an unexpected side-effect.

Is there any documentation for that register? I see other code writes
to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

Evidence of read from REG_DBI_CTRL was captured with an instrumented
kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
these dmesg lines;

[ 6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
[ 6.010338] rtl_pci: rtl_pci_enable_aspm
[ 6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
[ 6.034806] rtlwifi: rtlwifi: wireless switch is on
[ 6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0
[ 7.979186] rtl_pci: rtl_pci_disable_aspm
[ 7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[ 8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[ 8.295437] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
[ 8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[ 8.295462] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0200 (@034d)
[ 8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[ 8.295477] rtl_pci: rtl_pci_enable_aspm
[ 8.469734] rtl_pci: rtl_pci_disable_aspm
[ 8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[ 8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[ 8.687013] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
[ 8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[ 8.687038] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0218 (@034d)
[ 8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[ 8.687053] rtl_pci: rtl_pci_enable_aspm

Observe how the windowed read of DBI register 0x70f causes a read of
16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.

By the way, the cold boot value of DBI register 0x719 is 0x00, and
the warm boot value is 0x18, so I'm confident there isn't a
comprehensive register reset. It means that BIOS has relevance; and
this BIOS is outside my control. BIOS variation may explain
difficulty reproducing.

> There has been a report (in Russian unfortunately) at
> https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> handling.

Thanks. I've considered and excluded ARP handling delay. Though ARP
renewal is typical reason for device sleep to end.

With the call to enable ASPM disabled, instead of changing the DBI
read to 16-bit width, what happens is that the device stops accepting
data from the access point, packets are buffered there, and are
transmitted as soon as the device makes the next transmission.

http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
to confirm this.

I've a monitor mode tcpdump I can send by private mail if required.
In that the burst of packets shows ICMP echo requests were buffered by
the access point.

> According to Google translate is as follows:
>
> ============================================================
> Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> which causes the Internet to end. Wireshark looks quite interesting:
> ARP replays can be sent by one large packet a few seconds after
> receiving the requests, ie. they seem to be buffered somewhere.

Yes, buffering at access point.

> I need to explore that ENOBUFS return code.

I've seen ENOBUFS up at the application level with ping too, when the
original problem happens with v4.10 plus stable.

> Your case where the device is unresponsive to pings from another NIC
> until the device transmits may also be an ARP problem.
>
> For completeness, are you using the 2.4 of 5 GHz band? What is the
> make/model your AP? If possible for you to determine, what firmware
> is it running?

2.4 GHz and 5 GHz reproduces the problem.

Open or WPA reproduces the problem.

Netgear WNDR3800 OpenWrt 12.09-beta, r33312.

Several other access points reproduce the problem, including a
customer's TP-Link TL-WR1042ND with unknown firmware version.

No access point as yet does not reproduce the problem.

Hope that helps, thanks for your ideas.

--
James Cameron
http://quozl.netrek.org/

2017-09-19 09:42:15

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > On 09/13/2017 04:46 PM, James Cameron wrote:
> > >
> > >I'll give it some more testing and let you know, but it seems as
> > >capable of keeping a connection as 4.13 plus my earlier revert.
> > >
>
> Testing went well; removing the call to enable ASPM was as good as
> changing the DBI read back to 16-bit width.
>
> > The change I sent earlier should be as good as reverting the change
> > to write_byte in your reversion.
>
> Yes, that would be the hope.
>
> But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> so perhaps reading that register has an unexpected side-effect.
>

I've ruled that out after testing for several days different kernels
based on v4.13;

- add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
after the call to enable_aspm; does not solve problem,

- add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
_rtl8821ae_check_pcie_dma_hang; does not solve problem,

Only way to solve the problem at the moment is either;

- reverting 40b368af4b75 ("rtlwifi: Fix alignment issues"), which
means using rtl_read_word in _rtl8821ae_dbi_read,

or

- removing the two lines that enable ASPM, as you asked me to try.

> Is there any documentation for that register? I see other code writes
> to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

I'll repeat and expand on this. Is there any documentation for this
register, or the other REG_DBI_* registers?

I see that DBI windowed access in rtl8192de is different and yet very
similar.

In rtl8821ae, rtl8723be, and rtl8192de the method seems straightforward;
there are bits for address, bits for write enable by byte, and flag
bits for starting the transfer and completing.

> Evidence of read from REG_DBI_CTRL was captured with an instrumented
> kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
> these dmesg lines;
>
> [ 6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
> [ 6.010338] rtl_pci: rtl_pci_enable_aspm
> [ 6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
> [ 6.034806] rtlwifi: rtlwifi: wireless switch is on
> [ 6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0
> [ 7.979186] rtl_pci: rtl_pci_disable_aspm
> [ 7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [ 8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [ 8.295437] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
> [ 8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [ 8.295462] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0200 (@034d)
> [ 8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [ 8.295477] rtl_pci: rtl_pci_enable_aspm
> [ 8.469734] rtl_pci: rtl_pci_disable_aspm
> [ 8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [ 8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [ 8.687013] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
> [ 8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [ 8.687038] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0218 (@034d)
> [ 8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [ 8.687053] rtl_pci: rtl_pci_enable_aspm
>
> Observe how the windowed read of DBI register 0x70f causes a read of
> 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.
>
> By the way, the cold boot value of DBI register 0x719 is 0x00, and
> the warm boot value is 0x18, so I'm confident there isn't a
> comprehensive register reset. It means that BIOS has relevance; and
> this BIOS is outside my control. BIOS variation may explain
> difficulty reproducing.

Is there a register for device reset that I can try? It would help
to exclude BIOS.

>
> > There has been a report (in Russian unfortunately) at
> > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> > handling.
>
> Thanks. I've considered and excluded ARP handling delay. Though ARP
> renewal is typical reason for device sleep to end.
>
> With the call to enable ASPM disabled, instead of changing the DBI
> read to 16-bit width, what happens is that the device stops accepting
> data from the access point, packets are buffered there, and are
> transmitted as soon as the device makes the next transmission.
>
> http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
> to confirm this.
>
> I've a monitor mode tcpdump I can send by private mail if required.
> In that the burst of packets shows ICMP echo requests were buffered by
> the access point.
>
> > According to Google translate is as follows:
> >
> > ============================================================
> > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> > which causes the Internet to end. Wireshark looks quite interesting:
> > ARP replays can be sent by one large packet a few seconds after
> > receiving the requests, ie. they seem to be buffered somewhere.
>
> Yes, buffering at access point.
>
> > I need to explore that ENOBUFS return code.
>
> I've seen ENOBUFS up at the application level with ping too, when the
> original problem happens with v4.10 plus stable.
>
> > Your case where the device is unresponsive to pings from another NIC
> > until the device transmits may also be an ARP problem.
> >
> > For completeness, are you using the 2.4 of 5 GHz band? What is the
> > make/model your AP? If possible for you to determine, what firmware
> > is it running?
>
> 2.4 GHz and 5 GHz reproduces the problem.
>
> Open or WPA reproduces the problem.
>
> Netgear WNDR3800 OpenWrt 12.09-beta, r33312.
>
> Several other access points reproduce the problem, including a
> customer's TP-Link TL-WR1042ND with unknown firmware version.
>
> No access point as yet does not reproduce the problem.
>
> Hope that helps, thanks for your ideas.
>
> --
> James Cameron
> http://quozl.netrek.org/

--
James Cameron
http://quozl.netrek.org/

2017-09-20 23:22:35

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> On 09/20/2017 04:36 AM, James Cameron wrote:
> >When the problem occurs, register 0x350 bit 25 is set, for which a
> >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> >hang.
> >
> >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> >and _rtl8821ae_reset_pcie_interface_dma.
> >
> >Any ideas where to do this?
>
> Thanks for the extended debugging.
>
> I was able to repeat your findings. With the 8-bit read of
> REG_DBI_RDATA, I got poor connection stability. Reverting that part
> made it stable again. For that reason, I pushed the partial
> reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").

That's great you were able to reproduce, thanks!

> Where did you detect that bit 25 of register 0x350 was set?

In _rtl8821ae_check_pcie_dma_hang on link up.

REG_DBI_FLAG (0x350 bits 16-31) is observed as;

- 0x0000 on entry to function after warm boot,

- 0x0400 on exit from function; debug bit 23 is set by the function,

- 0x0400 on entry to function on link up when the problem has not
happened,

- 0x0600 on entry to function on link up when the problem has
happened.

But I don't know if 0x0600 is useful to detect earlier, or if it is
only a symptom of link down while device active. Either way, if it
truly does signal an RX hang or firmware RX queue full, it's useful.

My "-q9" and "-qa" test kernels dump REG_DBI_CTRL and REG_DBI_FLAG.

"-q9" is with 8-bit read of REG_DBI_RDATA.

"-qa" is with 16-bit read of REG_DBI_DATA.

My "-qa" test kernel;
http://dev.laptop.org/~quozl/y/1dunwN.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dubX7.txt (dmesg)

REG_DBI_CTRL+3 used by _rtl8821ae_check_pcie_dma_hang is effectively
REG_DBI_FLAG+1 (0x353).

REG_DBI_CTRL is REG_DBI_ADDR; a duplicate register definition.

I'm still pondering a few more theories;

- change write_readback, it is true now, and the while()/udelay in
_rtl8821ae_dbi_read seems a waste, it never executes,

- clearing REG_DBI_CTRL write enable bits at the end of
_rtl8821ae_dbi_write,

- switching to 32-bit access as used by rtl8192de.

And a giggle from reviewing the code,
_rtl8821ae_wowlan_initialize_adapter says "Patch Pcie Rx DMA hang
after S3/S4 several times. The root cause has not be found."
... I've learned that root causes that aren't found tend to cause
further problems later. ;-)

Given this, my gut feel is firmware or silicon problem; RX DMA ceases,
the driver does not detect it, the connection is lost.

--
James Cameron
http://quozl.netrek.org/

2018-01-31 17:06:15

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 09/12/2017 05:09 PM, James Cameron wrote:
> Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
> why?
>
> Wireless connection is lost after a few seconds or minutes, on every
> OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
> and any kernel with 40b368af4b75.
>
> dmesg contains
>
> wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
>
> iw event shows
>
> wlp2s0: del station 2c:b0:5d:a6:86:eb
> wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to inactivity
> wlp2s0 (phy #0): disconnected (local request)
>
> Workaround is to bounce the link, then reconnect;
>
> ip link set wlp2s0 down
> ip link set wlp2s0 up
> iw dev wlp2s0 connect qz
>
> A nearby monitor host captures a deauthentication packet sent by the
> device.
>
> Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> issues") which changes the width of DBI register read.
>
> On the face of it, 40b368af4b75 looks correct, especially compared
> against same function in rtl8723be.
>
> I've no idea why reverting fixes the problem. I'm hoping someone here
> might speculate and suggest ways to test.
>
> As keep alive is set through this path, my guess is that keep alive is
> not being set in the device. Or perhaps reading 16-bits perturbs
> another register. Is there a way to test?
>
> http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
>
> http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
> of 40b368af4b75

James,

I'm afraid we are needing to revisit this problem again. Changing that 8-bit
read to a 16-bit version causes an unaligned memory reference in AARCH64, thus
we will need to re-revert. To prevent problems on systems such as yours, PK
plans to turn off ASPM capability and backdoor in certain platforms that will be
listed in a quirks table. Please report the output of 'dmidecode -t system' for
you affected system(s).

We hope you will be able to test any proposed patches.

Thanks,

Larry

2018-02-04 18:18:39

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 02/02/2018 10:45 PM, Pkshih wrote:
> On Fri, 2018-02-02 at 14:13 -0600, Larry Finger wrote:
>> On 02/02/2018 01:50 AM, Pkshih wrote:
>>> Hi James,
>>>
>>> In my experiment, unaligned-word-access may get wrong values that
>>> are different from the value by byte-access. Actually, it can simply
>>> verified by using 'lspci' to check PCI configuration space.
>>>
>>> DBI read 0x70f:
>>> _rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
>>> _rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
>>> _rtl8821ae_dbi_read:1136 r16 0x350 = 0xffff
>>>
>>> DBI read 0x719:
>>> _rtl8821ae_dbi_read:1127 r8 0x34d = 0x0000
>>> _rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
>>> _rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200
>>>
>>>
>>> According to the wrong and original value of 0x70f is 0xff, I think
>>> larger L1 latency 0x70f[5:3] may be helpful. Please help to try
>>> below patch. If it works, quirk table won't be necessary.
>>>
>>> PK
>>>
>>>
>>> diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
>>> index 7d43ba002..e53af06ed 100644
>>> --- a/rtl8821ae/hw.c
>>> +++ b/rtl8821ae/hw.c
>>> @@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, u16 addr)
>>>     }
>>>     if (0 == tmp) {
>>>     read_addr = REG_DBI_RDATA + addr % 4;
>>> - ret = rtl_read_word(rtlpriv, read_addr);
>>> +
>>> + ret = rtl_read_byte(rtlpriv, read_addr);
>>>     }
>>>     return ret;
>>>    }
>>> @@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct ieee80211_hw *hw)
>>>     }
>>>
>>>     tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);
>>> - _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
>>> + _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
>>>
>>>     tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);
>>>     _rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));
>>>
>>
>>
>> PK,
>>
>> This patch works perfectly on my x86_64 system. With it, the interface handled a
>> 10 million count flood ping with <1000 packets dropped. It has now been running
>> for at least 11 hours. Good work.
>>
>> Can you explain that magic 0x38? I'm quite sure that Kalle will not accept the
>> patch as is. He already thinks that rtlwifi and friends already have too many
>> magic numbers.
>>
> The magic 0x38 represents L1 latency 0x70f[5:3]=b'111, and I will add definition
> or macro to describe
> 0x70f and move all ASPM code into pci.c, if the problem was
> solved. Hence, rtl8723de and rtl8723be
> will use the same values, too.
>
> Could you also help to test L1 latency as recommended value b'100?
> (i.e. 0x20 instead)

I have run several tests with the following results:

b'010 Failed very quickly
b'100 Did not fail, but the ping test had errors
b'110 No failure and no errors

My tests seem to indicate that we should use at least 0x30 in the field, and
perhaps we should stay at b'111 or 0x38.

To fix the current alignment errors for ARM architecture, I will push a patch
for mainline and stable that fixes rtl8821ae. Then you can do the patch that
moves all ASPM code into pci.c.

Larry

2018-02-02 20:27:28

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 02/02/2018 01:50 AM, Pkshih wrote:
>
> Hi James,
>
> In my experiment, unaligned-word-access may get wrong values that
> are different from the value by byte-access. Actually, it can simply
> verified by using 'lspci' to check PCI configuration space.
>
> DBI read 0x70f:
> _rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
> _rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
> _rtl8821ae_dbi_read:1136 r16 0x350 = 0xffff
>
> DBI read 0x719:
> _rtl8821ae_dbi_read:1127 r8 0x34d = 0x0000
> _rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
> _rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200
>
>
> According to the wrong and original value of 0x70f is 0xff, I think
> larger L1 latency 0x70f[5:3] may be helpful. Please help to try
> below patch. If it works, quirk table won't be necessary.
>
> PK
>
>
> diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
> index 7d43ba002..e53af06ed 100644
> --- a/rtl8821ae/hw.c
> +++ b/rtl8821ae/hw.c
> @@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, u16 addr)
> }
> if (0 == tmp) {
> read_addr = REG_DBI_RDATA + addr % 4;
> - ret = rtl_read_word(rtlpriv, read_addr);
> +
> + ret = rtl_read_byte(rtlpriv, read_addr);
> }
> return ret;
> }
> @@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct ieee80211_hw *hw)
> }
>
> tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);
> - _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
> + _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
>
> tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);
> _rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));
PK,

One more question: Will we need to make the same change to rtl8723be and rtl8723de?

Larry

2018-02-03 04:45:46

by Ping-Ke Shih

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

T24gRnJpLCAyMDE4LTAyLTAyIGF0IDE0OjEzIC0wNjAwLCBMYXJyeSBGaW5nZXIgd3JvdGU6DQo+
IE9uIDAyLzAyLzIwMTggMDE6NTAgQU0sIFBrc2hpaCB3cm90ZToNCj4gPiBIaSBKYW1lcywNCj4g
PsKgDQo+ID4gSW4gbXkgZXhwZXJpbWVudCwgdW5hbGlnbmVkLXdvcmQtYWNjZXNzIG1heSBnZXQg
d3JvbmcgdmFsdWVzIHRoYXQNCj4gPiBhcmUgZGlmZmVyZW50IGZyb20gdGhlIHZhbHVlIGJ5IGJ5
dGUtYWNjZXNzLiBBY3R1YWxseSwgaXQgY2FuIHNpbXBseQ0KPiA+IHZlcmlmaWVkIGJ5IHVzaW5n
ICdsc3BjaScgdG8gY2hlY2sgUENJIGNvbmZpZ3VyYXRpb24gc3BhY2UuDQo+ID7CoA0KPiA+IERC
SSByZWFkIDB4NzBmOg0KPiA+IF9ydGw4ODIxYWVfZGJpX3JlYWQ6MTEyNyByOCAweDM0ZiA9IDB4
MDAxNw0KPiA+IF9ydGw4ODIxYWVfZGJpX3JlYWQ6MTEzMSByOCAweDM1MCA9IDB4MDAwYw0KPiA+
IF9ydGw4ODIxYWVfZGJpX3JlYWQ6MTEzNiByMTYgMHgzNTAgPSAweGZmZmYNCj4gPsKgDQo+ID4g
REJJIHJlYWQgMHg3MTk6DQo+ID4gX3J0bDg4MjFhZV9kYmlfcmVhZDoxMTI3IHI4IDB4MzRkID0g
MHgwMDAwDQo+ID4gX3J0bDg4MjFhZV9kYmlfcmVhZDoxMTMxIHI4IDB4MzRlID0gMHgwMDAyDQo+
ID4gX3J0bDg4MjFhZV9kYmlfcmVhZDoxMTM2IHIxNiAweDM0ZSA9IDB4MDIwMA0KPiA+wqANCj4g
PsKgDQo+ID4gQWNjb3JkaW5nIHRvIHRoZSB3cm9uZyBhbmQgb3JpZ2luYWwgdmFsdWUgb2YgMHg3
MGYgaXMgMHhmZiwgSSB0aGluaw0KPiA+IGxhcmdlciBMMSBsYXRlbmN5IDB4NzBmWzU6M10gbWF5
IGJlIGhlbHBmdWwuIFBsZWFzZSBoZWxwIHRvIHRyeQ0KPiA+IGJlbG93IHBhdGNoLiBJZiBpdCB3
b3JrcywgcXVpcmsgdGFibGUgd29uJ3QgYmUgbmVjZXNzYXJ5Lg0KPiA+wqANCj4gPiBQSw0KPiA+
wqANCj4gPsKgDQo+ID4gZGlmZiAtLWdpdCBhL3J0bDg4MjFhZS9ody5jIGIvcnRsODgyMWFlL2h3
LmMNCj4gPiBpbmRleCA3ZDQzYmEwMDIuLmU1M2FmMDZlZCAxMDA2NDQNCj4gPiAtLS0gYS9ydGw4
ODIxYWUvaHcuYw0KPiA+ICsrKyBiL3J0bDg4MjFhZS9ody5jDQo+ID4gQEAgLTExMjMsNyArMTEy
Myw4IEBAIHN0YXRpYyB1OCBfcnRsODgyMWFlX2RiaV9yZWFkKHN0cnVjdCBydGxfcHJpdiAqcnRs
cHJpdiwgdTE2IGFkZHIpDQo+ID7CoMKgwqAJfQ0KPiA+wqDCoMKgCWlmICgwID09IHRtcCkgew0K
PiA+wqDCoMKgCQlyZWFkX2FkZHIgPSBSRUdfREJJX1JEQVRBICsgYWRkciAlIDQ7DQo+ID4gLQkJ
cmV0ID0gcnRsX3JlYWRfd29yZChydGxwcml2LCByZWFkX2FkZHIpOw0KPiA+ICsNCj4gPiArCQly
ZXQgPSBydGxfcmVhZF9ieXRlKHJ0bHByaXYsIHJlYWRfYWRkcik7DQo+ID7CoMKgwqAJfQ0KPiA+
wqDCoMKgCXJldHVybiByZXQ7DQo+ID7CoMKgwqB9DQo+ID4gQEAgLTExNjUsNyArMTE2Niw3IEBA
IHN0YXRpYyB2b2lkIF9ydGw4ODIxYWVfZW5hYmxlX2FzcG1fYmFja19kb29yKHN0cnVjdCBpZWVl
ODAyMTFfaHcgKmh3KQ0KPiA+wqDCoMKgCX0NCj4gPsKgwqDCoA0KPiA+wqDCoMKgCXRtcCA9IF9y
dGw4ODIxYWVfZGJpX3JlYWQocnRscHJpdiwgMHg3MGYpOw0KPiA+IC0JX3J0bDg4MjFhZV9kYmlf
d3JpdGUocnRscHJpdiwgMHg3MGYsIHRtcCB8IEJJVCg3KSk7DQo+ID4gKwlfcnRsODgyMWFlX2Ri
aV93cml0ZShydGxwcml2LCAweDcwZiwgdG1wIHwgQklUKDcpIHwgMHgzOCk7DQo+ID7CoMKgwqAN
Cj4gPsKgwqDCoAl0bXAgPSBfcnRsODgyMWFlX2RiaV9yZWFkKHJ0bHByaXYsIDB4NzE5KTsNCj4g
PsKgwqDCoAlfcnRsODgyMWFlX2RiaV93cml0ZShydGxwcml2LCAweDcxOSwgdG1wIHwgQklUKDMp
IHwgQklUKDQpKTsNCj4gPsKgDQo+IA0KPiANCj4gUEssDQo+IA0KPiBUaGlzIHBhdGNoIHdvcmtz
IHBlcmZlY3RseSBvbiBteSB4ODZfNjQgc3lzdGVtLiBXaXRoIGl0LCB0aGUgaW50ZXJmYWNlIGhh
bmRsZWQgYcKgDQo+IDEwIG1pbGxpb24gY291bnQgZmxvb2QgcGluZyB3aXRoIDwxMDAwIHBhY2tl
dHMgZHJvcHBlZC4gSXQgaGFzIG5vdyBiZWVuIHJ1bm5pbmfCoA0KPiBmb3IgYXQgbGVhc3QgMTEg
aG91cnMuIEdvb2Qgd29yay4NCj4gDQo+IENhbiB5b3UgZXhwbGFpbiB0aGF0IG1hZ2ljIDB4Mzg/
IEknbSBxdWl0ZSBzdXJlIHRoYXQgS2FsbGUgd2lsbCBub3QgYWNjZXB0IHRoZcKgDQo+IHBhdGNo
IGFzIGlzLiBIZSBhbHJlYWR5IHRoaW5rcyB0aGF0IHJ0bHdpZmkgYW5kIGZyaWVuZHMgYWxyZWFk
eSBoYXZlIHRvbyBtYW55wqANCj4gbWFnaWMgbnVtYmVycy4NCj4gDQpUaGUgbWFnaWMgMHgzOCBy
ZXByZXNlbnRzIEwxIGxhdGVuY3kgMHg3MGZbNTozXT1iJzExMSwgYW5kIEkgd2lsbCBhZGQgZGVm
aW5pdGlvbg0Kb3IgbWFjcm8gdG8gZGVzY3JpYmUNCjB4NzBmIGFuZCBtb3ZlIGFsbCBBU1BNIGNv
ZGUgaW50byBwY2kuYywgaWYgdGhlIHByb2JsZW0gd2FzDQpzb2x2ZWQuIEhlbmNlLCBydGw4NzIz
ZGUgYW5kIHJ0bDg3MjNiZQ0Kd2lsbCB1c2UgdGhlIHNhbWUgdmFsdWVzLCB0b28uDQoNCkNvdWxk
IHlvdSBhbHNvIGhlbHAgdG8gdGVzdCBMMSBsYXRlbmN5IGFzIHJlY29tbWVuZGVkIHZhbHVlIGIn
MTAwP8KgDQooaS5lLiAweDIwIGluc3RlYWQpDQoNClRoYW5rcw0KUEsNCg0K

2018-02-02 20:13:11

by Larry Finger

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On 02/02/2018 01:50 AM, Pkshih wrote:
> Hi James,
>
> In my experiment, unaligned-word-access may get wrong values that
> are different from the value by byte-access. Actually, it can simply
> verified by using 'lspci' to check PCI configuration space.
>
> DBI read 0x70f:
> _rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
> _rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
> _rtl8821ae_dbi_read:1136 r16 0x350 = 0xffff
>
> DBI read 0x719:
> _rtl8821ae_dbi_read:1127 r8 0x34d = 0x0000
> _rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
> _rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200
>
>
> According to the wrong and original value of 0x70f is 0xff, I think
> larger L1 latency 0x70f[5:3] may be helpful. Please help to try
> below patch. If it works, quirk table won't be necessary.
>
> PK
>
>
> diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
> index 7d43ba002..e53af06ed 100644
> --- a/rtl8821ae/hw.c
> +++ b/rtl8821ae/hw.c
> @@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, u16 addr)
> }
> if (0 == tmp) {
> read_addr = REG_DBI_RDATA + addr % 4;
> - ret = rtl_read_word(rtlpriv, read_addr);
> +
> + ret = rtl_read_byte(rtlpriv, read_addr);
> }
> return ret;
> }
> @@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct ieee80211_hw *hw)
> }
>
> tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);
> - _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
> + _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
>
> tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);
> _rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));
>


PK,

This patch works perfectly on my x86_64 system. With it, the interface handled a
10 million count flood ping with <1000 packets dropped. It has now been running
for at least 11 hours. Good work.

Can you explain that magic 0x38? I'm quite sure that Kalle will not accept the
patch as is. He already thinks that rtlwifi and friends already have too many
magic numbers.

Larry

2018-02-02 07:51:02

by Ping-Ke Shih

[permalink] [raw]
Subject: RE: rtl8821ae keep alive not set, connection lost


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf
> Of James Cameron
> Sent: Thursday, February 01, 2018 2:22 PM
> To: Larry Finger
> Cc: [email protected]; Pkshih
> Subject: Re: rtl8821ae keep alive not set, connection lost
>
> On Wed, Jan 31, 2018 at 11:06:12AM -0600, Larry Finger wrote:
> > On 09/12/2017 05:09 PM, James Cameron wrote:
> > >Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> > >rtl8821ae keep alive, causing "Connection to AP lost" and deauth,
> > >but why?
> > >
> > >Wireless connection is lost after a few seconds or minutes, on
> > >every OLPC NL3 laptop with rtl8821ae, with any stable kernel after
> > >4.10.1, and any kernel with 40b368af4b75.
> > >
> > >dmesg contains
> > >
> > > wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
> > >
> > >iw event shows
> > >
> > > wlp2s0: del station 2c:b0:5d:a6:86:eb
> > > wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to
> inactivity
> > > wlp2s0 (phy #0): disconnected (local request)
> > >
> > >Workaround is to bounce the link, then reconnect;
> > >
> > > ip link set wlp2s0 down
> > > ip link set wlp2s0 up
> > > iw dev wlp2s0 connect qz
> > >
> > >A nearby monitor host captures a deauthentication packet sent by
> > >the device.
> > >
> > >Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> > >issues") which changes the width of DBI register read.
> > >
> > >On the face of it, 40b368af4b75 looks correct, especially compared
> > >against same function in rtl8723be.
> > >
> > >I've no idea why reverting fixes the problem. I'm hoping someone
> > >here might speculate and suggest ways to test.
> > >
> > >As keep alive is set through this path, my guess is that keep alive
> > >is not being set in the device. Or perhaps reading 16-bits
> > >perturbs another register. Is there a way to test?
> > >
> > >http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
> > >
> > >http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and
> > >revert of 40b368af4b75
> >
> > James,
> >
> > I'm afraid we are needing to revisit this problem again. Changing
> > that 8-bit read to a 16-bit version causes an unaligned memory
> > reference in AARCH64, thus we will need to re-revert. To prevent
> > problems on systems such as yours, PK plans to turn off ASPM
> > capability and backdoor in certain platforms that will be listed in
> > a quirks table. Please report the output of 'dmidecode -t system'
> > for you affected system(s).
>
> Thanks for letting me know.
>
> We made three production runs, and I'm waiting to get a hold of the
> dmidecode for two of them. This may take some weeks; we have to find
> stock and ship it, or we have to ask our contract manufacturer (CM) if
> they have kept data or units.
>
> I've dmidecode for one production run.
>
> http://dev.laptop.org/~quozl/z/1eh7JF.txt (my unit nl3-e)
>
> I've dmidecode for prototypes, but they have clearly been programmed
> badly. We did not ask our CM for Windows compatibility, so they may
> have had no step to verify the data. We also went through several
> iterations to get serial numbers assigned, so the data I have does not
> have good provenance.
>
> http://dev.laptop.org/~quozl/z/1eh7EE.txt (my unit nl3-c)
> http://dev.laptop.org/~quozl/z/1eh7EV.txt (my unit nl3-d)
> http://dev.laptop.org/~quozl/z/1eh7He.txt (my unit nl3-a)
> http://dev.laptop.org/~quozl/z/1eh8DR.txt (my unit nl3-b)
>
> > We hope you will be able to test any proposed patches.
>
> Yes, can do.
>
> I've just tested v4.15.
>
> However, I'm concerned about your plan to use quirks;
>
> 1. turning off ASPM may decrease run time on battery, which if it is
> significant, across several thousand laptops will yield generator fuel
> or solar budget failure; can the power impact be quantified?
>
> 2. why not keep ASPM enabled, and use 8-bit when quirked, or on
> x86_64, or when not AARCH64?
>
> 3. why not find the underlying problem; PK is in the same company as
> the device firmware engineers, so it should be possible for them to
> find out why 16-bit access causes the device firmware to hang? We
> drew a blank trying to reach firmware engineers through our CM and
> module maker; perhaps we were not large or noisy enough.
>
> 4. it's not just me; there are others who have reported similar
> problems, so won't re-reverting affect them? They haven't engaged in
> the process as thoroughly, and may not be in the quirks table. You
> also reproduced the problem with different hardware.
>

Hi James,

In my experiment, unaligned-word-access may get wrong values that
are different from the value by byte-access. Actually, it can simply
verified by using 'lspci' to check PCI configuration space.

DBI read 0x70f:
_rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
_rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
_rtl8821ae_dbi_read:1136 r16 0x350 = 0xffff

DBI read 0x719:
_rtl8821ae_dbi_read:1127 r8 0x34d = 0x0000
_rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
_rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200


According to the wrong and original value of 0x70f is 0xff, I think
larger L1 latency 0x70f[5:3] may be helpful. Please help to try
below patch. If it works, quirk table won't be necessary.

PK


diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
index 7d43ba002..e53af06ed 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, u16 addr)
}
if (0 == tmp) {
read_addr = REG_DBI_RDATA + addr % 4;
- ret = rtl_read_word(rtlpriv, read_addr);
+
+ ret = rtl_read_byte(rtlpriv, read_addr);
}
return ret;
}
@@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct ieee80211_hw *hw)
}

tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);
- _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
+ _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);

tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);
_rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));

2018-02-01 06:31:28

by James Cameron

[permalink] [raw]
Subject: Re: rtl8821ae keep alive not set, connection lost

On Wed, Jan 31, 2018 at 11:06:12AM -0600, Larry Finger wrote:
> On 09/12/2017 05:09 PM, James Cameron wrote:
> >Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> >rtl8821ae keep alive, causing "Connection to AP lost" and deauth,
> >but why?
> >
> >Wireless connection is lost after a few seconds or minutes, on
> >every OLPC NL3 laptop with rtl8821ae, with any stable kernel after
> >4.10.1, and any kernel with 40b368af4b75.
> >
> >dmesg contains
> >
> > wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
> >
> >iw event shows
> >
> > wlp2s0: del station 2c:b0:5d:a6:86:eb
> > wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to inactivity
> > wlp2s0 (phy #0): disconnected (local request)
> >
> >Workaround is to bounce the link, then reconnect;
> >
> > ip link set wlp2s0 down
> > ip link set wlp2s0 up
> > iw dev wlp2s0 connect qz
> >
> >A nearby monitor host captures a deauthentication packet sent by
> >the device.
> >
> >Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> >issues") which changes the width of DBI register read.
> >
> >On the face of it, 40b368af4b75 looks correct, especially compared
> >against same function in rtl8723be.
> >
> >I've no idea why reverting fixes the problem. I'm hoping someone
> >here might speculate and suggest ways to test.
> >
> >As keep alive is set through this path, my guess is that keep alive
> >is not being set in the device. Or perhaps reading 16-bits
> >perturbs another register. Is there a way to test?
> >
> >http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
> >
> >http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and
> >revert of 40b368af4b75
>
> James,
>
> I'm afraid we are needing to revisit this problem again. Changing
> that 8-bit read to a 16-bit version causes an unaligned memory
> reference in AARCH64, thus we will need to re-revert. To prevent
> problems on systems such as yours, PK plans to turn off ASPM
> capability and backdoor in certain platforms that will be listed in
> a quirks table. Please report the output of 'dmidecode -t system'
> for you affected system(s).

Thanks for letting me know.

We made three production runs, and I'm waiting to get a hold of the
dmidecode for two of them. This may take some weeks; we have to find
stock and ship it, or we have to ask our contract manufacturer (CM) if
they have kept data or units.

I've dmidecode for one production run.

http://dev.laptop.org/~quozl/z/1eh7JF.txt (my unit nl3-e)

I've dmidecode for prototypes, but they have clearly been programmed
badly. We did not ask our CM for Windows compatibility, so they may
have had no step to verify the data. We also went through several
iterations to get serial numbers assigned, so the data I have does not
have good provenance.

http://dev.laptop.org/~quozl/z/1eh7EE.txt (my unit nl3-c)
http://dev.laptop.org/~quozl/z/1eh7EV.txt (my unit nl3-d)
http://dev.laptop.org/~quozl/z/1eh7He.txt (my unit nl3-a)
http://dev.laptop.org/~quozl/z/1eh8DR.txt (my unit nl3-b)

> We hope you will be able to test any proposed patches.

Yes, can do.

I've just tested v4.15.

However, I'm concerned about your plan to use quirks;

1. turning off ASPM may decrease run time on battery, which if it is
significant, across several thousand laptops will yield generator fuel
or solar budget failure; can the power impact be quantified?

2. why not keep ASPM enabled, and use 8-bit when quirked, or on
x86_64, or when not AARCH64?

3. why not find the underlying problem; PK is in the same company as
the device firmware engineers, so it should be possible for them to
find out why 16-bit access causes the device firmware to hang? We
drew a blank trying to reach firmware engineers through our CM and
module maker; perhaps we were not large or noisy enough.

4. it's not just me; there are others who have reported similar
problems, so won't re-reverting affect them? They haven't engaged in
the process as thoroughly, and may not be in the quirks table. You
also reproduced the problem with different hardware.

> Thanks,
>
> Larry

--
James Cameron
http://quozl.netrek.org/