2023-11-30 15:15:20

by James Prestwood

[permalink] [raw]
Subject: Ping/IP network loss post-roam

Hi,

We have noticed a sporadic problem that seems to come and go. It was
first noticed only on a specific AP manufacturer but more recently seen
on another which is why I'm revisiting the problem.

The client is using ath10k QCA6174 hw 3.2 hardware. The network is WPA2,
configured with FT. This has been seen both with over-Air and over-DS.

The client will always roam using FT without a problem. No indication of
failed FT (ft-auth/ft-action and assoc are both successful). Sometimes
though, after the roam, the client seems to lose all IP networking
capabilities (pings, tcp/udp all fail).

IWD is getting zero indication there there is a problem after the roam.
No packet/beacon loss CQM events, no deauths. On the AP side (if we even
get any indication of a problem) we see "client not responding". It
appears there is some disconnect between the clients state and the state
the AP thinks the client is in. The client thinks its connected, the AP
thinks the client has disappeared.

After noticing this problem a watchdog was implemented which starts
pinging post-roam and if enough pings fail it triggers a deauth and
authenticates again. This at least gets the client back on the network,
but obviously isn't great because the client loses networking for an
extended period waiting for pings to fail, then deauth/reauthing, doing
DHCP etc. We hadn't gotten any traction trying to explain the issue to
the AP vendor. Its always a client issue...

These are production devices and ath10k debugging is not built in to the
module. All I have is kernel/IWD logs which just shows the roam was
successful, and we deauthed later. Not particularly useful.

I'm trying to determine where the problem is, is it client side or
infrastructure, and if there is anything that can be done either from an
ath10k driver or supplicant (IWD) perspective. Getting ath10k logs is
something I'd like to eventually do but its easier said than done. These
devices are always running and customers generally don't want them
messed with. I have ssh access so if there is any additional info I can
get without kernel changes I'm happy to try.

Nov 30 13:33:11 kernel: wlan0: disconnect from AP xx:xx:xx:xx:xx:xx for
new assoc to yy:yy:yy:yy:yy:yy
Nov 30 13:33:11 kernel: wlan0: associate with yy:yy:yy:yy:yy:yy (try 1/3)
Nov 30 13:33:11 kernel: wlan0: RX ReassocResp from yy:yy:yy:yy:yy:yy
(capab=0x411 status=0 aid=6)
Nov 30 13:33:11 kernel: wlan0: associated
Nov 30 13:33:11 kernel: ath: EEPROM regdomain: 0x809c
Nov 30 13:33:11 kernel: ath: EEPROM indicates we should expect a country
code
Nov 30 13:33:11 kernel: ath: doing EEPROM country->regdmn map search
Nov 30 13:33:11 kernel: ath: country maps to regdmn code: 0x52
Nov 30 13:33:11 kernel: ath: Country alpha2 being used: CN
Nov 30 13:33:11 kernel: ath: Regpair used: 0x52
Nov 30 13:33:11 kernel: ath: regdomain 0x809c dynamically updated by
country element

# This condition is detected by watchdog, and we deauth

Nov 30 13:33:34 kernel: wlan0: deauthenticating from yy:yy:yy:yy:yy:yy
by local choice (Reason: 3=DEAUTH_LEAVING)

# We then auth to the very same BSS, successfully and have no problems
(until it happens sometime later)

Nov 30 13:33:36 kernel: wlan0: authenticate with yy:yy:yy:yy:yy:yy
Nov 30 13:33:36 kernel: wlan0: send auth to yy:yy:yy:yy:yy:yy (try 1/3)
Nov 30 13:33:36 kernel: wlan0: authenticated
Nov 30 13:33:36 kernel: wlan0: associate with yy:yy:yy:yy:yy:yy (try 1/3)
Nov 30 13:33:36 kernel: wlan0: RX AssocResp from yy:yy:yy:yy:yy:yy
(capab=0x411 status=0 aid=6)
Nov 30 13:33:36 kernel: wlan0: associated

Thanks,

James