2017-06-02 22:02:22

by Nathan Royce

[permalink] [raw]
Subject: ath9k - Division by zero in kernel (as well as firmware panic)

ODroid XU4

$ uname -a
Linux computer 4.12.0-rc3-dirty #1 SMP Wed May 31 15:02:05 CDT 2017
armv7l GNU/Linux

$ lsusb
...
Bus 001 Device 002: ID 2109:2813 VIA Labs, Inc.
Bus 001 Device 010: ID 0cf3:7015 Qualcomm Atheros Communications
TP-Link TL-WN821N v3 / TL-WN822N v2 802.11n [Atheros AR7010+AR9287]
...

*****
Jun 02 16:20:11 computer hostapd[14954]: vwlan0: interface state
COUNTRY_UPDATE->HT_SCAN
Jun 02 16:20:17 computer hostapd[14954]: 20/40 MHz operation not
permitted on channel pri=7 sec=3 based on overlapping BSSes
Jun 02 16:20:18 computer kernel: Division by zero in kernel.
Jun 02 16:20:18 computer kernel: CPU: 1 PID: 14507 Comm: kworker/u16:2
Tainted: G W 4.12.0-rc3-dirty #1
Jun 02 16:20:18 computer kernel: Hardware name: SAMSUNG EXYNOS
(Flattened Device Tree)
Jun 02 16:20:18 computer kernel: Workqueue: phy5 ieee80211_scan_work [mac80211]
Jun 02 16:20:18 computer kernel: [<c010ee0c>] (unwind_backtrace) from
[<c010b61c>] (show_stack+0x10/0x14)
Jun 02 16:20:18 computer kernel: [<c010b61c>] (show_stack) from
[<c0377708>] (dump_stack+0x88/0x9c)
Jun 02 16:20:18 computer kernel: [<c0377708>] (dump_stack) from
[<c03755d0>] (Ldiv0_64+0x8/0x18)
Jun 02 16:20:18 computer kernel: [<c03755d0>] (Ldiv0_64) from
[<bf71c9a4>] (ath9k_get_next_tbtt+0x58/0x5c [ath9k_common])
Jun 02 16:20:18 computer kernel: [<bf71c9a4>] (ath9k_get_next_tbtt
[ath9k_common]) from [<bf71cb90>] (ath9k_cmn_beacon_config
Jun 02 16:20:18 computer kernel: [<bf71cb90>]
(ath9k_cmn_beacon_config_ap [ath9k_common]) from [<bf7898c8>]
(ath9k_htc_beacon
Jun 02 16:20:18 computer kernel: [<bf7898c8>]
(ath9k_htc_beacon_config_ap [ath9k_htc]) from [<bf7885a8>]
(ath9k_htc_vif_recon
Jun 02 16:20:18 computer kernel: [<bf7885a8>] (ath9k_htc_vif_reconfig
[ath9k_htc]) from [<bf78860c>] (ath9k_htc_sw_scan_compl
Jun 02 16:20:18 computer kernel: [<bf78860c>]
(ath9k_htc_sw_scan_complete [ath9k_htc]) from [<bf506d38>]
(__ieee80211_scan_co
Jun 02 16:20:18 computer kernel: [<bf506d38>]
(__ieee80211_scan_completed [mac80211]) from [<bf507968>]
(ieee80211_scan_work+
Jun 02 16:20:18 computer kernel: [<bf507968>] (ieee80211_scan_work
[mac80211]) from [<c0133f10>] (process_one_work+0x1d8/0x40
Jun 02 16:20:18 computer kernel: [<c0133f10>] (process_one_work) from
[<c0134cb4>] (worker_thread+0x4c/0x564)
Jun 02 16:20:18 computer kernel: [<c0134cb4>] (worker_thread) from
[<c0139c20>] (kthread+0x14c/0x154)
Jun 02 16:20:18 computer kernel: [<c0139c20>] (kthread) from
[<c0107c38>] (ret_from_fork+0x14/0x3c)
Jun 02 16:20:18 computer hostapd[14954]: Using interface wlan0 with
hwaddr <sanitized> and ssid "<sanitized>"
Jun 02 16:20:18 computer kernel: IPv6: ADDRCONF(NETDEV_CHANGE):
vwlan0: link becomes ready
*****
This is a new one on me.

The "normal" problem (search shows to be a very old issue) I
consistently (daily or multiple times/day) encounter is:
*****
Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
Jun 02 14:55:30 computer kernel: usb 1-1.1: USB disconnect, device number 9
Jun 02 14:55:30 computer systemd-networkd[11959]: vwlan0: Lost carrier
Jun 02 14:55:30 computer kernel: br0: port 2(vwlan0) entered disabled state
Jun 02 14:55:30 computer kernel: wlan0: deauthenticating from
<sanitized> by local choice (Reason: 3=DEAUTH_LEAVING)
Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
Jun 02 14:55:30 computer systemd-networkd[11959]: wlan0: Lost carrier
Jun 02 14:55:30 computer systemd[1]: Stopping A simple WPA encrypted
wireless connection using a static IP...
-- Subject: Unit [email protected] has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit [email protected] has begun shutting down.
Jun 02 14:55:30 computer kernel: device vwlan0 left promiscuous mode
Jun 02 14:55:30 computer kernel: br0: port 2(vwlan0) entered disabled state
Jun 02 14:55:30 computer audit: ANOM_PROMISCUOUS dev=vwlan0 prom=0
old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
Jun 02 14:55:30 computer hostapd[13218]: vwlan0: AP-STA-DISCONNECTED <sanitized>
Jun 02 14:55:30 computer hostapd[13218]: Failed to set beacon parameters
Jun 02 14:55:30 computer hostapd[13218]: vwlan0: INTERFACE-DISABLED
Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: USB layer deinitialized
Jun 02 14:55:30 computer systemd[1]: Starting Load/Save RF Kill Switch Status...
-- Subject: Unit systemd-rfkill.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit systemd-rfkill.service has begun starting up.
Jun 02 14:55:30 computer systemd[1]: Started Load/Save RF Kill Switch Status.
-- Subject: Unit systemd-rfkill.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit systemd-rfkill.service has finished starting up.
--
-- The start-up result is done.
Jun 02 14:55:30 computer network[13261]: Stopping network profile 'wlan0'...
Jun 02 14:55:30 computer kernel: usb 1-1.1: new high-speed USB device
number 10 using exynos-ehci
Jun 02 14:55:30 computer kernel: usb 1-1.1: New USB device found,
idVendor=0cf3, idProduct=7015
Jun 02 14:55:30 computer kernel: usb 1-1.1: New USB device strings:
Mfr=16, Product=32, SerialNumber=48
Jun 02 14:55:30 computer kernel: usb 1-1.1: Product: USB WLAN
Jun 02 14:55:30 computer kernel: usb 1-1.1: Manufacturer: ATHEROS
Jun 02 14:55:30 computer kernel: usb 1-1.1: SerialNumber: 12345
Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Firmware
ath9k_htc/htc_7010-1.4.0.fw requested
Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Transferred FW:
ath9k_htc/htc_7010-1.4.0.fw, size: 72812
Jun 02 14:55:30 computer kernel: ath9k_htc 1-1.1:1.0: ath9k_htc: HTC
initialized with 45 credits
Jun 02 14:55:31 computer kernel: ath9k_htc 1-1.1:1.0: ath9k_htc: FW Version: 1.4
Jun 02 14:55:31 computer kernel: ath9k_htc 1-1.1:1.0: FW RMW support: On
Jun 02 14:55:31 computer kernel: ath: EEPROM regdomain: 0x809c
Jun 02 14:55:31 computer kernel: ath: EEPROM indicates we should
expect a country code
Jun 02 14:55:31 computer kernel: ath: doing EEPROM country->regdmn map search
Jun 02 14:55:31 computer kernel: ath: country maps to regdmn code: 0x52
Jun 02 14:55:31 computer kernel: ath: Country alpha2 being used: CN
Jun 02 14:55:31 computer kernel: ath: Regpair used: 0x52
Jun 02 14:55:31 computer kernel: ieee80211 phy5: Atheros AR9287 Rev:2
Jun 02 14:55:31 computer kernel: IPv6: ADDRCONF(NETDEV_UP): vwlan0:
link is not ready
Jun 02 14:55:31 computer hostapd[13218]: vwlan0: INTERFACE-ENABLED
Jun 02 14:55:31 computer network[13261]: Stopped network profile 'wlan0'
*****
I don't know the particular reason for this one.
At first it would happen every time I compiled anything (all cpu
used). Then I added the ZTE Mobley to the USB hub. Even after removing
the Mobley, the panic would still happen often.
I then recompiled the kernel so only the 4 LITTLE cpus were used
(big.LITTLE support+switcher), but the panic still happens sometimes.
Now the consistency seems to come from the wireless adapter used as
both AP and managed client.


2017-06-07 07:08:28

by Oleksij Rempel

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Am 07.06.2017 um 02:12 schrieb Tobias Diedrich:
> Oleksij Rempel wrote:
>> Yes, this is "normal" problem. The firmware has no error handler for PCI
>> bus related exceptions. So if we filed to read PCI bus first time, we
>> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
>> and provide an kernel "firmware panic!" message.
>> Every one who can or will to fix this, is welcome.
>>
>>> *****
>>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
>>> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
> [...]
>
>> memdmp 50ae78 50ae88
>
> 50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]
>
> [...copy to bin...]
> $ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
> [..]
> 0: 6c1004 entry a1, 32
> 3: 126aa2 l32r a2, 0xfffdaa8c
> 6: 0c0200 memw
> 9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
> b: c020 movi.n a2, 0
> d: 081940 extui a9, a8, 1, 1
>
> Judging from that it should be fairly simple to at least implement
> some sort of retry, possible after triggering a PCIe link retrain?

I assume, yes.

> There are some related PCIe root complex registers that may point to
> what exactly failed if they were dumped.
>
> The root complex registers live at 0x00040000 and I think match the
> registers described for the root complex in the AR9344 datasheet.

Suddenly I don't have ar7010 docs to tell..

> PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
> "A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
> the hierarchy reports any of the following errors and the associated
> enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
> ERR_NONFATAL."
>
> AFAICS link retrain can be done by setting bit3 (INIT_RST,
> "Application request to initiate a training reset") in
> PCIE_APP (0x40000).
>
> See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
> flips some bits in the RC to enable the PCIe bus for reading the
> EEPROM).
>
> The root complex pci configuration space is at 0x20000 which could
> have further error details:
>> memdmp 20000 20200
>
> 020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
> 020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
> 020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
> 020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
> 020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
> 020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
> 020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
> 0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
> 020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
> 020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
> 020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
> 020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>
> Transformed into something suitable for feeding into lspci -F:
>
> 00:00.0 Description filled in by lspci
> 00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> 40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
> 50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
> 80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
> 90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> $ lspci -F /tmp/hexdump -vvv
> 00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
> !!! Invalid class 0000 for header type 01
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 255
> Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
> I/O behind bridge: 00000000-00000fff
> Memory behind bridge: 00000000-000fffff
> Prefetchable memory behind bridge: 00000000-000fffff
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
> Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0
> ExtTag- RBE+
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> RootCap: CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>

Looks promising :)

>>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Transferred FW:
>>> ath9k_htc/htc_7010-1.4.0.fw, size: 72812
>
> $ ls -l /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
> -rw-r--r-- 1 root root 72812 Dec 14 04:59 /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
> $ sha1sum /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
> 959cb6550930de2882e12b9a549c3cf0c9bf51ac /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw

--
Regards,
Oleksij


Attachments:
signature.asc (195.00 B)
OpenPGP digital signature

2017-06-07 00:19:12

by Tobias Diedrich

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Oleksij Rempel wrote:
> Yes, this is "normal" problem. The firmware has no error handler for PCI
> bus related exceptions. So if we filed to read PCI bus first time, we
> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
> and provide an kernel "firmware panic!" message.
> Every one who can or will to fix this, is welcome.
>
> > *****
> > Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
> > exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
[...]

>memdmp 50ae78 50ae88

50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]

[...copy to bin...]
$ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
[..]
0: 6c1004 entry a1, 32
3: 126aa2 l32r a2, 0xfffdaa8c
6: 0c0200 memw
9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
b: c020 movi.n a2, 0
d: 081940 extui a9, a8, 1, 1

Judging from that it should be fairly simple to at least implement
some sort of retry, possible after triggering a PCIe link retrain?
There are some related PCIe root complex registers that may point to
what exactly failed if they were dumped.

The root complex registers live at 0x00040000 and I think match the
registers described for the root complex in the AR9344 datasheet.

PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
"A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
the hierarchy reports any of the following errors and the associated
enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
ERR_NONFATAL."

AFAICS link retrain can be done by setting bit3 (INIT_RST,
"Application request to initiate a training reset") in
PCIE_APP (0x40000).

See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
flips some bits in the RC to enable the PCIe bus for reading the
EEPROM).

The root complex pci configuration space is at 0x20000 which could
have further error details:
>memdmp 20000 20200

020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................

Transformed into something suitable for feeding into lspci -F:

00:00.0 Description filled in by lspci
00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

$ lspci -F /tmp/hexdump -vvv
00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
!!! Invalid class 0000 for header type 01
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 255
Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
I/O behind bridge: 00000000-00000fff
Memory behind bridge: 00000000-000fffff
Prefetchable memory behind bridge: 00000000-000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-


> > Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Transferred FW:
> > ath9k_htc/htc_7010-1.4.0.fw, size: 72812

$ ls -l /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
-rw-r--r-- 1 root root 72812 Dec 14 04:59 /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
$ sha1sum /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw
959cb6550930de2882e12b9a549c3cf0c9bf51ac /lib/firmware/ath9k_htc/htc_7010-1.4.0.fw



--
Tobias PGP: http://8ef7ddba.uguu.de


Attachments:
(No filename) (8.26 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2017-06-07 22:39:17

by Tobias Diedrich

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Oleksij Rempel wrote:
> Am 07.06.2017 um 02:12 schrieb Tobias Diedrich:
> > Oleksij Rempel wrote:
> >> Yes, this is "normal" problem. The firmware has no error handler for PCI
> >> bus related exceptions. So if we filed to read PCI bus first time, we
> >> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
> >> and provide an kernel "firmware panic!" message.
> >> Every one who can or will to fix this, is welcome.
> >>
> >>> *****
> >>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
> >>> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
> > [...]
> >
> >> memdmp 50ae78 50ae88
> >
> > 50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]
> >
> > [...copy to bin...]
> > $ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
> > [..]
> > 0: 6c1004 entry a1, 32
> > 3: 126aa2 l32r a2, 0xfffdaa8c
> > 6: 0c0200 memw
> > 9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
> > b: c020 movi.n a2, 0
> > d: 081940 extui a9, a8, 1, 1
> >
> > Judging from that it should be fairly simple to at least implement
> > some sort of retry, possible after triggering a PCIe link retrain?
>
> I assume, yes.
>
> > There are some related PCIe root complex registers that may point to
> > what exactly failed if they were dumped.
> >
> > The root complex registers live at 0x00040000 and I think match the
> > registers described for the root complex in the AR9344 datasheet.
>
> Suddenly I don't have ar7010 docs to tell..
>
> > PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
> > "A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
> > the hierarchy reports any of the following errors and the associated
> > enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
> > ERR_NONFATAL."
> >
> > AFAICS link retrain can be done by setting bit3 (INIT_RST,
> > "Application request to initiate a training reset") in
> > PCIE_APP (0x40000).
> >
> > See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
> > flips some bits in the RC to enable the PCIe bus for reading the
> > EEPROM).
> >
> > The root complex pci configuration space is at 0x20000 which could
> > have further error details:
> >> memdmp 20000 20200
> >
> > 020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
> > 020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
> > 020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
> > 020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
> > 020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
> > 020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
> > 020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
> > 0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
> > 020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
> > 020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
> > 020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
> > 020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> > 0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >
> > Transformed into something suitable for feeding into lspci -F:
> >
> > 00:00.0 Description filled in by lspci
> > 00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
> > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> > 40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
> > 50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
> > 80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
> > 90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
> > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > $ lspci -F /tmp/hexdump -vvv
> > 00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
> > !!! Invalid class 0000 for header type 01
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0
> > Interrupt: pin A routed to IRQ 255
> > Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
> > I/O behind bridge: 00000000-00000fff
> > Memory behind bridge: 00000000-000fffff
> > Prefetchable memory behind bridge: 00000000-000fffff
> > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
> > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> > Address: 0000000000000000 Data: 0000
> > Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
> > ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> > RootCap: CRSVisible-
> > RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> > DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> > EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >
>
> Looks promising :)
>

POC seems to work, though this may additionally need to restore wifi
state as well, no guarantees there.

>str 40018 3
00040018 : 00000003
>
Retry(1) failed PCIe access @0x10ff4038
Before: int_mask=0 app=ffc1 reset=0
After: int_mask=0 app=ffc1 reset=7
wlan int status=0

>str 40018 3
00040018 : 00000003
>
Retry(1) failed PCIe access @0x10ff4038
Before: int_mask=0 app=ffc1 reset=0
After: int_mask=0 app=ffc1 reset=7
wlan int status=0
>


diff --git a/target_firmware/magpie_fw_dev/target/init/app_start.c b/target_firmware/magpie_fw_dev/target/init/app_start.c
index 8fa9c8b..fea62c1 100644
--- a/target_firmware/magpie_fw_dev/target/init/app_start.c
+++ b/target_firmware/magpie_fw_dev/target/init/app_start.c
@@ -137,6 +137,13 @@ void __section(boot) __noreturn __visible app_start(void)

A_PRINTF(" A_WDT_INIT()\n\r");

+#if defined(PROJECT_MAGPIE)
+ // For some reason needs to be called again here for the
+ // exception handlers to work properly, at least on the XBOX
+ // adapter.
+ fatal_exception_func();
+#endif
+
#if defined(PROJECT_K2)
save_cmnos_printf = fw_cmnos_printf;
#endif
diff --git a/target_firmware/magpie_fw_dev/target/init/init.c b/target_firmware/magpie_fw_dev/target/init/init.c
index 7484c05..cad2519 100755
--- a/target_firmware/magpie_fw_dev/target/init/init.c
+++ b/target_firmware/magpie_fw_dev/target/init/init.c
@@ -212,6 +212,78 @@ LOCAL void zfGenWrongEpidEvent(uint32_t epid)
mUSB_EP3_XFER_DONE();
}

+static void
+AR7010_pcie_reset(void)
+{
+#define PCIE_RC_ACCESS_DELAY 20
+
+#define PCI_RC_RESET_BIT BIT6
+#define PCI_RC_PHY_RESET_BIT BIT7
+#define PCI_RC_PLL_RESET_BIT BIT8
+#define PCI_RC_PHY_SHIFT_RESET_BIT BIT10
+
+#define HAL_WORD_REG_WRITE(addr, val) do { *((uint32_t*)(addr)) = val; } while (0)
+#define HAL_WORD_REG_READ(addr) (*((uint32_t*)(addr)))
+
+#define CMD_PCI_RC_RESET_ON() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
+ (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)| \
+ (PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT)))
+
+#define CMD_PCI_RC_RESET_CLR() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
+ (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)& \
+ (~(PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT))))
+
+ int i;
+
+ CMD_PCI_RC_RESET_ON();
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+
+ /* dereset the reset */
+ CMD_PCI_RC_RESET_CLR();
+ A_DELAY_USECS(500);
+
+ /* 7. set bus master and memory space enable */
+ DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x45;
+ HAL_WORD_REG_WRITE(0x00020004, (HAL_WORD_REG_READ(0x00020004)|(BIT1|BIT2)));
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+
+ /* 7.5. asser pcie_ep reset */
+ HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018) & ~(0x1 << 2)));
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+
+ /* 7.5. de-asser pcie_ep reset */
+ HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018)|(0x1 << 2)));
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+
+ /* 8. set app_ltssm_enable */
+ DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x46;
+ HAL_WORD_REG_WRITE(0x00040000, (HAL_WORD_REG_READ(0x00040000)|0xffc1));
+
+ /*!
+ * Receive control (PCIE_RESET),
+ * 0x40018, BIT0: LINK_UP, PHY Link up -PHY Link up/down indicator
+ * in case the link up is not ready and we access the 0x14000000,
+ * vmc will hang here
+ */
+
+ /* poll 0x40018/bit0 (1000 times) until it turns to 1 */
+ i = 10000;
+ while(i-->0)
+ {
+ uint32_t reg_value = HAL_WORD_REG_READ(0x00040018);
+ if( reg_value & BIT0 )
+ break;
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+ }
+
+ HAL_WORD_REG_WRITE(0x14000004, (HAL_WORD_REG_READ(0x14000004)|0x116));
+ A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
+
+ HAL_WORD_REG_WRITE(0x14000010, (HAL_WORD_REG_READ(0x14000010)|EEPROM_CTRL_BASE));
+}
+
+static int exception_retries = 0;
+
void
AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
{
@@ -226,6 +298,32 @@ AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
dump.pc = exc_frame->xt_pc;
dump.assline = 0;

+ if (dump.badvaddr >= 0x10000000 &&
+ dump.badvaddr < 0x18000000) {
+ // Exception while accessing PCIe memory space.
+ volatile uint32_t *pcie_app = (uint32_t*) 0x40000;
+ volatile uint32_t *pcie_reset = (uint32_t*) 0x40018;
+ volatile uint32_t *pcie_int_mask = (uint32_t*) 0x40050;
+
+ // Maybe retry.
+ if (++exception_retries < 2) {
+ A_PRINTF("\nRetry(%d) failed PCIe access @0x%x\n",
+ exception_retries, dump.badvaddr);
+ A_PRINTF("Before: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
+
+ AR7010_pcie_reset();
+
+ A_PRINTF("After: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
+
+ // This should recurse if we failed to recover.
+ A_PRINTF("wlan int status=%x\n", HAL_WORD_REG_READ(0x10ff4038));
+
+ // Reset retry counter.
+ exception_retries = 0;
+ return;
+ }
+ }
+
zfGenExceptionEvent(dump.exc_frame.xt_exccause, dump.pc, dump.badvaddr);

#if SYSTEM_MODULE_PRINT


--
Tobias PGP: http://8ef7ddba.uguu.de


Attachments:
(No filename) (13.76 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2017-06-08 08:21:47

by Oleksij Rempel

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Am 08.06.2017 um 00:39 schrieb Tobias Diedrich:
> Oleksij Rempel wrote:
>> Am 07.06.2017 um 02:12 schrieb Tobias Diedrich:
>>> Oleksij Rempel wrote:
>>>> Yes, this is "normal" problem. The firmware has no error handler for PCI
>>>> bus related exceptions. So if we filed to read PCI bus first time, we
>>>> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
>>>> and provide an kernel "firmware panic!" message.
>>>> Every one who can or will to fix this, is welcome.
>>>>
>>>>> *****
>>>>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
>>>>> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
>>> [...]
>>>
>>>> memdmp 50ae78 50ae88
>>>
>>> 50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]
>>>
>>> [...copy to bin...]
>>> $ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
>>> [..]
>>> 0: 6c1004 entry a1, 32
>>> 3: 126aa2 l32r a2, 0xfffdaa8c
>>> 6: 0c0200 memw
>>> 9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
>>> b: c020 movi.n a2, 0
>>> d: 081940 extui a9, a8, 1, 1
>>>
>>> Judging from that it should be fairly simple to at least implement
>>> some sort of retry, possible after triggering a PCIe link retrain?
>>
>> I assume, yes.
>>
>>> There are some related PCIe root complex registers that may point to
>>> what exactly failed if they were dumped.
>>>
>>> The root complex registers live at 0x00040000 and I think match the
>>> registers described for the root complex in the AR9344 datasheet.
>>
>> Suddenly I don't have ar7010 docs to tell..
>>
>>> PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
>>> "A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
>>> the hierarchy reports any of the following errors and the associated
>>> enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
>>> ERR_NONFATAL."
>>>
>>> AFAICS link retrain can be done by setting bit3 (INIT_RST,
>>> "Application request to initiate a training reset") in
>>> PCIE_APP (0x40000).
>>>
>>> See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
>>> flips some bits in the RC to enable the PCIe bus for reading the
>>> EEPROM).
>>>
>>> The root complex pci configuration space is at 0x20000 which could
>>> have further error details:
>>>> memdmp 20000 20200
>>>
>>> 020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
>>> 020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
>>> 020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
>>> 020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
>>> 020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
>>> 020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
>>> 020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
>>> 0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
>>> 020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
>>> 020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
>>> 020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
>>> 020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>> 0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>>>
>>> Transformed into something suitable for feeding into lspci -F:
>>>
>>> 00:00.0 Description filled in by lspci
>>> 00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
>>> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
>>> 40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
>>> 50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> 70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
>>> 80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
>>> 90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
>>> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>
>>> $ lspci -F /tmp/hexdump -vvv
>>> 00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
>>> !!! Invalid class 0000 for header type 01
>>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>> Latency: 0
>>> Interrupt: pin A routed to IRQ 255
>>> Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
>>> I/O behind bridge: 00000000-00000fff
>>> Memory behind bridge: 00000000-000fffff
>>> Prefetchable memory behind bridge: 00000000-000fffff
>>> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>>> BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
>>> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>> Capabilities: [40] Power Management version 3
>>> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
>>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>> Address: 0000000000000000 Data: 0000
>>> Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>>> DevCap: MaxPayload 256 bytes, PhantFunc 0
>>> ExtTag- RBE+
>>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>>> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>>> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
>>> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
>>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>>> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
>>> RootCap: CRSVisible-
>>> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>>> DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
>>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
>>> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>>> Compliance De-emphasis: -6dB
>>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>
>>
>> Looks promising :)
>>
>
> POC seems to work, though this may additionally need to restore wifi
> state as well, no guarantees there.

This probably will be next topic. Can you address some comments in the
review and create a pull request in the github repo?

>
>> str 40018 3
> 00040018 : 00000003
>>
> Retry(1) failed PCIe access @0x10ff4038
> Before: int_mask=0 app=ffc1 reset=0
> After: int_mask=0 app=ffc1 reset=7
> wlan int status=0
>
>> str 40018 3
> 00040018 : 00000003
>>
> Retry(1) failed PCIe access @0x10ff4038
> Before: int_mask=0 app=ffc1 reset=0
> After: int_mask=0 app=ffc1 reset=7
> wlan int status=0
>>
>
>
> diff --git a/target_firmware/magpie_fw_dev/target/init/app_start.c b/target_firmware/magpie_fw_dev/target/init/app_start.c
> index 8fa9c8b..fea62c1 100644
> --- a/target_firmware/magpie_fw_dev/target/init/app_start.c
> +++ b/target_firmware/magpie_fw_dev/target/init/app_start.c
> @@ -137,6 +137,13 @@ void __section(boot) __noreturn __visible app_start(void)
>
> A_PRINTF(" A_WDT_INIT()\n\r");
>
> +#if defined(PROJECT_MAGPIE)

please, use /**/ style comments.

> + // For some reason needs to be called again here for the
> + // exception handlers to work properly, at least on the XBOX
> + // adapter.
> + fatal_exception_func();
> +#endif
> +
> #if defined(PROJECT_K2)
> save_cmnos_printf = fw_cmnos_printf;
> #endif
> diff --git a/target_firmware/magpie_fw_dev/target/init/init.c b/target_firmware/magpie_fw_dev/target/init/init.c
> index 7484c05..cad2519 100755
> --- a/target_firmware/magpie_fw_dev/target/init/init.c
> +++ b/target_firmware/magpie_fw_dev/target/init/init.c
> @@ -212,6 +212,78 @@ LOCAL void zfGenWrongEpidEvent(uint32_t epid)
> mUSB_EP3_XFER_DONE();
> }
>
> +static void
> +AR7010_pcie_reset(void)
> +{
> +#define PCIE_RC_ACCESS_DELAY 20
> +
> +#define PCI_RC_RESET_BIT BIT6
> +#define PCI_RC_PHY_RESET_BIT BIT7
> +#define PCI_RC_PLL_RESET_BIT BIT8
> +#define PCI_RC_PHY_SHIFT_RESET_BIT BIT10
> +
> +#define HAL_WORD_REG_WRITE(addr, val) do { *((uint32_t*)(addr)) = val; } while (0)
> +#define HAL_WORD_REG_READ(addr) (*((uint32_t*)(addr)))

we already have iowrite32* ioread32* functions, why do we need more?

> +#define CMD_PCI_RC_RESET_ON() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)| \
> + (PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT)))
> +
> +#define CMD_PCI_RC_RESET_CLR() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)& \
> + (~(PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT))))
> +
> + int i;
> +
> + CMD_PCI_RC_RESET_ON();
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* dereset the reset */
> + CMD_PCI_RC_RESET_CLR();
> + A_DELAY_USECS(500);
> +
> + /* 7. set bus master and memory space enable */
> + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x45;
> + HAL_WORD_REG_WRITE(0x00020004, (HAL_WORD_REG_READ(0x00020004)|(BIT1|BIT2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 7.5. asser pcie_ep reset */
> + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018) & ~(0x1 << 2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 7.5. de-asser pcie_ep reset */
> + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018)|(0x1 << 2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 8. set app_ltssm_enable */
> + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x46;
> + HAL_WORD_REG_WRITE(0x00040000, (HAL_WORD_REG_READ(0x00040000)|0xffc1));
> +
> + /*!
> + * Receive control (PCIE_RESET),
> + * 0x40018, BIT0: LINK_UP, PHY Link up -PHY Link up/down indicator
> + * in case the link up is not ready and we access the 0x14000000,
> + * vmc will hang here
> + */
> +
> + /* poll 0x40018/bit0 (1000 times) until it turns to 1 */
> + i = 10000;
> + while(i-->0)
> + {
> + uint32_t reg_value = HAL_WORD_REG_READ(0x00040018);
> + if( reg_value & BIT0 )
> + break;
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> + }
> +
> + HAL_WORD_REG_WRITE(0x14000004, (HAL_WORD_REG_READ(0x14000004)|0x116));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + HAL_WORD_REG_WRITE(0x14000010, (HAL_WORD_REG_READ(0x14000010)|EEPROM_CTRL_BASE));
> +}
> +
> +static int exception_retries = 0;
> +
> void
> AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> {
> @@ -226,6 +298,32 @@ AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> dump.pc = exc_frame->xt_pc;
> dump.assline = 0;

i would prefer to put it in to separate function. may be, complete pci
code in a separate file?

> + if (dump.badvaddr >= 0x10000000 &&
> + dump.badvaddr < 0x18000000) {

if (!bla)
return;

> + // Exception while accessing PCIe memory space.
> + volatile uint32_t *pcie_app = (uint32_t*) 0x40000;
> + volatile uint32_t *pcie_reset = (uint32_t*) 0x40018;
> + volatile uint32_t *pcie_int_mask = (uint32_t*) 0x40050;

magic values should be replaced.

> + // Maybe retry.
> + if (++exception_retries < 2) {

if (!bla)
return;

> + A_PRINTF("\nRetry(%d) failed PCIe access @0x%x\n",
> + exception_retries, dump.badvaddr);
> + A_PRINTF("Before: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> +
> + AR7010_pcie_reset();
> +
> + A_PRINTF("After: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> +
> + // This should recurse if we failed to recover.
> + A_PRINTF("wlan int status=%x\n", HAL_WORD_REG_READ(0x10ff4038));
> +
> + // Reset retry counter.
> + exception_retries = 0;
> + return;
> + }
> + }
> +
> zfGenExceptionEvent(dump.exc_frame.xt_exccause, dump.pc, dump.badvaddr);
>
> #if SYSTEM_MODULE_PRINT

I'm exciting to see it mainline. Thank you for your work!

--
Regards,
Oleksij


Attachments:
signature.asc (195.00 B)
OpenPGP digital signature

2017-06-03 08:19:17

by Nathan Royce

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

On Sat, Jun 3, 2017 at 2:57 AM, Oleksij Rempel <[email protected]> wrote:
> Hm... this function and file:
> linux/drivers/net/wireless/ath/ath9k/common-beacon.c
> didn't changed since 2015. So, it should be some thing different.
> Can you run
> git bisect to find exact patch caused this regression?
>
That was the first time I experienced the x/0 issue and don't know how
I'd reproduce it.

> Yes, this is "normal" problem. The firmware has no error handler for PCI
> bus related exceptions. So if we filed to read PCI bus first time, we
> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
> and provide an kernel "firmware panic!" message.
> Every one who can or will to fix this, is welcome.
>
Thanks for that explanation. I'm not sure it's something I could
tackle though. My best bet in the meantime is to coax systemd to
restart the services when the device pops up. However, every attempt
has failed so far.

> It is possible. If adapter is used in AP mode, then lots of WiFi noise
> is dumped over this interface. I assume the reproducibility depends on
> external environment, not internal.
>
I find this quite believable. I have 2.4ghz happening with the
TP-Link, ZTE Mobley, bluetooth, logitech unifying, usb 3.0. Though all
4 devices are going through the USB 2.0 port, and the tp-link and
mobley have long usb cables in the attic and the hub connects through
a 2m usb extension. So yeah, I've got a lot of variables in play.

2017-06-03 07:57:39

by Oleksij Rempel

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Hi,

Am 03.06.2017 um 00:02 schrieb Nathan Royce:
> ODroid XU4
>
> $ uname -a
> Linux computer 4.12.0-rc3-dirty #1 SMP Wed May 31 15:02:05 CDT 2017
> armv7l GNU/Linux
>
> $ lsusb
> ...
> Bus 001 Device 002: ID 2109:2813 VIA Labs, Inc.
> Bus 001 Device 010: ID 0cf3:7015 Qualcomm Atheros Communications
> TP-Link TL-WN821N v3 / TL-WN822N v2 802.11n [Atheros AR7010+AR9287]
> ...
>
> *****
> Jun 02 16:20:11 computer hostapd[14954]: vwlan0: interface state
> COUNTRY_UPDATE->HT_SCAN
> Jun 02 16:20:17 computer hostapd[14954]: 20/40 MHz operation not
> permitted on channel pri=7 sec=3 based on overlapping BSSes
> Jun 02 16:20:18 computer kernel: Division by zero in kernel.
> Jun 02 16:20:18 computer kernel: CPU: 1 PID: 14507 Comm: kworker/u16:2
> Tainted: G W 4.12.0-rc3-dirty #1
> Jun 02 16:20:18 computer kernel: Hardware name: SAMSUNG EXYNOS
> (Flattened Device Tree)
> Jun 02 16:20:18 computer kernel: Workqueue: phy5 ieee80211_scan_work [mac80211]
> Jun 02 16:20:18 computer kernel: [<c010ee0c>] (unwind_backtrace) from
> [<c010b61c>] (show_stack+0x10/0x14)
> Jun 02 16:20:18 computer kernel: [<c010b61c>] (show_stack) from
> [<c0377708>] (dump_stack+0x88/0x9c)
> Jun 02 16:20:18 computer kernel: [<c0377708>] (dump_stack) from
> [<c03755d0>] (Ldiv0_64+0x8/0x18)
> Jun 02 16:20:18 computer kernel: [<c03755d0>] (Ldiv0_64) from
> [<bf71c9a4>] (ath9k_get_next_tbtt+0x58/0x5c [ath9k_common])

Hm... this function and file:
linux/drivers/net/wireless/ath/ath9k/common-beacon.c
didn't changed since 2015. So, it should be some thing different.
Can you run
git bisect to find exact patch caused this regression?

> Jun 02 16:20:18 computer kernel: [<bf71c9a4>] (ath9k_get_next_tbtt
> [ath9k_common]) from [<bf71cb90>] (ath9k_cmn_beacon_config
> Jun 02 16:20:18 computer kernel: [<bf71cb90>]
> (ath9k_cmn_beacon_config_ap [ath9k_common]) from [<bf7898c8>]
> (ath9k_htc_beacon
> Jun 02 16:20:18 computer kernel: [<bf7898c8>]
> (ath9k_htc_beacon_config_ap [ath9k_htc]) from [<bf7885a8>]
> (ath9k_htc_vif_recon
> Jun 02 16:20:18 computer kernel: [<bf7885a8>] (ath9k_htc_vif_reconfig
> [ath9k_htc]) from [<bf78860c>] (ath9k_htc_sw_scan_compl
> Jun 02 16:20:18 computer kernel: [<bf78860c>]
> (ath9k_htc_sw_scan_complete [ath9k_htc]) from [<bf506d38>]
> (__ieee80211_scan_co
> Jun 02 16:20:18 computer kernel: [<bf506d38>]
> (__ieee80211_scan_completed [mac80211]) from [<bf507968>]
> (ieee80211_scan_work+
> Jun 02 16:20:18 computer kernel: [<bf507968>] (ieee80211_scan_work
> [mac80211]) from [<c0133f10>] (process_one_work+0x1d8/0x40
> Jun 02 16:20:18 computer kernel: [<c0133f10>] (process_one_work) from
> [<c0134cb4>] (worker_thread+0x4c/0x564)
> Jun 02 16:20:18 computer kernel: [<c0134cb4>] (worker_thread) from
> [<c0139c20>] (kthread+0x14c/0x154)
> Jun 02 16:20:18 computer kernel: [<c0139c20>] (kthread) from
> [<c0107c38>] (ret_from_fork+0x14/0x3c)
> Jun 02 16:20:18 computer hostapd[14954]: Using interface wlan0 with
> hwaddr <sanitized> and ssid "<sanitized>"
> Jun 02 16:20:18 computer kernel: IPv6: ADDRCONF(NETDEV_CHANGE):
> vwlan0: link becomes ready
> *****
> This is a new one on me.
>
> The "normal" problem (search shows to be a very old issue) I
> consistently (daily or multiple times/day) encounter is:

Yes, this is "normal" problem. The firmware has no error handler for PCI
bus related exceptions. So if we filed to read PCI bus first time, we
have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
and provide an kernel "firmware panic!" message.
Every one who can or will to fix this, is welcome.

> *****
> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
> Jun 02 14:55:30 computer kernel: usb 1-1.1: USB disconnect, device number 9
> Jun 02 14:55:30 computer systemd-networkd[11959]: vwlan0: Lost carrier
> Jun 02 14:55:30 computer kernel: br0: port 2(vwlan0) entered disabled state
> Jun 02 14:55:30 computer kernel: wlan0: deauthenticating from
> <sanitized> by local choice (Reason: 3=DEAUTH_LEAVING)
> Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
> Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
> Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
> Jun 02 14:55:30 computer kernel: ath: phy4: Failed to wakeup in 500us
> Jun 02 14:55:30 computer systemd-networkd[11959]: wlan0: Lost carrier
> Jun 02 14:55:30 computer systemd[1]: Stopping A simple WPA encrypted
> wireless connection using a static IP...
> -- Subject: Unit [email protected] has begun shutting down
> -- Defined-By: systemd
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
> --
> -- Unit [email protected] has begun shutting down.
> Jun 02 14:55:30 computer kernel: device vwlan0 left promiscuous mode
> Jun 02 14:55:30 computer kernel: br0: port 2(vwlan0) entered disabled state
> Jun 02 14:55:30 computer audit: ANOM_PROMISCUOUS dev=vwlan0 prom=0
> old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
> Jun 02 14:55:30 computer hostapd[13218]: vwlan0: AP-STA-DISCONNECTED <sanitized>
> Jun 02 14:55:30 computer hostapd[13218]: Failed to set beacon parameters
> Jun 02 14:55:30 computer hostapd[13218]: vwlan0: INTERFACE-DISABLED
> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: USB layer deinitialized
> Jun 02 14:55:30 computer systemd[1]: Starting Load/Save RF Kill Switch Status...
> -- Subject: Unit systemd-rfkill.service has begun start-up
> -- Defined-By: systemd
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
> --
> -- Unit systemd-rfkill.service has begun starting up.
> Jun 02 14:55:30 computer systemd[1]: Started Load/Save RF Kill Switch Status.
> -- Subject: Unit systemd-rfkill.service has finished start-up
> -- Defined-By: systemd
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
> --
> -- Unit systemd-rfkill.service has finished starting up.
> --
> -- The start-up result is done.
> Jun 02 14:55:30 computer network[13261]: Stopping network profile 'wlan0'...
> Jun 02 14:55:30 computer kernel: usb 1-1.1: new high-speed USB device
> number 10 using exynos-ehci
> Jun 02 14:55:30 computer kernel: usb 1-1.1: New USB device found,
> idVendor=0cf3, idProduct=7015
> Jun 02 14:55:30 computer kernel: usb 1-1.1: New USB device strings:
> Mfr=16, Product=32, SerialNumber=48
> Jun 02 14:55:30 computer kernel: usb 1-1.1: Product: USB WLAN
> Jun 02 14:55:30 computer kernel: usb 1-1.1: Manufacturer: ATHEROS
> Jun 02 14:55:30 computer kernel: usb 1-1.1: SerialNumber: 12345
> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Firmware
> ath9k_htc/htc_7010-1.4.0.fw requested
> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath9k_htc: Transferred FW:
> ath9k_htc/htc_7010-1.4.0.fw, size: 72812
> Jun 02 14:55:30 computer kernel: ath9k_htc 1-1.1:1.0: ath9k_htc: HTC
> initialized with 45 credits
> Jun 02 14:55:31 computer kernel: ath9k_htc 1-1.1:1.0: ath9k_htc: FW Version: 1.4
> Jun 02 14:55:31 computer kernel: ath9k_htc 1-1.1:1.0: FW RMW support: On
> Jun 02 14:55:31 computer kernel: ath: EEPROM regdomain: 0x809c
> Jun 02 14:55:31 computer kernel: ath: EEPROM indicates we should
> expect a country code
> Jun 02 14:55:31 computer kernel: ath: doing EEPROM country->regdmn map search
> Jun 02 14:55:31 computer kernel: ath: country maps to regdmn code: 0x52
> Jun 02 14:55:31 computer kernel: ath: Country alpha2 being used: CN
> Jun 02 14:55:31 computer kernel: ath: Regpair used: 0x52
> Jun 02 14:55:31 computer kernel: ieee80211 phy5: Atheros AR9287 Rev:2
> Jun 02 14:55:31 computer kernel: IPv6: ADDRCONF(NETDEV_UP): vwlan0:
> link is not ready
> Jun 02 14:55:31 computer hostapd[13218]: vwlan0: INTERFACE-ENABLED
> Jun 02 14:55:31 computer network[13261]: Stopped network profile 'wlan0'
> *****
> I don't know the particular reason for this one.
> At first it would happen every time I compiled anything (all cpu
> used). Then I added the ZTE Mobley to the USB hub. Even after removing
> the Mobley, the panic would still happen often.
> I then recompiled the kernel so only the 4 LITTLE cpus were used
> (big.LITTLE support+switcher), but the panic still happens sometimes.
> Now the consistency seems to come from the wireless adapter used as
> both AP and managed client.

It is possible. If adapter is used in AP mode, then lots of WiFi noise
is dumped over this interface. I assume the reproducibility depends on
external environment, not internal.

--
Regards,
Oleksij


Attachments:
signature.asc (213.00 B)
OpenPGP digital signature

2017-06-15 22:38:03

by Tobias Diedrich

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

Yeah, this is basically mostly copy-pasted from the sboot code,
would need some cleaning up.
I've been playing more a little with other bits of the hardware,
writing some test fw from scratch, mostly without using the builtin
rom (except for interrupts).

Oleksij Rempel wrote:
> Am 08.06.2017 um 00:39 schrieb Tobias Diedrich:
> > Oleksij Rempel wrote:
> >> Am 07.06.2017 um 02:12 schrieb Tobias Diedrich:
> >>> Oleksij Rempel wrote:
> >>>> Yes, this is "normal" problem. The firmware has no error handler for PCI
> >>>> bus related exceptions. So if we filed to read PCI bus first time, we
> >>>> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
> >>>> and provide an kernel "firmware panic!" message.
> >>>> Every one who can or will to fix this, is welcome.
> >>>>
> >>>>> *****
> >>>>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
> >>>>> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
> >>> [...]
> >>>
> >>>> memdmp 50ae78 50ae88
> >>>
> >>> 50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]
> >>>
> >>> [...copy to bin...]
> >>> $ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
> >>> [..]
> >>> 0: 6c1004 entry a1, 32
> >>> 3: 126aa2 l32r a2, 0xfffdaa8c
> >>> 6: 0c0200 memw
> >>> 9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
> >>> b: c020 movi.n a2, 0
> >>> d: 081940 extui a9, a8, 1, 1
> >>>
> >>> Judging from that it should be fairly simple to at least implement
> >>> some sort of retry, possible after triggering a PCIe link retrain?
> >>
> >> I assume, yes.
> >>
> >>> There are some related PCIe root complex registers that may point to
> >>> what exactly failed if they were dumped.
> >>>
> >>> The root complex registers live at 0x00040000 and I think match the
> >>> registers described for the root complex in the AR9344 datasheet.
> >>
> >> Suddenly I don't have ar7010 docs to tell..
> >>
> >>> PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
> >>> "A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
> >>> the hierarchy reports any of the following errors and the associated
> >>> enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
> >>> ERR_NONFATAL."
> >>>
> >>> AFAICS link retrain can be done by setting bit3 (INIT_RST,
> >>> "Application request to initiate a training reset") in
> >>> PCIE_APP (0x40000).
> >>>
> >>> See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
> >>> flips some bits in the RC to enable the PCIe bus for reading the
> >>> EEPROM).
> >>>
> >>> The root complex pci configuration space is at 0x20000 which could
> >>> have further error details:
> >>>> memdmp 20000 20200
> >>>
> >>> 020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
> >>> 020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
> >>> 020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
> >>> 020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
> >>> 020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
> >>> 020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
> >>> 020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
> >>> 0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
> >>> 020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
> >>> 020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
> >>> 020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
> >>> 020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>> 0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> >>>
> >>> Transformed into something suitable for feeding into lspci -F:
> >>>
> >>> 00:00.0 Description filled in by lspci
> >>> 00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
> >>> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> >>> 40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
> >>> 80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
> >>> 90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
> >>> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>>
> >>> $ lspci -F /tmp/hexdump -vvv
> >>> 00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
> >>> !!! Invalid class 0000 for header type 01
> >>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> >>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> >>> Latency: 0
> >>> Interrupt: pin A routed to IRQ 255
> >>> Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
> >>> I/O behind bridge: 00000000-00000fff
> >>> Memory behind bridge: 00000000-000fffff
> >>> Prefetchable memory behind bridge: 00000000-000fffff
> >>> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> >>> BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> >>> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> >>> Capabilities: [40] Power Management version 3
> >>> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
> >>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> >>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>> Address: 0000000000000000 Data: 0000
> >>> Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
> >>> DevCap: MaxPayload 256 bytes, PhantFunc 0
> >>> ExtTag- RBE+
> >>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> >>> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >>> MaxPayload 128 bytes, MaxReadReq 512 bytes
> >>> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> >>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
> >>> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
> >>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> >>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> >>> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> >>> RootCap: CRSVisible-
> >>> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> >>> DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> >>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> >>> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >>> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> >>> Compliance De-emphasis: -6dB
> >>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> >>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >>>
> >>
> >> Looks promising :)
> >>
> >
> > POC seems to work, though this may additionally need to restore wifi
> > state as well, no guarantees there.
>
> This probably will be next topic. Can you address some comments in the
> review and create a pull request in the github repo?
>
> >
> >> str 40018 3
> > 00040018 : 00000003
> >>
> > Retry(1) failed PCIe access @0x10ff4038
> > Before: int_mask=0 app=ffc1 reset=0
> > After: int_mask=0 app=ffc1 reset=7
> > wlan int status=0
> >
> >> str 40018 3
> > 00040018 : 00000003
> >>
> > Retry(1) failed PCIe access @0x10ff4038
> > Before: int_mask=0 app=ffc1 reset=0
> > After: int_mask=0 app=ffc1 reset=7
> > wlan int status=0
> >>
> >
> >
> > diff --git a/target_firmware/magpie_fw_dev/target/init/app_start.c b/target_firmware/magpie_fw_dev/target/init/app_start.c
> > index 8fa9c8b..fea62c1 100644
> > --- a/target_firmware/magpie_fw_dev/target/init/app_start.c
> > +++ b/target_firmware/magpie_fw_dev/target/init/app_start.c
> > @@ -137,6 +137,13 @@ void __section(boot) __noreturn __visible app_start(void)
> >
> > A_PRINTF(" A_WDT_INIT()\n\r");
> >
> > +#if defined(PROJECT_MAGPIE)
>
> please, use /**/ style comments.
>
> > + // For some reason needs to be called again here for the
> > + // exception handlers to work properly, at least on the XBOX
> > + // adapter.
> > + fatal_exception_func();
> > +#endif
> > +
> > #if defined(PROJECT_K2)
> > save_cmnos_printf = fw_cmnos_printf;
> > #endif
> > diff --git a/target_firmware/magpie_fw_dev/target/init/init.c b/target_firmware/magpie_fw_dev/target/init/init.c
> > index 7484c05..cad2519 100755
> > --- a/target_firmware/magpie_fw_dev/target/init/init.c
> > +++ b/target_firmware/magpie_fw_dev/target/init/init.c
> > @@ -212,6 +212,78 @@ LOCAL void zfGenWrongEpidEvent(uint32_t epid)
> > mUSB_EP3_XFER_DONE();
> > }
> >
> > +static void
> > +AR7010_pcie_reset(void)
> > +{
> > +#define PCIE_RC_ACCESS_DELAY 20
> > +
> > +#define PCI_RC_RESET_BIT BIT6
> > +#define PCI_RC_PHY_RESET_BIT BIT7
> > +#define PCI_RC_PLL_RESET_BIT BIT8
> > +#define PCI_RC_PHY_SHIFT_RESET_BIT BIT10
> > +
> > +#define HAL_WORD_REG_WRITE(addr, val) do { *((uint32_t*)(addr)) = val; } while (0)
> > +#define HAL_WORD_REG_READ(addr) (*((uint32_t*)(addr)))
>
> we already have iowrite32* ioread32* functions, why do we need more?
>
> > +#define CMD_PCI_RC_RESET_ON() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> > + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)| \
> > + (PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT)))
> > +
> > +#define CMD_PCI_RC_RESET_CLR() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> > + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)& \
> > + (~(PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT))))
> > +
> > + int i;
> > +
> > + CMD_PCI_RC_RESET_ON();
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > +
> > + /* dereset the reset */
> > + CMD_PCI_RC_RESET_CLR();
> > + A_DELAY_USECS(500);
> > +
> > + /* 7. set bus master and memory space enable */
> > + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x45;
> > + HAL_WORD_REG_WRITE(0x00020004, (HAL_WORD_REG_READ(0x00020004)|(BIT1|BIT2)));
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > +
> > + /* 7.5. asser pcie_ep reset */
> > + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018) & ~(0x1 << 2)));
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > +
> > + /* 7.5. de-asser pcie_ep reset */
> > + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018)|(0x1 << 2)));
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > +
> > + /* 8. set app_ltssm_enable */
> > + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x46;
> > + HAL_WORD_REG_WRITE(0x00040000, (HAL_WORD_REG_READ(0x00040000)|0xffc1));
> > +
> > + /*!
> > + * Receive control (PCIE_RESET),
> > + * 0x40018, BIT0: LINK_UP, PHY Link up -PHY Link up/down indicator
> > + * in case the link up is not ready and we access the 0x14000000,
> > + * vmc will hang here
> > + */
> > +
> > + /* poll 0x40018/bit0 (1000 times) until it turns to 1 */
> > + i = 10000;
> > + while(i-->0)
> > + {
> > + uint32_t reg_value = HAL_WORD_REG_READ(0x00040018);
> > + if( reg_value & BIT0 )
> > + break;
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > + }
> > +
> > + HAL_WORD_REG_WRITE(0x14000004, (HAL_WORD_REG_READ(0x14000004)|0x116));
> > + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> > +
> > + HAL_WORD_REG_WRITE(0x14000010, (HAL_WORD_REG_READ(0x14000010)|EEPROM_CTRL_BASE));
> > +}
> > +
> > +static int exception_retries = 0;
> > +
> > void
> > AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> > {
> > @@ -226,6 +298,32 @@ AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> > dump.pc = exc_frame->xt_pc;
> > dump.assline = 0;
>
> i would prefer to put it in to separate function. may be, complete pci
> code in a separate file?
>
> > + if (dump.badvaddr >= 0x10000000 &&
> > + dump.badvaddr < 0x18000000) {
>
> if (!bla)
> return;
>
> > + // Exception while accessing PCIe memory space.
> > + volatile uint32_t *pcie_app = (uint32_t*) 0x40000;
> > + volatile uint32_t *pcie_reset = (uint32_t*) 0x40018;
> > + volatile uint32_t *pcie_int_mask = (uint32_t*) 0x40050;
>
> magic values should be replaced.
>
> > + // Maybe retry.
> > + if (++exception_retries < 2) {
>
> if (!bla)
> return;
>
> > + A_PRINTF("\nRetry(%d) failed PCIe access @0x%x\n",
> > + exception_retries, dump.badvaddr);
> > + A_PRINTF("Before: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> > +
> > + AR7010_pcie_reset();
> > +
> > + A_PRINTF("After: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> > +
> > + // This should recurse if we failed to recover.
> > + A_PRINTF("wlan int status=%x\n", HAL_WORD_REG_READ(0x10ff4038));
> > +
> > + // Reset retry counter.
> > + exception_retries = 0;
> > + return;
> > + }
> > + }
> > +
> > zfGenExceptionEvent(dump.exc_frame.xt_exccause, dump.pc, dump.badvaddr);
> >
> > #if SYSTEM_MODULE_PRINT
>
> I'm exciting to see it mainline. Thank you for your work!
>
> --
> Regards,
> Oleksij
>




> _______________________________________________
> ath9k_htc_fw mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath9k_htc_fw


--
Tobias PGP: http://8ef7ddba.uguu.de


Attachments:
(No filename) (15.69 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2018-04-24 17:28:11

by Nathan Royce

[permalink] [raw]
Subject: Re: ath9k_htc - Division by zero in kernel (as well as firmware panic)

I finally got around to applying your patch, building the toolchain
(based on master source (gcc8)), but alas while there is no firmware
panic in the log, wifi drops off the face of the planet (ssid
disappears and hostapd doesn't know wifi failed (nothing in the log
either)).

On Wed, Jun 7, 2017 at 5:39 PM, Tobias Diedrich
<[email protected]> wrote:
> Oleksij Rempel wrote:
>> Am 07.06.2017 um 02:12 schrieb Tobias Diedrich:
>> > Oleksij Rempel wrote:
>> >> Yes, this is "normal" problem. The firmware has no error handler for PCI
>> >> bus related exceptions. So if we filed to read PCI bus first time, we
>> >> have choice to Ooops and stall or Ooops and reboot ASAP. So we reboot
>> >> and provide an kernel "firmware panic!" message.
>> >> Every one who can or will to fix this, is welcome.
>> >>
>> >>> *****
>> >>> Jun 02 14:55:30 computer kernel: usb 1-1.1: ath: firmware panic!
>> >>> exccause: 0x0000000d; pc: 0x0090ae81; badvaddr: 0x10ff4038.
>> > [...]
>> >
>> >> memdmp 50ae78 50ae88
>> >
>> > 50ae78: 6c10 0412 6aa2 0c02 0088 20c0 2008 1940 [email protected]
>> >
>> > [...copy to bin...]
>> > $ bin/objdump -b binary -m xtensa -D /tmp/memdump.bin
>> > [..]
>> > 0: 6c1004 entry a1, 32
>> > 3: 126aa2 l32r a2, 0xfffdaa8c
>> > 6: 0c0200 memw
>> > 9: 8820 l32i.n a8, a2, 0 <----------Exception cause PC still points at load
>> > b: c020 movi.n a2, 0
>> > d: 081940 extui a9, a8, 1, 1
>> >
>> > Judging from that it should be fairly simple to at least implement
>> > some sort of retry, possible after triggering a PCIe link retrain?
>>
>> I assume, yes.
>>
>> > There are some related PCIe root complex registers that may point to
>> > what exactly failed if they were dumped.
>> >
>> > The root complex registers live at 0x00040000 and I think match the
>> > registers described for the root complex in the AR9344 datasheet.
>>
>> Suddenly I don't have ar7010 docs to tell..
>>
>> > PCIE_INT_MASK would map to 0x40050 and has a bit for SYS_ERR:
>> > "A system error. The RC Core asserts CFG_SYS_ERR_RC if any device in
>> > the hierarchy reports any of the following errors and the associated
>> > enable bit is set in the Root Control register: ERR_COR, ERR_FATAL,
>> > ERR_NONFATAL."
>> >
>> > AFAICS link retrain can be done by setting bit3 (INIT_RST,
>> > "Application request to initiate a training reset") in
>> > PCIE_APP (0x40000).
>> >
>> > See sboot/magpie_1_1/sboot/cmnos/eeprom/src/cmnos_eeprom.c (which
>> > flips some bits in the RC to enable the PCIe bus for reading the
>> > EEPROM).
>> >
>> > The root complex pci configuration space is at 0x20000 which could
>> > have further error details:
>> >> memdmp 20000 20200
>> >
>> > 020000: a02a 168c 0010 0006 0000 0001 0001 0000 .*..............
>> > 020010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020030: 0000 0000 0000 0040 0000 0000 0000 01ff [email protected]
>> > 020040: 5bc3 5001 0000 0000 0000 0000 0000 0000 [.P.............
>> > 020050: 0080 7005 0000 0000 0000 0000 0000 0000 ..p.............
>> > 020060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020070: 0042 0010 0000 8701 0000 2010 0013 4411 .B............D.
>> > 020080: 3011 0000 0000 0000 00c0 03c0 0000 0000 0...............
>> > 020090: 0000 0000 0000 0010 0000 0000 0000 0000 ................
>> > 0200a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0200b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0200c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0200d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0200e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0200f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020100: 1401 0001 0000 0000 0000 0000 0006 2030 ...............0
>> > 020110: 0000 0000 0000 2000 0000 00a0 0000 0000 ................
>> > 020120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020140: 0001 0002 0000 0000 0000 0000 0000 0000 ................
>> > 020150: 0000 0000 8000 00ff 0000 0000 0000 0000 ................
>> > 020160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 020190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> > 0201f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>> >
>> > Transformed into something suitable for feeding into lspci -F:
>> >
>> > 00:00.0 Description filled in by lspci
>> > 00: 8c 16 2a a0 06 00 10 00 01 00 00 00 00 00 01 00
>> > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
>> > 40: 01 50 c3 5b 00 00 00 00 00 00 00 00 00 00 00 00
>> > 50: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > 70: 10 00 42 00 01 87 00 00 10 20 00 00 11 44 13 00
>> > 80: 00 00 11 30 00 00 00 00 c0 03 c0 00 00 00 00 00
>> > 90: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
>> > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> >
>> > $ lspci -F /tmp/hexdump -vvv
>> > 00:00.0 Non-VGA unclassified device: Qualcomm Atheros Device a02a (rev 01)
>> > !!! Invalid class 0000 for header type 01
>> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> > Latency: 0
>> > Interrupt: pin A routed to IRQ 255
>> > Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
>> > I/O behind bridge: 00000000-00000fff
>> > Memory behind bridge: 00000000-000fffff
>> > Prefetchable memory behind bridge: 00000000-000fffff
>> > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>> > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
>> > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>> > Capabilities: [40] Power Management version 3
>> > Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
>> > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>> > Address: 0000000000000000 Data: 0000
>> > Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>> > DevCap: MaxPayload 256 bytes, PhantFunc 0
>> > ExtTag- RBE+
>> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
>> > MaxPayload 128 bytes, MaxReadReq 512 bytes
>> > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>> > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <1us, L1 <64us
>> > ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp-
>> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>> > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
>> > RootCap: CRSVisible-
>> > RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>> > DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
>> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
>> > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>> > Compliance De-emphasis: -6dB
>> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>> > EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>> >
>>
>> Looks promising :)
>>
>
> POC seems to work, though this may additionally need to restore wifi
> state as well, no guarantees there.
>
>>str 40018 3
> 00040018 : 00000003
>>
> Retry(1) failed PCIe access @0x10ff4038
> Before: int_mask=0 app=ffc1 reset=0
> After: int_mask=0 app=ffc1 reset=7
> wlan int status=0
>
>>str 40018 3
> 00040018 : 00000003
>>
> Retry(1) failed PCIe access @0x10ff4038
> Before: int_mask=0 app=ffc1 reset=0
> After: int_mask=0 app=ffc1 reset=7
> wlan int status=0
>>
>
>
> diff --git a/target_firmware/magpie_fw_dev/target/init/app_start.c b/target_firmware/magpie_fw_dev/target/init/app_start.c
> index 8fa9c8b..fea62c1 100644
> --- a/target_firmware/magpie_fw_dev/target/init/app_start.c
> +++ b/target_firmware/magpie_fw_dev/target/init/app_start.c
> @@ -137,6 +137,13 @@ void __section(boot) __noreturn __visible app_start(void)
>
> A_PRINTF(" A_WDT_INIT()\n\r");
>
> +#if defined(PROJECT_MAGPIE)
> + // For some reason needs to be called again here for the
> + // exception handlers to work properly, at least on the XBOX
> + // adapter.
> + fatal_exception_func();
> +#endif
> +
> #if defined(PROJECT_K2)
> save_cmnos_printf = fw_cmnos_printf;
> #endif
> diff --git a/target_firmware/magpie_fw_dev/target/init/init.c b/target_firmware/magpie_fw_dev/target/init/init.c
> index 7484c05..cad2519 100755
> --- a/target_firmware/magpie_fw_dev/target/init/init.c
> +++ b/target_firmware/magpie_fw_dev/target/init/init.c
> @@ -212,6 +212,78 @@ LOCAL void zfGenWrongEpidEvent(uint32_t epid)
> mUSB_EP3_XFER_DONE();
> }
>
> +static void
> +AR7010_pcie_reset(void)
> +{
> +#define PCIE_RC_ACCESS_DELAY 20
> +
> +#define PCI_RC_RESET_BIT BIT6
> +#define PCI_RC_PHY_RESET_BIT BIT7
> +#define PCI_RC_PLL_RESET_BIT BIT8
> +#define PCI_RC_PHY_SHIFT_RESET_BIT BIT10
> +
> +#define HAL_WORD_REG_WRITE(addr, val) do { *((uint32_t*)(addr)) = val; } while (0)
> +#define HAL_WORD_REG_READ(addr) (*((uint32_t*)(addr)))
> +
> +#define CMD_PCI_RC_RESET_ON() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)| \
> + (PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT)))
> +
> +#define CMD_PCI_RC_RESET_CLR() HAL_WORD_REG_WRITE(MAGPIE_REG_RST_RESET_ADDR, \
> + (HAL_WORD_REG_READ(MAGPIE_REG_RST_RESET_ADDR)& \
> + (~(PCI_RC_PHY_SHIFT_RESET_BIT|PCI_RC_PLL_RESET_BIT|PCI_RC_PHY_RESET_BIT|PCI_RC_RESET_BIT))))
> +
> + int i;
> +
> + CMD_PCI_RC_RESET_ON();
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* dereset the reset */
> + CMD_PCI_RC_RESET_CLR();
> + A_DELAY_USECS(500);
> +
> + /* 7. set bus master and memory space enable */
> + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x45;
> + HAL_WORD_REG_WRITE(0x00020004, (HAL_WORD_REG_READ(0x00020004)|(BIT1|BIT2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 7.5. asser pcie_ep reset */
> + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018) & ~(0x1 << 2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 7.5. de-asser pcie_ep reset */
> + HAL_WORD_REG_WRITE(0x00040018, (HAL_WORD_REG_READ(0x00040018)|(0x1 << 2)));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + /* 8. set app_ltssm_enable */
> + DEBUG_SYSTEM_STATE = (DEBUG_SYSTEM_STATE&(~0xff)) | 0x46;
> + HAL_WORD_REG_WRITE(0x00040000, (HAL_WORD_REG_READ(0x00040000)|0xffc1));
> +
> + /*!
> + * Receive control (PCIE_RESET),
> + * 0x40018, BIT0: LINK_UP, PHY Link up -PHY Link up/down indicator
> + * in case the link up is not ready and we access the 0x14000000,
> + * vmc will hang here
> + */
> +
> + /* poll 0x40018/bit0 (1000 times) until it turns to 1 */
> + i = 10000;
> + while(i-->0)
> + {
> + uint32_t reg_value = HAL_WORD_REG_READ(0x00040018);
> + if( reg_value & BIT0 )
> + break;
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> + }
> +
> + HAL_WORD_REG_WRITE(0x14000004, (HAL_WORD_REG_READ(0x14000004)|0x116));
> + A_DELAY_USECS(PCIE_RC_ACCESS_DELAY);
> +
> + HAL_WORD_REG_WRITE(0x14000010, (HAL_WORD_REG_READ(0x14000010)|EEPROM_CTRL_BASE));
> +}
> +
> +static int exception_retries = 0;
> +
> void
> AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> {
> @@ -226,6 +298,32 @@ AR6002_fatal_exception_handler_patch(CPU_exception_frame_t *exc_frame)
> dump.pc = exc_frame->xt_pc;
> dump.assline = 0;
>
> + if (dump.badvaddr >= 0x10000000 &&
> + dump.badvaddr < 0x18000000) {
> + // Exception while accessing PCIe memory space.
> + volatile uint32_t *pcie_app = (uint32_t*) 0x40000;
> + volatile uint32_t *pcie_reset = (uint32_t*) 0x40018;
> + volatile uint32_t *pcie_int_mask = (uint32_t*) 0x40050;
> +
> + // Maybe retry.
> + if (++exception_retries < 2) {
> + A_PRINTF("\nRetry(%d) failed PCIe access @0x%x\n",
> + exception_retries, dump.badvaddr);
> + A_PRINTF("Before: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> +
> + AR7010_pcie_reset();
> +
> + A_PRINTF("After: int_mask=%x app=%x reset=%x\n", *pcie_int_mask, *pcie_app, *pcie_reset);
> +
> + // This should recurse if we failed to recover.
> + A_PRINTF("wlan int status=%x\n", HAL_WORD_REG_READ(0x10ff4038));
> +
> + // Reset retry counter.
> + exception_retries = 0;
> + return;
> + }
> + }
> +
> zfGenExceptionEvent(dump.exc_frame.xt_exccause, dump.pc, dump.badvaddr);
>
> #if SYSTEM_MODULE_PRINT
>
>
> --
> Tobias PGP: http://8ef7ddba.uguu.de