2021-02-09 03:10:59

by Wen Gong

[permalink] [raw]
Subject: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

Kernel panic every time in kernel when running below case:
steps:
1. connect to an AP with good signal strength
2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
5. sudo trace-cmd record -e ath10k
6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
7. kernel panic immeditely

It always crash at trace_event_raw_event_ath10k_xxx, below is 2 sample:

dmesg 1:
[ 58.038310] Unable to handle kernel paging request at virtual address 003130393837363d
[ 58.038325] Mem abort info:
[ 58.038330] ESR = 0x96000004
[ 58.038336] Exception class = DABT (current EL), IL = 32 bits
[ 58.038340] SET = 0, FnV = 0
[ 58.038345] EA = 0, S1PTW = 0
[ 58.038349] Data abort info:
[ 58.038353] ISV = 0, ISS = 0x00000004
[ 58.038358] CM = 0, WnR = 0
[ 58.038363] [003130393837363d] address between user and kernel address ranges
[ 58.038370] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 58.038377] Modules linked in: rfcomm algif_hash algif_skcipher
[ 58.038385] Unable to handle kernel paging request at virtual address 003938373635343b
[ 58.038386] af_alg uinput cros_ec_rpmsg mtk_dip mtk_mdp3
[ 58.038395] Mem abort info:
[ 58.038403] ESR = 0x96000004
[ 58.038412] Exception class = DABT (current EL), IL = 32 bits
[ 58.038419] SET = 0, FnV = 0
[ 58.038427] EA = 0, S1PTW = 0
[ 58.038434] Data abort info:
[ 58.038442] ISV = 0, ISS = 0x00000004
[ 58.038449] CM = 0, WnR = 0
[ 58.038457] [003938373635343b] address between user and kernel address ranges
[ 58.038486] Process kworker/u16:1 (pid: 141, stack limit = 0x000000009bd6ba2e)
[ 58.038496] CPU: 1 PID: 141 Comm: kworker/u16:1 Not tainted 4.19.139 #162
[ 58.038501] Hardware name: MediaTek krane sku176 board (DT)
[ 58.038515] Workqueue: ath10k_sdio_wq ath10k_sdio_write_async_work [ath10k_sdio]
[ 58.038523] pstate: 80000005 (Nzcv daif -PAN -UAO)
[ 58.038536] pc : filter_match_preds+0x20/0x94
[ 58.038546] lr : trace_event_buffer_commit+0x150/0x22c
[ 58.038550] sp : ffffff80088a39f0
[ 58.038555] x29: ffffff80088a3a20 x28: 000000000000000c
[ 58.038561] x27: 000000000000000c x26: 0000000000000001
[ 58.038567] x25: ffffffdfd96ef008 x24: 0000000000000000
[ 58.038573] x23: 0000000000000000 x22: ffffffe03b006c00
[ 58.038580] x21: ffffffdfd96ef000 x20: 3231303938373635
[ 58.038586] x19: ffffffe02de9b160 x18: 0000000000000020
[ 58.038592] x17: 0000000000100000 x16: ffffffa4e67937f0
[ 58.038598] x15: 0000000000100000 x14: 0000393837363534
[ 58.038604] x13: 3332313039383736 x12: 3534333231303938
[ 58.038611] x11: 3736353433323130 x10: 3938373635343332
[ 58.038617] x9 : 3130393837363534 x8 : 000000000000040b
[ 58.038623] x7 : 3534333231303938 x6 : ffffffdfd96fb2d1
[ 58.038629] x5 : 0000000000000000 x4 : 0000000000000000
[ 58.038635] x3 : 000000000000c2cd x2 : ffffffffffffffc0
[ 58.038641] x1 : ffffffdfd96ef008 x0 : ffffffdfd96f7080
[ 58.038648] Call trace:
[ 58.038655] filter_match_preds+0x20/0x94
[ 58.038661] trace_event_buffer_commit+0x150/0x22c
[ 58.038698] trace_event_raw_event_ath10k_log_dbg_dump+0x208/0x240 [ath10k_core]
[ 58.038716] ath10k_dbg_dump+0x1bc/0x21c [ath10k_core]
[ 58.038724] ath10k_sdio_write+0xc8/0xe8 [ath10k_sdio]
[ 58.038730] ath10k_sdio_write_async_work+0x11c/0x1f4 [ath10k_sdio]
[ 58.038738] process_one_work+0x210/0x418
[ 58.038744] worker_thread+0x234/0x3dc
[ 58.038751] kthread+0x120/0x140
[ 58.038758] ret_from_fork+0x10/0x18
[ 58.038766] Code: 9100c3fd b40002e0 f9400014 b40002b4 (f9400680)
[ 58.038773] ---[ end trace 8b9d2463f441eb7f ]---
[ 58.051846] Kernel panic - not syncing: Fatal exception
[ 58.051864] SMP: stopping secondary CPUs

dmesg 2:
[ 228.692905] Unable to handle kernel paging request at virtual address 0033323130393837
[ 228.692913] Mem abort info:
[ 228.692920] ESR = 0x96000004
[ 228.692927] Exception class = DABT (current EL), IL = 32 bits
[ 228.692933] SET = 0, FnV = 0
[ 228.692938] EA = 0, S1PTW = 0
[ 228.692943] Data abort info:
[ 228.692953] ISV = 0, ISS = 0x00000004
[ 228.692965] CM = 0, WnR = 0
[ 228.692975] [0033323130393837] address between user and kernel address ranges
[ 228.692983] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 228.693078] Process kworker/u16:2 (pid: 144, stack limit = 0x000000009f882493)
[ 228.693088] CPU: 6 PID: 144 Comm: kworker/u16:2 Tainted: G W 4.19.139 #162
[ 228.693095] Hardware name: MediaTek krane sku176 board (DT)
[ 228.693125] Workqueue: ath10k_tx_complete_wq ath10k_htc_tx_complete_work [ath10k_core]
[ 228.693137] pstate: 80000005 (Nzcv daif -PAN -UAO)
[ 228.693152] pc : filter_match_preds+0x38/0x94
[ 228.693164] lr : trace_event_buffer_commit+0x150/0x22c
[ 228.693170] sp : ffffff80088b3b20
[ 228.693175] x29: ffffff80088b3b50 x28: 0000000000000402
[ 228.693183] x27: fffffff040001f60 x26: ffffff9d4ba34000
[ 228.693191] x25: fffffff040af5004 x24: 0000000000000000
[ 228.693198] x23: 0000000000000000 x22: fffffff040af7100
[ 228.693205] x21: 0000000000000000 x20: fffffff040af7100
[ 228.693212] x19: fffffff040af5004 x18: 0000000000000000
[ 228.693219] x17: 000000000000003c x16: ffffff9d4da649d8
[ 228.693226] x15: 0000000000000006 x14: 3837363534333231
[ 228.693233] x13: fffffff01514fc00 x12: 0000000000000000
[ 228.693240] x11: 0000000000000000 x10: fffffff040af5004
[ 228.693247] x9 : 0000000000000000 x8 : 0000000000000000
[ 228.693254] x7 : 646b686064ff6e68 x6 : 0000800000000000
[ 228.693261] x5 : 0000000000000000 x4 : 0000000000000000
[ 228.693268] x3 : 000000000000002c x2 : 0000000000000000
[ 228.693275] x1 : fffffff040af5004 x0 : 3433323130393837
[ 228.693282] Call trace:
[ 228.693291] filter_match_preds+0x38/0x94
[ 228.693299] trace_event_buffer_commit+0x150/0x22c
[ 228.693321] trace_event_raw_event_ath10k_txrx_tx_unref+0x164/0x198 [ath10k_core]
[ 228.693344] ath10k_txrx_tx_unref+0x1c0/0x36c [ath10k_core]
[ 228.693365] ath10k_htt_htc_tx_complete+0xe4/0x114 [ath10k_core]
[ 228.693386] ath10k_htc_notify_tx_completion+0xe4/0x124 [ath10k_core]
[ 228.693407] ath10k_htc_tx_complete_work+0xb8/0xf0 [ath10k_core]
[ 228.693418] process_one_work+0x210/0x418
[ 228.693425] worker_thread+0x234/0x3dc
[ 228.693433] kthread+0x120/0x140
[ 228.693441] ret_from_fork+0x10/0x18
[ 228.693451] Code: aa0103f3 aa1f03e8 2a1f03f5 aa1403f6 (f9400009)
[ 228.693459] ---[ end trace 187b0ae372e7515c ]---
[ 228.711074] Kernel panic - not syncing: Fatal exception
[ 228.711086] SMP: stopping secondary CPUs
[ 228.711254] Kernel Offset: 0x1d45600000 from 0xffffff8008000000
[ 228.711259] CPU features: 0x0,2188200c
[ 228.711262] Memory Limit: none

The value of prog in filter_match_preds of kernel/trace/trace_events_filter.c
is overwrite to the content of the UDP packets's content like this
0x0039383736353433, it is a invalid address, so crash.

Content of udp tx packets with iperf:
0000 00 03 7f 46 c4 49 40 49 0f 2f f9 09 08 00 45 00 ...F.I@I./....E.
0010 05 da 18 00 40 00 40 11 99 4a c0 a8 01 77 c0 a8 ....@[email protected]..
0020 01 01 9c 7d 13 89 05 c6 89 a0 00 00 01 8b 5d 69 ...}..........]i
0030 98 9a 00 0a 55 b8 00 00 00 00 00 00 00 01 00 00 ....U...........
0040 13 89 00 00 00 00 00 00 00 00 ff ff f4 48 05 f5 .............H..
0050 e1 00 30 31 32 33 34 35 36 37 38 39 30 31 32 33 ..01234567890123
0060 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
0070 30 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 0123456789012345
0080 36 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 6789012345678901
0090 32 33 34 35 36 37 38 39 30 31 32 33 34 35 36 37 2345678901234567
00a0 38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 33 8901234567890123
00b0 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
...
05b0 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
05c0 30 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 0123456789012345
05d0 36 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 6789012345678901
05e0 32 33 34 35 36 37 38 39 23456789

ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
is bigger than PAGE_SIZE which is normally 4096, then ath10k_sdio_write
call ath10k_dbg_dump to trace the large size skb and corrupt the trace
data of tracing and lead crash. When disable the TX bundle of SDIO, it
does not crash, but TX bundle is for improve throughput, so it is enabled
by default. It is useless to call ath10k_dbg_dump to trace the large
bundled skb, so this patch trace the top part of large bundled skb.

I also did another try with below change, it is move the ath10k_dbg_dump
from ath10k_sdio_write to ath10k_sdio_mbox_rx_fetch_bundle, it also
crash immediately when run iperf UDP RX traffic, RX bundle is enabled
by default for SDIO, so the virt_pkt_len of ath10k_sdio_mbox_rx_fetch_bundle
is large, max value is 57344(1792*32), so it crashed same with upper test.

Change of try:
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -336,7 +336,6 @@ static int ath10k_sdio_write(struct ath10k *ar, u32 addr, const void *buf, size_

ath10k_dbg(ar, ATH10K_DBG_SDIO, "sdio write addr 0x%x buf 0x%p len %zu\n",
addr, buf, len);
- ath10k_dbg_dump(ar, ATH10K_DBG_SDIO_DUMP, NULL, "sdio write ", buf, len);

out:
sdio_release_host(func);
@@ -692,6 +691,8 @@ static int ath10k_sdio_mbox_rx_fetch_bundle(struct ath10k *ar)
goto err;
}

+ ath10k_dbg_dump(ar, ATH10K_DBG_SDIO_DUMP, NULL, "sdio read vsg", ar_sdio->vsg_buffer, virt_pkt_len);
+
pkt_offset = 0;
for (i = 0; i < ar_sdio->n_rx_pkts; i++) {
pkt = &ar_sdio->rx_pkts[i];

dmesg of try:
[ 170.621318] Unable to handle kernel paging request at virtual address 0036353433323138
[ 170.621336] Mem abort info:
[ 170.621341] ESR = 0x96000004
[ 170.621347] Exception class = DABT (current EL), IL = 32 bits
[ 170.621352] SET = 0, FnV = 0
[ 170.621357] EA = 0, S1PTW = 0
[ 170.621362] Data abort info:
[ 170.621367] ISV = 0, ISS = 0x00000004
[ 170.621371] CM = 0, WnR = 0
[ 170.621377] [0036353433323138] address between user and kernel address ranges
[ 170.621385] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 170.621479] Process kworker/0:1 (pid: 14, stack limit = 0x00000000c6fcfb2a)
[ 170.621489] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 4.19.139 #162
[ 170.621494] Hardware name: MediaTek krane sku176 board (DT)
[ 170.621511] Workqueue: events sdio_irq_work
[ 170.621519] pstate: 80000005 (Nzcv daif -PAN -UAO)
[ 170.621529] pc : filter_match_preds+0x20/0x94
[ 170.621540] lr : trace_event_buffer_commit+0x150/0x22c
[ 170.621545] sp : ffffff80080f37d0
[ 170.621550] x29: ffffff80080f3800 x28: 000000000000000c
[ 170.621556] x27: 000000000000000c x26: 0000000000000001
[ 170.621563] x25: fffffff118eb8008 x24: 0000000000000000
[ 170.621569] x23: 0000000000000000 x22: fffffff17b007900
[ 170.621576] x21: fffffff118eb8000 x20: 3736353433323130
[ 170.621583] x19: fffffff178645790 x18: 0000000000000020
[ 170.621589] x17: 000000000000003c x16: ffffff91fe3937f0
[ 170.621596] x15: 0000000000000006 x14: 00000636000ad8b0
[ 170.621602] x13: 0000000000000000 x12: 0000000000000000
[ 170.621608] x11: 0000000000000000 x10: 0000000000000000
[ 170.621615] x9 : 0000000000000000 x8 : 000000000000040b
[ 170.621621] x7 : 0000000000000000 x6 : fffffff118ec6052
[ 170.621628] x5 : 0000000000000000 x4 : 0000000000000000
[ 170.621634] x3 : 000000000000e04e x2 : ffffffffffffffc0
[ 170.621640] x1 : fffffff118eb8008 x0 : fffffff118ec1f00
[ 170.621648] Call trace:
[ 170.621655] filter_match_preds+0x20/0x94
[ 170.621662] trace_event_buffer_commit+0x150/0x22c
[ 170.621704] trace_event_raw_event_ath10k_log_dbg_dump+0x208/0x240 [ath10k_core]
[ 170.621724] ath10k_dbg_dump+0x1bc/0x21c [ath10k_core]
[ 170.621734] ath10k_sdio_readsb+0xdc/0xfc [ath10k_sdio]
[ 170.621742] ath10k_sdio_irq_handler+0x6f8/0xca8 [ath10k_sdio]
[ 170.621749] process_sdio_pending_irqs+0x60/0x1b4
[ 170.621756] sdio_run_irqs+0x34/0x60
[ 170.621762] sdio_irq_work+0x1c/0x28
[ 170.621769] process_one_work+0x210/0x418
[ 170.621775] worker_thread+0x234/0x3dc
[ 170.621783] kthread+0x120/0x140
[ 170.621791] ret_from_fork+0x10/0x18
[ 170.621799] Code: 9100c3fd b40002e0 f9400014 b40002b4 (f9400680)
[ 170.621807] ---[ end trace b52c34b9c8deb8e3 ]---
[ 170.624123] Unable to handle kernel paging request at virtual address dffd74685cec003c
[ 170.624134] Mem abort info:
[ 170.624138] ESR = 0x96000004
[ 170.624143] Exception class = DABT (current EL), IL = 32 bits
[ 170.624146] SET = 0, FnV = 0
[ 170.624150] EA = 0, S1PTW = 0
[ 170.624153] Data abort info:
[ 170.624157] ISV = 0, ISS = 0x00000004
[ 170.624160] CM = 0, WnR = 0
[ 170.624165] [dffd74685cec003c] address between user and kernel address ranges
[ 170.634352] Kernel panic - not syncing: Fatal exception
[ 170.634370] SMP: stopping secondary CPUs
[ 171.711302] SMP: failed to stop secondary CPUs 0,4
[ 171.711310] Kernel Offset: 0x11f6200000 from 0xffffff8008000000
[ 171.711316] CPU features: 0x0,2188200c
[ 171.711321] Memory Limit: none
[ 171.723885] SMP: stopping secondary CPUs
[ 172.800814] SMP: failed to stop secondary CPUs 0,4

trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
call trace_event_buffer_reserve got get a struct pointer *entry, its
type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
call trace_event_buffer_lock_reserve to get ring_buffer_event.

In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
temp buffer this_cpu_read(trace_buffered_event), and the buffer size
is 1 page size which allocatee in trace_buffered_event_enable by
alloc_pages_node, and then ath10k pass the buffer size > 1 page trigger
overflow and crash.

Based on upper test, try and debugging, pass large buff size to function
trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
ath10k_sdio_write to print the length of skb/buffer, it is not necessary
to trace all content of the large skb.

Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049

Signed-off-by: Wen Gong <[email protected]>
---
drivers/net/wireless/ath/ath10k/debug.c | 2 +-
drivers/net/wireless/ath/ath10k/debug.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath10k/debug.c b/drivers/net/wireless/ath/ath10k/debug.c
index e8250a665433..5433dbdde0e5 100644
--- a/drivers/net/wireless/ath/ath10k/debug.c
+++ b/drivers/net/wireless/ath/ath10k/debug.c
@@ -2718,7 +2718,7 @@ void ath10k_dbg_dump(struct ath10k *ar,

/* tracing code doesn't like null strings :/ */
trace_ath10k_log_dbg_dump(ar, msg ? msg : "", prefix ? prefix : "",
- buf, len);
+ buf, min_t(size_t, len, ATH10K_LOG_DUMP_MAX));
}
EXPORT_SYMBOL(ath10k_dbg_dump);

diff --git a/drivers/net/wireless/ath/ath10k/debug.h b/drivers/net/wireless/ath/ath10k/debug.h
index 997c1c80aba7..cec9ba92f28f 100644
--- a/drivers/net/wireless/ath/ath10k/debug.h
+++ b/drivers/net/wireless/ath/ath10k/debug.h
@@ -11,6 +11,8 @@
#include <linux/types.h>
#include "trace.h"

+#define ATH10K_LOG_DUMP_MAX 1024
+
enum ath10k_debug_mask {
ATH10K_DBG_PCI = 0x00000001,
ATH10K_DBG_WMI = 0x00000002,
--
2.23.0


2021-02-09 07:37:29

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

Wen Gong <[email protected]> wrote:

> Kernel panic every time in kernel when running below case:
> steps:
> 1. connect to an AP with good signal strength
> 2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
> 3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
> 4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
> 5. sudo trace-cmd record -e ath10k
> 6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
> 7. kernel panic immeditely
>
> It always crash at trace_event_raw_event_ath10k_xxx, below is 2 sample:
>
> dmesg 1:
> [ 58.038310] Unable to handle kernel paging request at virtual address 003130393837363d
> [ 58.038325] Mem abort info:
> [ 58.038330] ESR = 0x96000004
> [ 58.038336] Exception class = DABT (current EL), IL = 32 bits
> [ 58.038340] SET = 0, FnV = 0
> [ 58.038345] EA = 0, S1PTW = 0
> [ 58.038349] Data abort info:
> [ 58.038353] ISV = 0, ISS = 0x00000004
> [ 58.038358] CM = 0, WnR = 0
> [ 58.038363] [003130393837363d] address between user and kernel address ranges
> [ 58.038370] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 58.038377] Modules linked in: rfcomm algif_hash algif_skcipher
> [ 58.038385] Unable to handle kernel paging request at virtual address 003938373635343b
> [ 58.038386] af_alg uinput cros_ec_rpmsg mtk_dip mtk_mdp3
> [ 58.038395] Mem abort info:
> [ 58.038403] ESR = 0x96000004
> [ 58.038412] Exception class = DABT (current EL), IL = 32 bits
> [ 58.038419] SET = 0, FnV = 0
> [ 58.038427] EA = 0, S1PTW = 0
> [ 58.038434] Data abort info:
> [ 58.038442] ISV = 0, ISS = 0x00000004
> [ 58.038449] CM = 0, WnR = 0
> [ 58.038457] [003938373635343b] address between user and kernel address ranges
> [ 58.038486] Process kworker/u16:1 (pid: 141, stack limit = 0x000000009bd6ba2e)
> [ 58.038496] CPU: 1 PID: 141 Comm: kworker/u16:1 Not tainted 4.19.139 #162
> [ 58.038501] Hardware name: MediaTek krane sku176 board (DT)
> [ 58.038515] Workqueue: ath10k_sdio_wq ath10k_sdio_write_async_work [ath10k_sdio]
> [ 58.038523] pstate: 80000005 (Nzcv daif -PAN -UAO)
> [ 58.038536] pc : filter_match_preds+0x20/0x94
> [ 58.038546] lr : trace_event_buffer_commit+0x150/0x22c
> [ 58.038550] sp : ffffff80088a39f0
> [ 58.038555] x29: ffffff80088a3a20 x28: 000000000000000c
> [ 58.038561] x27: 000000000000000c x26: 0000000000000001
> [ 58.038567] x25: ffffffdfd96ef008 x24: 0000000000000000
> [ 58.038573] x23: 0000000000000000 x22: ffffffe03b006c00
> [ 58.038580] x21: ffffffdfd96ef000 x20: 3231303938373635
> [ 58.038586] x19: ffffffe02de9b160 x18: 0000000000000020
> [ 58.038592] x17: 0000000000100000 x16: ffffffa4e67937f0
> [ 58.038598] x15: 0000000000100000 x14: 0000393837363534
> [ 58.038604] x13: 3332313039383736 x12: 3534333231303938
> [ 58.038611] x11: 3736353433323130 x10: 3938373635343332
> [ 58.038617] x9 : 3130393837363534 x8 : 000000000000040b
> [ 58.038623] x7 : 3534333231303938 x6 : ffffffdfd96fb2d1
> [ 58.038629] x5 : 0000000000000000 x4 : 0000000000000000
> [ 58.038635] x3 : 000000000000c2cd x2 : ffffffffffffffc0
> [ 58.038641] x1 : ffffffdfd96ef008 x0 : ffffffdfd96f7080
> [ 58.038648] Call trace:
> [ 58.038655] filter_match_preds+0x20/0x94
> [ 58.038661] trace_event_buffer_commit+0x150/0x22c
> [ 58.038698] trace_event_raw_event_ath10k_log_dbg_dump+0x208/0x240 [ath10k_core]
> [ 58.038716] ath10k_dbg_dump+0x1bc/0x21c [ath10k_core]
> [ 58.038724] ath10k_sdio_write+0xc8/0xe8 [ath10k_sdio]
> [ 58.038730] ath10k_sdio_write_async_work+0x11c/0x1f4 [ath10k_sdio]
> [ 58.038738] process_one_work+0x210/0x418
> [ 58.038744] worker_thread+0x234/0x3dc
> [ 58.038751] kthread+0x120/0x140
> [ 58.038758] ret_from_fork+0x10/0x18
> [ 58.038766] Code: 9100c3fd b40002e0 f9400014 b40002b4 (f9400680)
> [ 58.038773] ---[ end trace 8b9d2463f441eb7f ]---
> [ 58.051846] Kernel panic - not syncing: Fatal exception
> [ 58.051864] SMP: stopping secondary CPUs
>
> dmesg 2:
> [ 228.692905] Unable to handle kernel paging request at virtual address 0033323130393837
> [ 228.692913] Mem abort info:
> [ 228.692920] ESR = 0x96000004
> [ 228.692927] Exception class = DABT (current EL), IL = 32 bits
> [ 228.692933] SET = 0, FnV = 0
> [ 228.692938] EA = 0, S1PTW = 0
> [ 228.692943] Data abort info:
> [ 228.692953] ISV = 0, ISS = 0x00000004
> [ 228.692965] CM = 0, WnR = 0
> [ 228.692975] [0033323130393837] address between user and kernel address ranges
> [ 228.692983] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 228.693078] Process kworker/u16:2 (pid: 144, stack limit = 0x000000009f882493)
> [ 228.693088] CPU: 6 PID: 144 Comm: kworker/u16:2 Tainted: G W 4.19.139 #162
> [ 228.693095] Hardware name: MediaTek krane sku176 board (DT)
> [ 228.693125] Workqueue: ath10k_tx_complete_wq ath10k_htc_tx_complete_work [ath10k_core]
> [ 228.693137] pstate: 80000005 (Nzcv daif -PAN -UAO)
> [ 228.693152] pc : filter_match_preds+0x38/0x94
> [ 228.693164] lr : trace_event_buffer_commit+0x150/0x22c
> [ 228.693170] sp : ffffff80088b3b20
> [ 228.693175] x29: ffffff80088b3b50 x28: 0000000000000402
> [ 228.693183] x27: fffffff040001f60 x26: ffffff9d4ba34000
> [ 228.693191] x25: fffffff040af5004 x24: 0000000000000000
> [ 228.693198] x23: 0000000000000000 x22: fffffff040af7100
> [ 228.693205] x21: 0000000000000000 x20: fffffff040af7100
> [ 228.693212] x19: fffffff040af5004 x18: 0000000000000000
> [ 228.693219] x17: 000000000000003c x16: ffffff9d4da649d8
> [ 228.693226] x15: 0000000000000006 x14: 3837363534333231
> [ 228.693233] x13: fffffff01514fc00 x12: 0000000000000000
> [ 228.693240] x11: 0000000000000000 x10: fffffff040af5004
> [ 228.693247] x9 : 0000000000000000 x8 : 0000000000000000
> [ 228.693254] x7 : 646b686064ff6e68 x6 : 0000800000000000
> [ 228.693261] x5 : 0000000000000000 x4 : 0000000000000000
> [ 228.693268] x3 : 000000000000002c x2 : 0000000000000000
> [ 228.693275] x1 : fffffff040af5004 x0 : 3433323130393837
> [ 228.693282] Call trace:
> [ 228.693291] filter_match_preds+0x38/0x94
> [ 228.693299] trace_event_buffer_commit+0x150/0x22c
> [ 228.693321] trace_event_raw_event_ath10k_txrx_tx_unref+0x164/0x198 [ath10k_core]
> [ 228.693344] ath10k_txrx_tx_unref+0x1c0/0x36c [ath10k_core]
> [ 228.693365] ath10k_htt_htc_tx_complete+0xe4/0x114 [ath10k_core]
> [ 228.693386] ath10k_htc_notify_tx_completion+0xe4/0x124 [ath10k_core]
> [ 228.693407] ath10k_htc_tx_complete_work+0xb8/0xf0 [ath10k_core]
> [ 228.693418] process_one_work+0x210/0x418
> [ 228.693425] worker_thread+0x234/0x3dc
> [ 228.693433] kthread+0x120/0x140
> [ 228.693441] ret_from_fork+0x10/0x18
> [ 228.693451] Code: aa0103f3 aa1f03e8 2a1f03f5 aa1403f6 (f9400009)
> [ 228.693459] ---[ end trace 187b0ae372e7515c ]---
> [ 228.711074] Kernel panic - not syncing: Fatal exception
> [ 228.711086] SMP: stopping secondary CPUs
> [ 228.711254] Kernel Offset: 0x1d45600000 from 0xffffff8008000000
> [ 228.711259] CPU features: 0x0,2188200c
> [ 228.711262] Memory Limit: none
>
> The value of prog in filter_match_preds of kernel/trace/trace_events_filter.c
> is overwrite to the content of the UDP packets's content like this
> 0x0039383736353433, it is a invalid address, so crash.
>
> Content of udp tx packets with iperf:
> 0000 00 03 7f 46 c4 49 40 49 0f 2f f9 09 08 00 45 00 ...F.I@I./....E.
> 0010 05 da 18 00 40 00 40 11 99 4a c0 a8 01 77 c0 a8 ....@[email protected]..
> 0020 01 01 9c 7d 13 89 05 c6 89 a0 00 00 01 8b 5d 69 ...}..........]i
> 0030 98 9a 00 0a 55 b8 00 00 00 00 00 00 00 01 00 00 ....U...........
> 0040 13 89 00 00 00 00 00 00 00 00 ff ff f4 48 05 f5 .............H..
> 0050 e1 00 30 31 32 33 34 35 36 37 38 39 30 31 32 33 ..01234567890123
> 0060 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
> 0070 30 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 0123456789012345
> 0080 36 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 6789012345678901
> 0090 32 33 34 35 36 37 38 39 30 31 32 33 34 35 36 37 2345678901234567
> 00a0 38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 33 8901234567890123
> 00b0 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
> ...
> 05b0 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39 4567890123456789
> 05c0 30 31 32 33 34 35 36 37 38 39 30 31 32 33 34 35 0123456789012345
> 05d0 36 37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 6789012345678901
> 05e0 32 33 34 35 36 37 38 39 23456789
>
> ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
> is bigger than PAGE_SIZE which is normally 4096, then ath10k_sdio_write
> call ath10k_dbg_dump to trace the large size skb and corrupt the trace
> data of tracing and lead crash. When disable the TX bundle of SDIO, it
> does not crash, but TX bundle is for improve throughput, so it is enabled
> by default. It is useless to call ath10k_dbg_dump to trace the large
> bundled skb, so this patch trace the top part of large bundled skb.
>
> I also did another try with below change, it is move the ath10k_dbg_dump
> from ath10k_sdio_write to ath10k_sdio_mbox_rx_fetch_bundle, it also
> crash immediately when run iperf UDP RX traffic, RX bundle is enabled
> by default for SDIO, so the virt_pkt_len of ath10k_sdio_mbox_rx_fetch_bundle
> is large, max value is 57344(1792*32), so it crashed same with upper test.
>
> Change of try:
>
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -336,7 +336,6 @@ static int ath10k_sdio_write(struct ath10k *ar, u32 addr, const void *buf, size_
>
> ath10k_dbg(ar, ATH10K_DBG_SDIO, "sdio write addr 0x%x buf 0x%p len %zu\n",
> addr, buf, len);
> - ath10k_dbg_dump(ar, ATH10K_DBG_SDIO_DUMP, NULL, "sdio write ", buf, len);
>
> out:
> sdio_release_host(func);
> @@ -692,6 +691,8 @@ static int ath10k_sdio_mbox_rx_fetch_bundle(struct ath10k *ar)
> goto err;
> }
>
> + ath10k_dbg_dump(ar, ATH10K_DBG_SDIO_DUMP, NULL, "sdio read vsg", ar_sdio->vsg_buffer, virt_pkt_len);
> +
> pkt_offset = 0;
> for (i = 0; i < ar_sdio->n_rx_pkts; i++) {
> pkt = &ar_sdio->rx_pkts[i];
>
> dmesg of try:
> [ 170.621318] Unable to handle kernel paging request at virtual address 0036353433323138
> [ 170.621336] Mem abort info:
> [ 170.621341] ESR = 0x96000004
> [ 170.621347] Exception class = DABT (current EL), IL = 32 bits
> [ 170.621352] SET = 0, FnV = 0
> [ 170.621357] EA = 0, S1PTW = 0
> [ 170.621362] Data abort info:
> [ 170.621367] ISV = 0, ISS = 0x00000004
> [ 170.621371] CM = 0, WnR = 0
> [ 170.621377] [0036353433323138] address between user and kernel address ranges
> [ 170.621385] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 170.621479] Process kworker/0:1 (pid: 14, stack limit = 0x00000000c6fcfb2a)
> [ 170.621489] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 4.19.139 #162
> [ 170.621494] Hardware name: MediaTek krane sku176 board (DT)
> [ 170.621511] Workqueue: events sdio_irq_work
> [ 170.621519] pstate: 80000005 (Nzcv daif -PAN -UAO)
> [ 170.621529] pc : filter_match_preds+0x20/0x94
> [ 170.621540] lr : trace_event_buffer_commit+0x150/0x22c
> [ 170.621545] sp : ffffff80080f37d0
> [ 170.621550] x29: ffffff80080f3800 x28: 000000000000000c
> [ 170.621556] x27: 000000000000000c x26: 0000000000000001
> [ 170.621563] x25: fffffff118eb8008 x24: 0000000000000000
> [ 170.621569] x23: 0000000000000000 x22: fffffff17b007900
> [ 170.621576] x21: fffffff118eb8000 x20: 3736353433323130
> [ 170.621583] x19: fffffff178645790 x18: 0000000000000020
> [ 170.621589] x17: 000000000000003c x16: ffffff91fe3937f0
> [ 170.621596] x15: 0000000000000006 x14: 00000636000ad8b0
> [ 170.621602] x13: 0000000000000000 x12: 0000000000000000
> [ 170.621608] x11: 0000000000000000 x10: 0000000000000000
> [ 170.621615] x9 : 0000000000000000 x8 : 000000000000040b
> [ 170.621621] x7 : 0000000000000000 x6 : fffffff118ec6052
> [ 170.621628] x5 : 0000000000000000 x4 : 0000000000000000
> [ 170.621634] x3 : 000000000000e04e x2 : ffffffffffffffc0
> [ 170.621640] x1 : fffffff118eb8008 x0 : fffffff118ec1f00
> [ 170.621648] Call trace:
> [ 170.621655] filter_match_preds+0x20/0x94
> [ 170.621662] trace_event_buffer_commit+0x150/0x22c
> [ 170.621704] trace_event_raw_event_ath10k_log_dbg_dump+0x208/0x240 [ath10k_core]
> [ 170.621724] ath10k_dbg_dump+0x1bc/0x21c [ath10k_core]
> [ 170.621734] ath10k_sdio_readsb+0xdc/0xfc [ath10k_sdio]
> [ 170.621742] ath10k_sdio_irq_handler+0x6f8/0xca8 [ath10k_sdio]
> [ 170.621749] process_sdio_pending_irqs+0x60/0x1b4
> [ 170.621756] sdio_run_irqs+0x34/0x60
> [ 170.621762] sdio_irq_work+0x1c/0x28
> [ 170.621769] process_one_work+0x210/0x418
> [ 170.621775] worker_thread+0x234/0x3dc
> [ 170.621783] kthread+0x120/0x140
> [ 170.621791] ret_from_fork+0x10/0x18
> [ 170.621799] Code: 9100c3fd b40002e0 f9400014 b40002b4 (f9400680)
> [ 170.621807] ---[ end trace b52c34b9c8deb8e3 ]---
> [ 170.624123] Unable to handle kernel paging request at virtual address dffd74685cec003c
> [ 170.624134] Mem abort info:
> [ 170.624138] ESR = 0x96000004
> [ 170.624143] Exception class = DABT (current EL), IL = 32 bits
> [ 170.624146] SET = 0, FnV = 0
> [ 170.624150] EA = 0, S1PTW = 0
> [ 170.624153] Data abort info:
> [ 170.624157] ISV = 0, ISS = 0x00000004
> [ 170.624160] CM = 0, WnR = 0
> [ 170.624165] [dffd74685cec003c] address between user and kernel address ranges
> [ 170.634352] Kernel panic - not syncing: Fatal exception
> [ 170.634370] SMP: stopping secondary CPUs
> [ 171.711302] SMP: failed to stop secondary CPUs 0,4
> [ 171.711310] Kernel Offset: 0x11f6200000 from 0xffffff8008000000
> [ 171.711316] CPU features: 0x0,2188200c
> [ 171.711321] Memory Limit: none
> [ 171.723885] SMP: stopping secondary CPUs
> [ 172.800814] SMP: failed to stop secondary CPUs 0,4
>
> trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
> call trace_event_buffer_reserve got get a struct pointer *entry, its
> type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
> compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
> call trace_event_buffer_lock_reserve to get ring_buffer_event.
>
> In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
> the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
> which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
> temp buffer this_cpu_read(trace_buffered_event), and the buffer size
> is 1 page size which allocatee in trace_buffered_event_enable by
> alloc_pages_node, and then ath10k pass the buffer size > 1 page trigger
> overflow and crash.
>
> Based on upper test, try and debugging, pass large buff size to function
> trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
> ath10k_sdio_write to print the length of skb/buffer, it is not necessary
> to trace all content of the large skb.
>
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
>
> Signed-off-by: Wen Gong <[email protected]>

Failed to apply, please rebase.

error: patch failed: drivers/net/wireless/ath/ath10k/sdio.c:336
error: drivers/net/wireless/ath/ath10k/sdio.c: patch does not apply
stg import: Diff does not apply cleanly

Patch set to Changes Requested.

--
https://patchwork.kernel.org/project/linux-wireless/patch/[email protected]/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2021-02-09 19:59:33

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

+ Steven Rostedt

Hi Wen,

(Trimming down the description a bit:)

On Mon, Feb 8, 2021 at 6:59 PM Wen Gong <[email protected]> wrote:
>
> Kernel panic every time in kernel when running below case:
> steps:
> 1. connect to an AP with good signal strength
> 2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
> 3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
> 4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
> 5. sudo trace-cmd record -e ath10k
> 6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
> 7. kernel panic immeditely
>
> It always crash at trace_event_raw_event_ath10k_xxx, below is 2 sample:
...
> The value of prog in filter_match_preds of kernel/trace/trace_events_filter.c
> is overwrite to the content of the UDP packets's content like this
> 0x0039383736353433, it is a invalid address, so crash.
...
> ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
> is bigger than PAGE_SIZE which is normally 4096, then ath10k_sdio_write
> call ath10k_dbg_dump to trace the large size skb and corrupt the trace
> data of tracing and lead crash. When disable the TX bundle of SDIO, it
> does not crash, but TX bundle is for improve throughput, so it is enabled
> by default. It is useless to call ath10k_dbg_dump to trace the large
> bundled skb, so this patch trace the top part of large bundled skb.
...
> trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
> call trace_event_buffer_reserve got get a struct pointer *entry, its
> type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
> compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
> call trace_event_buffer_lock_reserve to get ring_buffer_event.
>
> In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
> the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
> which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
> temp buffer this_cpu_read(trace_buffered_event), and the buffer size
> is 1 page size which allocatee in trace_buffered_event_enable by
> alloc_pages_node, and then ath10k pass the buffer size > 1 page trigger
> overflow and crash.
>
> Based on upper test, try and debugging, pass large buff size to function
> trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
> ath10k_sdio_write to print the length of skb/buffer, it is not necessary
> to trace all content of the large skb.

Is this the same issue noted in this thread?

[for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
https://lore.kernel.org/lkml/[email protected]/

It seems like we should still try to get that fixed somehow, even if
the below change is fine on its own (it probably doesn't make sense to
such a large amount of data via tracepoints). It would be unfortunate
for next poor soul to hit the same issues, just because they wanted to
dump a few KB.

Brian

> drivers/net/wireless/ath/ath10k/debug.c | 2 +-
> drivers/net/wireless/ath/ath10k/debug.h | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/debug.c b/drivers/net/wireless/ath/ath10k/debug.c
> index e8250a665433..5433dbdde0e5 100644
> --- a/drivers/net/wireless/ath/ath10k/debug.c
> +++ b/drivers/net/wireless/ath/ath10k/debug.c
> @@ -2718,7 +2718,7 @@ void ath10k_dbg_dump(struct ath10k *ar,
>
> /* tracing code doesn't like null strings :/ */
> trace_ath10k_log_dbg_dump(ar, msg ? msg : "", prefix ? prefix : "",
> - buf, len);
> + buf, min_t(size_t, len, ATH10K_LOG_DUMP_MAX));
> }
> EXPORT_SYMBOL(ath10k_dbg_dump);
>
> diff --git a/drivers/net/wireless/ath/ath10k/debug.h b/drivers/net/wireless/ath/ath10k/debug.h
> index 997c1c80aba7..cec9ba92f28f 100644
> --- a/drivers/net/wireless/ath/ath10k/debug.h
> +++ b/drivers/net/wireless/ath/ath10k/debug.h
> @@ -11,6 +11,8 @@
> #include <linux/types.h>
> #include "trace.h"
>
> +#define ATH10K_LOG_DUMP_MAX 1024
> +
> enum ath10k_debug_mask {
> ATH10K_DBG_PCI = 0x00000001,
> ATH10K_DBG_WMI = 0x00000002,
> --
> 2.23.0
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k

2021-02-09 22:30:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Tue, 9 Feb 2021 11:35:07 -0800
Brian Norris <[email protected]> wrote:

> + Steven Rostedt

Thanks.

>
> Hi Wen,
>
> (Trimming down the description a bit:)
>
> On Mon, Feb 8, 2021 at 6:59 PM Wen Gong <[email protected]> wrote:
> >
> > Kernel panic every time in kernel when running below case:
> > steps:
> > 1. connect to an AP with good signal strength
> > 2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
> > 3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
> > 4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
> > 5. sudo trace-cmd record -e ath10k
> > 6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
> > 7. kernel panic immeditely
> >
> > It always crash at trace_event_raw_event_ath10k_xxx, below is 2 sample:
> ...
> > The value of prog in filter_match_preds of kernel/trace/trace_events_filter.c
> > is overwrite to the content of the UDP packets's content like this
> > 0x0039383736353433, it is a invalid address, so crash.
> ...
> > ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
> > is bigger than PAGE_SIZE which is normally 4096, then ath10k_sdio_write
> > call ath10k_dbg_dump to trace the large size skb and corrupt the trace
> > data of tracing and lead crash. When disable the TX bundle of SDIO, it
> > does not crash, but TX bundle is for improve throughput, so it is enabled
> > by default. It is useless to call ath10k_dbg_dump to trace the large
> > bundled skb, so this patch trace the top part of large bundled skb.
> ...
> > trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
> > call trace_event_buffer_reserve got get a struct pointer *entry, its
> > type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
> > compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
> > call trace_event_buffer_lock_reserve to get ring_buffer_event.
> >
> > In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
> > the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
> > which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
> > temp buffer this_cpu_read(trace_buffered_event), and the buffer size
> > is 1 page size which allocatee in trace_buffered_event_enable by
> > alloc_pages_node, and then ath10k pass the buffer size > 1 page trigger
> > overflow and crash.
> >
> > Based on upper test, try and debugging, pass large buff size to function
> > trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
> > ath10k_sdio_write to print the length of skb/buffer, it is not necessary
> > to trace all content of the large skb.
>
> Is this the same issue noted in this thread?
>
> [for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
> https://lore.kernel.org/lkml/[email protected]/

Note, that is only used when filtering happens, which doesn't appear to be
the case here.

>
> It seems like we should still try to get that fixed somehow, even if
> the below change is fine on its own (it probably doesn't make sense to
> such a large amount of data via tracepoints). It would be unfortunate
> for next poor soul to hit the same issues, just because they wanted to
> dump a few KB.

Yeah, it was a design decision to cap the max size of events to just under
PAGE_SIZE. The ring buffer is broken up into pages (for zero copy
transfers to file systems and the network). Thus, no event is allowed to be
bigger than a page (and actually a bit smaller)

That said, it shouldn't have crashed, it should have just failed to record.

I'll test it out and see why it crashed.

-- Steve

2021-02-09 23:52:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Tue, 9 Feb 2021 14:55:31 -0500
Steven Rostedt <[email protected]> wrote:

> > [for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
> > https://lore.kernel.org/lkml/[email protected]/
>
> Note, that is only used when filtering happens, which doesn't appear to be
> the case here.

I was basing this off of the original commands, but the stack dump says
otherwise. But it should still work.

>
> >
> > It seems like we should still try to get that fixed somehow, even if
> > the below change is fine on its own (it probably doesn't make sense to
> > such a large amount of data via tracepoints). It would be unfortunate
> > for next poor soul to hit the same issues, just because they wanted to
> > dump a few KB.
>
> Yeah, it was a design decision to cap the max size of events to just under
> PAGE_SIZE. The ring buffer is broken up into pages (for zero copy
> transfers to file systems and the network). Thus, no event is allowed to be
> bigger than a page (and actually a bit smaller)
>
> That said, it shouldn't have crashed, it should have just failed to record.
>
> I'll test it out and see why it crashed.

Looking at the original report, I see:

CPU: 1 PID: 141 Comm: kworker/u16:1 Not tainted 4.19.139 #162

Does this still crash on the latest kernel?

-- Steve

2021-02-10 02:06:22

by Wen Gong

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On 2021-02-10 05:34, Steven Rostedt wrote:
> On Tue, 9 Feb 2021 14:55:31 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> > [for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
>> > https://lore.kernel.org/lkml/[email protected]/
>>
>> Note, that is only used when filtering happens, which doesn't appear
>> to be
>> the case here.
>
> I was basing this off of the original commands, but the stack dump says
> otherwise. But it should still work.
>
>>
>> >
>> > It seems like we should still try to get that fixed somehow, even if
>> > the below change is fine on its own (it probably doesn't make sense to
>> > such a large amount of data via tracepoints). It would be unfortunate
>> > for next poor soul to hit the same issues, just because they wanted to
>> > dump a few KB.
>>
>> Yeah, it was a design decision to cap the max size of events to just
>> under
>> PAGE_SIZE. The ring buffer is broken up into pages (for zero copy
>> transfers to file systems and the network). Thus, no event is allowed
>> to be
>> bigger than a page (and actually a bit smaller)
>>
>> That said, it shouldn't have crashed, it should have just failed to
>> record.
>>
>> I'll test it out and see why it crashed.
>
> Looking at the original report, I see:
>
> CPU: 1 PID: 141 Comm: kworker/u16:1 Not tainted 4.19.139 #162
>
> Does this still crash on the latest kernel?
>
> -- Steve
Not tested with latest kernel.
The reason is below which I said in
https://lore.kernel.org/lkml/[email protected]/

the per cpu buffer seems it is initilized in
trace_buffered_event_enable,
it is only 1 page size as below:
void trace_buffered_event_enable(void)
{
...
for_each_tracing_cpu(cpu) {
page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
If the size of buffer to trace is more than 1 page, such as 46680, then
it trigger kernel crash/panic in my case while run trace-cmd.
After debugging, the trace_file->flags in
trace_event_buffer_lock_reserve is 0x40b while run trace-cmd, and it is
0x403 while collecting ftrace log.

2021-02-10 02:15:28

by Wen Gong

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On 2021-02-10 03:35, Brian Norris wrote:
> + Steven Rostedt
>
> Hi Wen,
>
> (Trimming down the description a bit:)
>
> On Mon, Feb 8, 2021 at 6:59 PM Wen Gong <[email protected]> wrote:
>>
>> Kernel panic every time in kernel when running below case:
>> steps:
>> 1. connect to an AP with good signal strength
>> 2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
>> 3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
>> 4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
>> 5. sudo trace-cmd record -e ath10k
>> 6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
>> 7. kernel panic immeditely
>>
>> It always crash at trace_event_raw_event_ath10k_xxx, below is 2
>> sample:
> ...
>> The value of prog in filter_match_preds of
>> kernel/trace/trace_events_filter.c
>> is overwrite to the content of the UDP packets's content like this
>> 0x0039383736353433, it is a invalid address, so crash.
> ...
>> ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
>> is bigger than PAGE_SIZE which is normally 4096, then
>> ath10k_sdio_write
>> call ath10k_dbg_dump to trace the large size skb and corrupt the trace
>> data of tracing and lead crash. When disable the TX bundle of SDIO, it
>> does not crash, but TX bundle is for improve throughput, so it is
>> enabled
>> by default. It is useless to call ath10k_dbg_dump to trace the large
>> bundled skb, so this patch trace the top part of large bundled skb.
> ...
>> trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
>> call trace_event_buffer_reserve got get a struct pointer *entry, its
>> type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
>> compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
>> call trace_event_buffer_lock_reserve to get ring_buffer_event.
>>
>> In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
>> the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
>> which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
>> temp buffer this_cpu_read(trace_buffered_event), and the buffer size
>> is 1 page size which allocatee in trace_buffered_event_enable by
>> alloc_pages_node, and then ath10k pass the buffer size > 1 page
>> trigger
>> overflow and crash.
>>
>> Based on upper test, try and debugging, pass large buff size to
>> function
>> trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
>> ath10k_sdio_write to print the length of skb/buffer, it is not
>> necessary
>> to trace all content of the large skb.
>
> Is this the same issue noted in this thread?
Hi Norris,
Yes, but it is not the total same issue.
>
> [for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
> https://lore.kernel.org/lkml/[email protected]/
>
> It seems like we should still try to get that fixed somehow, even if
> the below change is fine on its own (it probably doesn't make sense to
> such a large amount of data via tracepoints). It would be unfortunate
> for next poor soul to hit the same issues, just because they wanted to
> dump a few KB.
For ath10k, we do not want to dump content which size > 1024.
otherwise it will generate a much big file while collecting log, it make
us much
hard to analyze the log.
so this patch is to dump the top 1024 bytes only,
its 1st goal is make log smaller.
its 2nd effect is fix the crash issue,
>
> Brian
>
>> drivers/net/wireless/ath/ath10k/debug.c | 2 +-
>> drivers/net/wireless/ath/ath10k/debug.h | 2 ++
>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/debug.c
>> b/drivers/net/wireless/ath/ath10k/debug.c
>> index e8250a665433..5433dbdde0e5 100644
>> --- a/drivers/net/wireless/ath/ath10k/debug.c
>> +++ b/drivers/net/wireless/ath/ath10k/debug.c
>> @@ -2718,7 +2718,7 @@ void ath10k_dbg_dump(struct ath10k *ar,
>>
>> /* tracing code doesn't like null strings :/ */
>> trace_ath10k_log_dbg_dump(ar, msg ? msg : "", prefix ? prefix
>> : "",
>> - buf, len);
>> + buf, min_t(size_t, len,
>> ATH10K_LOG_DUMP_MAX));
>> }
>> EXPORT_SYMBOL(ath10k_dbg_dump);
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/debug.h
>> b/drivers/net/wireless/ath/ath10k/debug.h
>> index 997c1c80aba7..cec9ba92f28f 100644
>> --- a/drivers/net/wireless/ath/ath10k/debug.h
>> +++ b/drivers/net/wireless/ath/ath10k/debug.h
>> @@ -11,6 +11,8 @@
>> #include <linux/types.h>
>> #include "trace.h"
>>
>> +#define ATH10K_LOG_DUMP_MAX 1024
>> +
>> enum ath10k_debug_mask {
>> ATH10K_DBG_PCI = 0x00000001,
>> ATH10K_DBG_WMI = 0x00000002,
>> --
>> 2.23.0
>>
>>
>> _______________________________________________
>> ath10k mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/ath10k

2021-02-10 02:18:57

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Tue, Feb 9, 2021 at 6:12 PM Wen Gong <[email protected]> wrote:
> On 2021-02-10 03:35, Brian Norris wrote:
> so this patch is to dump the top 1024 bytes only,
> its 1st goal is make log smaller.

Agreed. I wasn't objecting to this patch. I just wanted to highlight
the second part should probably also be considered as not exclusively
an ath10k bug:

> its 2nd effect is fix the crash issue,

but I guess if you're not testing the latest upstream, it's a little
hard to tell for sure whether the crash still exist at all anyway...

Brian

2021-02-10 02:28:52

by Wen Gong

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On 2021-02-10 10:14, Brian Norris wrote:
> On Tue, Feb 9, 2021 at 6:12 PM Wen Gong <[email protected]> wrote:
>> On 2021-02-10 03:35, Brian Norris wrote:
>> so this patch is to dump the top 1024 bytes only,
>> its 1st goal is make log smaller.
>
> Agreed. I wasn't objecting to this patch. I just wanted to highlight
> the second part should probably also be considered as not exclusively
> an ath10k bug:
>
>> its 2nd effect is fix the crash issue,
>
> but I guess if you're not testing the latest upstream, it's a little
> hard to tell for sure whether the crash still exist at all anyway...
>
Yes.
It is hard for me to test with the latest upstream, because the kernel
is v4.19 for my SDIO chip.
I do not have the environment with latest kernel.:).
If someone use the other environment(not V4.19+SDIO) which is the latest
kernel,
and if it not crash, it can not conclude it also will not crash on my
v4.19 for SDIO chip without this patch.:)
It will not crash with this patch on my v4.19 for SDIO, this have been
verified my me.
If someone want to check the same crash with latest kernel, he/she can
generate a simulate test and analyze with latest kernel.:)
> Brian

2021-02-10 16:35:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Wed, 10 Feb 2021 10:01:57 +0800
Wen Gong <[email protected]> wrote:

> Not tested with latest kernel.
> The reason is below which I said in
> https://lore.kernel.org/lkml/[email protected]/
>
> the per cpu buffer seems it is initilized in
> trace_buffered_event_enable,
> it is only 1 page size as below:
> void trace_buffered_event_enable(void)
> {
> ...
> for_each_tracing_cpu(cpu) {
> page = alloc_pages_node(cpu_to_node(cpu),
> GFP_KERNEL | __GFP_NORETRY, 0);
> If the size of buffer to trace is more than 1 page, such as 46680, then
> it trigger kernel crash/panic in my case while run trace-cmd.
> After debugging, the trace_file->flags in
> trace_event_buffer_lock_reserve is 0x40b while run trace-cmd, and it is
> 0x403 while collecting ftrace log.

Yeah, looking at this, I do see that this is a bug upstream.

Can you test this patch?

-- Steve


diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b79bcacdd6f9..a0b352aa23fe 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2739,7 +2739,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
(entry = this_cpu_read(trace_buffered_event))) {
/* Try to use the per cpu buffer first */
val = this_cpu_inc_return(trace_buffered_event_cnt);
- if (val == 1) {
+ if (len < PAGE_SIZE && val == 1) {
trace_event_setup(entry, type, trace_ctx);
entry->array[0] = len;
return entry;

2021-02-10 16:53:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Wed, 10 Feb 2021 11:30:26 -0500
Steven Rostedt <[email protected]> wrote:

> Can you test this patch?

v2:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b79bcacdd6f9..c5c3d826d420 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2739,7 +2739,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
(entry = this_cpu_read(trace_buffered_event))) {
/* Try to use the per cpu buffer first */
val = this_cpu_inc_return(trace_buffered_event_cnt);
- if (val == 1) {
+ if ((len < (PAGE_SIZE - sizeof(entry))) && val == 1) {
trace_event_setup(entry, type, trace_ctx);
entry->array[0] = len;
return entry;


-- Steve

2021-02-10 16:56:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On Wed, 10 Feb 2021 11:48:23 -0500
Steven Rostedt <[email protected]> wrote:

> On Wed, 10 Feb 2021 11:30:26 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > Can you test this patch?
>
> v2:

Bah!

v3:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b79bcacdd6f9..90d7ade03499 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2739,7 +2739,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
(entry = this_cpu_read(trace_buffered_event))) {
/* Try to use the per cpu buffer first */
val = this_cpu_inc_return(trace_buffered_event_cnt);
- if (val == 1) {
+ if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) {
trace_event_setup(entry, type, trace_ctx);
entry->array[0] = len;
return entry;

2021-03-08 08:46:11

by Wen Gong

[permalink] [raw]
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for large buffer size

On 2021-02-11 00:53, Steven Rostedt wrote:
> On Wed, 10 Feb 2021 11:48:23 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> On Wed, 10 Feb 2021 11:30:26 -0500
>> Steven Rostedt <[email protected]> wrote:
>>
>> > Can you test this patch?
>>
>> v2:
>
> Bah!
>
> v3:
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index b79bcacdd6f9..90d7ade03499 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2739,7 +2739,7 @@ trace_event_buffer_lock_reserve(struct
> trace_buffer **current_rb,
> (entry = this_cpu_read(trace_buffered_event))) {
> /* Try to use the per cpu buffer first */
> val = this_cpu_inc_return(trace_buffered_event_cnt);
> - if (val == 1) {
> + if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) {
> trace_event_setup(entry, type, trace_ctx);
> entry->array[0] = len;
> return entry;
After apply this
patch(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/trace/trace.c?id=b220c049d5196dd94d992dd2dc8cba1a5e6123bf).
It not crash again on my v4.19 kernel.