2024-03-28 10:06:36

by Zenghui Yu

[permalink] [raw]
Subject: [report] WARN_ON_ONCE triggered in for_each_sibling_event()

Hi folks,

The following splat is triggered when I execute `perf stat -e
smmuv3_pmcg_100020/config_cache_miss/` on mainline kernel (built with
arm64-defconfig + PROVE_LOCKING).

| ------------[ cut here ]------------
| WARNING: CPU: 36 PID: 72452 at drivers/perf/arm_smmuv3_pmu.c:434
smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
| Modules linked in: xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c overlay
ip6table_filter ip6_tables xt_tcpudp iptable_filter ip_tables x_tables
md_mod arm_smmuv3_pmu hibmc_drm drm_vram_helper drm_ttm_helper ttm
drm_kms_helper spi_dw_mmio spi_dw fuse drm backlight crct10dif_ce
onboard_usb_hub xhci_pci xhci_pci_renesas hisi_sec2 hisi_qm uacce
authenc ipmi_si ipmi_devintf ipmi_msghandler dm_mod br_netfilter bridge
stp llc nvme nvme_core nbd ipv6
| CPU: 36 PID: 72452 Comm: perf Kdump: loaded Not tainted
6.9.0-rc1-00061-g8d025e2092e2-dirty #1700
| Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019
| pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
| lr : smmu_pmu_event_init+0x290/0x2b0 [arm_smmuv3_pmu]
| sp : ffff8000c8ce3be0
| x29: ffff8000c8ce3be0 x28: 0000000000000000 x27: ffff8000802a2c1c
| x26: ffff8000c8ce3d70 x25: ffff8000802a2bc8 x24: 0000000000000000
| x23: 0000000000000001 x22: ffff0028045d52b0 x21: ffff002807228168
| x20: 0000000000000002 x19: ffff002807228000 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000040
| x11: ffff0020804016d0 x10: ffff0020804016d2 x9 : ffff800083b29b18
| x8 : ffff0020804016f8 x7 : 0000000000000000 x6 : ffff0020804018c0
| x5 : ffff0020804016d0 x4 : ffff80007bf07a58 x3 : 0000000000000002
| x2 : ffff802f5db37000 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
| perf_try_init_event+0x54/0x140| perf_event_alloc+0x3e4/0x1080
| __do_sys_perf_event_open+0x178/0xfa8
| __arm64_sys_perf_event_open+0x28/0x34
| invoke_syscall+0x48/0x114
| el0_svc_common.constprop.0+0x40/0xe0
| do_el0_svc+0x1c/0x28
| el0_svc+0x4c/0x11c
| el0t_64_sync_handler+0xc0/0xc4
| el0t_64_sync+0x190/0x194
| irq event stamp: 174338
| hardirqs last enabled at (174337): [<ffff800080357774>]
___slab_alloc+0x3bc/0xf38
| hardirqs last disabled at (174338): [<ffff8000812a7ee0>] el1_dbg+0x24/0x8c
| softirqs last enabled at (174292): [<ffff8000800185bc>]
fpsimd_restore_current_state+0x34/0xc4
| softirqs last disabled at (174290): [<ffff80008001858c>]
fpsimd_restore_current_state+0x4/0xc4
| ---[ end trace 0000000000000000 ]---

Note that this is not specific to the arm_smmuv3_pmu driver as I can
also reproduce it with some HiSilicon uncore PMU events (e.g., executing
`perf stat -e hisi_sccl1_ddrc0/flux_rd/`).

For your convenience, the assertion was added by commit f3c0eba28704
("perf: Add a few assertions").

Post it out for visibility, not sure if there are already similar
reports on the list though. Please have a look.

Thanks,
Zenghui


2024-04-03 09:59:04

by Mark Rutland

[permalink] [raw]
Subject: Re: [report] WARN_ON_ONCE triggered in for_each_sibling_event()

On Thu, Mar 28, 2024 at 06:06:21PM +0800, Zenghui Yu wrote:
> Hi folks,

Hi Zenghui,

> The following splat is triggered when I execute `perf stat -e
> smmuv3_pmcg_100020/config_cache_miss/` on mainline kernel (built with
> arm64-defconfig + PROVE_LOCKING).
>
> | ------------[ cut here ]------------
> | WARNING: CPU: 36 PID: 72452 at drivers/perf/arm_smmuv3_pmu.c:434
> smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
> | Modules linked in: xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack
> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c overlay
> ip6table_filter ip6_tables xt_tcpudp iptable_filter ip_tables x_tables
> md_mod arm_smmuv3_pmu hibmc_drm drm_vram_helper drm_ttm_helper ttm
> drm_kms_helper spi_dw_mmio spi_dw fuse drm backlight crct10dif_ce
> onboard_usb_hub xhci_pci xhci_pci_renesas hisi_sec2 hisi_qm uacce authenc
> ipmi_si ipmi_devintf ipmi_msghandler dm_mod br_netfilter bridge stp llc nvme
> nvme_core nbd ipv6
> | CPU: 36 PID: 72452 Comm: perf Kdump: loaded Not tainted
> 6.9.0-rc1-00061-g8d025e2092e2-dirty #1700
> | Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019
> | pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> | pc : smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
> | lr : smmu_pmu_event_init+0x290/0x2b0 [arm_smmuv3_pmu]
> | sp : ffff8000c8ce3be0
> | x29: ffff8000c8ce3be0 x28: 0000000000000000 x27: ffff8000802a2c1c
> | x26: ffff8000c8ce3d70 x25: ffff8000802a2bc8 x24: 0000000000000000
> | x23: 0000000000000001 x22: ffff0028045d52b0 x21: ffff002807228168
> | x20: 0000000000000002 x19: ffff002807228000 x18: 0000000000000000
> | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000040
> | x11: ffff0020804016d0 x10: ffff0020804016d2 x9 : ffff800083b29b18
> | x8 : ffff0020804016f8 x7 : 0000000000000000 x6 : ffff0020804018c0
> | x5 : ffff0020804016d0 x4 : ffff80007bf07a58 x3 : 0000000000000002
> | x2 : ffff802f5db37000 x1 : 0000000000000000 x0 : 0000000000000000
> | Call trace:
> | smmu_pmu_event_init+0x298/0x2b0 [arm_smmuv3_pmu]
> | perf_try_init_event+0x54/0x140| perf_event_alloc+0x3e4/0x1080
> | __do_sys_perf_event_open+0x178/0xfa8
> | __arm64_sys_perf_event_open+0x28/0x34
> | invoke_syscall+0x48/0x114
> | el0_svc_common.constprop.0+0x40/0xe0
> | do_el0_svc+0x1c/0x28
> | el0_svc+0x4c/0x11c
> | el0t_64_sync_handler+0xc0/0xc4
> | el0t_64_sync+0x190/0x194
> | irq event stamp: 174338
> | hardirqs last enabled at (174337): [<ffff800080357774>]
> ___slab_alloc+0x3bc/0xf38
> | hardirqs last disabled at (174338): [<ffff8000812a7ee0>] el1_dbg+0x24/0x8c
> | softirqs last enabled at (174292): [<ffff8000800185bc>]
> fpsimd_restore_current_state+0x34/0xc4
> | softirqs last disabled at (174290): [<ffff80008001858c>]
> fpsimd_restore_current_state+0x4/0xc4
> | ---[ end trace 0000000000000000 ]---
>
> Note that this is not specific to the arm_smmuv3_pmu driver as I can
> also reproduce it with some HiSilicon uncore PMU events (e.g., executing
> `perf stat -e hisi_sccl1_ddrc0/flux_rd/`).
>
> For your convenience, the assertion was added by commit f3c0eba28704
> ("perf: Add a few assertions").
>
> Post it out for visibility, not sure if there are already similar
> reports on the list though. Please have a look.

Thanks for the report!

This is indeed a regression caused by commit:

f3c0eba28704 ("perf: Add a few assertions")

.. and it's concerning that we haven't had a report until now; we're clearly
not testing system/uncore PMUs all that much. :/

I don't want to remove the assertion, since that will catch missed locking that
has been the source of a number of nasty bugs.

I think we have to update all PMU drivers to avoid for_each_sibling_event()
when event == group_leader. That's the established practice in CPU PMU drivers,
and with that done I believe that we can tighten the requirements such that
for_each_sibling_event() can *only* be called for a group leader, which would
catch unintentional misuse in the core code.

Looking around, I see some other (related) problems in group validation in some
PMU drivers, so I'll spin a series to address those in one go. I'll try to get
that out in the next few days.

Mark.