2023-08-09 12:01:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 5.15 00/92] 5.15.126-rc1 review

This is the start of the stable review cycle for the 5.15.126 release.
There are 92 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.

Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.

thanks,

greg k-h

-------------
Pseudo-Shortlog of commits:

Greg Kroah-Hartman <[email protected]>
Linux 5.15.126-rc1

Johan Hovold <[email protected]>
PM: sleep: wakeirq: fix wake irq arming

Chunfeng Yun <[email protected]>
PM / wakeirq: support enabling wake-up irq after runtime_suspend called

Johan Hovold <[email protected]>
soundwire: fix enumeration completion

Pierre-Louis Bossart <[email protected]>
soundwire: bus: pm_runtime_request_resume on peripheral attachment

Sean Christopherson <[email protected]>
selftests/rseq: Play nice with binaries statically linked against glibc 2.35+

Michael Jeanson <[email protected]>
selftests/rseq: check if libc rseq support is registered

Alexander Stein <[email protected]>
drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning

Thomas Zimmermann <[email protected]>
drm/fsl-dcu: Use drm_plane_helper_destroy()

Aneesh Kumar K.V <[email protected]>
powerpc/mm/altmap: Fix altmap boundary check

Christophe JAILLET <[email protected]>
mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()

Johan Jonker <[email protected]>
mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts

Johan Jonker <[email protected]>
mtd: rawnand: rockchip: fix oobfree offset and description

Roger Quadros <[email protected]>
mtd: rawnand: omap_elm: Fix incorrect type in assignment

Jan Kara <[email protected]>
ext2: Drop fragment support

Jan Kara <[email protected]>
fs: Protect reconfiguration of sb read-write from racing writes

Alan Stern <[email protected]>
net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb

Sungwoo Kim <[email protected]>
Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb

Prince Kumar Maurya <[email protected]>
fs/sysv: Null check to prevent null-ptr-deref bug

Tetsuo Handa <[email protected]>
fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()

Linus Torvalds <[email protected]>
file: reinstate f_pos locking optimization for regular files

Hou Tao <[email protected]>
bpf, cpumap: Make sure kthread is running before map update returns

Guchun Chen <[email protected]>
drm/ttm: check null pointer before accessing when swapping

Aleksa Sarai <[email protected]>
open: make RESOLVE_CACHED correctly test for O_TMPFILE

Jiri Olsa <[email protected]>
bpf: Disable preemption in bpf_event_output

Ilya Dryomov <[email protected]>
rbd: prevent busy loop when requesting exclusive lock

Paul Fertser <[email protected]>
wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)

Laszlo Ersek <[email protected]>
net: tap_open(): set sk_uid from current_fsuid()

Laszlo Ersek <[email protected]>
net: tun_chr_open(): set sk_uid from current_fsuid()

Dinh Nguyen <[email protected]>
arm64: dts: stratix10: fix incorrect I2C property for SCL signal

Arseniy Krasnov <[email protected]>
mtd: rawnand: meson: fix OOB available bytes for ECC

Olivier Maignial <[email protected]>
mtd: spinand: toshiba: Fix ecc_get_status

Sungjong Seo <[email protected]>
exfat: release s_lock before calling dir_emit()

gaoming <[email protected]>
exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree

Krzysztof Kozlowski <[email protected]>
firmware: arm_scmi: Drop OF node reference in the transport channel setup

Xiubo Li <[email protected]>
ceph: defer stopping mdsc delayed_work

Ross Maynard <[email protected]>
USB: zaurus: Add ID for A-300/B-500/C-700

Ilya Dryomov <[email protected]>
libceph: fix potential hang in ceph_osdc_notify()

Michael Kelley <[email protected]>
scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices

Steffen Maier <[email protected]>
scsi: zfcp: Defer fc_rport blocking until after ADISC response

Eric Dumazet <[email protected]>
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen

Eric Dumazet <[email protected]>
tcp_metrics: annotate data-races around tm->tcpm_net

Eric Dumazet <[email protected]>
tcp_metrics: annotate data-races around tm->tcpm_vals[]

Eric Dumazet <[email protected]>
tcp_metrics: annotate data-races around tm->tcpm_lock

Eric Dumazet <[email protected]>
tcp_metrics: annotate data-races around tm->tcpm_stamp

Eric Dumazet <[email protected]>
tcp_metrics: fix addr_same() helper

Jonas Gorski <[email protected]>
prestera: fix fallback to previous version on same major version

Jianbo Liu <[email protected]>
net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio

Jianbo Liu <[email protected]>
net/mlx5: fs_core: Make find_closest_ft more generic

Benjamin Poirier <[email protected]>
vxlan: Fix nexthop hash size

Yue Haibing <[email protected]>
ip6mr: Fix skb_under_panic in ip6mr_cache_report()

Alexandra Winter <[email protected]>
s390/qeth: Don't call dev_close/dev_open (DOWN/UP)

Lin Ma <[email protected]>
net: dcb: choose correct policy to parse DCB_ATTR_BCN

Mark Brown <[email protected]>
net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode

Yuanjun Gong <[email protected]>
net: korina: handle clk prepare error in korina_probe()

Dan Carpenter <[email protected]>
net: ll_temac: fix error checking of irq_of_parse_and_map()

Yang Yingliang <[email protected]>
net: ll_temac: Switch to use dev_err_probe() helper

Tomas Glozar <[email protected]>
bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire

valis <[email protected]>
net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free

valis <[email protected]>
net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free

valis <[email protected]>
net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free

Hou Tao <[email protected]>
bpf, cpumap: Handle skb as well when clean up ptr_ring

Kuniyuki Iwashima <[email protected]>
net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.

Eric Dumazet <[email protected]>
net: add missing data-race annotation for sk_ll_usec

Eric Dumazet <[email protected]>
net: add missing data-race annotations around sk->sk_peek_off

Eric Dumazet <[email protected]>
net: add missing READ_ONCE(sk->sk_rcvbuf) annotation

Eric Dumazet <[email protected]>
net: add missing READ_ONCE(sk->sk_sndbuf) annotation

Eric Dumazet <[email protected]>
net: add missing READ_ONCE(sk->sk_rcvlowat) annotation

Eric Dumazet <[email protected]>
net: annotate data-races around sk->sk_max_pacing_rate

Konstantin Khorenko <[email protected]>
qed: Fix scheduling in a tasklet while getting stats

Prabhakar Kushwaha <[email protected]>
qed: Fix kernel-doc warnings

Chengfeng Ye <[email protected]>
mISDN: hfcpci: Fix potential deadlock on &hc->lock

Jamal Hadi Salim <[email protected]>
net: sched: cls_u32: Fix match key mis-addressing

Georg Müller <[email protected]>
perf test uprobe_from_different_cu: Skip if there is no gcc

Yuanjun Gong <[email protected]>
net: dsa: fix value check in bcm_sf2_sw_probe()

Lin Ma <[email protected]>
rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length

Lin Ma <[email protected]>
bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing

Yuanjun Gong <[email protected]>
net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()

Zhengchao Shao <[email protected]>
net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx

Ilan Peer <[email protected]>
wifi: cfg80211: Fix return value in scan logic

Heiko Carstens <[email protected]>
KVM: s390: fix sthyi error handling

[email protected] <[email protected]>
word-at-a-time: use the same return type for has_zero regardless of endianness

Cristian Marussi <[email protected]>
firmware: arm_scmi: Fix chan_free cleanup on SMC

Hugo Villeneuve <[email protected]>
arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux

Robin Murphy <[email protected]>
iommu/arm-smmu-v3: Document nesting-related errata

Robin Murphy <[email protected]>
iommu/arm-smmu-v3: Add explicit feature for nesting

Robin Murphy <[email protected]>
iommu/arm-smmu-v3: Document MMU-700 erratum 2812531

Robin Murphy <[email protected]>
iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982

Suzuki K Poulose <[email protected]>
arm64: errata: Add detection for TRBE write to out-of-range

Suzuki K Poulose <[email protected]>
arm64: errata: Add workaround for TSB flush failures

Shay Drory <[email protected]>
net/mlx5: Free irqs only on shutdown callback

Peter Zijlstra <[email protected]>
perf: Fix function pointer case

Jens Axboe <[email protected]>
io_uring: gate iowait schedule on having pending requests


-------------

Diffstat:

Documentation/arm64/silicon-errata.rst | 12 +
Makefile | 4 +-
arch/arm64/Kconfig | 74 ++
.../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +-
.../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +-
arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +-
arch/arm64/include/asm/barrier.h | 16 +-
arch/arm64/kernel/cpu_errata.c | 39 +
arch/arm64/tools/cpucaps | 2 +
arch/powerpc/include/asm/word-at-a-time.h | 2 +-
arch/powerpc/mm/init_64.c | 3 +-
arch/s390/kernel/sthyi.c | 6 +-
arch/s390/kvm/intercept.c | 9 +-
drivers/base/power/power.h | 8 +-
drivers/base/power/runtime.c | 6 +-
drivers/base/power/wakeirq.c | 111 ++-
drivers/block/rbd.c | 28 +-
drivers/firmware/arm_scmi/mailbox.c | 4 +-
drivers/firmware/arm_scmi/smc.c | 21 +-
drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +-
drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +-
drivers/gpu/drm/ttm/ttm_bo.c | 3 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 +
drivers/isdn/hardware/mISDN/hfcpci.c | 10 +-
drivers/mtd/nand/raw/fsl_upm.c | 2 +-
drivers/mtd/nand/raw/meson_nand.c | 3 +-
drivers/mtd/nand/raw/omap_elm.c | 24 +-
drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +-
drivers/mtd/nand/spi/toshiba.c | 4 +-
drivers/net/dsa/bcm_sf2.c | 8 +-
drivers/net/ethernet/korina.c | 3 +-
.../net/ethernet/marvell/prestera/prestera_pci.c | 3 +-
.../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++-
drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 +
drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 +
.../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +-
drivers/net/ethernet/qlogic/qed/qed.h | 9 +-
drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +--
drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++----
drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +-
drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +-
drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++----------
drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +-
drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++---
drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +-
drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++---
drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +-
drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +-
drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +-
drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++--
drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +--
drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +-
drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++--------
drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +-
drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++--
drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++-
drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++---
drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +-
drivers/net/ethernet/socionext/netsec.c | 11 +
drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +-
drivers/net/tap.c | 2 +-
drivers/net/tun.c | 2 +-
drivers/net/usb/cdc_ether.c | 21 +
drivers/net/usb/usbnet.c | 6 +
drivers/net/usb/zaurus.c | 21 +
drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +-
drivers/s390/net/qeth_core.h | 1 -
drivers/s390/net/qeth_core_main.c | 2 -
drivers/s390/net/qeth_l2_main.c | 9 +-
drivers/s390/net/qeth_l3_main.c | 8 +-
drivers/s390/scsi/zfcp_fc.c | 6 +-
drivers/scsi/storvsc_drv.c | 4 +
drivers/soundwire/bus.c | 20 +-
fs/ceph/mds_client.c | 4 +-
fs/ceph/mds_client.h | 5 +
fs/ceph/super.c | 10 +
fs/exfat/balloc.c | 6 +-
fs/exfat/dir.c | 27 +-
fs/ext2/ext2.h | 12 -
fs/ext2/super.c | 23 +-
fs/file.c | 18 +-
fs/ntfs3/attrlist.c | 4 +-
fs/open.c | 2 +-
fs/super.c | 11 +-
fs/sysv/itree.c | 4 +
include/asm-generic/word-at-a-time.h | 2 +-
include/linux/pm_wakeirq.h | 9 +-
include/linux/qed/qed_chain.h | 97 ++-
include/linux/qed/qed_if.h | 255 +++---
include/linux/qed/qed_iscsi_if.h | 2 +-
include/linux/qed/qed_ll2_if.h | 42 +-
include/linux/qed/qed_nvmetcp_if.h | 17 +
include/net/vxlan.h | 4 +-
io_uring/io_uring.c | 23 +-
kernel/bpf/cpumap.c | 35 +-
kernel/events/core.c | 8 +-
kernel/trace/bpf_trace.c | 6 +-
net/bluetooth/l2cap_sock.c | 2 +
net/ceph/osd_client.c | 20 +-
net/core/bpf_sk_storage.c | 5 +-
net/core/rtnetlink.c | 8 +-
net/core/sock.c | 21 +-
net/core/sock_map.c | 2 -
net/dcb/dcbnl.c | 2 +-
net/ipv4/tcp_metrics.c | 70 +-
net/ipv6/ip6mr.c | 2 +-
net/sched/cls_fw.c | 1 -
net/sched/cls_route.c | 1 -
net/sched/cls_u32.c | 57 +-
net/sched/sch_taprio.c | 15 +-
net/unix/af_unix.c | 2 +-
net/wireless/scan.c | 2 +-
.../tests/shell/test_uprobe_from_different_cu.sh | 8 +-
tools/testing/selftests/rseq/rseq.c | 31 +-
117 files changed, 3227 insertions(+), 2247 deletions(-)




2023-08-09 14:15:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.

Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
hang with this -rc: TREE04, TREE07, TASKS03.

5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
hotplug rcutorture testing. Me and tglx are continuing to debug this. The
issue does not show up on anything but 5.15 stable kernels and neither on
mainline.

I will do some more runs to see if TASKS03 hang is a new thing but it could
be related to the existing issues.

thanks,

- Joel



>
> thanks,
>
> greg k-h
>
> -------------
> Pseudo-Shortlog of commits:
>
> Greg Kroah-Hartman <[email protected]>
> Linux 5.15.126-rc1
>
> Johan Hovold <[email protected]>
> PM: sleep: wakeirq: fix wake irq arming
>
> Chunfeng Yun <[email protected]>
> PM / wakeirq: support enabling wake-up irq after runtime_suspend called
>
> Johan Hovold <[email protected]>
> soundwire: fix enumeration completion
>
> Pierre-Louis Bossart <[email protected]>
> soundwire: bus: pm_runtime_request_resume on peripheral attachment
>
> Sean Christopherson <[email protected]>
> selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
>
> Michael Jeanson <[email protected]>
> selftests/rseq: check if libc rseq support is registered
>
> Alexander Stein <[email protected]>
> drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning
>
> Thomas Zimmermann <[email protected]>
> drm/fsl-dcu: Use drm_plane_helper_destroy()
>
> Aneesh Kumar K.V <[email protected]>
> powerpc/mm/altmap: Fix altmap boundary check
>
> Christophe JAILLET <[email protected]>
> mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()
>
> Johan Jonker <[email protected]>
> mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts
>
> Johan Jonker <[email protected]>
> mtd: rawnand: rockchip: fix oobfree offset and description
>
> Roger Quadros <[email protected]>
> mtd: rawnand: omap_elm: Fix incorrect type in assignment
>
> Jan Kara <[email protected]>
> ext2: Drop fragment support
>
> Jan Kara <[email protected]>
> fs: Protect reconfiguration of sb read-write from racing writes
>
> Alan Stern <[email protected]>
> net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
>
> Sungwoo Kim <[email protected]>
> Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
>
> Prince Kumar Maurya <[email protected]>
> fs/sysv: Null check to prevent null-ptr-deref bug
>
> Tetsuo Handa <[email protected]>
> fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()
>
> Linus Torvalds <[email protected]>
> file: reinstate f_pos locking optimization for regular files
>
> Hou Tao <[email protected]>
> bpf, cpumap: Make sure kthread is running before map update returns
>
> Guchun Chen <[email protected]>
> drm/ttm: check null pointer before accessing when swapping
>
> Aleksa Sarai <[email protected]>
> open: make RESOLVE_CACHED correctly test for O_TMPFILE
>
> Jiri Olsa <[email protected]>
> bpf: Disable preemption in bpf_event_output
>
> Ilya Dryomov <[email protected]>
> rbd: prevent busy loop when requesting exclusive lock
>
> Paul Fertser <[email protected]>
> wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
>
> Laszlo Ersek <[email protected]>
> net: tap_open(): set sk_uid from current_fsuid()
>
> Laszlo Ersek <[email protected]>
> net: tun_chr_open(): set sk_uid from current_fsuid()
>
> Dinh Nguyen <[email protected]>
> arm64: dts: stratix10: fix incorrect I2C property for SCL signal
>
> Arseniy Krasnov <[email protected]>
> mtd: rawnand: meson: fix OOB available bytes for ECC
>
> Olivier Maignial <[email protected]>
> mtd: spinand: toshiba: Fix ecc_get_status
>
> Sungjong Seo <[email protected]>
> exfat: release s_lock before calling dir_emit()
>
> gaoming <[email protected]>
> exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
>
> Krzysztof Kozlowski <[email protected]>
> firmware: arm_scmi: Drop OF node reference in the transport channel setup
>
> Xiubo Li <[email protected]>
> ceph: defer stopping mdsc delayed_work
>
> Ross Maynard <[email protected]>
> USB: zaurus: Add ID for A-300/B-500/C-700
>
> Ilya Dryomov <[email protected]>
> libceph: fix potential hang in ceph_osdc_notify()
>
> Michael Kelley <[email protected]>
> scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
>
> Steffen Maier <[email protected]>
> scsi: zfcp: Defer fc_rport blocking until after ADISC response
>
> Eric Dumazet <[email protected]>
> tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
>
> Eric Dumazet <[email protected]>
> tcp_metrics: annotate data-races around tm->tcpm_net
>
> Eric Dumazet <[email protected]>
> tcp_metrics: annotate data-races around tm->tcpm_vals[]
>
> Eric Dumazet <[email protected]>
> tcp_metrics: annotate data-races around tm->tcpm_lock
>
> Eric Dumazet <[email protected]>
> tcp_metrics: annotate data-races around tm->tcpm_stamp
>
> Eric Dumazet <[email protected]>
> tcp_metrics: fix addr_same() helper
>
> Jonas Gorski <[email protected]>
> prestera: fix fallback to previous version on same major version
>
> Jianbo Liu <[email protected]>
> net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
>
> Jianbo Liu <[email protected]>
> net/mlx5: fs_core: Make find_closest_ft more generic
>
> Benjamin Poirier <[email protected]>
> vxlan: Fix nexthop hash size
>
> Yue Haibing <[email protected]>
> ip6mr: Fix skb_under_panic in ip6mr_cache_report()
>
> Alexandra Winter <[email protected]>
> s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
>
> Lin Ma <[email protected]>
> net: dcb: choose correct policy to parse DCB_ATTR_BCN
>
> Mark Brown <[email protected]>
> net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode
>
> Yuanjun Gong <[email protected]>
> net: korina: handle clk prepare error in korina_probe()
>
> Dan Carpenter <[email protected]>
> net: ll_temac: fix error checking of irq_of_parse_and_map()
>
> Yang Yingliang <[email protected]>
> net: ll_temac: Switch to use dev_err_probe() helper
>
> Tomas Glozar <[email protected]>
> bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
>
> valis <[email protected]>
> net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
>
> valis <[email protected]>
> net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
>
> valis <[email protected]>
> net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
>
> Hou Tao <[email protected]>
> bpf, cpumap: Handle skb as well when clean up ptr_ring
>
> Kuniyuki Iwashima <[email protected]>
> net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
>
> Eric Dumazet <[email protected]>
> net: add missing data-race annotation for sk_ll_usec
>
> Eric Dumazet <[email protected]>
> net: add missing data-race annotations around sk->sk_peek_off
>
> Eric Dumazet <[email protected]>
> net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
>
> Eric Dumazet <[email protected]>
> net: add missing READ_ONCE(sk->sk_sndbuf) annotation
>
> Eric Dumazet <[email protected]>
> net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
>
> Eric Dumazet <[email protected]>
> net: annotate data-races around sk->sk_max_pacing_rate
>
> Konstantin Khorenko <[email protected]>
> qed: Fix scheduling in a tasklet while getting stats
>
> Prabhakar Kushwaha <[email protected]>
> qed: Fix kernel-doc warnings
>
> Chengfeng Ye <[email protected]>
> mISDN: hfcpci: Fix potential deadlock on &hc->lock
>
> Jamal Hadi Salim <[email protected]>
> net: sched: cls_u32: Fix match key mis-addressing
>
> Georg M?ller <[email protected]>
> perf test uprobe_from_different_cu: Skip if there is no gcc
>
> Yuanjun Gong <[email protected]>
> net: dsa: fix value check in bcm_sf2_sw_probe()
>
> Lin Ma <[email protected]>
> rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
>
> Lin Ma <[email protected]>
> bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
>
> Yuanjun Gong <[email protected]>
> net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
>
> Zhengchao Shao <[email protected]>
> net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
>
> Ilan Peer <[email protected]>
> wifi: cfg80211: Fix return value in scan logic
>
> Heiko Carstens <[email protected]>
> KVM: s390: fix sthyi error handling
>
> [email protected] <[email protected]>
> word-at-a-time: use the same return type for has_zero regardless of endianness
>
> Cristian Marussi <[email protected]>
> firmware: arm_scmi: Fix chan_free cleanup on SMC
>
> Hugo Villeneuve <[email protected]>
> arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
>
> Robin Murphy <[email protected]>
> iommu/arm-smmu-v3: Document nesting-related errata
>
> Robin Murphy <[email protected]>
> iommu/arm-smmu-v3: Add explicit feature for nesting
>
> Robin Murphy <[email protected]>
> iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
>
> Robin Murphy <[email protected]>
> iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
>
> Suzuki K Poulose <[email protected]>
> arm64: errata: Add detection for TRBE write to out-of-range
>
> Suzuki K Poulose <[email protected]>
> arm64: errata: Add workaround for TSB flush failures
>
> Shay Drory <[email protected]>
> net/mlx5: Free irqs only on shutdown callback
>
> Peter Zijlstra <[email protected]>
> perf: Fix function pointer case
>
> Jens Axboe <[email protected]>
> io_uring: gate iowait schedule on having pending requests
>
>
> -------------
>
> Diffstat:
>
> Documentation/arm64/silicon-errata.rst | 12 +
> Makefile | 4 +-
> arch/arm64/Kconfig | 74 ++
> .../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +-
> .../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +-
> arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +-
> arch/arm64/include/asm/barrier.h | 16 +-
> arch/arm64/kernel/cpu_errata.c | 39 +
> arch/arm64/tools/cpucaps | 2 +
> arch/powerpc/include/asm/word-at-a-time.h | 2 +-
> arch/powerpc/mm/init_64.c | 3 +-
> arch/s390/kernel/sthyi.c | 6 +-
> arch/s390/kvm/intercept.c | 9 +-
> drivers/base/power/power.h | 8 +-
> drivers/base/power/runtime.c | 6 +-
> drivers/base/power/wakeirq.c | 111 ++-
> drivers/block/rbd.c | 28 +-
> drivers/firmware/arm_scmi/mailbox.c | 4 +-
> drivers/firmware/arm_scmi/smc.c | 21 +-
> drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +-
> drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +-
> drivers/gpu/drm/ttm/ttm_bo.c | 3 +-
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 +
> drivers/isdn/hardware/mISDN/hfcpci.c | 10 +-
> drivers/mtd/nand/raw/fsl_upm.c | 2 +-
> drivers/mtd/nand/raw/meson_nand.c | 3 +-
> drivers/mtd/nand/raw/omap_elm.c | 24 +-
> drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +-
> drivers/mtd/nand/spi/toshiba.c | 4 +-
> drivers/net/dsa/bcm_sf2.c | 8 +-
> drivers/net/ethernet/korina.c | 3 +-
> .../net/ethernet/marvell/prestera/prestera_pci.c | 3 +-
> .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +-
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++-
> drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 +
> drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 +
> .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +-
> drivers/net/ethernet/qlogic/qed/qed.h | 9 +-
> drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +--
> drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++----
> drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +-
> drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +-
> drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++----------
> drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +-
> drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++---
> drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +-
> drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++---
> drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +-
> drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +-
> drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +-
> drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++--
> drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +--
> drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +-
> drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++--------
> drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +-
> drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++--
> drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++-
> drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++---
> drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +-
> drivers/net/ethernet/socionext/netsec.c | 11 +
> drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +-
> drivers/net/tap.c | 2 +-
> drivers/net/tun.c | 2 +-
> drivers/net/usb/cdc_ether.c | 21 +
> drivers/net/usb/usbnet.c | 6 +
> drivers/net/usb/zaurus.c | 21 +
> drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +-
> drivers/s390/net/qeth_core.h | 1 -
> drivers/s390/net/qeth_core_main.c | 2 -
> drivers/s390/net/qeth_l2_main.c | 9 +-
> drivers/s390/net/qeth_l3_main.c | 8 +-
> drivers/s390/scsi/zfcp_fc.c | 6 +-
> drivers/scsi/storvsc_drv.c | 4 +
> drivers/soundwire/bus.c | 20 +-
> fs/ceph/mds_client.c | 4 +-
> fs/ceph/mds_client.h | 5 +
> fs/ceph/super.c | 10 +
> fs/exfat/balloc.c | 6 +-
> fs/exfat/dir.c | 27 +-
> fs/ext2/ext2.h | 12 -
> fs/ext2/super.c | 23 +-
> fs/file.c | 18 +-
> fs/ntfs3/attrlist.c | 4 +-
> fs/open.c | 2 +-
> fs/super.c | 11 +-
> fs/sysv/itree.c | 4 +
> include/asm-generic/word-at-a-time.h | 2 +-
> include/linux/pm_wakeirq.h | 9 +-
> include/linux/qed/qed_chain.h | 97 ++-
> include/linux/qed/qed_if.h | 255 +++---
> include/linux/qed/qed_iscsi_if.h | 2 +-
> include/linux/qed/qed_ll2_if.h | 42 +-
> include/linux/qed/qed_nvmetcp_if.h | 17 +
> include/net/vxlan.h | 4 +-
> io_uring/io_uring.c | 23 +-
> kernel/bpf/cpumap.c | 35 +-
> kernel/events/core.c | 8 +-
> kernel/trace/bpf_trace.c | 6 +-
> net/bluetooth/l2cap_sock.c | 2 +
> net/ceph/osd_client.c | 20 +-
> net/core/bpf_sk_storage.c | 5 +-
> net/core/rtnetlink.c | 8 +-
> net/core/sock.c | 21 +-
> net/core/sock_map.c | 2 -
> net/dcb/dcbnl.c | 2 +-
> net/ipv4/tcp_metrics.c | 70 +-
> net/ipv6/ip6mr.c | 2 +-
> net/sched/cls_fw.c | 1 -
> net/sched/cls_route.c | 1 -
> net/sched/cls_u32.c | 57 +-
> net/sched/sch_taprio.c | 15 +-
> net/unix/af_unix.c | 2 +-
> net/wireless/scan.c | 2 +-
> .../tests/shell/test_uprobe_from_different_cu.sh | 8 +-
> tools/testing/selftests/rseq/rseq.c | 31 +-
> 117 files changed, 3227 insertions(+), 2247 deletions(-)
>
>

2023-08-09 16:58:54

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 06:53, Joel Fernandes wrote:
> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
>> This is the start of the stable review cycle for the 5.15.126 release.
>> There are 92 patches in this series, all will be posted as a response
>> to this one. If anyone has any issues with these being applied, please
>> let me know.
>>
>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>> Anything received after that time might be too late.
>>
>> The whole patch series can be found in one patch at:
>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>> or in the git tree and branch at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>> and the diffstat can be found below.
>
> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> hang with this -rc: TREE04, TREE07, TASKS03.
>
> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> issue does not show up on anything but 5.15 stable kernels and neither on
> mainline.
>

Do you by any have a crash pattern that we could possibly use to find the crash
in ChromeOS crash logs ? No idea if that would help, but it could provide some
additional data points.

Thanks,
Guenter

> I will do some more runs to see if TASKS03 hang is a new thing but it could
> be related to the existing issues.
>
> thanks,
>
> - Joel
>
>
>
>>
>> thanks,
>>
>> greg k-h
>>
>> -------------
>> Pseudo-Shortlog of commits:
>>
>> Greg Kroah-Hartman <[email protected]>
>> Linux 5.15.126-rc1
>>
>> Johan Hovold <[email protected]>
>> PM: sleep: wakeirq: fix wake irq arming
>>
>> Chunfeng Yun <[email protected]>
>> PM / wakeirq: support enabling wake-up irq after runtime_suspend called
>>
>> Johan Hovold <[email protected]>
>> soundwire: fix enumeration completion
>>
>> Pierre-Louis Bossart <[email protected]>
>> soundwire: bus: pm_runtime_request_resume on peripheral attachment
>>
>> Sean Christopherson <[email protected]>
>> selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
>>
>> Michael Jeanson <[email protected]>
>> selftests/rseq: check if libc rseq support is registered
>>
>> Alexander Stein <[email protected]>
>> drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning
>>
>> Thomas Zimmermann <[email protected]>
>> drm/fsl-dcu: Use drm_plane_helper_destroy()
>>
>> Aneesh Kumar K.V <[email protected]>
>> powerpc/mm/altmap: Fix altmap boundary check
>>
>> Christophe JAILLET <[email protected]>
>> mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()
>>
>> Johan Jonker <[email protected]>
>> mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts
>>
>> Johan Jonker <[email protected]>
>> mtd: rawnand: rockchip: fix oobfree offset and description
>>
>> Roger Quadros <[email protected]>
>> mtd: rawnand: omap_elm: Fix incorrect type in assignment
>>
>> Jan Kara <[email protected]>
>> ext2: Drop fragment support
>>
>> Jan Kara <[email protected]>
>> fs: Protect reconfiguration of sb read-write from racing writes
>>
>> Alan Stern <[email protected]>
>> net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
>>
>> Sungwoo Kim <[email protected]>
>> Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
>>
>> Prince Kumar Maurya <[email protected]>
>> fs/sysv: Null check to prevent null-ptr-deref bug
>>
>> Tetsuo Handa <[email protected]>
>> fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()
>>
>> Linus Torvalds <[email protected]>
>> file: reinstate f_pos locking optimization for regular files
>>
>> Hou Tao <[email protected]>
>> bpf, cpumap: Make sure kthread is running before map update returns
>>
>> Guchun Chen <[email protected]>
>> drm/ttm: check null pointer before accessing when swapping
>>
>> Aleksa Sarai <[email protected]>
>> open: make RESOLVE_CACHED correctly test for O_TMPFILE
>>
>> Jiri Olsa <[email protected]>
>> bpf: Disable preemption in bpf_event_output
>>
>> Ilya Dryomov <[email protected]>
>> rbd: prevent busy loop when requesting exclusive lock
>>
>> Paul Fertser <[email protected]>
>> wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
>>
>> Laszlo Ersek <[email protected]>
>> net: tap_open(): set sk_uid from current_fsuid()
>>
>> Laszlo Ersek <[email protected]>
>> net: tun_chr_open(): set sk_uid from current_fsuid()
>>
>> Dinh Nguyen <[email protected]>
>> arm64: dts: stratix10: fix incorrect I2C property for SCL signal
>>
>> Arseniy Krasnov <[email protected]>
>> mtd: rawnand: meson: fix OOB available bytes for ECC
>>
>> Olivier Maignial <[email protected]>
>> mtd: spinand: toshiba: Fix ecc_get_status
>>
>> Sungjong Seo <[email protected]>
>> exfat: release s_lock before calling dir_emit()
>>
>> gaoming <[email protected]>
>> exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
>>
>> Krzysztof Kozlowski <[email protected]>
>> firmware: arm_scmi: Drop OF node reference in the transport channel setup
>>
>> Xiubo Li <[email protected]>
>> ceph: defer stopping mdsc delayed_work
>>
>> Ross Maynard <[email protected]>
>> USB: zaurus: Add ID for A-300/B-500/C-700
>>
>> Ilya Dryomov <[email protected]>
>> libceph: fix potential hang in ceph_osdc_notify()
>>
>> Michael Kelley <[email protected]>
>> scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
>>
>> Steffen Maier <[email protected]>
>> scsi: zfcp: Defer fc_rport blocking until after ADISC response
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: annotate data-races around tm->tcpm_net
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: annotate data-races around tm->tcpm_vals[]
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: annotate data-races around tm->tcpm_lock
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: annotate data-races around tm->tcpm_stamp
>>
>> Eric Dumazet <[email protected]>
>> tcp_metrics: fix addr_same() helper
>>
>> Jonas Gorski <[email protected]>
>> prestera: fix fallback to previous version on same major version
>>
>> Jianbo Liu <[email protected]>
>> net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
>>
>> Jianbo Liu <[email protected]>
>> net/mlx5: fs_core: Make find_closest_ft more generic
>>
>> Benjamin Poirier <[email protected]>
>> vxlan: Fix nexthop hash size
>>
>> Yue Haibing <[email protected]>
>> ip6mr: Fix skb_under_panic in ip6mr_cache_report()
>>
>> Alexandra Winter <[email protected]>
>> s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
>>
>> Lin Ma <[email protected]>
>> net: dcb: choose correct policy to parse DCB_ATTR_BCN
>>
>> Mark Brown <[email protected]>
>> net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode
>>
>> Yuanjun Gong <[email protected]>
>> net: korina: handle clk prepare error in korina_probe()
>>
>> Dan Carpenter <[email protected]>
>> net: ll_temac: fix error checking of irq_of_parse_and_map()
>>
>> Yang Yingliang <[email protected]>
>> net: ll_temac: Switch to use dev_err_probe() helper
>>
>> Tomas Glozar <[email protected]>
>> bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
>>
>> valis <[email protected]>
>> net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
>>
>> valis <[email protected]>
>> net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
>>
>> valis <[email protected]>
>> net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
>>
>> Hou Tao <[email protected]>
>> bpf, cpumap: Handle skb as well when clean up ptr_ring
>>
>> Kuniyuki Iwashima <[email protected]>
>> net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
>>
>> Eric Dumazet <[email protected]>
>> net: add missing data-race annotation for sk_ll_usec
>>
>> Eric Dumazet <[email protected]>
>> net: add missing data-race annotations around sk->sk_peek_off
>>
>> Eric Dumazet <[email protected]>
>> net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
>>
>> Eric Dumazet <[email protected]>
>> net: add missing READ_ONCE(sk->sk_sndbuf) annotation
>>
>> Eric Dumazet <[email protected]>
>> net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
>>
>> Eric Dumazet <[email protected]>
>> net: annotate data-races around sk->sk_max_pacing_rate
>>
>> Konstantin Khorenko <[email protected]>
>> qed: Fix scheduling in a tasklet while getting stats
>>
>> Prabhakar Kushwaha <[email protected]>
>> qed: Fix kernel-doc warnings
>>
>> Chengfeng Ye <[email protected]>
>> mISDN: hfcpci: Fix potential deadlock on &hc->lock
>>
>> Jamal Hadi Salim <[email protected]>
>> net: sched: cls_u32: Fix match key mis-addressing
>>
>> Georg Müller <[email protected]>
>> perf test uprobe_from_different_cu: Skip if there is no gcc
>>
>> Yuanjun Gong <[email protected]>
>> net: dsa: fix value check in bcm_sf2_sw_probe()
>>
>> Lin Ma <[email protected]>
>> rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
>>
>> Lin Ma <[email protected]>
>> bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
>>
>> Yuanjun Gong <[email protected]>
>> net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
>>
>> Zhengchao Shao <[email protected]>
>> net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
>>
>> Ilan Peer <[email protected]>
>> wifi: cfg80211: Fix return value in scan logic
>>
>> Heiko Carstens <[email protected]>
>> KVM: s390: fix sthyi error handling
>>
>> [email protected] <[email protected]>
>> word-at-a-time: use the same return type for has_zero regardless of endianness
>>
>> Cristian Marussi <[email protected]>
>> firmware: arm_scmi: Fix chan_free cleanup on SMC
>>
>> Hugo Villeneuve <[email protected]>
>> arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
>>
>> Robin Murphy <[email protected]>
>> iommu/arm-smmu-v3: Document nesting-related errata
>>
>> Robin Murphy <[email protected]>
>> iommu/arm-smmu-v3: Add explicit feature for nesting
>>
>> Robin Murphy <[email protected]>
>> iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
>>
>> Robin Murphy <[email protected]>
>> iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
>>
>> Suzuki K Poulose <[email protected]>
>> arm64: errata: Add detection for TRBE write to out-of-range
>>
>> Suzuki K Poulose <[email protected]>
>> arm64: errata: Add workaround for TSB flush failures
>>
>> Shay Drory <[email protected]>
>> net/mlx5: Free irqs only on shutdown callback
>>
>> Peter Zijlstra <[email protected]>
>> perf: Fix function pointer case
>>
>> Jens Axboe <[email protected]>
>> io_uring: gate iowait schedule on having pending requests
>>
>>
>> -------------
>>
>> Diffstat:
>>
>> Documentation/arm64/silicon-errata.rst | 12 +
>> Makefile | 4 +-
>> arch/arm64/Kconfig | 74 ++
>> .../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +-
>> .../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +-
>> arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +-
>> arch/arm64/include/asm/barrier.h | 16 +-
>> arch/arm64/kernel/cpu_errata.c | 39 +
>> arch/arm64/tools/cpucaps | 2 +
>> arch/powerpc/include/asm/word-at-a-time.h | 2 +-
>> arch/powerpc/mm/init_64.c | 3 +-
>> arch/s390/kernel/sthyi.c | 6 +-
>> arch/s390/kvm/intercept.c | 9 +-
>> drivers/base/power/power.h | 8 +-
>> drivers/base/power/runtime.c | 6 +-
>> drivers/base/power/wakeirq.c | 111 ++-
>> drivers/block/rbd.c | 28 +-
>> drivers/firmware/arm_scmi/mailbox.c | 4 +-
>> drivers/firmware/arm_scmi/smc.c | 21 +-
>> drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +-
>> drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +-
>> drivers/gpu/drm/ttm/ttm_bo.c | 3 +-
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 +
>> drivers/isdn/hardware/mISDN/hfcpci.c | 10 +-
>> drivers/mtd/nand/raw/fsl_upm.c | 2 +-
>> drivers/mtd/nand/raw/meson_nand.c | 3 +-
>> drivers/mtd/nand/raw/omap_elm.c | 24 +-
>> drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +-
>> drivers/mtd/nand/spi/toshiba.c | 4 +-
>> drivers/net/dsa/bcm_sf2.c | 8 +-
>> drivers/net/ethernet/korina.c | 3 +-
>> .../net/ethernet/marvell/prestera/prestera_pci.c | 3 +-
>> .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +-
>> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +-
>> drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++-
>> drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 +
>> drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 +
>> .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +-
>> drivers/net/ethernet/qlogic/qed/qed.h | 9 +-
>> drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +--
>> drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++----
>> drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +-
>> drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +-
>> drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++----------
>> drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +-
>> drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++---
>> drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +-
>> drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++---
>> drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +-
>> drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +-
>> drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +-
>> drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++--
>> drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +--
>> drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +-
>> drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++--------
>> drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +-
>> drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++--
>> drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++-
>> drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++---
>> drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +-
>> drivers/net/ethernet/socionext/netsec.c | 11 +
>> drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +-
>> drivers/net/tap.c | 2 +-
>> drivers/net/tun.c | 2 +-
>> drivers/net/usb/cdc_ether.c | 21 +
>> drivers/net/usb/usbnet.c | 6 +
>> drivers/net/usb/zaurus.c | 21 +
>> drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +-
>> drivers/s390/net/qeth_core.h | 1 -
>> drivers/s390/net/qeth_core_main.c | 2 -
>> drivers/s390/net/qeth_l2_main.c | 9 +-
>> drivers/s390/net/qeth_l3_main.c | 8 +-
>> drivers/s390/scsi/zfcp_fc.c | 6 +-
>> drivers/scsi/storvsc_drv.c | 4 +
>> drivers/soundwire/bus.c | 20 +-
>> fs/ceph/mds_client.c | 4 +-
>> fs/ceph/mds_client.h | 5 +
>> fs/ceph/super.c | 10 +
>> fs/exfat/balloc.c | 6 +-
>> fs/exfat/dir.c | 27 +-
>> fs/ext2/ext2.h | 12 -
>> fs/ext2/super.c | 23 +-
>> fs/file.c | 18 +-
>> fs/ntfs3/attrlist.c | 4 +-
>> fs/open.c | 2 +-
>> fs/super.c | 11 +-
>> fs/sysv/itree.c | 4 +
>> include/asm-generic/word-at-a-time.h | 2 +-
>> include/linux/pm_wakeirq.h | 9 +-
>> include/linux/qed/qed_chain.h | 97 ++-
>> include/linux/qed/qed_if.h | 255 +++---
>> include/linux/qed/qed_iscsi_if.h | 2 +-
>> include/linux/qed/qed_ll2_if.h | 42 +-
>> include/linux/qed/qed_nvmetcp_if.h | 17 +
>> include/net/vxlan.h | 4 +-
>> io_uring/io_uring.c | 23 +-
>> kernel/bpf/cpumap.c | 35 +-
>> kernel/events/core.c | 8 +-
>> kernel/trace/bpf_trace.c | 6 +-
>> net/bluetooth/l2cap_sock.c | 2 +
>> net/ceph/osd_client.c | 20 +-
>> net/core/bpf_sk_storage.c | 5 +-
>> net/core/rtnetlink.c | 8 +-
>> net/core/sock.c | 21 +-
>> net/core/sock_map.c | 2 -
>> net/dcb/dcbnl.c | 2 +-
>> net/ipv4/tcp_metrics.c | 70 +-
>> net/ipv6/ip6mr.c | 2 +-
>> net/sched/cls_fw.c | 1 -
>> net/sched/cls_route.c | 1 -
>> net/sched/cls_u32.c | 57 +-
>> net/sched/sch_taprio.c | 15 +-
>> net/unix/af_unix.c | 2 +-
>> net/wireless/scan.c | 2 +-
>> .../tests/shell/test_uprobe_from_different_cu.sh | 8 +-
>> tools/testing/selftests/rseq/rseq.c | 31 +-
>> 117 files changed, 3227 insertions(+), 2247 deletions(-)
>>
>>


2023-08-09 17:29:21

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

Hello,

On 2023-08-09T12:40:36+02:00 Greg Kroah-Hartman <[email protected]> wrote:

> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.

This rc kernel passes DAMON functionality test[1] on my test machine.
Attaching the test results summary below. Please note that I retrieved the
kernel from linux-stable-rc tree[2].

Tested-by: SeongJae Park <[email protected]>

[1] https://github.com/awslabs/damon-tests/tree/next/corr
[2] ae7f23cbf199 ("Linux 5.15.126-rc1")


Thanks,
SJ

[...]

---

ok 1 selftests: damon: debugfs_attrs.sh
ok 1 selftests: damon-tests: kunit.sh
ok 2 selftests: damon-tests: huge_count_read_write.sh
ok 3 selftests: damon-tests: buffer_overflow.sh
ok 4 selftests: damon-tests: rm_contexts.sh
ok 5 selftests: damon-tests: record_null_deref.sh
ok 6 selftests: damon-tests: dbgfs_target_ids_read_before_terminate_race.sh
ok 7 selftests: damon-tests: dbgfs_target_ids_pid_leak.sh
ok 8 selftests: damon-tests: damo_tests.sh
ok 9 selftests: damon-tests: masim-record.sh
ok 10 selftests: damon-tests: build_i386.sh
ok 11 selftests: damon-tests: build_m68k.sh
ok 12 selftests: damon-tests: build_arm64.sh
ok 13 selftests: damon-tests: build_i386_idle_flag.sh
ok 14 selftests: damon-tests: build_i386_highpte.sh
ok 15 selftests: damon-tests: build_nomemcg.sh

PASS

2023-08-09 19:14:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 9, 2023 at 2:35 PM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> >
> > On 8/9/23 06:53, Joel Fernandes wrote:
> > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > >> This is the start of the stable review cycle for the 5.15.126 release.
> > >> There are 92 patches in this series, all will be posted as a response
> > >> to this one. If anyone has any issues with these being applied, please
> > >> let me know.
> > >>
> > >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > >> Anything received after that time might be too late.
> > >>
> > >> The whole patch series can be found in one patch at:
> > >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > >> or in the git tree and branch at:
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > >> and the diffstat can be found below.
> > >
> > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > hang with this -rc: TREE04, TREE07, TASKS03.
> > >
> > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > mainline.
> > >
> >
> > Do you by any have a crash pattern that we could possibly use to find the crash
> > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > additional data points.
>
> The pattern shows as a hard hang, the system is unresponsive and all CPUs
> are stuck in stop_machine. Sometimes it recovers on its own from the
> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> to reproduce and sometimes never happens for several hours.
>
> It appears related to CPU hotplug since gdb showed me most of the CPUs
> are spinning in multi_cpu_stop() / stop machine after the hang.
>

Adding to this, it appears one of the CPUs is constantly firing and
reprogramming hrtimer events for some reason every few 100
microseconds (I see this in gdb). My debug angle right now is to
figure out why it does that but collecting a trace is hard as it
appears even trace collection may not be happening once hung and the
only traces I am getting are the ones after the hang recovers, not
during the hang. I am also trying to see if multi_cpu_stop() can
panic the kernel if it sits there too long.

- Joel

2023-08-09 19:39:53

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
>
> On 8/9/23 06:53, Joel Fernandes wrote:
> > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> >> This is the start of the stable review cycle for the 5.15.126 release.
> >> There are 92 patches in this series, all will be posted as a response
> >> to this one. If anyone has any issues with these being applied, please
> >> let me know.
> >>
> >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> >> Anything received after that time might be too late.
> >>
> >> The whole patch series can be found in one patch at:
> >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> >> or in the git tree and branch at:
> >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> >> and the diffstat can be found below.
> >
> > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > hang with this -rc: TREE04, TREE07, TASKS03.
> >
> > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > issue does not show up on anything but 5.15 stable kernels and neither on
> > mainline.
> >
>
> Do you by any have a crash pattern that we could possibly use to find the crash
> in ChromeOS crash logs ? No idea if that would help, but it could provide some
> additional data points.

The pattern shows as a hard hang, the system is unresponsive and all CPUs
are stuck in stop_machine. Sometimes it recovers on its own from the
hang and then RCU immediately gives stall warnings. It takes 1.5 hour
to reproduce and sometimes never happens for several hours.

It appears related to CPU hotplug since gdb showed me most of the CPUs
are spinning in multi_cpu_stop() / stop machine after the hang.

thanks,

- Joel

2023-08-09 20:09:50

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 03:40, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

SCMI with the SMC transport fails to build because of "[PATCH 5.15
11/92] firmware: arm_scmi: Fix chan_free cleanup on SMC" where the
specific details have been reported there. Here is the build failure FWIW:

drivers/firmware/arm_scmi/smc.c:39:6: error: duplicate member 'irq'
int irq;
^~~
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
drivers/firmware/arm_scmi/smc.c:118:20: error: 'irq' undeclared (first
use in this function); did you mean 'rq'?
scmi_info->irq = irq;
^~~
rq
drivers/firmware/arm_scmi/smc.c:118:20: note: each undeclared identifier
is reported only once for each function it appears in
CC drivers/mmc/core/slot-gpio.o
host-make[5]: *** [scripts/Makefile.build:289:
drivers/firmware/arm_scmi/smc.o] Error 1
host-make[5]: *** Waiting for unfinished jobs....
--
Florian


2023-08-09 20:43:38

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> >
> > On 8/9/23 06:53, Joel Fernandes wrote:
> > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > >> This is the start of the stable review cycle for the 5.15.126 release.
> > >> There are 92 patches in this series, all will be posted as a response
> > >> to this one. If anyone has any issues with these being applied, please
> > >> let me know.
> > >>
> > >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > >> Anything received after that time might be too late.
> > >>
> > >> The whole patch series can be found in one patch at:
> > >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > >> or in the git tree and branch at:
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > >> and the diffstat can be found below.
> > >
> > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > hang with this -rc: TREE04, TREE07, TASKS03.
> > >
> > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > mainline.
> > >
> >
> > Do you by any have a crash pattern that we could possibly use to find the crash
> > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > additional data points.
>
> The pattern shows as a hard hang, the system is unresponsive and all CPUs
> are stuck in stop_machine. Sometimes it recovers on its own from the
> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> to reproduce and sometimes never happens for several hours.
>
> It appears related to CPU hotplug since gdb showed me most of the CPUs
> are spinning in multi_cpu_stop() / stop machine after the hang.
>

Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Example:

<0>[63298.624328] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [migration/0:11]
<4>[63298.624331] Modules linked in: 8021q ccm snd_seq_dummy snd_seq snd_seq_device bridge stp llc tun nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp esp6 ah6 ip6t_REJECT ip6t_ipv6header vhost_vsock vhost vmw_vsock_virtio_transport_common vsock veth rfcomm xt_cgroup cmac algif_hash algif_skcipher af_alg xt_MASQUERADE uinput iwlmvm snd_soc_skl_ssp_clk iwl7000_mac80211 btusb snd_soc_kbl_da7219_max98357a btrtl btintel snd_soc_hdac_hdmi btbcm bluetooth snd_soc_dmic snd_soc_skl ecdh_generic ecc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_hdac_hda uvcvideo snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core videobuf2_vmalloc videobuf2_v4l2 videobuf2_common snd_intel_dspcfg videobuf2_memops snd_hda_codec snd_hwdep snd_hda_core iwlwifi snd_soc_da7219 snd_soc_max98357a fuse ip6table_nat cfg80211 lzo_rle lzo_compress zram joydev
<4>[63298.624357] CPU: 0 PID: 11 Comm: migration/0 Tainted: G U W 5.4.180-17902-g44152654f29b #1
<4>[63298.624358] Hardware name: Google Nami/Nami, BIOS Google_Nami.10775.145.0 09/19/2019
<4>[63298.624363] RIP: 0010:stop_machine_yield+0xb/0xd
<4>[63298.624366] Code: ff 74 b6 f0 ff 0f 75 b1 48 83 c7 08 e8 1f cb f9 ff eb a6 e8 a0 20 e3 ff eb bc e8 50 4b f5 ff 0f 1f 44 00 00 55 48 89 e5 f3 90 <5d> c3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 81
<4>[63298.624368] RSP: 0000:ffffbaf90006fe38 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13
<4>[63298.624370] RAX: 0000000000000000 RBX: ffffbaf90300bca8 RCX: 0000000000000000
<4>[63298.624371] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffffb0d46920
<4>[63298.624373] RBP: ffffbaf90006fe38 R08: 0000000000000002 R09: 0000398ecf9a0ac5
<4>[63298.624374] R10: 0000000000000171 R11: ffffffffaf9cfb11 R12: 0000000000000001
<4>[63298.624376] R13: ffff9b09baa22201 R14: ffffffffb0d46920 R15: 0000000000000001
<4>[63298.624377] FS: 0000000000000000(0000) GS:ffff9b09baa00000(0000) knlGS:0000000000000000
<4>[63298.624379] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[63298.624380] CR2: 0000153c00724820 CR3: 0000000171ab8005 CR4: 00000000003606f0
<4>[63298.624382] Call Trace:
<4>[63298.624386] multi_cpu_stop+0x89/0x119
<4>[63298.624389] ? stop_two_cpus+0x24d/0x24d
<4>[63298.624391] cpu_stopper_thread+0x8f/0x111
<4>[63298.624394] smpboot_thread_fn+0x174/0x212
<4>[63298.624397] kthread+0x147/0x156
<4>[63298.624399] ? cpu_report_death+0x43/0x43
<4>[63298.624401] ? kthread_blkcg+0x2e/0x2e
<4>[63298.624404] ret_from_fork+0x35/0x40
<0>[63298.624407] Kernel panic - not syncing: softlockup: hung tasks

I guess that is something different ?

Guenter

2023-08-09 21:26:23

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> > >
> > > On 8/9/23 06:53, Joel Fernandes wrote:
> > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > > >> This is the start of the stable review cycle for the 5.15.126 release.
> > > >> There are 92 patches in this series, all will be posted as a response
> > > >> to this one. If anyone has any issues with these being applied, please
> > > >> let me know.
> > > >>
> > > >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > >> Anything received after that time might be too late.
> > > >>
> > > >> The whole patch series can be found in one patch at:
> > > >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > > >> or in the git tree and branch at:
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > >> and the diffstat can be found below.
> > > >
> > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > > hang with this -rc: TREE04, TREE07, TASKS03.
> > > >
> > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > > mainline.
> > > >
> > >
> > > Do you by any have a crash pattern that we could possibly use to find the crash
> > > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > > additional data points.
> >
> > The pattern shows as a hard hang, the system is unresponsive and all CPUs
> > are stuck in stop_machine. Sometimes it recovers on its own from the
> > hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> > to reproduce and sometimes never happens for several hours.
> >
> > It appears related to CPU hotplug since gdb showed me most of the CPUs
> > are spinning in multi_cpu_stop() / stop machine after the hang.
> >
>
> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().

Interesting. It looks similar as far as the stack dump in gdb goes, here are
the stacks I dumped with the hang I referred to:
https://paste.debian.net/1288308/

But in dmesg, it prints nothing for about 20-30 mins before recovering, then
I get RCU stalls. It looks like this:

[  682.721962] kvm-clock: cpu 7, msr 199981c1, secondary cpu clock
[  682.736830] kvm-guest: stealtime: cpu 7, msr 1f5db140
[  684.445875] smpboot: Booting Node 0 Processor 5 APIC 0x5
[  684.467831] kvm-clock: cpu 5, msr 19998141, secondary cpu clock
[  684.555766] kvm-guest: stealtime: cpu 5, msr 1f55b140
[  687.356637] smpboot: Booting Node 0 Processor 4 APIC 0x4
[  687.377214] kvm-clock: cpu 4, msr 19998101, secondary cpu clock
[ 2885.473742] kvm-guest: stealtime: cpu 4, msr 1f51b140
[ 2886.456408] rcu: INFO: rcu_sched self-detected stall on CPU
[ 2886.457590] rcu_torture_fwd_prog_nr: Duration 15423 cver 170 gps 337
[ 2886.464934] rcu: 0-...!: (2 ticks this GP) idle=7eb/0/0x1 softirq=118271/118271 fqs=0 last_accelerate: e3cd/71c0 dyntick_enabled: 1
[ 2886.490837] (t=2199034 jiffies g=185489 q=4)
[ 2886.497297] rcu: rcu_sched kthread timer wakeup didn't happen for 2199031 jiffies! g185489 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2886.514201] rcu: Possible timer handling issue on cpu=0 timer-softirq=441616
[ 2886.524593] rcu: rcu_sched kthread starved for 2199034 jiffies! g185489 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 2886.540067] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2886.551967] rcu: RCU grace-period kthread stack dump:
[ 2886.558644] task:rcu_sched       state:I stack:14896 pid:   15 ppid:     2 flags:0x00004000
[ 2886.569640] Call Trace:
[ 2886.572940]  <TASK>
[ 2886.575902]  __schedule+0x284/0x6e0
[ 2886.580969]  schedule+0x53/0xa0
[ 2886.585231]  schedule_timeout+0x8f/0x130

In that huge gap, I connect gdb and dumped those stacks in above link.

On 5.15 stable you could repro it in about an hour and a half most of the time by running something like:
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 48 --duration 60 --configs TREE04

Let me know if you saw anything like this. I am currently trying to panic the
kernel when the hang happens so I can get better traces.

thanks,

- Joel


2023-08-09 22:07:03

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 13:14, Joel Fernandes wrote:
> On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
>> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
>>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
>>>>
>>>> On 8/9/23 06:53, Joel Fernandes wrote:
>>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
>>>>>> This is the start of the stable review cycle for the 5.15.126 release.
>>>>>> There are 92 patches in this series, all will be posted as a response
>>>>>> to this one. If anyone has any issues with these being applied, please
>>>>>> let me know.
>>>>>>
>>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>>>>>> Anything received after that time might be too late.
>>>>>>
>>>>>> The whole patch series can be found in one patch at:
>>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>>>>>> or in the git tree and branch at:
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>>>>>> and the diffstat can be found below.
>>>>>
>>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
>>>>> hang with this -rc: TREE04, TREE07, TASKS03.
>>>>>
>>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
>>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
>>>>> issue does not show up on anything but 5.15 stable kernels and neither on
>>>>> mainline.
>>>>>
>>>>
>>>> Do you by any have a crash pattern that we could possibly use to find the crash
>>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some
>>>> additional data points.
>>>
>>> The pattern shows as a hard hang, the system is unresponsive and all CPUs
>>> are stuck in stop_machine. Sometimes it recovers on its own from the
>>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
>>> to reproduce and sometimes never happens for several hours.
>>>
>>> It appears related to CPU hotplug since gdb showed me most of the CPUs
>>> are spinning in multi_cpu_stop() / stop machine after the hang.
>>>
>>
>> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
>> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
>
> Interesting. It looks similar as far as the stack dump in gdb goes, here are
> the stacks I dumped with the hang I referred to:
> https://paste.debian.net/1288308/
>

That link gives me "Entry not found".

Guenter


2023-08-09 23:08:41

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 13:39, Joel Fernandes wrote:
> On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
>>
>> On 8/9/23 13:14, Joel Fernandes wrote:
>>> On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
>>>> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
>>>>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
>>>>>>
>>>>>> On 8/9/23 06:53, Joel Fernandes wrote:
>>>>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
>>>>>>>> This is the start of the stable review cycle for the 5.15.126 release.
>>>>>>>> There are 92 patches in this series, all will be posted as a response
>>>>>>>> to this one. If anyone has any issues with these being applied, please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>>>>>>>> Anything received after that time might be too late.
>>>>>>>>
>>>>>>>> The whole patch series can be found in one patch at:
>>>>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>>>>>>>> or in the git tree and branch at:
>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>>>>>>>> and the diffstat can be found below.
>>>>>>>
>>>>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
>>>>>>> hang with this -rc: TREE04, TREE07, TASKS03.
>>>>>>>
>>>>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
>>>>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
>>>>>>> issue does not show up on anything but 5.15 stable kernels and neither on
>>>>>>> mainline.
>>>>>>>
>>>>>>
>>>>>> Do you by any have a crash pattern that we could possibly use to find the crash
>>>>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some
>>>>>> additional data points.
>>>>>
>>>>> The pattern shows as a hard hang, the system is unresponsive and all CPUs
>>>>> are stuck in stop_machine. Sometimes it recovers on its own from the
>>>>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
>>>>> to reproduce and sometimes never happens for several hours.
>>>>>
>>>>> It appears related to CPU hotplug since gdb showed me most of the CPUs
>>>>> are spinning in multi_cpu_stop() / stop machine after the hang.
>>>>>
>>>>
>>>> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
>>>> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
>>>
>>> Interesting. It looks similar as far as the stack dump in gdb goes, here are
>>> the stacks I dumped with the hang I referred to:
>>> https://paste.debian.net/1288308/
>>>
>>
>> That link gives me "Entry not found".
>
> Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2

I found a couple of crash reports from chromeos-5.10, one of them complaining
about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.

Guenter


2023-08-09 23:14:00

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
>
> On 8/9/23 13:14, Joel Fernandes wrote:
> > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> >> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> >>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> >>>>
> >>>> On 8/9/23 06:53, Joel Fernandes wrote:
> >>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> >>>>>> This is the start of the stable review cycle for the 5.15.126 release.
> >>>>>> There are 92 patches in this series, all will be posted as a response
> >>>>>> to this one. If anyone has any issues with these being applied, please
> >>>>>> let me know.
> >>>>>>
> >>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> >>>>>> Anything received after that time might be too late.
> >>>>>>
> >>>>>> The whole patch series can be found in one patch at:
> >>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> >>>>>> or in the git tree and branch at:
> >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> >>>>>> and the diffstat can be found below.
> >>>>>
> >>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> >>>>> hang with this -rc: TREE04, TREE07, TASKS03.
> >>>>>
> >>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> >>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> >>>>> issue does not show up on anything but 5.15 stable kernels and neither on
> >>>>> mainline.
> >>>>>
> >>>>
> >>>> Do you by any have a crash pattern that we could possibly use to find the crash
> >>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some
> >>>> additional data points.
> >>>
> >>> The pattern shows as a hard hang, the system is unresponsive and all CPUs
> >>> are stuck in stop_machine. Sometimes it recovers on its own from the
> >>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> >>> to reproduce and sometimes never happens for several hours.
> >>>
> >>> It appears related to CPU hotplug since gdb showed me most of the CPUs
> >>> are spinning in multi_cpu_stop() / stop machine after the hang.
> >>>
> >>
> >> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> >> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
> >
> > Interesting. It looks similar as far as the stack dump in gdb goes, here are
> > the stacks I dumped with the hang I referred to:
> > https://paste.debian.net/1288308/
> >
>
> That link gives me "Entry not found".

Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2

2023-08-10 11:14:12

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/10/23 03:16, Harshit Mogalapalli wrote:
> Hi Greg,
>
> On 09/08/23 4:10 pm, Greg Kroah-Hartman wrote:
>> This is the start of the stable review cycle for the 5.15.126 release.
>> There are 92 patches in this series, all will be posted as a response
>> to this one.  If anyone has any issues with these being applied, please
>> let me know.
>>
>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>> Anything received after that time might be too late.
>>
> No problems seen on x86_64 and aarch64.
>

fwiw, aarch64:allmodconfig doesn't compile.

Guenter

> Tested-by: Harshit Mogalapalli <[email protected]>
>
> Thanks,
> Harshit
>
>> The whole patch series can be found in one patch at:
>>     https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>> or in the git tree and branch at:
>>     git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>> and the diffstat can be found below.
>>
>> thanks,
>>
>> greg k-h
>>


2023-08-10 11:15:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 03:40, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
Building arm:allmodconfig ... failed
--------------
Error log:
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'

drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared

Building arm64:defconfig ... failed
--------------
Error log:

drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'

drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared

That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan
free cleanup on SMC") is applied without its dependent commit(s).

Guenter


2023-08-10 12:17:06

by Harshit Mogalapalli

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

Hi Greg,

On 09/08/23 4:10 pm, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
No problems seen on x86_64 and aarch64.

Tested-by: Harshit Mogalapalli <[email protected]>

Thanks,
Harshit

> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

2023-08-10 16:35:44

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>

Build results:
total: 160 pass: 157 fail: 3
Failed builds:
arm:allmodconfig
arm64:defconfig
arm64:allmodconfig
Qemu test results:
total: 501 pass: 423 fail: 78
Failed tests:
<most arm>
<all arm64/arm64be>

As already reported, plus:

Error log:
drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c:176:20: error: 'drm_plane_helper_destroy' undeclared here

for arm:multi_v7_defconfig

Side note: I am surprised about successful arm64 tests/builds
since arm64:defconfig fails to build with obvious code errors.

drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'

drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared

Guenter

2023-08-10 17:36:08

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/10/23 03:24, Guenter Roeck wrote:
> On 8/9/23 03:40, Greg Kroah-Hartman wrote:
>> This is the start of the stable review cycle for the 5.15.126 release.
>> There are 92 patches in this series, all will be posted as a response
>> to this one.  If anyone has any issues with these being applied, please
>> let me know.
>>
>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>> Anything received after that time might be too late.
>>
> Building arm:allmodconfig ... failed
> --------------
> Error log:
> drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
>
> drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
> drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
>
> Building arm64:defconfig ... failed
> --------------
> Error log:
>
> drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
>
> drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
> drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
>
> That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan
> free cleanup on SMC") is applied without its dependent commit(s).

Indeed, we discussed this here:
https://lore.kernel.org/all/20230810084529.53thk6dmlejbma3t@bogus/
--
Florian


2023-08-10 18:31:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
> On 8/9/23 13:39, Joel Fernandes wrote:
> > On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
> > >
> > > On 8/9/23 13:14, Joel Fernandes wrote:
> > > > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> > > > > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> > > > > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> > > > > > >
> > > > > > > On 8/9/23 06:53, Joel Fernandes wrote:
> > > > > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > > > > > > > > This is the start of the stable review cycle for the 5.15.126 release.
> > > > > > > > > There are 92 patches in this series, all will be posted as a response
> > > > > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > > > > let me know.
> > > > > > > > >
> > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > > > > > > > Anything received after that time might be too late.
> > > > > > > > >
> > > > > > > > > The whole patch series can be found in one patch at:
> > > > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > > > > > > > > or in the git tree and branch at:
> > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > > > and the diffstat can be found below.
> > > > > > > >
> > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > > > > > > hang with this -rc: TREE04, TREE07, TASKS03.
> > > > > > > >
> > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > > > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > > > > > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > > > > > > mainline.
> > > > > > > >
> > > > > > >
> > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash
> > > > > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > > > > > > additional data points.
> > > > > >
> > > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs
> > > > > > are stuck in stop_machine. Sometimes it recovers on its own from the
> > > > > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> > > > > > to reproduce and sometimes never happens for several hours.
> > > > > >
> > > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs
> > > > > > are spinning in multi_cpu_stop() / stop machine after the hang.
> > > > > >
> > > > >
> > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> > > > > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
> > > >
> > > > Interesting. It looks similar as far as the stack dump in gdb goes, here are
> > > > the stacks I dumped with the hang I referred to:
> > > > https://paste.debian.net/1288308/
> > > >
> > >
> > > That link gives me "Entry not found".
> >
> > Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
>
> I found a couple of crash reports from chromeos-5.10, one of them complaining
> about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.

Is the crash showing the eternally refiring timer fixed by this commit?

53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")

This commit fixed something similar for me in v5.16.

https://paulmck.livejournal.com/62071.html

Thanx, Paul

2023-08-10 22:04:01

by Ron Economos

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/9/23 3:40 AM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

Built and booted successfully on RISC-V RV64 (HiFive Unmatched).

Tested-by: Ron Economos <[email protected]>


2023-08-10 22:32:14

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
> > On 8/9/23 13:39, Joel Fernandes wrote:
> > > On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
> > > >
> > > > On 8/9/23 13:14, Joel Fernandes wrote:
> > > > > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> > > > > > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> > > > > > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 8/9/23 06:53, Joel Fernandes wrote:
> > > > > > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > > > > > > > > > This is the start of the stable review cycle for the 5.15.126 release.
> > > > > > > > > > There are 92 patches in this series, all will be posted as a response
> > > > > > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > > > > > let me know.
> > > > > > > > > >
> > > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > > > > > > > > Anything received after that time might be too late.
> > > > > > > > > >
> > > > > > > > > > The whole patch series can be found in one patch at:
> > > > > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > > > > > > > > > or in the git tree and branch at:
> > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > > > > and the diffstat can be found below.
> > > > > > > > >
> > > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > > > > > > > hang with this -rc: TREE04, TREE07, TASKS03.
> > > > > > > > >
> > > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > > > > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > > > > > > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > > > > > > > mainline.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash
> > > > > > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > > > > > > > additional data points.
> > > > > > >
> > > > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs
> > > > > > > are stuck in stop_machine. Sometimes it recovers on its own from the
> > > > > > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> > > > > > > to reproduce and sometimes never happens for several hours.
> > > > > > >
> > > > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs
> > > > > > > are spinning in multi_cpu_stop() / stop machine after the hang.
> > > > > > >
> > > > > >
> > > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> > > > > > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
> > > > >
> > > > > Interesting. It looks similar as far as the stack dump in gdb goes, here are
> > > > > the stacks I dumped with the hang I referred to:
> > > > > https://paste.debian.net/1288308/
> > > > >
> > > >
> > > > That link gives me "Entry not found".
> > >
> > > Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
> >
> > I found a couple of crash reports from chromeos-5.10, one of them complaining
> > about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
>
> Is the crash showing the eternally refiring timer fixed by this commit?
>
> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")

Ah I was just replying, I have been seeing really good results after applying
the following 3 commits since yesterday:

53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped")
a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")

5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was
seeing.

I did a lengthy test and everything is looking good. I'll send these out to
the stable list.

thanks,

- Joel



2023-08-10 22:49:35

by Daniel Díaz

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

Hello!

On Wed, 9 Aug 2023 at 04:57, Greg Kroah-Hartman
<[email protected]> wrote:
> This is the start of the stable review cycle for the 5.15.126 release.
> There are 92 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>
> -------------

We are also seeing build failures on Arm and Arm64, with Clang 17 and GCC 8:

* arm, build
- clang-17-defconfig
- clang-17-lkftconfig
- clang-17-lkftconfig-no-kselftest-frag
- clang-lkftconfig
- clang-nightly-lkftconfig-kselftest
- gcc-8-defconfig

* arm64, build
- clang-17-defconfig
- clang-17-defconfig-40bc7ee5
- clang-17-lkftconfig
- clang-17-lkftconfig-no-kselftest-frag
- clang-lkftconfig
- clang-nightly-lkftconfig-kselftest
- gcc-8-defconfig
- gcc-8-defconfig-40bc7ee5

Failure is:

-----8<-----
/builds/linux/drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate
member 'irq'
39 | int irq;
| ^~~
/builds/linux/drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
/builds/linux/drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq'
undeclared (first use in this function); did you mean 'rq'?
118 | scmi_info->irq = irq;
| ^~~
| rq
----->8-----

(Funnily enough, this was reported by Naresh [1] before this RC round,
but we chalked it up to GCC-13 on an older branch.)

Greetings!

Daniel Díaz
[email protected]

[1] https://lore.kernel.org/stable/CA+G9fYvTjm2oa6mXR=HUe6gYuVaS2nFb_otuvPfmPeKHDoC+Tw@mail.gmail.com/

2023-08-10 22:56:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Thu, Aug 10, 2023 at 09:54:16PM +0000, Joel Fernandes wrote:
> On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
> > > On 8/9/23 13:39, Joel Fernandes wrote:
> > > > On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
> > > > >
> > > > > On 8/9/23 13:14, Joel Fernandes wrote:
> > > > > > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> > > > > > > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> > > > > > > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On 8/9/23 06:53, Joel Fernandes wrote:
> > > > > > > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > > > > > > > > > > This is the start of the stable review cycle for the 5.15.126 release.
> > > > > > > > > > > There are 92 patches in this series, all will be posted as a response
> > > > > > > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > > > > > > let me know.
> > > > > > > > > > >
> > > > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > > > > > > > > > Anything received after that time might be too late.
> > > > > > > > > > >
> > > > > > > > > > > The whole patch series can be found in one patch at:
> > > > > > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > > > > > > > > > > or in the git tree and branch at:
> > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > > > > > and the diffstat can be found below.
> > > > > > > > > >
> > > > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > > > > > > > > hang with this -rc: TREE04, TREE07, TASKS03.
> > > > > > > > > >
> > > > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > > > > > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > > > > > > > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > > > > > > > > mainline.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash
> > > > > > > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > > > > > > > > additional data points.
> > > > > > > >
> > > > > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs
> > > > > > > > are stuck in stop_machine. Sometimes it recovers on its own from the
> > > > > > > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> > > > > > > > to reproduce and sometimes never happens for several hours.
> > > > > > > >
> > > > > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs
> > > > > > > > are spinning in multi_cpu_stop() / stop machine after the hang.
> > > > > > > >
> > > > > > >
> > > > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> > > > > > > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
> > > > > >
> > > > > > Interesting. It looks similar as far as the stack dump in gdb goes, here are
> > > > > > the stacks I dumped with the hang I referred to:
> > > > > > https://paste.debian.net/1288308/
> > > > > >
> > > > >
> > > > > That link gives me "Entry not found".
> > > >
> > > > Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
> > >
> > > I found a couple of crash reports from chromeos-5.10, one of them complaining
> > > about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
> >
> > Is the crash showing the eternally refiring timer fixed by this commit?
> >
> > 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
>
> Ah I was just replying, I have been seeing really good results after applying
> the following 3 commits since yesterday:
>
> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
> 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped")
> a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
>
> 5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was
> seeing.
>
> I did a lengthy test and everything is looking good. I'll send these out to
> the stable list.

I just read your post for the first time. And just to humor you about my
debugging which was very similar to yours, I got as far as this statement in
your post (before looking for fixes in timer code):
<quote>
Further checking showed that the stuck CPU was in fact suffering from an
interrupt storm, namely an interrupt storm of scheduling-clock interrupts.
This spurred another code-inspection session.
</quote>

My detection of this came from gdb, within that 2000 second stall, I broke
into the VM with --gdb and kept dumping the stuck CPU's stack with "thread X"
and "bt". I noticed that it was always in the timer interrupt. Here were the
stacks: https://pastebin.com/raw/L3nv1kH2

Then I narrowed my search down to timer events by enabling
boot options ftrace_dump_on_oops and panic-on-stall ones, and noticed a storm
of hrtimer_start coming out of the long stall. I was all but certain it was a
tick storm and noticed it kept programming hrtimer to the same event.

Ah, then I just did a "git diff" in kernel/time/ between v5.15 and v6.1 and
noticed the missing patches. ;-)

Though in my experience, I wasn't seeing a KTIME_MAX-type of value like you
mentioned in the post. What I noticed is that the tick was never stopped, it
just kept firing a bit earlier than was requested and in the interrupt exit
path (of the delivered-too-early timer interrupt), it kept re-requesting the
tick.

thanks,

- Joel


2023-08-10 23:05:08

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On 8/10/23 14:54, Joel Fernandes wrote:
> On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
>>> On 8/9/23 13:39, Joel Fernandes wrote:
>>>> On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
>>>>>
>>>>> On 8/9/23 13:14, Joel Fernandes wrote:
>>>>>> On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
>>>>>>> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
>>>>>>>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 8/9/23 06:53, Joel Fernandes wrote:
>>>>>>>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
>>>>>>>>>>> This is the start of the stable review cycle for the 5.15.126 release.
>>>>>>>>>>> There are 92 patches in this series, all will be posted as a response
>>>>>>>>>>> to this one. If anyone has any issues with these being applied, please
>>>>>>>>>>> let me know.
>>>>>>>>>>>
>>>>>>>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>>>>>>>>>>> Anything received after that time might be too late.
>>>>>>>>>>>
>>>>>>>>>>> The whole patch series can be found in one patch at:
>>>>>>>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>>>>>>>>>>> or in the git tree and branch at:
>>>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>>>>>>>>>>> and the diffstat can be found below.
>>>>>>>>>>
>>>>>>>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
>>>>>>>>>> hang with this -rc: TREE04, TREE07, TASKS03.
>>>>>>>>>>
>>>>>>>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
>>>>>>>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
>>>>>>>>>> issue does not show up on anything but 5.15 stable kernels and neither on
>>>>>>>>>> mainline.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you by any have a crash pattern that we could possibly use to find the crash
>>>>>>>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some
>>>>>>>>> additional data points.
>>>>>>>>
>>>>>>>> The pattern shows as a hard hang, the system is unresponsive and all CPUs
>>>>>>>> are stuck in stop_machine. Sometimes it recovers on its own from the
>>>>>>>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
>>>>>>>> to reproduce and sometimes never happens for several hours.
>>>>>>>>
>>>>>>>> It appears related to CPU hotplug since gdb showed me most of the CPUs
>>>>>>>> are spinning in multi_cpu_stop() / stop machine after the hang.
>>>>>>>>
>>>>>>>
>>>>>>> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
>>>>>>> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
>>>>>>
>>>>>> Interesting. It looks similar as far as the stack dump in gdb goes, here are
>>>>>> the stacks I dumped with the hang I referred to:
>>>>>> https://paste.debian.net/1288308/
>>>>>>
>>>>>
>>>>> That link gives me "Entry not found".
>>>>
>>>> Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
>>>
>>> I found a couple of crash reports from chromeos-5.10, one of them complaining
>>> about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
>>
>> Is the crash showing the eternally refiring timer fixed by this commit?
>>
>> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
>
> Ah I was just replying, I have been seeing really good results after applying
> the following 3 commits since yesterday:
>
> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
> 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped")
> a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
>

Would those also apply to v5.10.y, or just 5.15.y ?

Thanks,
Guenter

> 5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was
> seeing.
>
> I did a lengthy test and everything is looking good. I'll send these out to
> the stable list.
>
> thanks,
>
> - Joel
>
>


2023-08-10 23:08:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Thu, Aug 10, 2023 at 10:14:16PM +0000, Joel Fernandes wrote:
> On Thu, Aug 10, 2023 at 09:54:16PM +0000, Joel Fernandes wrote:
> > On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
> > > On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
> > > > On 8/9/23 13:39, Joel Fernandes wrote:
> > > > > On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
> > > > > >
> > > > > > On 8/9/23 13:14, Joel Fernandes wrote:
> > > > > > > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
> > > > > > > > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
> > > > > > > > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On 8/9/23 06:53, Joel Fernandes wrote:
> > > > > > > > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > > > > > > > > > > > This is the start of the stable review cycle for the 5.15.126 release.
> > > > > > > > > > > > There are 92 patches in this series, all will be posted as a response
> > > > > > > > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > > > > > > > let me know.
> > > > > > > > > > > >
> > > > > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > > > > > > > > > > Anything received after that time might be too late.
> > > > > > > > > > > >
> > > > > > > > > > > > The whole patch series can be found in one patch at:
> > > > > > > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
> > > > > > > > > > > > or in the git tree and branch at:
> > > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > > > > > > and the diffstat can be found below.
> > > > > > > > > > >
> > > > > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
> > > > > > > > > > > hang with this -rc: TREE04, TREE07, TASKS03.
> > > > > > > > > > >
> > > > > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
> > > > > > > > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The
> > > > > > > > > > > issue does not show up on anything but 5.15 stable kernels and neither on
> > > > > > > > > > > mainline.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash
> > > > > > > > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some
> > > > > > > > > > additional data points.
> > > > > > > > >
> > > > > > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs
> > > > > > > > > are stuck in stop_machine. Sometimes it recovers on its own from the
> > > > > > > > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour
> > > > > > > > > to reproduce and sometimes never happens for several hours.
> > > > > > > > >
> > > > > > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs
> > > > > > > > > are spinning in multi_cpu_stop() / stop machine after the hang.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
> > > > > > > > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
> > > > > > >
> > > > > > > Interesting. It looks similar as far as the stack dump in gdb goes, here are
> > > > > > > the stacks I dumped with the hang I referred to:
> > > > > > > https://paste.debian.net/1288308/
> > > > > > >
> > > > > >
> > > > > > That link gives me "Entry not found".
> > > > >
> > > > > Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
> > > >
> > > > I found a couple of crash reports from chromeos-5.10, one of them complaining
> > > > about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
> > >
> > > Is the crash showing the eternally refiring timer fixed by this commit?
> > >
> > > 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
> >
> > Ah I was just replying, I have been seeing really good results after applying
> > the following 3 commits since yesterday:
> >
> > 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
> > 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped")
> > a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
> >
> > 5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was
> > seeing.
> >
> > I did a lengthy test and everything is looking good. I'll send these out to
> > the stable list.
>
> I just read your post for the first time. And just to humor you about my
> debugging which was very similar to yours, I got as far as this statement in
> your post (before looking for fixes in timer code):
> <quote>
> Further checking showed that the stuck CPU was in fact suffering from an
> interrupt storm, namely an interrupt storm of scheduling-clock interrupts.
> This spurred another code-inspection session.
> </quote>
>
> My detection of this came from gdb, within that 2000 second stall, I broke
> into the VM with --gdb and kept dumping the stuck CPU's stack with "thread X"
> and "bt". I noticed that it was always in the timer interrupt. Here were the
> stacks: https://pastebin.com/raw/L3nv1kH2
>
> Then I narrowed my search down to timer events by enabling
> boot options ftrace_dump_on_oops and panic-on-stall ones, and noticed a storm
> of hrtimer_start coming out of the long stall. I was all but certain it was a
> tick storm and noticed it kept programming hrtimer to the same event.
>
> Ah, then I just did a "git diff" in kernel/time/ between v5.15 and v6.1 and
> noticed the missing patches. ;-)
>
> Though in my experience, I wasn't seeing a KTIME_MAX-type of value like you
> mentioned in the post. What I noticed is that the tick was never stopped, it
> just kept firing a bit earlier than was requested and in the interrupt exit
> path (of the delivered-too-early timer interrupt), it kept re-requesting the
> tick.

That "git diff" wouldn't have shown me much at the time, but I am very
glad that you found it!

Thanx, Paul

2023-08-11 00:58:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review



> On Aug 10, 2023, at 6:55 PM, Guenter Roeck <[email protected]> wrote:
>
> On 8/10/23 14:54, Joel Fernandes wrote:
>>> On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
>>> On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
>>>> On 8/9/23 13:39, Joel Fernandes wrote:
>>>>> On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck <[email protected]> wrote:
>>>>>>
>>>>>> On 8/9/23 13:14, Joel Fernandes wrote:
>>>>>>> On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
>>>>>>>> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
>>>>>>>>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 8/9/23 06:53, Joel Fernandes wrote:
>>>>>>>>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
>>>>>>>>>>>> This is the start of the stable review cycle for the 5.15.126 release.
>>>>>>>>>>>> There are 92 patches in this series, all will be posted as a response
>>>>>>>>>>>> to this one. If anyone has any issues with these being applied, please
>>>>>>>>>>>> let me know.
>>>>>>>>>>>>
>>>>>>>>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
>>>>>>>>>>>> Anything received after that time might be too late.
>>>>>>>>>>>>
>>>>>>>>>>>> The whole patch series can be found in one patch at:
>>>>>>>>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz
>>>>>>>>>>>> or in the git tree and branch at:
>>>>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
>>>>>>>>>>>> and the diffstat can be found below.
>>>>>>>>>>>
>>>>>>>>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios
>>>>>>>>>>> hang with this -rc: TREE04, TREE07, TASKS03.
>>>>>>>>>>>
>>>>>>>>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu
>>>>>>>>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The
>>>>>>>>>>> issue does not show up on anything but 5.15 stable kernels and neither on
>>>>>>>>>>> mainline.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you by any have a crash pattern that we could possibly use to find the crash
>>>>>>>>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some
>>>>>>>>>> additional data points.
>>>>>>>>>
>>>>>>>>> The pattern shows as a hard hang, the system is unresponsive and all CPUs
>>>>>>>>> are stuck in stop_machine. Sometimes it recovers on its own from the
>>>>>>>>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour
>>>>>>>>> to reproduce and sometimes never happens for several hours.
>>>>>>>>>
>>>>>>>>> It appears related to CPU hotplug since gdb showed me most of the CPUs
>>>>>>>>> are spinning in multi_cpu_stop() / stop machine after the hang.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace,
>>>>>>>> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
>>>>>>>
>>>>>>> Interesting. It looks similar as far as the stack dump in gdb goes, here are
>>>>>>> the stacks I dumped with the hang I referred to:
>>>>>>> https://paste.debian.net/1288308/
>>>>>>>
>>>>>>
>>>>>> That link gives me "Entry not found".
>>>>>
>>>>> Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
>>>>
>>>> I found a couple of crash reports from chromeos-5.10, one of them complaining
>>>> about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
>>>
>>> Is the crash showing the eternally refiring timer fixed by this commit?
>>>
>>> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
>> Ah I was just replying, I have been seeing really good results after applying
>> the following 3 commits since yesterday:
>> 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
>> 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped")
>> a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
>
> Would those also apply to v5.10.y, or just 5.15.y ?

All apply to 5.10 but one. I am currently testing with it more and will post to stable for 5.10 as well.

Thanks,

- Joel



>
> Thanks,
> Guenter
>
>> 5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was
>> seeing.
>> I did a lengthy test and everything is looking good. I'll send these out to
>> the stable list.
>> thanks,
>> - Joel
>

2023-08-11 11:06:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Thu, Aug 10, 2023 at 09:25:53AM -0700, Florian Fainelli wrote:
> On 8/10/23 03:24, Guenter Roeck wrote:
> > On 8/9/23 03:40, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 5.15.126 release.
> > > There are 92 patches in this series, all will be posted as a response
> > > to this one.? If anyone has any issues with these being applied, please
> > > let me know.
> > >
> > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > > Anything received after that time might be too late.
> > >
> > Building arm:allmodconfig ... failed
> > --------------
> > Error log:
> > drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
> >
> > drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
> > drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
> >
> > Building arm64:defconfig ... failed
> > --------------
> > Error log:
> >
> > drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
> >
> > drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
> > drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
> >
> > That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan
> > free cleanup on SMC") is applied without its dependent commit(s).
>
> Indeed, we discussed this here:
> https://lore.kernel.org/all/20230810084529.53thk6dmlejbma3t@bogus/

Offending commit should now be dropped, thanks.

greg k-h

2023-08-11 11:52:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 5.15 00/92] 5.15.126-rc1 review

On Thu, Aug 10, 2023 at 09:06:01AM -0700, Guenter Roeck wrote:
> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 5.15.126 release.
> > There are 92 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000.
> > Anything received after that time might be too late.
> >
>
> Build results:
> total: 160 pass: 157 fail: 3
> Failed builds:
> arm:allmodconfig
> arm64:defconfig
> arm64:allmodconfig
> Qemu test results:
> total: 501 pass: 423 fail: 78
> Failed tests:
> <most arm>
> <all arm64/arm64be>
>
> As already reported, plus:
>
> Error log:
> drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c:176:20: error: 'drm_plane_helper_destroy' undeclared here

Offending commit now dropped, Sasha's dep-bot went a little crazy there,
and this wasn't needed, sorry for not catching that sooner.

> for arm:multi_v7_defconfig
>
> Side note: I am surprised about successful arm64 tests/builds
> since arm64:defconfig fails to build with obvious code errors.
>
> drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
>
> drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup':
> drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared

Should now be fixed, thanks.

greg k-h