2023-07-25 10:56:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 6.4 000/227] 6.4.7-rc1 review

This is the start of the stable review cycle for the 6.4.7 release.
There are 227 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
and the diffstat can be found below.

thanks,

greg k-h

-------------
Pseudo-Shortlog of commits:

Greg Kroah-Hartman <[email protected]>
Linux 6.4.7-rc1

Wayne Lin <[email protected]>
drm/amd/display: Add polling method to handle MST reply packet

Srinivasan Shanmugam <[email protected]>
drm/amd/display: Clean up errors & warnings in amdgpu_dm.c

Yu Kuai <[email protected]>
scsi: sg: Fix checking return value of blk_get_queue()

Yu Kuai <[email protected]>
scsi/sg: don't grab scsi host module reference

Abe Kohandel <[email protected]>
spi: dw: Remove misleading comment for Mount Evans SoC

Yunxiang Li <[email protected]>
drm/ttm: fix bulk_move corruption when adding a entry

Mohamed Khalfella <[email protected]>
tracing/histograms: Return an error if we fail to add histogram to hist_vars list

Miguel Ojeda <[email protected]>
kbuild: rust: avoid creating temporary files

Zhang Yi <[email protected]>
jbd2: recheck chechpointing non-dirty buffer

Vladimir Oltean <[email protected]>
net: phy: prevent stale pointer dereference in phy_init()

Eric Dumazet <[email protected]>
tcp: annotate data-races around fastopenq.max_qlen

Eric Dumazet <[email protected]>
tcp: annotate data-races around icsk->icsk_user_timeout

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->notsent_lowat

Eric Dumazet <[email protected]>
tcp: annotate data-races around rskq_defer_accept

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->linger2

Eric Dumazet <[email protected]>
tcp: annotate data-races around icsk->icsk_syn_retries

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->keepalive_probes

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->keepalive_intvl

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->keepalive_time

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->tsoffset

Eric Dumazet <[email protected]>
tcp: annotate data-races around tp->tcp_tx_delay

Tomasz Moń <[email protected]>
Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014

Pauli Virtanen <[email protected]>
Bluetooth: SCO: fix sco_conn related locking and validity issues

Siddh Raman Pant <[email protected]>
Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link

Douglas Anderson <[email protected]>
Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()

Pauli Virtanen <[email protected]>
Bluetooth: ISO: fix iso_conn related locking and validity issues

Pauli Virtanen <[email protected]>
Bluetooth: hci_event: call disconnect callback before deleting conn

Pauli Virtanen <[email protected]>
Bluetooth: use RCU for hci_conn_params and iterate safely in hci_sync

Pablo Neira Ayuso <[email protected]>
netfilter: nf_tables: skip bound chain on rule flush

Pablo Neira Ayuso <[email protected]>
netfilter: nf_tables: skip bound chain in netns release path

Florian Westphal <[email protected]>
netfilter: nft_set_pipapo: fix improper element removal

Florian Westphal <[email protected]>
netfilter: nf_tables: can't schedule in nft_chain_validate

Florian Westphal <[email protected]>
netfilter: nf_tables: fix spurious set element insertion failure

Vitaly Rodionov <[email protected]>
ALSA: hda/realtek: Fix generic fixup definition for cs35l41 amp

Kuniyuki Iwashima <[email protected]>
llc: Don't drop packet from non-root netns.

Zhang Shurong <[email protected]>
fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe

Daniel Golle <[email protected]>
net: ethernet: mtk_eth_soc: always mtk_get_ib1_pkt_type

Kuniyuki Iwashima <[email protected]>
Revert "tcp: avoid the lookup process failing to get sk in ehash table"

Yuanjun Gong <[email protected]>
net:ipv6: check return value of pskb_trim()

Wang Ming <[email protected]>
net: ipv4: Use kfree_sensitive instead of kfree

Eric Dumazet <[email protected]>
tcp: annotate data-races around tcp_rsk(req)->ts_recent

Eric Dumazet <[email protected]>
tcp: annotate data-races around tcp_rsk(req)->txhash

Antoine Tenart <[email protected]>
net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV

Florian Kauer <[email protected]>
igc: Prevent garbled TX queue with XDP ZEROCOPY

Kurt Kanzenbach <[email protected]>
igc: Avoid transmit queue timeout for XDP

Alexander Duyck <[email protected]>
bpf, arm64: Fix BTI type used for freplace attached functions

Kumar Kartikeya Dwivedi <[email protected]>
bpf: Repeat check_max_stack_depth for async callbacks

Kumar Kartikeya Dwivedi <[email protected]>
bpf: Fix subprog idx logic in check_max_stack_depth

Geetha sowjanya <[email protected]>
octeontx2-pf: Dont allocate BPIDs for LBK interfaces

Ido Schimmel <[email protected]>
vrf: Fix lockdep splat in output path

Jiapeng Chong <[email protected]>
security: keys: Modify mismatched function name

Ahmed Zaki <[email protected]>
iavf: fix reset task race with iavf_remove()

Ahmed Zaki <[email protected]>
iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies

Marcin Szycik <[email protected]>
iavf: Wait for reset in callbacks which trigger it

Przemek Kitszel <[email protected]>
iavf: make functions static where possible

Ahmed Zaki <[email protected]>
iavf: use internal state to free traffic IRQs

Ding Hui <[email protected]>
iavf: Fix out-of-bounds when setting channels on remove

Ding Hui <[email protected]>
iavf: Fix use-after-free in free_netdev

Andrzej Hajda <[email protected]>
drm/i915/perf: add sentinel to xehp_oa_b_counters

Heiner Kallweit <[email protected]>
r8169: fix ASPM-related problem for chip version 42 and 43

Tristram Ha <[email protected]>
net: dsa: microchip: correct KSZ8795 static MAC table access

Victor Nogueira <[email protected]>
net: sched: cls_bpf: Undo tcf_bind_filter in case of an error

Victor Nogueira <[email protected]>
net: sched: cls_u32: Undo refcount decrement in case update failed

Victor Nogueira <[email protected]>
net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode

Victor Nogueira <[email protected]>
net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms

Martin Fuzzey <[email protected]>
regulator: da9063: fix null pointer deref with partial DT config

Dan Carpenter <[email protected]>
ASoC: SOF: ipc3-dtrace: uninitialized data in dfsentry_trace_filter_write()

Michal Swiatkowski <[email protected]>
ice: prevent NULL pointer deref during reload

Petr Oros <[email protected]>
ice: Unregister netdev and devlink_port only once

Shyam Prasad N <[email protected]>
cifs: fix mid leak during reconnection after timeout threshold

Dan Carpenter <[email protected]>
iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()

Yan Zhai <[email protected]>
gso: fix dodgy bit handling for GSO_UDP_L4

Daniel Golle <[email protected]>
net: ethernet: mtk_eth_soc: handle probe deferral

Kuniyuki Iwashima <[email protected]>
bridge: Add extack warning when enabling STP in netns.

Tanmay Patil <[email protected]>
net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field()

Linus Walleij <[email protected]>
dsa: mv88e6xxx: Do a final check before timing out

Marc Zyngier <[email protected]>
arm64: Fix HFGxTR_EL2 field naming

Paulo Alcantara <[email protected]>
smb: client: fix missed ses refcounting

Yonghong Song <[email protected]>
kallsyms: strip LTO-only suffixes from promoted global functions

Jaewon Kim <[email protected]>
spi: s3c64xx: clear loopback bit after loopback test

Christoph Hellwig <[email protected]>
btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block

James Clark <[email protected]>
perf build: Fix library not found error when using CSLIBS

Yangtao Li <[email protected]>
fbdev: imxfb: Removed unneeded release_mem_region

Martin Kaiser <[email protected]>
fbdev: imxfb: warn about invalid left/right margin

Jonas Gorski <[email protected]>
spi: bcm63xx: fix max prepend length

Biju Das <[email protected]>
pinctrl: renesas: rzg2l: Handle non-unique subnode names

Geert Uytterhoeven <[email protected]>
pinctrl: renesas: rzv2m: Handle non-unique subnode names

Suren Baghdasaryan <[email protected]>
sched/psi: use kernfs polling functions for PSI trigger polling

Miaohe Lin <[email protected]>
sched/fair: Use recent_used_cpu to test p->cpus_ptr

Peter Zijlstra <[email protected]>
iov_iter: Mark copy_iovec_from_user() noclone

Srinivas Kandagatla <[email protected]>
ASoC: qcom: q6apm: do not close GPR port before closing graph

Srinivas Kandagatla <[email protected]>
ASoC: codecs: wcd938x: fix dB range for HPHL and HPHR

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix mbhc impedance loglevel

Vijendar Mukunda <[email protected]>
ASoC: amd: acp: fix for invalid dai id handling in acp_get_byte_count()

Hao Chen <[email protected]>
net: hns3: fix strncpy() not using dest-buf length as length issue

Ying Hsu <[email protected]>
igb: Fix igb_down hung on surprise removal

Yi Kuo <[email protected]>
wifi: iwlwifi: pcie: add device id 51F1 for killer 1675

Johannes Berg <[email protected]>
wifi: iwlwifi: mvm: avoid baid size integer overflow

Mukesh Sisodiya <[email protected]>
wifi: iwlwifi: Add support for new PCI Id

Gustavo A. R. Silva <[email protected]>
wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point()

Mukesh Sisodiya <[email protected]>
wifi: iwlwifi: mvm: Add NULL check before dereferencing the pointer

Petr Oros <[email protected]>
devlink: report devlink_port_type_warn source device

Jisheng Zhang <[email protected]>
net: ethernet: litex: add support for 64 bit stats

Gregory Greenman <[email protected]>
wifi: iwlwifi: mvm: fix potential array out of bounds access

P Praneesh <[email protected]>
wifi: ath11k: fix memory leak in WMI firmware stats

Balamurugan S <[email protected]>
wifi: ath12k: Avoid NULL pointer access during management transmit cleanup

Abe Kohandel <[email protected]>
spi: dw: Add compatible for Intel Mount Evans SoC

Ilan Peer <[email protected]>
wifi: mac80211_hwsim: Fix possible NULL dereference

Wen Gong <[email protected]>
wifi: ath11k: add support default regdb while searching board-2.bin for WCN6855

Jakub Kicinski <[email protected]>
devlink: make health report on unregistered instance warn just once

Yonghong Song <[email protected]>
bpf: Silence a warning in btf_type_id_size()

Martin Blumenstingl <[email protected]>
wifi: rtw88: sdio: Check the HISR RX_REQUEST bit in rtw_sdio_rx_isr()

Aditi Ghag <[email protected]>
bpf: tcp: Avoid taking fast sock lock in iterator

Andrii Nakryiko <[email protected]>
bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log

Brad Larson <[email protected]>
spi: cadence-quadspi: Add compatible for AMD Pensando Elba SoC

Martin KaFai Lau <[email protected]>
bpf: Address KCSAN report on bpf_lru_list

Kui-Feng Lee <[email protected]>
bpf: Print a warning only if writing to unprivileged_bpf_disabled.

Maxime Bizon <[email protected]>
wifi: ath11k: fix registration of 6Ghz-only phy without the full channel range

Yicong Yang <[email protected]>
sched/fair: Don't balance task to its current running CPU

Thomas Weißschuh <[email protected]>
tools/nolibc: ensure stack protector guard is never zero

Paul E. McKenney <[email protected]>
rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp

Shigeru Yoshida <[email protected]>
rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

Hans de Goede <[email protected]>
ACPI: video: Add backlight=native DMI quirk for Dell Studio 1569

Mark Rutland <[email protected]>
arm64: mm: fix VA-range sanity check

Youngmin Nam <[email protected]>
arm64: set __exception_irq_entry with __irq_entry as a default

Mario Limonciello <[email protected]>
ACPI: resource: Remove "Zen" specific match and quirks

Hans de Goede <[email protected]>
ACPI: video: Add backlight=native DMI quirk for Lenovo ThinkPad X131e (3371 AMD version)

Hans de Goede <[email protected]>
ACPI: video: Add backlight=native DMI quirk for Apple iMac11,3

Hans de Goede <[email protected]>
ACPI: x86: Add ACPI_QUIRK_UART1_SKIP for Lenovo Yoga Book yb1-x90f/l

Hans de Goede <[email protected]>
ACPI: button: Add lid disable DMI quirk for Nextbook Ares 8A

Hans de Goede <[email protected]>
ACPI: x86: Add skip i2c clients quirk for Nextbook Ares 8A

Sandeep Dhavale <[email protected]>
erofs: Fix detection of atomic context

Filipe Manana <[email protected]>
btrfs: abort transaction at update_ref_for_cow() when ref count is zero

Christoph Hellwig <[email protected]>
btrfs: don't check PageError in __extent_writepage

David Sterba <[email protected]>
btrfs: add xxhash to fast checksum implementations

Thomas Gleixner <[email protected]>
posix-timers: Ensure timer ID search-loop limit is valid

Ming Lei <[email protected]>
blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none

Yu Kuai <[email protected]>
scsi: sg: fix blktrace debugfs entries leakage

Yu Kuai <[email protected]>
md/raid10: prevent soft lockup while flush writes

Yu Kuai <[email protected]>
md: fix data corruption for raid456 when reshape restart while grow up

Immad Mir <[email protected]>
FS: JFS: Check for read-only mounted filesystem in txBegin

Immad Mir <[email protected]>
FS: JFS: Fix null-ptr-deref Read in txBegin

Gustavo A. R. Silva <[email protected]>
MIPS: dec: prom: Address -Warray-bounds warning

Yogesh <[email protected]>
fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev

Matthew Anderson <[email protected]>
ALSA: hda/realtek: Add quirks for ROG ALLY CS35l41 audio

Jan Kara <[email protected]>
udf: Fix uninitialized array access for some pathnames

Christian Brauner <[email protected]>
ovl: check type and offset of struct vfsmount in ovl_entry

Marco Morandini <[email protected]>
HID: add quirk for 03f0:464a HP Elite Presenter Mouse

Ye Bin <[email protected]>
quota: fix warning in dqgrab()

Jan Kara <[email protected]>
quota: Properly disable quotas when add_dquot_ref() fails

Oswald Buddenhagen <[email protected]>
ALSA: emu10k1: roll up loops in DSP setup code for Audigy

hackyzh002 <[email protected]>
drm/radeon: Fix integer overflow in radeon_cs_parser_init

Eric Whitney <[email protected]>
ext4: correct inline offset when handling xattrs in inode body

Marc Zyngier <[email protected]>
KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption

Marc Zyngier <[email protected]>
KVM: arm64: Disable preemption in kvm_arch_hardware_enable()

Oliver Upton <[email protected]>
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot

Marc Zyngier <[email protected]>
KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix soundwire initialisation race

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix codec initialisation race

Johan Hovold <[email protected]>
ASoC: codecs: wcd934x: fix resource leaks on component remove

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix missing mbhc init error handling

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix resource leaks on component remove

Sheetal <[email protected]>
ASoC: tegra: Fix AMX byte map

Johan Hovold <[email protected]>
ASoC: qdsp6: audioreach: fix topology probe deferral

Johan Hovold <[email protected]>
ASoC: codecs: wcd-mbhc-v2: fix resource leaks on component remove

Nathan Chancellor <[email protected]>
ASoC: cs35l45: Select REGMAP_IRQ

Johan Hovold <[email protected]>
ASoC: codecs: wcd938x: fix missing clsh ctrl error handling

Thomas Petazzoni <[email protected]>
ASoC: cs42l51: fix driver to properly autoload with automatic module loading

Sameer Pujar <[email protected]>
ASoC: rt5640: Fix sleep in atomic context

Sheetal <[email protected]>
ASoC: tegra: Fix ADX byte map

Fabio Estevam <[email protected]>
ASoC: fsl_sai: Revert "ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode"

Matus Gajdos <[email protected]>
ASoC: fsl_sai: Disable bit clock with transmitter

Nicholas Kazlauskas <[email protected]>
drm/amd/display: Keep PHY active for DP displays on DCN31

Taimur Hassan <[email protected]>
drm/amd/display: check TG is non-null before checking if enabled

Zhikai Zhai <[email protected]>
drm/amd/display: Disable MPC split by default on special asic

Simon Ser <[email protected]>
drm/amd/display: only accept async flips for fast updates

Jocelyn Falempe <[email protected]>
drm/client: Fix memory leak in drm_client_modeset_probe

Jocelyn Falempe <[email protected]>
drm/client: Fix memory leak in drm_client_target_cloned

Ben Skeggs <[email protected]>
drm/nouveau/i2c: fix number of aux event slots

Ben Skeggs <[email protected]>
drm/nouveau/kms/nv50-: init hpd_irq_lock for PIOR DP

Ben Skeggs <[email protected]>
drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts

Alex Deucher <[email protected]>
drm/amdgpu/pm: make mclk consistent for smu 13.0.7

Alex Deucher <[email protected]>
drm/amdgpu/pm: make gfxclock consistent for sienna cichlid

Guchun Chen <[email protected]>
drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

Ville Syrjälä <[email protected]>
dma-buf/dma-resv: Stop leaking on krealloc() failure

Dan Carpenter <[email protected]>
accel/qaic: Add consistent integer overflow checks

Dan Carpenter <[email protected]>
accel/qaic: tighten bounds checking in decode_message()

Dan Carpenter <[email protected]>
accel/qaic: tighten bounds checking in encode_message()

Matthieu Baerts <[email protected]>
selftests: tc: add ConnTrack procfs kconfig

Heiner Kallweit <[email protected]>
Revert "r8169: disable ASPM during NAPI poll"

Marc Kleine-Budde <[email protected]>
can: gs_usb: fix time stamp counter initialization

Marc Kleine-Budde <[email protected]>
can: gs_usb: gs_can_open(): improve error handling

YueHaibing <[email protected]>
can: bcm: Fix UAF in bcm_proc_show()

Fedor Ross <[email protected]>
can: mcp251xfd: __mcp251xfd_chip_set_mode(): increase poll timeout

Mark Brown <[email protected]>
arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes

Helge Deller <[email protected]>
ia64: mmap: Consider pgoff when searching for free mapping

Mark Brown <[email protected]>
regmap: Account for register length in SMBus I/O limits

Rob Herring <[email protected]>
of: Preserve "of-display" device name for compatibility

Harald Freudenberger <[email protected]>
s390/zcrypt: fix reply buffer calculations for CCA replies

Mark Brown <[email protected]>
regmap: Drop initial version of maximum transfer length fixes

Matthieu Baerts <[email protected]>
selftests: tc: add 'ct' action kconfig dep

Dan Carpenter <[email protected]>
accel/qaic: Fix a leak in map_user_pages()

Matthieu Baerts <[email protected]>
selftests: tc: set timeout to 15 minutes

Josef Bacik <[email protected]>
btrfs: fix race between balance and cancel/pause

Miklos Szeredi <[email protected]>
fuse: ioctl: translate ENOSYS in outarg

Filipe Manana <[email protected]>
btrfs: zoned: fix memory leak after finding block group with super blocks

Filipe Manana <[email protected]>
btrfs: fix double iput() on inode after an error during orphan cleanup

Josef Bacik <[email protected]>
btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand

Qu Wenruo <[email protected]>
btrfs: raid56: always verify the P/Q contents for scrub

Bernd Schubert <[email protected]>
fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT

Miklos Szeredi <[email protected]>
fuse: add feature flag for expire-only

Miklos Szeredi <[email protected]>
fuse: revalidate: don't invalidate if interrupted

Filipe Manana <[email protected]>
btrfs: fix warning when putting transaction with qgroups enabled after abort

Filipe Manana <[email protected]>
btrfs: fix iput() on error pointer after error during orphan cleanup

Georg Müller <[email protected]>
perf probe: Read DWARF files from the correct CU

Georg Müller <[email protected]>
perf probe: Add test for regression introduced by switch to die_get_decl_file()

Miguel Ojeda <[email protected]>
prctl: move PR_GET_AUXV out of PR_MCE_KILL

Petr Pavlu <[email protected]>
keys: Fix linking a duplicate key to a keyring's assoc_array

Colin Ian King <[email protected]>
selftests/mm: mkdirty: fix incorrect position of #endif

Liam R. Howlett <[email protected]>
maple_tree: fix node allocation testing on 32 bit

Liam R. Howlett <[email protected]>
mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()

Peng Zhang <[email protected]>
maple_tree: set the node limit when creating a new root node

Luka Guzenko <[email protected]>
ALSA: hda/realtek: Enable Mute LED on HP Laptop 15s-eq2xxx

Christoffer Sandberg <[email protected]>
ALSA: hda/realtek: Add quirk for Clevo NS70AU

Kailang Yang <[email protected]>
ALSA: hda/realtek - remove 3k pull low procedure

Helge Deller <[email protected]>
io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()

Jens Axboe <[email protected]>
io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq


-------------

Diffstat:

Makefile | 4 +-
arch/arm64/include/asm/exception.h | 5 -
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/include/asm/kvm_pgtable.h | 26 +--
arch/arm64/kernel/fpsimd.c | 33 ++-
arch/arm64/kvm/arch_timer.c | 6 +-
arch/arm64/kvm/arm.c | 19 +-
arch/arm64/kvm/hyp/pgtable.c | 47 +++-
arch/arm64/kvm/mmu.c | 18 +-
arch/arm64/kvm/vgic/vgic-v3.c | 2 +-
arch/arm64/kvm/vgic/vgic-v4.c | 7 +-
arch/arm64/mm/mmu.c | 4 +-
arch/arm64/net/bpf_jit_comp.c | 8 +-
arch/arm64/tools/sysreg | 12 +-
arch/ia64/kernel/sys_ia64.c | 2 +-
arch/mips/include/asm/dec/prom.h | 2 +-
arch/parisc/kernel/sys_parisc.c | 15 +-
block/blk-mq.c | 10 +-
drivers/accel/qaic/qaic_control.c | 39 ++--
drivers/acpi/button.c | 9 +
drivers/acpi/resource.c | 60 -----
drivers/acpi/video_detect.c | 24 ++
drivers/acpi/x86/utils.c | 26 ++-
drivers/base/regmap/regmap-i2c.c | 8 +-
drivers/base/regmap/regmap-spi-avmm.c | 2 +-
drivers/base/regmap/regmap.c | 6 +-
drivers/bluetooth/btusb.c | 1 +
drivers/dma-buf/dma-resv.c | 13 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 5 +-
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 256 +++++++++------------
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 7 +
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 12 +
.../amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 110 +++++++++
.../amd/display/amdgpu_dm/amdgpu_dm_mst_types.h | 11 +
.../amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c | 5 +
.../drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c | 3 +-
.../drm/amd/display/dc/dcn303/dcn303_resource.c | 2 +-
.../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 8 +-
.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c | 2 +-
drivers/gpu/drm/drm_client_modeset.c | 6 +
drivers/gpu/drm/i915/i915_perf.c | 1 +
drivers/gpu/drm/nouveau/dispnv50/disp.c | 4 +
drivers/gpu/drm/nouveau/include/nvkm/subdev/i2c.h | 4 +-
drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c | 27 ++-
drivers/gpu/drm/nouveau/nvkm/subdev/i2c/base.c | 11 +-
drivers/gpu/drm/radeon/radeon_cs.c | 3 +-
drivers/gpu/drm/ttm/ttm_resource.c | 5 +-
drivers/hid/hid-ids.h | 1 +
drivers/hid/hid-quirks.c | 1 +
drivers/iommu/iommu-sva.c | 3 +-
drivers/md/md.c | 14 +-
drivers/md/raid10.c | 2 +
drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c | 10 +-
drivers/net/can/spi/mcp251xfd/mcp251xfd.h | 1 +
drivers/net/can/usb/gs_usb.c | 130 ++++++-----
drivers/net/dsa/microchip/ksz8795.c | 8 +-
drivers/net/dsa/microchip/ksz_common.c | 8 +-
drivers/net/dsa/microchip/ksz_common.h | 7 +
drivers/net/dsa/mv88e6xxx/chip.c | 7 +
drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 33 ++-
.../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 29 ++-
drivers/net/ethernet/intel/iavf/iavf.h | 16 +-
drivers/net/ethernet/intel/iavf/iavf_ethtool.c | 39 ++--
drivers/net/ethernet/intel/iavf/iavf_main.c | 223 ++++++++++++------
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 43 ++--
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 4 -
drivers/net/ethernet/intel/iavf/iavf_virtchnl.c | 5 +-
drivers/net/ethernet/intel/ice/ice_base.c | 2 +
drivers/net/ethernet/intel/ice/ice_ethtool.c | 13 +-
drivers/net/ethernet/intel/ice/ice_lib.c | 27 ---
drivers/net/ethernet/intel/ice/ice_main.c | 10 +-
drivers/net/ethernet/intel/igb/igb_main.c | 5 +
drivers/net/ethernet/intel/igc/igc_main.c | 12 +-
drivers/net/ethernet/litex/litex_liteeth.c | 19 +-
.../net/ethernet/marvell/octeontx2/nic/otx2_pf.c | 5 +-
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 29 +--
drivers/net/ethernet/mediatek/mtk_ppe_debugfs.c | 2 +-
drivers/net/ethernet/realtek/r8169_main.c | 18 +-
drivers/net/ethernet/ti/cpsw_ale.c | 24 +-
drivers/net/phy/phy_device.c | 21 +-
drivers/net/vrf.c | 12 +-
drivers/net/wireless/ath/ath11k/core.c | 53 +++--
drivers/net/wireless/ath/ath11k/mac.c | 3 +-
drivers/net/wireless/ath/ath11k/wmi.c | 5 +
drivers/net/wireless/ath/ath12k/mac.c | 1 +
drivers/net/wireless/intel/iwlwifi/mvm/mld-key.c | 9 +-
drivers/net/wireless/intel/iwlwifi/mvm/power.c | 14 +-
drivers/net/wireless/intel/iwlwifi/mvm/sta.c | 2 +-
drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 4 +
drivers/net/wireless/realtek/rtw88/sdio.c | 24 +-
drivers/net/wireless/virtual/mac80211_hwsim.c | 4 +-
drivers/of/platform.c | 2 +-
drivers/pinctrl/renesas/pinctrl-rzg2l.c | 28 ++-
drivers/pinctrl/renesas/pinctrl-rzv2m.c | 28 ++-
drivers/regulator/da9063-regulator.c | 3 +
drivers/s390/crypto/zcrypt_msgtype6.c | 33 ++-
drivers/scsi/sg.c | 10 +
drivers/spi/spi-bcm63xx.c | 2 +-
drivers/spi/spi-cadence-quadspi.c | 19 ++
drivers/spi/spi-dw-mmio.c | 22 ++
drivers/spi/spi-s3c64xx.c | 2 +
drivers/video/fbdev/au1200fb.c | 3 +
drivers/video/fbdev/imxfb.c | 5 +-
fs/btrfs/block-group.c | 1 +
fs/btrfs/ctree.c | 10 +-
fs/btrfs/disk-io.c | 3 +
fs/btrfs/extent_io.c | 33 +--
fs/btrfs/inode.c | 35 +--
fs/btrfs/qgroup.c | 1 +
fs/btrfs/raid56.c | 11 +-
fs/btrfs/volumes.c | 17 +-
fs/erofs/zdata.c | 2 +-
fs/ext4/xattr.c | 14 ++
fs/fuse/dir.c | 2 +-
fs/fuse/inode.c | 8 +-
fs/fuse/ioctl.c | 21 +-
fs/jbd2/checkpoint.c | 102 +++-----
fs/jfs/jfs_dmap.c | 3 +
fs/jfs/jfs_txnmgr.c | 5 +
fs/jfs/namei.c | 5 +
fs/overlayfs/ovl_entry.h | 9 +
fs/quota/dquot.c | 5 +-
fs/smb/client/connect.c | 19 +-
fs/smb/client/dfs.c | 26 +--
fs/smb/client/smb2transport.c | 2 +-
fs/udf/unicode.c | 2 +-
include/kvm/arm_vgic.h | 2 +-
include/linux/psi.h | 5 +-
include/linux/psi_types.h | 3 +
include/linux/sched/signal.h | 2 +-
include/linux/tcp.h | 2 +-
include/net/bluetooth/hci_core.h | 5 +
include/net/ip.h | 2 +-
include/net/tcp.h | 31 ++-
include/uapi/linux/fuse.h | 3 +
io_uring/io_uring.c | 52 ++---
kernel/bpf/bpf_lru_list.c | 21 +-
kernel/bpf/bpf_lru_list.h | 7 +-
kernel/bpf/btf.c | 23 +-
kernel/bpf/log.c | 3 -
kernel/bpf/syscall.c | 3 +-
kernel/bpf/verifier.c | 32 ++-
kernel/cgroup/cgroup.c | 2 +-
kernel/kallsyms.c | 5 +-
kernel/rcu/tasks.h | 5 +-
kernel/rcu/tree_exp.h | 2 +-
kernel/rcu/tree_plugin.h | 4 +-
kernel/sched/fair.c | 4 +-
kernel/sched/psi.c | 29 ++-
kernel/sys.c | 10 +-
kernel/time/posix-timers.c | 31 +--
kernel/trace/trace_events_hist.c | 3 +-
lib/iov_iter.c | 2 +-
lib/maple_tree.c | 3 +-
mm/mlock.c | 9 +-
net/bluetooth/hci_conn.c | 14 +-
net/bluetooth/hci_core.c | 42 +++-
net/bluetooth/hci_event.c | 15 +-
net/bluetooth/hci_sync.c | 121 ++++++++--
net/bluetooth/iso.c | 55 +++--
net/bluetooth/mgmt.c | 26 +--
net/bluetooth/sco.c | 23 +-
net/bridge/br_stp_if.c | 3 +
net/can/bcm.c | 12 +-
net/devlink/health.c | 2 +-
net/devlink/leftover.c | 5 +-
net/ipv4/esp4.c | 2 +-
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/inet_hashtables.c | 17 +-
net/ipv4/inet_timewait_sock.c | 8 +-
net/ipv4/ip_output.c | 4 +-
net/ipv4/tcp.c | 57 ++---
net/ipv4/tcp_fastopen.c | 6 +-
net/ipv4/tcp_ipv4.c | 27 ++-
net/ipv4/tcp_minisocks.c | 11 +-
net/ipv4/tcp_output.c | 6 +-
net/ipv4/udp_offload.c | 16 +-
net/ipv6/ip6_gre.c | 3 +-
net/ipv6/tcp_ipv6.c | 4 +-
net/ipv6/udp_offload.c | 3 +-
net/llc/llc_input.c | 3 -
net/netfilter/nf_tables_api.c | 12 +-
net/netfilter/nft_set_pipapo.c | 6 +-
net/sched/cls_bpf.c | 99 ++++----
net/sched/cls_matchall.c | 35 +--
net/sched/cls_u32.c | 48 +++-
net/wireless/wext-core.c | 6 +
scripts/Makefile.build | 5 +-
scripts/Makefile.host | 6 +-
scripts/kallsyms.c | 6 +-
security/keys/request_key.c | 35 ++-
security/keys/trusted-keys/trusted_tpm2.c | 2 +-
sound/pci/emu10k1/emufx.c | 112 +--------
sound/pci/hda/patch_realtek.c | 100 +++++++-
sound/soc/amd/acp/amd.h | 7 +-
sound/soc/codecs/Kconfig | 1 +
sound/soc/codecs/cs42l51-i2c.c | 6 +
sound/soc/codecs/cs42l51.c | 7 -
sound/soc/codecs/cs42l51.h | 1 -
sound/soc/codecs/rt5640.c | 12 +-
sound/soc/codecs/wcd-mbhc-v2.c | 57 +++--
sound/soc/codecs/wcd934x.c | 12 +
sound/soc/codecs/wcd938x.c | 86 ++++++-
sound/soc/fsl/fsl_sai.c | 8 +-
sound/soc/fsl/fsl_sai.h | 1 +
sound/soc/qcom/qdsp6/q6apm.c | 7 +-
sound/soc/qcom/qdsp6/topology.c | 4 +-
sound/soc/sof/ipc3-dtrace.c | 9 +-
sound/soc/tegra/tegra210_adx.c | 34 ++-
sound/soc/tegra/tegra210_amx.c | 40 ++--
tools/include/nolibc/stackprotector.h | 5 +-
tools/perf/Makefile.config | 4 +-
.../tests/shell/test_uprobe_from_different_cu.sh | 77 +++++++
tools/perf/util/dwarf-aux.c | 4 +-
tools/testing/radix-tree/maple.c | 6 +-
tools/testing/selftests/mm/mkdirty.c | 2 +-
tools/testing/selftests/tc-testing/config | 2 +
tools/testing/selftests/tc-testing/settings | 1 +
218 files changed, 2462 insertions(+), 1482 deletions(-)




2023-07-25 12:25:55

by Ronald Warsow

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

Hi Greg

6.4.7-rc1

compiles, boots and runs here on x86_64
(Intel Rocket Lake, i5-11400)

Thanks

Tested-by: Ronald Warsow <[email protected]>


2023-07-25 16:44:27

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, 25 Jul 2023 12:42:47 +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

All tests passing for Tegra ...

Test results for stable-v6.4:
11 builds: 11 pass, 0 fail
28 boots: 28 pass, 0 fail
130 tests: 130 pass, 0 fail

Linux version: 6.4.7-rc1-g3c19c5641cce
Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000,
tegra194-p2972-0000, tegra194-p3509-0000+p3668-0000,
tegra20-ventana, tegra210-p2371-2180,
tegra210-p3450-0000, tegra30-cardhu-a04

Tested-by: Jon Hunter <[email protected]>

Jon

2023-07-25 18:40:35

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

Hi,

On Tue, 25 Jul 2023 12:42:47 +0200 Greg Kroah-Hartman <[email protected]> wrote:

> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.

This rc kernel passes DAMON functionality test[1] on my test machine.
Attaching the test results summary below. Please note that I retrieved the
kernel from linux-stable-rc tree[2].

Tested-by: SeongJae Park <[email protected]>

[1] https://github.com/awslabs/damon-tests/tree/next/corr
[2] commit 3c19c5641cce ("Linux 6.4.7-rc1")


Thanks,
SJ

[...]

---

ok 1 selftests: damon: debugfs_attrs.sh
ok 2 selftests: damon: debugfs_schemes.sh
ok 3 selftests: damon: debugfs_target_ids.sh
ok 4 selftests: damon: debugfs_empty_targets.sh
ok 5 selftests: damon: debugfs_huge_count_read_write.sh
ok 6 selftests: damon: debugfs_duplicate_context_creation.sh
ok 7 selftests: damon: debugfs_rm_non_contexts.sh
ok 8 selftests: damon: sysfs.sh
ok 9 selftests: damon: sysfs_update_removed_scheme_dir.sh
ok 10 selftests: damon: reclaim.sh
ok 11 selftests: damon: lru_sort.sh
ok 1 selftests: damon-tests: kunit.sh
ok 2 selftests: damon-tests: huge_count_read_write.sh
ok 3 selftests: damon-tests: buffer_overflow.sh
ok 4 selftests: damon-tests: rm_contexts.sh
ok 5 selftests: damon-tests: record_null_deref.sh
ok 6 selftests: damon-tests: dbgfs_target_ids_read_before_terminate_race.sh
ok 7 selftests: damon-tests: dbgfs_target_ids_pid_leak.sh
ok 8 selftests: damon-tests: damo_tests.sh
ok 9 selftests: damon-tests: masim-record.sh
ok 10 selftests: damon-tests: build_i386.sh
ok 11 selftests: damon-tests: build_m68k.sh
ok 12 selftests: damon-tests: build_arm64.sh
ok 13 selftests: damon-tests: build_i386_idle_flag.sh
ok 14 selftests: damon-tests: build_i386_highpte.sh
ok 15 selftests: damon-tests: build_nomemcg.sh

PASS

2023-07-25 20:47:11

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/25/23 04:42, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

Compiled and booted on my test system. No dmesg regressions.

Tested-by: Shuah Khan <[email protected]>

thanks,
-- Shuah

2023-07-26 01:21:29

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/25/23 03:42, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

On ARCH_BRCMSTB using 32-bit and 64-bit ARM kernels, build tested on
BMIPS_GENERIC:

Tested-by: Florian Fainelli <[email protected]>
--
Florian


2023-07-26 06:57:15

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, Jul 25, 2023 at 12:42:47PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>

Successfully compiled and installed bindeb-pkgs on my computer (Acer
Aspire E15, Intel Core i3 Haswell). No noticeable regressions.

Tested-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (558.00 B)
signature.asc (235.00 B)
Download all attachments

2023-07-26 09:52:26

by Ron Economos

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/25/23 3:42 AM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

Built and booted successfully on RISC-V RV64 (HiFive Unmatched).

Tested-by: Ron Economos <[email protected]>


2023-07-26 10:05:11

by Conor Dooley

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, Jul 25, 2023 at 12:42:47PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.

Tested-by: Conor Dooley <[email protected]>

Thanks,
Conor.


Attachments:
(No filename) (380.00 B)
signature.asc (235.00 B)
Download all attachments

2023-07-26 11:01:07

by Naresh Kamboju

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, 25 Jul 2023 at 16:19, Greg Kroah-Hartman
<[email protected]> wrote:
>
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h


Results from Linaro’s test farm.
No regressions on arm64, arm, x86_64, and i386.

Tested-by: Linux Kernel Functional Testing <[email protected]>

## Build
* kernel: 6.4.7-rc1
* git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc
* git branch: linux-6.4.y
* git commit: 3c19c5641cce21ec84a7d62be76d53f454531f48
* git describe: v6.4.6-228-g3c19c5641cce
* test details:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4.6-228-g3c19c5641cce

## Test Regressions (compared to v6.4.5)

## Metric Regressions (compared to v6.4.5)

## Test Fixes (compared to v6.4.5)

## Metric Fixes (compared to v6.4.5)

## Test result summary
total: 166993, pass: 145128, fail: 2201, skip: 19509, xfail: 155

## Build Summary
* arc: 5 total, 5 passed, 0 failed
* arm: 141 total, 141 passed, 0 failed
* arm64: 50 total, 50 passed, 0 failed
* i386: 37 total, 37 passed, 0 failed
* mips: 26 total, 26 passed, 0 failed
* parisc: 3 total, 3 passed, 0 failed
* powerpc: 34 total, 34 passed, 0 failed
* riscv: 22 total, 22 passed, 0 failed
* s390: 12 total, 12 passed, 0 failed
* sh: 12 total, 12 passed, 0 failed
* sparc: 6 total, 6 passed, 0 failed
* x86_64: 42 total, 42 passed, 0 failed

## Test suites summary
* boot
* kselftest-android
* kselftest-arm64
* kselftest-breakpoints
* kselftest-capabilities
* kselftest-cgroup
* kselftest-clone3
* kselftest-core
* kselftest-cpu-hotplug
* kselftest-cpufreq
* kselftest-drivers-dma-buf
* kselftest-efivarfs
* kselftest-exec
* kselftest-filesystems
* kselftest-filesystems-binderfs
* kselftest-filesystems-epoll
* kselftest-firmware
* kselftest-fpu
* kselftest-ftrace
* kselftest-futex
* kselftest-gpio
* kselftest-intel_pstate
* kselftest-ipc
* kselftest-ir
* kselftest-kcmp
* kselftest-kexec
* kselftest-kvm
* kselftest-lib
* kselftest-livepatch
* kselftest-membarrier
* kselftest-memfd
* kselftest-memory-hotplug
* kselftest-mincore
* kselftest-mount
* kselftest-mqueue
* kselftest-net
* kselftest-net-forwarding
* kselftest-net-mptcp
* kselftest-netfilter
* kselftest-nsfs
* kselftest-openat2
* kselftest-pid_namespace
* kselftest-pidfd
* kselftest-proc
* kselftest-pstore
* kselftest-ptrace
* kselftest-rseq
* kselftest-rtc
* kselftest-seccomp
* kselftest-sigaltstack
* kselftest-size
* kselftest-splice
* kselftest-static_keys
* kselftest-sync
* kselftest-sysctl
* kselftest-tc-testing
* kselftest-timens
* kselftest-timers
* kselftest-tmpfs
* kselftest-tpm2
* kselftest-user
* kselftest-user_events
* kselftest-vDSO
* kselftest-vm
* kselftest-watchdog
* kselftest-x86
* kselftest-zram
* kunit
* kvm-unit-tests
* libgpiod
* libhugetlbfs
* log-parser-boot
* log-parser-test
* ltp-cap_bounds
* ltp-commands
* ltp-containers
* ltp-controllers
* ltp-cpuhotplug
* ltp-crypto
* ltp-cve
* ltp-dio
* ltp-fcntl-locktests
* ltp-filecaps
* ltp-fs
* ltp-fs_bind
* ltp-fs_perms_simple
* ltp-fsx
* ltp-hugetlb
* ltp-io
* ltp-ipc
* ltp-math
* ltp-mm
* ltp-nptl
* ltp-pty
* ltp-sched
* ltp-securebits
* ltp-smoke
* ltp-syscalls
* ltp-tracing
* network-basic-tests
* perf
* rcutorture
* v4l2-compliance

--
Linaro LKFT
https://lkft.linaro.org

2023-07-27 00:40:23

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, Jul 25, 2023 at 12:42:47PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>

Build results:
total: 157 pass: 157 fail: 0
Qemu test results:
total: 522 pass: 522 fail: 0

Tested-by: Guenter Roeck <[email protected]>

Guenter

2023-07-27 04:52:15

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Tue, Jul 25, 2023 at 12:42:47PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 6.4.7 release.
> There are 227 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> and the diffstat can be found below.

I saw this when running rcutorture, this one happened in the TREE04
configuration. This is likely due to the stuttering issues we are discussing
in the other thread. Anyway I am just making a note here while I am
continuing to look into it.

Other than that, all tests pass:
Tested-by: Joel Fernandes (Google) <[email protected]>


[ 1676.206713] ------------[ cut here ]------------
[ 1676.213985] rcutorture_oom_notify invoked upon OOM during forward-progress testing.
[ 1676.224945] WARNING: CPU: 7 PID: 103 at kernel/rcu/rcutorture.c:2841 rcutorture_oom_notify+0x3c/0x1d0
[ 1676.238323] Modules linked in:
[ 1676.242750] CPU: 7 PID: 103 Comm: rcu_torture_fwd Not tainted 6.4.7-rc1-g3c19c5641cce #6
[ 1676.254378] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 1676.268003] RIP: 0010:rcutorture_oom_notify+0x3c/0x1d0
[ 1676.275468] Code: d5 53 e8 e7 23 d4 00 48 8b 1d 70 34 45 02 48 85 db 0f 84 88 01 00 00 48 c7 c6 e0 f6 a0 b2 48 c7 c7 88 91 ee b2 e8 14 25 f7 ff <0f> 0b 8b 35 8c d8 a2 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83 c5 01
[ 1676.302738] RSP: 0000:ffffa7c6c0397a98 EFLAGS: 00010282
[ 1676.310984] RAX: 0000000000000000 RBX: ffff897a418cc000 RCX: 00000000ffffdfff
[ 1676.322207] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000009ffb
[ 1676.333232] RBP: ffffa7c6c0397b28 R08: 00000000ffffdfff R09: 00000000ffffdfff
[ 1676.342365] R10: ffffffffb32591e0 R11: ffffffffb32591e0 R12: 0000000000000000
[ 1676.352563] R13: ffffa7c6c0397b28 R14: 00000000ffffffff R15: 0000000000000000
[ 1676.362721] FS: 0000000000000000(0000) GS:ffff897a5f5c0000(0000) knlGS:0000000000000000
[ 1676.374816] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1676.383256] CR2: 0000000000000000 CR3: 000000001e22e000 CR4: 00000000000006e0
[ 1676.392499] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1676.401739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1676.410804] Call Trace:
[ 1676.414279] <TASK>
[ 1676.417140] ? rcutorture_oom_notify+0x3c/0x1d0
[ 1676.422944] ? __warn+0x7c/0x120
[ 1676.427146] ? rcutorture_oom_notify+0x3c/0x1d0
[ 1676.432902] ? report_bug+0x15d/0x180
[ 1676.437783] ? handle_bug+0x3c/0x70
[ 1676.442369] ? exc_invalid_op+0x17/0x70
[ 1676.447269] ? asm_exc_invalid_op+0x1a/0x20
[ 1676.452574] ? rcutorture_oom_notify+0x3c/0x1d0
[ 1676.458128] ? rcutorture_oom_notify+0x3c/0x1d0
[ 1676.463880] notifier_call_chain+0x55/0xb0
[ 1676.469255] blocking_notifier_call_chain+0x3a/0x60
[ 1676.475244] out_of_memory+0x3bc/0x710
[ 1676.480323] __alloc_pages_slowpath.constprop.0+0xbb6/0xd00
[ 1676.487347] __alloc_pages+0x2cb/0x2e0
[ 1676.492200] allocate_slab+0x348/0x3e0
[ 1676.496983] ? sysvec_reschedule_ipi+0x31/0xd0
[ 1676.502607] ___slab_alloc+0x2d8/0x7a0
[ 1676.507406] ? rcu_torture_fwd_prog+0x3d8/0xa60
[ 1676.513157] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 1676.519767] ? rcu_nocb_do_flush_bypass+0xc6/0x110
[ 1676.525875] ? rcu_torture_fwd_prog+0x3d8/0xa60
[ 1676.531607] __kmem_cache_alloc_node+0x183/0x1a0
[ 1676.537506] kmalloc_trace+0x25/0x90
[ 1676.542240] rcu_torture_fwd_prog+0x3d8/0xa60
[ 1676.547800] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 1676.554051] ? kthread+0xcb/0xf0
[ 1676.558286] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 1676.564594] kthread+0xcb/0xf0
[ 1676.568731] ? __pfx_kthread+0x10/0x10
[ 1676.573590] ret_from_fork+0x2c/0x50
[ 1676.578317] </TASK>
[ 1676.581240] ---[ end trace 0000000000000000 ]---


thanks,

- Joel

>
> thanks,
>
> greg k-h
>
> -------------
> Pseudo-Shortlog of commits:
>
> Greg Kroah-Hartman <[email protected]>
> Linux 6.4.7-rc1
>
> Wayne Lin <[email protected]>
> drm/amd/display: Add polling method to handle MST reply packet
>
> Srinivasan Shanmugam <[email protected]>
> drm/amd/display: Clean up errors & warnings in amdgpu_dm.c
>
> Yu Kuai <[email protected]>
> scsi: sg: Fix checking return value of blk_get_queue()
>
> Yu Kuai <[email protected]>
> scsi/sg: don't grab scsi host module reference
>
> Abe Kohandel <[email protected]>
> spi: dw: Remove misleading comment for Mount Evans SoC
>
> Yunxiang Li <[email protected]>
> drm/ttm: fix bulk_move corruption when adding a entry
>
> Mohamed Khalfella <[email protected]>
> tracing/histograms: Return an error if we fail to add histogram to hist_vars list
>
> Miguel Ojeda <[email protected]>
> kbuild: rust: avoid creating temporary files
>
> Zhang Yi <[email protected]>
> jbd2: recheck chechpointing non-dirty buffer
>
> Vladimir Oltean <[email protected]>
> net: phy: prevent stale pointer dereference in phy_init()
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around fastopenq.max_qlen
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around icsk->icsk_user_timeout
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->notsent_lowat
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around rskq_defer_accept
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->linger2
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around icsk->icsk_syn_retries
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->keepalive_probes
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->keepalive_intvl
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->keepalive_time
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->tsoffset
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tp->tcp_tx_delay
>
> Tomasz Moń <[email protected]>
> Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
>
> Pauli Virtanen <[email protected]>
> Bluetooth: SCO: fix sco_conn related locking and validity issues
>
> Siddh Raman Pant <[email protected]>
> Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
>
> Douglas Anderson <[email protected]>
> Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
>
> Pauli Virtanen <[email protected]>
> Bluetooth: ISO: fix iso_conn related locking and validity issues
>
> Pauli Virtanen <[email protected]>
> Bluetooth: hci_event: call disconnect callback before deleting conn
>
> Pauli Virtanen <[email protected]>
> Bluetooth: use RCU for hci_conn_params and iterate safely in hci_sync
>
> Pablo Neira Ayuso <[email protected]>
> netfilter: nf_tables: skip bound chain on rule flush
>
> Pablo Neira Ayuso <[email protected]>
> netfilter: nf_tables: skip bound chain in netns release path
>
> Florian Westphal <[email protected]>
> netfilter: nft_set_pipapo: fix improper element removal
>
> Florian Westphal <[email protected]>
> netfilter: nf_tables: can't schedule in nft_chain_validate
>
> Florian Westphal <[email protected]>
> netfilter: nf_tables: fix spurious set element insertion failure
>
> Vitaly Rodionov <[email protected]>
> ALSA: hda/realtek: Fix generic fixup definition for cs35l41 amp
>
> Kuniyuki Iwashima <[email protected]>
> llc: Don't drop packet from non-root netns.
>
> Zhang Shurong <[email protected]>
> fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe
>
> Daniel Golle <[email protected]>
> net: ethernet: mtk_eth_soc: always mtk_get_ib1_pkt_type
>
> Kuniyuki Iwashima <[email protected]>
> Revert "tcp: avoid the lookup process failing to get sk in ehash table"
>
> Yuanjun Gong <[email protected]>
> net:ipv6: check return value of pskb_trim()
>
> Wang Ming <[email protected]>
> net: ipv4: Use kfree_sensitive instead of kfree
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tcp_rsk(req)->ts_recent
>
> Eric Dumazet <[email protected]>
> tcp: annotate data-races around tcp_rsk(req)->txhash
>
> Antoine Tenart <[email protected]>
> net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
>
> Florian Kauer <[email protected]>
> igc: Prevent garbled TX queue with XDP ZEROCOPY
>
> Kurt Kanzenbach <[email protected]>
> igc: Avoid transmit queue timeout for XDP
>
> Alexander Duyck <[email protected]>
> bpf, arm64: Fix BTI type used for freplace attached functions
>
> Kumar Kartikeya Dwivedi <[email protected]>
> bpf: Repeat check_max_stack_depth for async callbacks
>
> Kumar Kartikeya Dwivedi <[email protected]>
> bpf: Fix subprog idx logic in check_max_stack_depth
>
> Geetha sowjanya <[email protected]>
> octeontx2-pf: Dont allocate BPIDs for LBK interfaces
>
> Ido Schimmel <[email protected]>
> vrf: Fix lockdep splat in output path
>
> Jiapeng Chong <[email protected]>
> security: keys: Modify mismatched function name
>
> Ahmed Zaki <[email protected]>
> iavf: fix reset task race with iavf_remove()
>
> Ahmed Zaki <[email protected]>
> iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies
>
> Marcin Szycik <[email protected]>
> iavf: Wait for reset in callbacks which trigger it
>
> Przemek Kitszel <[email protected]>
> iavf: make functions static where possible
>
> Ahmed Zaki <[email protected]>
> iavf: use internal state to free traffic IRQs
>
> Ding Hui <[email protected]>
> iavf: Fix out-of-bounds when setting channels on remove
>
> Ding Hui <[email protected]>
> iavf: Fix use-after-free in free_netdev
>
> Andrzej Hajda <[email protected]>
> drm/i915/perf: add sentinel to xehp_oa_b_counters
>
> Heiner Kallweit <[email protected]>
> r8169: fix ASPM-related problem for chip version 42 and 43
>
> Tristram Ha <[email protected]>
> net: dsa: microchip: correct KSZ8795 static MAC table access
>
> Victor Nogueira <[email protected]>
> net: sched: cls_bpf: Undo tcf_bind_filter in case of an error
>
> Victor Nogueira <[email protected]>
> net: sched: cls_u32: Undo refcount decrement in case update failed
>
> Victor Nogueira <[email protected]>
> net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode
>
> Victor Nogueira <[email protected]>
> net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms
>
> Martin Fuzzey <[email protected]>
> regulator: da9063: fix null pointer deref with partial DT config
>
> Dan Carpenter <[email protected]>
> ASoC: SOF: ipc3-dtrace: uninitialized data in dfsentry_trace_filter_write()
>
> Michal Swiatkowski <[email protected]>
> ice: prevent NULL pointer deref during reload
>
> Petr Oros <[email protected]>
> ice: Unregister netdev and devlink_port only once
>
> Shyam Prasad N <[email protected]>
> cifs: fix mid leak during reconnection after timeout threshold
>
> Dan Carpenter <[email protected]>
> iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()
>
> Yan Zhai <[email protected]>
> gso: fix dodgy bit handling for GSO_UDP_L4
>
> Daniel Golle <[email protected]>
> net: ethernet: mtk_eth_soc: handle probe deferral
>
> Kuniyuki Iwashima <[email protected]>
> bridge: Add extack warning when enabling STP in netns.
>
> Tanmay Patil <[email protected]>
> net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field()
>
> Linus Walleij <[email protected]>
> dsa: mv88e6xxx: Do a final check before timing out
>
> Marc Zyngier <[email protected]>
> arm64: Fix HFGxTR_EL2 field naming
>
> Paulo Alcantara <[email protected]>
> smb: client: fix missed ses refcounting
>
> Yonghong Song <[email protected]>
> kallsyms: strip LTO-only suffixes from promoted global functions
>
> Jaewon Kim <[email protected]>
> spi: s3c64xx: clear loopback bit after loopback test
>
> Christoph Hellwig <[email protected]>
> btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block
>
> James Clark <[email protected]>
> perf build: Fix library not found error when using CSLIBS
>
> Yangtao Li <[email protected]>
> fbdev: imxfb: Removed unneeded release_mem_region
>
> Martin Kaiser <[email protected]>
> fbdev: imxfb: warn about invalid left/right margin
>
> Jonas Gorski <[email protected]>
> spi: bcm63xx: fix max prepend length
>
> Biju Das <[email protected]>
> pinctrl: renesas: rzg2l: Handle non-unique subnode names
>
> Geert Uytterhoeven <[email protected]>
> pinctrl: renesas: rzv2m: Handle non-unique subnode names
>
> Suren Baghdasaryan <[email protected]>
> sched/psi: use kernfs polling functions for PSI trigger polling
>
> Miaohe Lin <[email protected]>
> sched/fair: Use recent_used_cpu to test p->cpus_ptr
>
> Peter Zijlstra <[email protected]>
> iov_iter: Mark copy_iovec_from_user() noclone
>
> Srinivas Kandagatla <[email protected]>
> ASoC: qcom: q6apm: do not close GPR port before closing graph
>
> Srinivas Kandagatla <[email protected]>
> ASoC: codecs: wcd938x: fix dB range for HPHL and HPHR
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix mbhc impedance loglevel
>
> Vijendar Mukunda <[email protected]>
> ASoC: amd: acp: fix for invalid dai id handling in acp_get_byte_count()
>
> Hao Chen <[email protected]>
> net: hns3: fix strncpy() not using dest-buf length as length issue
>
> Ying Hsu <[email protected]>
> igb: Fix igb_down hung on surprise removal
>
> Yi Kuo <[email protected]>
> wifi: iwlwifi: pcie: add device id 51F1 for killer 1675
>
> Johannes Berg <[email protected]>
> wifi: iwlwifi: mvm: avoid baid size integer overflow
>
> Mukesh Sisodiya <[email protected]>
> wifi: iwlwifi: Add support for new PCI Id
>
> Gustavo A. R. Silva <[email protected]>
> wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point()
>
> Mukesh Sisodiya <[email protected]>
> wifi: iwlwifi: mvm: Add NULL check before dereferencing the pointer
>
> Petr Oros <[email protected]>
> devlink: report devlink_port_type_warn source device
>
> Jisheng Zhang <[email protected]>
> net: ethernet: litex: add support for 64 bit stats
>
> Gregory Greenman <[email protected]>
> wifi: iwlwifi: mvm: fix potential array out of bounds access
>
> P Praneesh <[email protected]>
> wifi: ath11k: fix memory leak in WMI firmware stats
>
> Balamurugan S <[email protected]>
> wifi: ath12k: Avoid NULL pointer access during management transmit cleanup
>
> Abe Kohandel <[email protected]>
> spi: dw: Add compatible for Intel Mount Evans SoC
>
> Ilan Peer <[email protected]>
> wifi: mac80211_hwsim: Fix possible NULL dereference
>
> Wen Gong <[email protected]>
> wifi: ath11k: add support default regdb while searching board-2.bin for WCN6855
>
> Jakub Kicinski <[email protected]>
> devlink: make health report on unregistered instance warn just once
>
> Yonghong Song <[email protected]>
> bpf: Silence a warning in btf_type_id_size()
>
> Martin Blumenstingl <[email protected]>
> wifi: rtw88: sdio: Check the HISR RX_REQUEST bit in rtw_sdio_rx_isr()
>
> Aditi Ghag <[email protected]>
> bpf: tcp: Avoid taking fast sock lock in iterator
>
> Andrii Nakryiko <[email protected]>
> bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log
>
> Brad Larson <[email protected]>
> spi: cadence-quadspi: Add compatible for AMD Pensando Elba SoC
>
> Martin KaFai Lau <[email protected]>
> bpf: Address KCSAN report on bpf_lru_list
>
> Kui-Feng Lee <[email protected]>
> bpf: Print a warning only if writing to unprivileged_bpf_disabled.
>
> Maxime Bizon <[email protected]>
> wifi: ath11k: fix registration of 6Ghz-only phy without the full channel range
>
> Yicong Yang <[email protected]>
> sched/fair: Don't balance task to its current running CPU
>
> Thomas Weißschuh <[email protected]>
> tools/nolibc: ensure stack protector guard is never zero
>
> Paul E. McKenney <[email protected]>
> rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
>
> Shigeru Yoshida <[email protected]>
> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
>
> Hans de Goede <[email protected]>
> ACPI: video: Add backlight=native DMI quirk for Dell Studio 1569
>
> Mark Rutland <[email protected]>
> arm64: mm: fix VA-range sanity check
>
> Youngmin Nam <[email protected]>
> arm64: set __exception_irq_entry with __irq_entry as a default
>
> Mario Limonciello <[email protected]>
> ACPI: resource: Remove "Zen" specific match and quirks
>
> Hans de Goede <[email protected]>
> ACPI: video: Add backlight=native DMI quirk for Lenovo ThinkPad X131e (3371 AMD version)
>
> Hans de Goede <[email protected]>
> ACPI: video: Add backlight=native DMI quirk for Apple iMac11,3
>
> Hans de Goede <[email protected]>
> ACPI: x86: Add ACPI_QUIRK_UART1_SKIP for Lenovo Yoga Book yb1-x90f/l
>
> Hans de Goede <[email protected]>
> ACPI: button: Add lid disable DMI quirk for Nextbook Ares 8A
>
> Hans de Goede <[email protected]>
> ACPI: x86: Add skip i2c clients quirk for Nextbook Ares 8A
>
> Sandeep Dhavale <[email protected]>
> erofs: Fix detection of atomic context
>
> Filipe Manana <[email protected]>
> btrfs: abort transaction at update_ref_for_cow() when ref count is zero
>
> Christoph Hellwig <[email protected]>
> btrfs: don't check PageError in __extent_writepage
>
> David Sterba <[email protected]>
> btrfs: add xxhash to fast checksum implementations
>
> Thomas Gleixner <[email protected]>
> posix-timers: Ensure timer ID search-loop limit is valid
>
> Ming Lei <[email protected]>
> blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none
>
> Yu Kuai <[email protected]>
> scsi: sg: fix blktrace debugfs entries leakage
>
> Yu Kuai <[email protected]>
> md/raid10: prevent soft lockup while flush writes
>
> Yu Kuai <[email protected]>
> md: fix data corruption for raid456 when reshape restart while grow up
>
> Immad Mir <[email protected]>
> FS: JFS: Check for read-only mounted filesystem in txBegin
>
> Immad Mir <[email protected]>
> FS: JFS: Fix null-ptr-deref Read in txBegin
>
> Gustavo A. R. Silva <[email protected]>
> MIPS: dec: prom: Address -Warray-bounds warning
>
> Yogesh <[email protected]>
> fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev
>
> Matthew Anderson <[email protected]>
> ALSA: hda/realtek: Add quirks for ROG ALLY CS35l41 audio
>
> Jan Kara <[email protected]>
> udf: Fix uninitialized array access for some pathnames
>
> Christian Brauner <[email protected]>
> ovl: check type and offset of struct vfsmount in ovl_entry
>
> Marco Morandini <[email protected]>
> HID: add quirk for 03f0:464a HP Elite Presenter Mouse
>
> Ye Bin <[email protected]>
> quota: fix warning in dqgrab()
>
> Jan Kara <[email protected]>
> quota: Properly disable quotas when add_dquot_ref() fails
>
> Oswald Buddenhagen <[email protected]>
> ALSA: emu10k1: roll up loops in DSP setup code for Audigy
>
> hackyzh002 <[email protected]>
> drm/radeon: Fix integer overflow in radeon_cs_parser_init
>
> Eric Whitney <[email protected]>
> ext4: correct inline offset when handling xattrs in inode body
>
> Marc Zyngier <[email protected]>
> KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
>
> Marc Zyngier <[email protected]>
> KVM: arm64: Disable preemption in kvm_arch_hardware_enable()
>
> Oliver Upton <[email protected]>
> KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
>
> Marc Zyngier <[email protected]>
> KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix soundwire initialisation race
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix codec initialisation race
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd934x: fix resource leaks on component remove
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix missing mbhc init error handling
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix resource leaks on component remove
>
> Sheetal <[email protected]>
> ASoC: tegra: Fix AMX byte map
>
> Johan Hovold <[email protected]>
> ASoC: qdsp6: audioreach: fix topology probe deferral
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd-mbhc-v2: fix resource leaks on component remove
>
> Nathan Chancellor <[email protected]>
> ASoC: cs35l45: Select REGMAP_IRQ
>
> Johan Hovold <[email protected]>
> ASoC: codecs: wcd938x: fix missing clsh ctrl error handling
>
> Thomas Petazzoni <[email protected]>
> ASoC: cs42l51: fix driver to properly autoload with automatic module loading
>
> Sameer Pujar <[email protected]>
> ASoC: rt5640: Fix sleep in atomic context
>
> Sheetal <[email protected]>
> ASoC: tegra: Fix ADX byte map
>
> Fabio Estevam <[email protected]>
> ASoC: fsl_sai: Revert "ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode"
>
> Matus Gajdos <[email protected]>
> ASoC: fsl_sai: Disable bit clock with transmitter
>
> Nicholas Kazlauskas <[email protected]>
> drm/amd/display: Keep PHY active for DP displays on DCN31
>
> Taimur Hassan <[email protected]>
> drm/amd/display: check TG is non-null before checking if enabled
>
> Zhikai Zhai <[email protected]>
> drm/amd/display: Disable MPC split by default on special asic
>
> Simon Ser <[email protected]>
> drm/amd/display: only accept async flips for fast updates
>
> Jocelyn Falempe <[email protected]>
> drm/client: Fix memory leak in drm_client_modeset_probe
>
> Jocelyn Falempe <[email protected]>
> drm/client: Fix memory leak in drm_client_target_cloned
>
> Ben Skeggs <[email protected]>
> drm/nouveau/i2c: fix number of aux event slots
>
> Ben Skeggs <[email protected]>
> drm/nouveau/kms/nv50-: init hpd_irq_lock for PIOR DP
>
> Ben Skeggs <[email protected]>
> drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts
>
> Alex Deucher <[email protected]>
> drm/amdgpu/pm: make mclk consistent for smu 13.0.7
>
> Alex Deucher <[email protected]>
> drm/amdgpu/pm: make gfxclock consistent for sienna cichlid
>
> Guchun Chen <[email protected]>
> drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
>
> Ville Syrjälä <[email protected]>
> dma-buf/dma-resv: Stop leaking on krealloc() failure
>
> Dan Carpenter <[email protected]>
> accel/qaic: Add consistent integer overflow checks
>
> Dan Carpenter <[email protected]>
> accel/qaic: tighten bounds checking in decode_message()
>
> Dan Carpenter <[email protected]>
> accel/qaic: tighten bounds checking in encode_message()
>
> Matthieu Baerts <[email protected]>
> selftests: tc: add ConnTrack procfs kconfig
>
> Heiner Kallweit <[email protected]>
> Revert "r8169: disable ASPM during NAPI poll"
>
> Marc Kleine-Budde <[email protected]>
> can: gs_usb: fix time stamp counter initialization
>
> Marc Kleine-Budde <[email protected]>
> can: gs_usb: gs_can_open(): improve error handling
>
> YueHaibing <[email protected]>
> can: bcm: Fix UAF in bcm_proc_show()
>
> Fedor Ross <[email protected]>
> can: mcp251xfd: __mcp251xfd_chip_set_mode(): increase poll timeout
>
> Mark Brown <[email protected]>
> arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes
>
> Helge Deller <[email protected]>
> ia64: mmap: Consider pgoff when searching for free mapping
>
> Mark Brown <[email protected]>
> regmap: Account for register length in SMBus I/O limits
>
> Rob Herring <[email protected]>
> of: Preserve "of-display" device name for compatibility
>
> Harald Freudenberger <[email protected]>
> s390/zcrypt: fix reply buffer calculations for CCA replies
>
> Mark Brown <[email protected]>
> regmap: Drop initial version of maximum transfer length fixes
>
> Matthieu Baerts <[email protected]>
> selftests: tc: add 'ct' action kconfig dep
>
> Dan Carpenter <[email protected]>
> accel/qaic: Fix a leak in map_user_pages()
>
> Matthieu Baerts <[email protected]>
> selftests: tc: set timeout to 15 minutes
>
> Josef Bacik <[email protected]>
> btrfs: fix race between balance and cancel/pause
>
> Miklos Szeredi <[email protected]>
> fuse: ioctl: translate ENOSYS in outarg
>
> Filipe Manana <[email protected]>
> btrfs: zoned: fix memory leak after finding block group with super blocks
>
> Filipe Manana <[email protected]>
> btrfs: fix double iput() on inode after an error during orphan cleanup
>
> Josef Bacik <[email protected]>
> btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
>
> Qu Wenruo <[email protected]>
> btrfs: raid56: always verify the P/Q contents for scrub
>
> Bernd Schubert <[email protected]>
> fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
>
> Miklos Szeredi <[email protected]>
> fuse: add feature flag for expire-only
>
> Miklos Szeredi <[email protected]>
> fuse: revalidate: don't invalidate if interrupted
>
> Filipe Manana <[email protected]>
> btrfs: fix warning when putting transaction with qgroups enabled after abort
>
> Filipe Manana <[email protected]>
> btrfs: fix iput() on error pointer after error during orphan cleanup
>
> Georg Müller <[email protected]>
> perf probe: Read DWARF files from the correct CU
>
> Georg Müller <[email protected]>
> perf probe: Add test for regression introduced by switch to die_get_decl_file()
>
> Miguel Ojeda <[email protected]>
> prctl: move PR_GET_AUXV out of PR_MCE_KILL
>
> Petr Pavlu <[email protected]>
> keys: Fix linking a duplicate key to a keyring's assoc_array
>
> Colin Ian King <[email protected]>
> selftests/mm: mkdirty: fix incorrect position of #endif
>
> Liam R. Howlett <[email protected]>
> maple_tree: fix node allocation testing on 32 bit
>
> Liam R. Howlett <[email protected]>
> mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()
>
> Peng Zhang <[email protected]>
> maple_tree: set the node limit when creating a new root node
>
> Luka Guzenko <[email protected]>
> ALSA: hda/realtek: Enable Mute LED on HP Laptop 15s-eq2xxx
>
> Christoffer Sandberg <[email protected]>
> ALSA: hda/realtek: Add quirk for Clevo NS70AU
>
> Kailang Yang <[email protected]>
> ALSA: hda/realtek - remove 3k pull low procedure
>
> Helge Deller <[email protected]>
> io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()
>
> Jens Axboe <[email protected]>
> io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq
>
>
> -------------
>
> Diffstat:
>
> Makefile | 4 +-
> arch/arm64/include/asm/exception.h | 5 -
> arch/arm64/include/asm/kvm_host.h | 2 +
> arch/arm64/include/asm/kvm_pgtable.h | 26 +--
> arch/arm64/kernel/fpsimd.c | 33 ++-
> arch/arm64/kvm/arch_timer.c | 6 +-
> arch/arm64/kvm/arm.c | 19 +-
> arch/arm64/kvm/hyp/pgtable.c | 47 +++-
> arch/arm64/kvm/mmu.c | 18 +-
> arch/arm64/kvm/vgic/vgic-v3.c | 2 +-
> arch/arm64/kvm/vgic/vgic-v4.c | 7 +-
> arch/arm64/mm/mmu.c | 4 +-
> arch/arm64/net/bpf_jit_comp.c | 8 +-
> arch/arm64/tools/sysreg | 12 +-
> arch/ia64/kernel/sys_ia64.c | 2 +-
> arch/mips/include/asm/dec/prom.h | 2 +-
> arch/parisc/kernel/sys_parisc.c | 15 +-
> block/blk-mq.c | 10 +-
> drivers/accel/qaic/qaic_control.c | 39 ++--
> drivers/acpi/button.c | 9 +
> drivers/acpi/resource.c | 60 -----
> drivers/acpi/video_detect.c | 24 ++
> drivers/acpi/x86/utils.c | 26 ++-
> drivers/base/regmap/regmap-i2c.c | 8 +-
> drivers/base/regmap/regmap-spi-avmm.c | 2 +-
> drivers/base/regmap/regmap.c | 6 +-
> drivers/bluetooth/btusb.c | 1 +
> drivers/dma-buf/dma-resv.c | 13 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 5 +-
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 256 +++++++++------------
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 7 +
> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 12 +
> .../amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 110 +++++++++
> .../amd/display/amdgpu_dm/amdgpu_dm_mst_types.h | 11 +
> .../amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c | 5 +
> .../drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c | 3 +-
> .../drm/amd/display/dc/dcn303/dcn303_resource.c | 2 +-
> .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 8 +-
> .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c | 2 +-
> drivers/gpu/drm/drm_client_modeset.c | 6 +
> drivers/gpu/drm/i915/i915_perf.c | 1 +
> drivers/gpu/drm/nouveau/dispnv50/disp.c | 4 +
> drivers/gpu/drm/nouveau/include/nvkm/subdev/i2c.h | 4 +-
> drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c | 27 ++-
> drivers/gpu/drm/nouveau/nvkm/subdev/i2c/base.c | 11 +-
> drivers/gpu/drm/radeon/radeon_cs.c | 3 +-
> drivers/gpu/drm/ttm/ttm_resource.c | 5 +-
> drivers/hid/hid-ids.h | 1 +
> drivers/hid/hid-quirks.c | 1 +
> drivers/iommu/iommu-sva.c | 3 +-
> drivers/md/md.c | 14 +-
> drivers/md/raid10.c | 2 +
> drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c | 10 +-
> drivers/net/can/spi/mcp251xfd/mcp251xfd.h | 1 +
> drivers/net/can/usb/gs_usb.c | 130 ++++++-----
> drivers/net/dsa/microchip/ksz8795.c | 8 +-
> drivers/net/dsa/microchip/ksz_common.c | 8 +-
> drivers/net/dsa/microchip/ksz_common.h | 7 +
> drivers/net/dsa/mv88e6xxx/chip.c | 7 +
> drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 33 ++-
> .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 29 ++-
> drivers/net/ethernet/intel/iavf/iavf.h | 16 +-
> drivers/net/ethernet/intel/iavf/iavf_ethtool.c | 39 ++--
> drivers/net/ethernet/intel/iavf/iavf_main.c | 223 ++++++++++++------
> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 43 ++--
> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 4 -
> drivers/net/ethernet/intel/iavf/iavf_virtchnl.c | 5 +-
> drivers/net/ethernet/intel/ice/ice_base.c | 2 +
> drivers/net/ethernet/intel/ice/ice_ethtool.c | 13 +-
> drivers/net/ethernet/intel/ice/ice_lib.c | 27 ---
> drivers/net/ethernet/intel/ice/ice_main.c | 10 +-
> drivers/net/ethernet/intel/igb/igb_main.c | 5 +
> drivers/net/ethernet/intel/igc/igc_main.c | 12 +-
> drivers/net/ethernet/litex/litex_liteeth.c | 19 +-
> .../net/ethernet/marvell/octeontx2/nic/otx2_pf.c | 5 +-
> drivers/net/ethernet/mediatek/mtk_eth_soc.c | 29 +--
> drivers/net/ethernet/mediatek/mtk_ppe_debugfs.c | 2 +-
> drivers/net/ethernet/realtek/r8169_main.c | 18 +-
> drivers/net/ethernet/ti/cpsw_ale.c | 24 +-
> drivers/net/phy/phy_device.c | 21 +-
> drivers/net/vrf.c | 12 +-
> drivers/net/wireless/ath/ath11k/core.c | 53 +++--
> drivers/net/wireless/ath/ath11k/mac.c | 3 +-
> drivers/net/wireless/ath/ath11k/wmi.c | 5 +
> drivers/net/wireless/ath/ath12k/mac.c | 1 +
> drivers/net/wireless/intel/iwlwifi/mvm/mld-key.c | 9 +-
> drivers/net/wireless/intel/iwlwifi/mvm/power.c | 14 +-
> drivers/net/wireless/intel/iwlwifi/mvm/sta.c | 2 +-
> drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 4 +
> drivers/net/wireless/realtek/rtw88/sdio.c | 24 +-
> drivers/net/wireless/virtual/mac80211_hwsim.c | 4 +-
> drivers/of/platform.c | 2 +-
> drivers/pinctrl/renesas/pinctrl-rzg2l.c | 28 ++-
> drivers/pinctrl/renesas/pinctrl-rzv2m.c | 28 ++-
> drivers/regulator/da9063-regulator.c | 3 +
> drivers/s390/crypto/zcrypt_msgtype6.c | 33 ++-
> drivers/scsi/sg.c | 10 +
> drivers/spi/spi-bcm63xx.c | 2 +-
> drivers/spi/spi-cadence-quadspi.c | 19 ++
> drivers/spi/spi-dw-mmio.c | 22 ++
> drivers/spi/spi-s3c64xx.c | 2 +
> drivers/video/fbdev/au1200fb.c | 3 +
> drivers/video/fbdev/imxfb.c | 5 +-
> fs/btrfs/block-group.c | 1 +
> fs/btrfs/ctree.c | 10 +-
> fs/btrfs/disk-io.c | 3 +
> fs/btrfs/extent_io.c | 33 +--
> fs/btrfs/inode.c | 35 +--
> fs/btrfs/qgroup.c | 1 +
> fs/btrfs/raid56.c | 11 +-
> fs/btrfs/volumes.c | 17 +-
> fs/erofs/zdata.c | 2 +-
> fs/ext4/xattr.c | 14 ++
> fs/fuse/dir.c | 2 +-
> fs/fuse/inode.c | 8 +-
> fs/fuse/ioctl.c | 21 +-
> fs/jbd2/checkpoint.c | 102 +++-----
> fs/jfs/jfs_dmap.c | 3 +
> fs/jfs/jfs_txnmgr.c | 5 +
> fs/jfs/namei.c | 5 +
> fs/overlayfs/ovl_entry.h | 9 +
> fs/quota/dquot.c | 5 +-
> fs/smb/client/connect.c | 19 +-
> fs/smb/client/dfs.c | 26 +--
> fs/smb/client/smb2transport.c | 2 +-
> fs/udf/unicode.c | 2 +-
> include/kvm/arm_vgic.h | 2 +-
> include/linux/psi.h | 5 +-
> include/linux/psi_types.h | 3 +
> include/linux/sched/signal.h | 2 +-
> include/linux/tcp.h | 2 +-
> include/net/bluetooth/hci_core.h | 5 +
> include/net/ip.h | 2 +-
> include/net/tcp.h | 31 ++-
> include/uapi/linux/fuse.h | 3 +
> io_uring/io_uring.c | 52 ++---
> kernel/bpf/bpf_lru_list.c | 21 +-
> kernel/bpf/bpf_lru_list.h | 7 +-
> kernel/bpf/btf.c | 23 +-
> kernel/bpf/log.c | 3 -
> kernel/bpf/syscall.c | 3 +-
> kernel/bpf/verifier.c | 32 ++-
> kernel/cgroup/cgroup.c | 2 +-
> kernel/kallsyms.c | 5 +-
> kernel/rcu/tasks.h | 5 +-
> kernel/rcu/tree_exp.h | 2 +-
> kernel/rcu/tree_plugin.h | 4 +-
> kernel/sched/fair.c | 4 +-
> kernel/sched/psi.c | 29 ++-
> kernel/sys.c | 10 +-
> kernel/time/posix-timers.c | 31 +--
> kernel/trace/trace_events_hist.c | 3 +-
> lib/iov_iter.c | 2 +-
> lib/maple_tree.c | 3 +-
> mm/mlock.c | 9 +-
> net/bluetooth/hci_conn.c | 14 +-
> net/bluetooth/hci_core.c | 42 +++-
> net/bluetooth/hci_event.c | 15 +-
> net/bluetooth/hci_sync.c | 121 ++++++++--
> net/bluetooth/iso.c | 55 +++--
> net/bluetooth/mgmt.c | 26 +--
> net/bluetooth/sco.c | 23 +-
> net/bridge/br_stp_if.c | 3 +
> net/can/bcm.c | 12 +-
> net/devlink/health.c | 2 +-
> net/devlink/leftover.c | 5 +-
> net/ipv4/esp4.c | 2 +-
> net/ipv4/inet_connection_sock.c | 2 +-
> net/ipv4/inet_hashtables.c | 17 +-
> net/ipv4/inet_timewait_sock.c | 8 +-
> net/ipv4/ip_output.c | 4 +-
> net/ipv4/tcp.c | 57 ++---
> net/ipv4/tcp_fastopen.c | 6 +-
> net/ipv4/tcp_ipv4.c | 27 ++-
> net/ipv4/tcp_minisocks.c | 11 +-
> net/ipv4/tcp_output.c | 6 +-
> net/ipv4/udp_offload.c | 16 +-
> net/ipv6/ip6_gre.c | 3 +-
> net/ipv6/tcp_ipv6.c | 4 +-
> net/ipv6/udp_offload.c | 3 +-
> net/llc/llc_input.c | 3 -
> net/netfilter/nf_tables_api.c | 12 +-
> net/netfilter/nft_set_pipapo.c | 6 +-
> net/sched/cls_bpf.c | 99 ++++----
> net/sched/cls_matchall.c | 35 +--
> net/sched/cls_u32.c | 48 +++-
> net/wireless/wext-core.c | 6 +
> scripts/Makefile.build | 5 +-
> scripts/Makefile.host | 6 +-
> scripts/kallsyms.c | 6 +-
> security/keys/request_key.c | 35 ++-
> security/keys/trusted-keys/trusted_tpm2.c | 2 +-
> sound/pci/emu10k1/emufx.c | 112 +--------
> sound/pci/hda/patch_realtek.c | 100 +++++++-
> sound/soc/amd/acp/amd.h | 7 +-
> sound/soc/codecs/Kconfig | 1 +
> sound/soc/codecs/cs42l51-i2c.c | 6 +
> sound/soc/codecs/cs42l51.c | 7 -
> sound/soc/codecs/cs42l51.h | 1 -
> sound/soc/codecs/rt5640.c | 12 +-
> sound/soc/codecs/wcd-mbhc-v2.c | 57 +++--
> sound/soc/codecs/wcd934x.c | 12 +
> sound/soc/codecs/wcd938x.c | 86 ++++++-
> sound/soc/fsl/fsl_sai.c | 8 +-
> sound/soc/fsl/fsl_sai.h | 1 +
> sound/soc/qcom/qdsp6/q6apm.c | 7 +-
> sound/soc/qcom/qdsp6/topology.c | 4 +-
> sound/soc/sof/ipc3-dtrace.c | 9 +-
> sound/soc/tegra/tegra210_adx.c | 34 ++-
> sound/soc/tegra/tegra210_amx.c | 40 ++--
> tools/include/nolibc/stackprotector.h | 5 +-
> tools/perf/Makefile.config | 4 +-
> .../tests/shell/test_uprobe_from_different_cu.sh | 77 +++++++
> tools/perf/util/dwarf-aux.c | 4 +-
> tools/testing/radix-tree/maple.c | 6 +-
> tools/testing/selftests/mm/mkdirty.c | 2 +-
> tools/testing/selftests/tc-testing/config | 2 +
> tools/testing/selftests/tc-testing/settings | 1 +
> 218 files changed, 2462 insertions(+), 1482 deletions(-)
>
>

2023-07-27 11:50:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

Hi!

> > This is the start of the stable review cycle for the 6.4.7 release.
> > There are 227 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> > https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> > or in the git tree and branch at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> > and the diffstat can be found below.
>
> I saw this when running rcutorture, this one happened in the TREE04
> configuration. This is likely due to the stuttering issues we are discussing
> in the other thread. Anyway I am just making a note here while I am
> continuing to look into it.

So is the stuttering new in 6.4.7?

> Other than that, all tests pass:
> Tested-by: Joel Fernandes (Google) <[email protected]>

...or you still believe 6.4.7 is okay to release?

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Attachments:
(No filename) (1.30 kB)
signature.asc (201.00 B)
Download all attachments

2023-07-27 13:33:57

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review



> On Jul 27, 2023, at 7:35 AM, Pavel Machek <[email protected]> wrote:
>
> Hi!
>
>>> This is the start of the stable review cycle for the 6.4.7 release.
>>> There are 227 patches in this series, all will be posted as a response
>>> to this one. If anyone has any issues with these being applied, please
>>> let me know.
>>>
>>> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
>>> Anything received after that time might be too late.
>>>
>>> The whole patch series can be found in one patch at:
>>> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
>>> or in the git tree and branch at:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
>>> and the diffstat can be found below.
>>
>> I saw this when running rcutorture, this one happened in the TREE04
>> configuration. This is likely due to the stuttering issues we are discussing
>> in the other thread. Anyway I am just making a note here while I am
>> continuing to look into it.
>
> So is the stuttering new in 6.4.7?

No it is an old feature in RCU torture tests. But is dependent on timing. Something
changed in recent kernels that is making the issues with it more likely. Its hard to bisect as failure sometimes takes hours.

>
>> Other than that, all tests pass:
>> Tested-by: Joel Fernandes (Google) <[email protected]>
>
> ...or you still believe 6.4.7 is okay to release?

As such, it should be Ok. However naturally I am not happy that the RCU testing
is intermittently failing. These issues have been seen in last several 6.4 stable releases
so since those were released, maybe this one can be too?
The fix for stuttering is currently being reviewed.

Thanks,

- Joel


>
> Best regards,
> Pavel
> --
> DENX Software Engineering GmbH, Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

2023-07-27 14:22:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Thu, Jul 27, 2023 at 09:26:52AM -0400, Joel Fernandes wrote:
>
>
> > On Jul 27, 2023, at 7:35 AM, Pavel Machek <[email protected]> wrote:
> >
> > Hi!
> >
> >>> This is the start of the stable review cycle for the 6.4.7 release.
> >>> There are 227 patches in this series, all will be posted as a response
> >>> to this one. If anyone has any issues with these being applied, please
> >>> let me know.
> >>>
> >>> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> >>> Anything received after that time might be too late.
> >>>
> >>> The whole patch series can be found in one patch at:
> >>> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> >>> or in the git tree and branch at:
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> >>> and the diffstat can be found below.
> >>
> >> I saw this when running rcutorture, this one happened in the TREE04
> >> configuration. This is likely due to the stuttering issues we are discussing
> >> in the other thread. Anyway I am just making a note here while I am
> >> continuing to look into it.
> >
> > So is the stuttering new in 6.4.7?
>
> No it is an old feature in RCU torture tests. But is dependent on timing. Something
> changed in recent kernels that is making the issues with it more likely. Its hard to bisect as failure sometimes takes hours.
>
> >
> >> Other than that, all tests pass:
> >> Tested-by: Joel Fernandes (Google) <[email protected]>
> >
> > ...or you still believe 6.4.7 is okay to release?
>
> As such, it should be Ok. However naturally I am not happy that the RCU testing
> is intermittently failing. These issues have been seen in last several 6.4 stable releases
> so since those were released, maybe this one can be too?
> The fix for stuttering is currently being reviewed.

Or, to look at it another way, the stuttering fix is specific to torture
testing. Would we really want to hold up a -stable release only because
rcutorture occasionally gives a false-positive failure on certain types
of systems?

Thanx, Paul

> Thanks,
>
> - Joel
>
>
> >
> > Best regards,
> > Pavel
> > --
> > DENX Software Engineering GmbH, Managing Director: Erika Unter
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

2023-07-27 14:59:22

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/27/23 07:06, Paul E. McKenney wrote:
> On Thu, Jul 27, 2023 at 09:26:52AM -0400, Joel Fernandes wrote:
>>
>>
>>> On Jul 27, 2023, at 7:35 AM, Pavel Machek <[email protected]> wrote:
>>>
>>> Hi!
>>>
>>>>> This is the start of the stable review cycle for the 6.4.7 release.
>>>>> There are 227 patches in this series, all will be posted as a response
>>>>> to this one. If anyone has any issues with these being applied, please
>>>>> let me know.
>>>>>
>>>>> Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
>>>>> Anything received after that time might be too late.
>>>>>
>>>>> The whole patch series can be found in one patch at:
>>>>> https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
>>>>> or in the git tree and branch at:
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
>>>>> and the diffstat can be found below.
>>>>
>>>> I saw this when running rcutorture, this one happened in the TREE04
>>>> configuration. This is likely due to the stuttering issues we are discussing
>>>> in the other thread. Anyway I am just making a note here while I am
>>>> continuing to look into it.
>>>
>>> So is the stuttering new in 6.4.7?
>>
>> No it is an old feature in RCU torture tests. But is dependent on timing. Something
>> changed in recent kernels that is making the issues with it more likely. Its hard to bisect as failure sometimes takes hours.
>>
>>>
>>>> Other than that, all tests pass:
>>>> Tested-by: Joel Fernandes (Google) <[email protected]>
>>>
>>> ...or you still believe 6.4.7 is okay to release?
>>
>> As such, it should be Ok. However naturally I am not happy that the RCU testing
>> is intermittently failing. These issues have been seen in last several 6.4 stable releases
>> so since those were released, maybe this one can be too?
>> The fix for stuttering is currently being reviewed.
>
> Or, to look at it another way, the stuttering fix is specific to torture
> testing. Would we really want to hold up a -stable release only because
> rcutorture occasionally gives a false-positive failure on certain types
> of systems?
>

No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
or long runtime.

[ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.797998] Running RCU synchronous self tests
[ 0.798209] Running RCU synchronous self tests
[ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
[ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[ 0.925419] Running RCU-tasks wait API self tests

(hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
I'd be much more concerned about that.

Guenter


2023-07-27 16:51:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Thu, Jul 27, 2023 at 07:39:54AM -0700, Guenter Roeck wrote:
> On 7/27/23 07:06, Paul E. McKenney wrote:
> > On Thu, Jul 27, 2023 at 09:26:52AM -0400, Joel Fernandes wrote:
> > >
> > >
> > > > On Jul 27, 2023, at 7:35 AM, Pavel Machek <[email protected]> wrote:
> > > >
> > > > Hi!
> > > >
> > > > > > This is the start of the stable review cycle for the 6.4.7 release.
> > > > > > There are 227 patches in this series, all will be posted as a response
> > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > let me know.
> > > > > >
> > > > > > Responses should be made by Thu, 27 Jul 2023 10:44:26 +0000.
> > > > > > Anything received after that time might be too late.
> > > > > >
> > > > > > The whole patch series can be found in one patch at:
> > > > > > https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.7-rc1.gz
> > > > > > or in the git tree and branch at:
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
> > > > > > and the diffstat can be found below.
> > > > >
> > > > > I saw this when running rcutorture, this one happened in the TREE04
> > > > > configuration. This is likely due to the stuttering issues we are discussing
> > > > > in the other thread. Anyway I am just making a note here while I am
> > > > > continuing to look into it.
> > > >
> > > > So is the stuttering new in 6.4.7?
> > >
> > > No it is an old feature in RCU torture tests. But is dependent on timing. Something
> > > changed in recent kernels that is making the issues with it more likely. Its hard to bisect as failure sometimes takes hours.
> > >
> > > >
> > > > > Other than that, all tests pass:
> > > > > Tested-by: Joel Fernandes (Google) <[email protected]>
> > > >
> > > > ...or you still believe 6.4.7 is okay to release?
> > >
> > > As such, it should be Ok. However naturally I am not happy that the RCU testing
> > > is intermittently failing. These issues have been seen in last several 6.4 stable releases
> > > so since those were released, maybe this one can be too?
> > > The fix for stuttering is currently being reviewed.
> >
> > Or, to look at it another way, the stuttering fix is specific to torture
> > testing. Would we really want to hold up a -stable release only because
> > rcutorture occasionally gives a false-positive failure on certain types
> > of systems?
> >
>
> No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
> or long runtime.
>
> [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
> [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
> [ 0.797998] Running RCU synchronous self tests
> [ 0.798209] Running RCU synchronous self tests
> [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
> [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
> [ 0.925419] Running RCU-tasks wait API self tests
>
> (hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
> Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
> all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
> I'd be much more concerned about that.

First I have heard of this, so thank you for letting me know.

About what fraction of the time does this happen?

Thanx, Paul

2023-07-27 18:03:06

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/27/23 09:07, Paul E. McKenney wrote:

...]

>> No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
>> or long runtime.
>>
>> [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
>> [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
>> [ 0.797998] Running RCU synchronous self tests
>> [ 0.798209] Running RCU synchronous self tests
>> [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
>> [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
>> [ 0.925419] Running RCU-tasks wait API self tests
>>
>> (hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
>> Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
>> all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
>> I'd be much more concerned about that.
>
> First I have heard of this, so thank you for letting me know.
>
> About what fraction of the time does this happen?
>

Here is a sample test log from yesterday's -next. This is with x86_64.
Today's -next always crashes, so no data.

Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .................R........... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ............. passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ..................R.......... passed
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ...... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .................R.............. failed (silent)
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .................R................. passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ................... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ......... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running .................R... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ...... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running .....................R....... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running .................R.............. failed (silent)

An earlier test run:

Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ........ passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running .......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ......... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ........ passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running .................R..... passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running .................R.............. failed (silent)
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ....... passed
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running .......... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ....... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ........ passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ....... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ....... passed

"R" means retry, and the dots reflect time expired. It looks like it happens most of the time,
but not always, on affected CPUs. I don't have specific data for non-Intel CPUs. I don't think
I see the problem there, but there is too much interference from other problems to be sure.

For comparison, here is the result from the latest mainline:

Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .......... passed
Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ........... passed
Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running ........ passed
Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ....... passed
Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ............. passed
Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ...... passed
Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ......... passed
Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ......... passed
Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running ......... passed
Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ......... passed
Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ............ passed
Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running .......... passed
Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running .......... passed
Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ...... passed
Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ...... passed

Guenter


2023-07-27 20:51:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
> On 7/27/23 09:07, Paul E. McKenney wrote:
>
> ...]
>
> > > No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
> > > or long runtime.
> > >
> > > [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
> > > [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
> > > [ 0.797998] Running RCU synchronous self tests
> > > [ 0.798209] Running RCU synchronous self tests
> > > [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
> > > [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
> > > [ 0.925419] Running RCU-tasks wait API self tests
> > >
> > > (hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
> > > Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
> > > all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
> > > I'd be much more concerned about that.
> >
> > First I have heard of this, so thank you for letting me know.
> >
> > About what fraction of the time does this happen?
> >
>
> Here is a sample test log from yesterday's -next. This is with x86_64.
> Today's -next always crashes, so no data.
>
> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .................R........... passed
> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ............. passed
> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ..................R.......... passed
> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ...... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .................R.............. failed (silent)
> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .................R................. passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ................... passed
> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ......... passed
> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running .................R... passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ...... passed
> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running .....................R....... passed
> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running .................R.............. failed (silent)
>
> An earlier test run:
>
> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ........ passed
> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running .......... passed
> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ......... passed
> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ........ passed
> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running .................R..... passed
> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running .................R.............. failed (silent)
> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ....... passed
> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .......... passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running .......... passed
> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ....... passed
> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ....... passed
> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ........ passed
> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ....... passed
> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ....... passed
>
> "R" means retry, and the dots reflect time expired. It looks like it happens most of the time,
> but not always, on affected CPUs. I don't have specific data for non-Intel CPUs. I don't think
> I see the problem there, but there is too much interference from other problems to be sure.
>
> For comparison, here is the result from the latest mainline:
>
> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .......... passed
> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ........... passed
> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running ........ passed
> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ....... passed
> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ............. passed
> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ...... passed
> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ......... passed
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ......... passed
> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ......... passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running ......... passed
> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ......... passed
> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ............ passed
> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running .......... passed
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running .......... passed
> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ...... passed
> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ...... passed

I freely confess that I am having a hard time imagining what would
be CPU dependent in that code. Timing, maybe? Whatever the reason,
I am not seeing these failures in my testing.

So which of the following Kconfig options is defined in your .config?
CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.

If you have more than one of them, could you please apply this patch
and show me the corresponding console output from the resulting hang?

Thanx, Paul

------------------------------------------------------------------------

commit 709a917710dc01798e01750ea628ece4bfc42b7b
Author: Paul E. McKenney <[email protected]>
Date: Thu Jul 27 13:13:46 2023 -0700

rcu-tasks: Add printk()s to localize boot-time self-test hang

Currently, rcu_tasks_initiate_self_tests() prints a message and then
initiates self tests on up to three different RCU Tasks flavors. If one
of the flavors has a grace-period hang, it is not easy to work out which
of the three hung. This commit therefore prints a message prior to each
individual test.

Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 56c470a489c8..427433c90935 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)

static void rcu_tasks_initiate_self_tests(void)
{
- pr_info("Running RCU-tasks wait API self tests\n");
#ifdef CONFIG_TASKS_RCU
+ pr_info("Running RCU Tasks wait API self tests\n");
tests[0].runstart = jiffies;
synchronize_rcu_tasks();
call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
#endif

#ifdef CONFIG_TASKS_RUDE_RCU
+ pr_info("Running RCU Tasks Rude wait API self tests\n");
tests[1].runstart = jiffies;
synchronize_rcu_tasks_rude();
call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
#endif

#ifdef CONFIG_TASKS_TRACE_RCU
+ pr_info("Running RCU Tasks Trace wait API self tests\n");
tests[2].runstart = jiffies;
synchronize_rcu_tasks_trace();
call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);

2023-07-27 23:51:39

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review



> On Jul 27, 2023, at 4:33 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
>> On 7/27/23 09:07, Paul E. McKenney wrote:
>>
>> ...]
>>
>>>> No. However, (unrelated) in linux-next, rcu tests sometimes result in apparent hangs
>>>> or long runtime.
>>>>
>>>> [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
>>>> [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
>>>> [ 0.797998] Running RCU synchronous self tests
>>>> [ 0.798209] Running RCU synchronous self tests
>>>> [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family: 0x15, model: 0x2, stepping: 0x0)
>>>> [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
>>>> [ 0.925419] Running RCU-tasks wait API self tests
>>>>
>>>> (hangs until aborted). This is primarily with Opteron CPUs, but also with others such as Haswell,
>>>> Icelake-Server, and pentium3. It is all but impossible to bisect because it doesn't happen
>>>> all the time. All I was able to figure out was that it has to do with rcu changes in linux-next.
>>>> I'd be much more concerned about that.
>>>
>>> First I have heard of this, so thank you for letting me know.
>>>
>>> About what fraction of the time does this happen?
>>>
>>
>> Here is a sample test log from yesterday's -next. This is with x86_64.
>> Today's -next always crashes, so no data.
>>
>> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
>> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
>> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
>> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
>> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
>> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
>> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
>> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
>> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .................R........... passed
>> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ............. passed
>> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ..................R.......... passed
>> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ...... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .................R.............. failed (silent)
>> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
>> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
>> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .................R................. passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ................... passed
>> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ......... passed
>> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running .................R... passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ...... passed
>> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
>> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
>> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
>> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
>> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
>> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running .....................R....... passed
>> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running .................R.............. failed (silent)
>>
>> An earlier test run:
>>
>> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
>> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .................R....... passed
>> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ........ passed
>> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running .......... passed
>> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ....... passed
>> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running .................R.... passed
>> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
>> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ......... passed
>> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ........ passed
>> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
>> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running .................R..... passed
>> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running .................R.............. failed (silent)
>> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ....... passed
>> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running .......... passed
>> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ........ passed
>> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ...... passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running .......... passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running .......... passed
>> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
>> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ......... passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
>> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ....... passed
>> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ....... passed
>> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ........ passed
>> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running ......... passed
>> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running ....................R................. failed (silent)
>> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ....... passed
>> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ....... passed
>>
>> "R" means retry, and the dots reflect time expired. It looks like it happens most of the time,
>> but not always, on affected CPUs. I don't have specific data for non-Intel CPUs. I don't think
>> I see the problem there, but there is too much interference from other problems to be sure.
>>
>> For comparison, here is the result from the latest mainline:
>>
>> Building x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ... running ....... passed
>> Building x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd ... running .......... passed
>> Building x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd ... running ...... passed
>> Building x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:hd ... running ......... passed
>> Building x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd ... running ........... passed
>> Building x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd ... running ........ passed
>> Building x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem4G:sdhci:mmc:hd ... running ....... passed
>> Building x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC395]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974]:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53C974]:hd ... running ....... passed
>> Building x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem1G:scsi[53C810]:cd ... running .......... passed
>> Building x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:efi32:mem2G:scsi[53C895A]:hd ... running ....... passed
>> Building x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSION]:hd ... running ............. passed
>> Building x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi[MEGASAS]:hd ... running ....... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[MEGASAS2]:hd ... running ....... passed
>> Building x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ...... passed
>> Building x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2]:hd ... running ......... passed
>> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd ... running ......... passed
>> Building x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ... running ......... passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci]:hd ... running ......... passed
>> Building x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-pci-old]:hd ... running ......... passed
>> Building x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd ... running ...... passed
>> Building x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd ... running ....... passed
>> Building x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd ... running ...... passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:mem2G:virtio:cd ... running ............ passed
>> Building x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:nvme:hd ... running ....... passed
>> Building x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:efi32:mem1G:sdhci:mmc:hd ... running ...... passed
>> Building x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:initrd ... running ...... passed
>> Building x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C810]:hd ... running ....... passed
>> Building x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd ... running .......... passed
>> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd ... running .......... passed
>> Building x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd ... running ...... passed
>> Building x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:ata:hd ... running ...... passed
>
> I freely confess that I am having a hard time imagining what would
> be CPU dependent in that code. Timing, maybe? Whatever the reason,
> I am not seeing these failures in my testing.
>
> So which of the following Kconfig options is defined in your .config?
> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>
> If you have more than one of them, could you please apply this patch
> and show me the corresponding console output from the resulting hang?

FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.

Cheers,

- Joel


>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 709a917710dc01798e01750ea628ece4bfc42b7b
> Author: Paul E. McKenney <[email protected]>
> Date: Thu Jul 27 13:13:46 2023 -0700
>
> rcu-tasks: Add printk()s to localize boot-time self-test hang
>
> Currently, rcu_tasks_initiate_self_tests() prints a message and then
> initiates self tests on up to three different RCU Tasks flavors. If one
> of the flavors has a grace-period hang, it is not easy to work out which
> of the three hung. This commit therefore prints a message prior to each
> individual test.
>
> Reported-by: Guenter Roeck <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 56c470a489c8..427433c90935 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
>
> static void rcu_tasks_initiate_self_tests(void)
> {
> - pr_info("Running RCU-tasks wait API self tests\n");
> #ifdef CONFIG_TASKS_RCU
> + pr_info("Running RCU Tasks wait API self tests\n");
> tests[0].runstart = jiffies;
> synchronize_rcu_tasks();
> call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
> #endif
>
> #ifdef CONFIG_TASKS_RUDE_RCU
> + pr_info("Running RCU Tasks Rude wait API self tests\n");
> tests[1].runstart = jiffies;
> synchronize_rcu_tasks_rude();
> call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
> #endif
>
> #ifdef CONFIG_TASKS_TRACE_RCU
> + pr_info("Running RCU Tasks Trace wait API self tests\n");
> tests[2].runstart = jiffies;
> synchronize_rcu_tasks_trace();
> call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);

2023-07-28 04:50:25

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/27/23 13:33, Paul E. McKenney wrote:
[ ... ]

> So which of the following Kconfig options is defined in your .config?
> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>

Only CONFIG_TASKS_RCU. I added another log message after call_rcu_tasks().
It never returns from that function.

[ 1.168993] Running RCU synchronous self tests
[ 1.169219] Running RCU synchronous self tests
[ 1.285795] smpboot: CPU0: Intel Xeon Processor (Cascadelake) (family: 0x6, model: 0x55, stepping: 0x6)
[ 1.302827] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[ 1.304526] Running RCU Tasks wait API self tests

... and then nothing for at least 10 minutes (then I gave up and stopped the test).

Qemu command line:

qemu-system-x86_64 -kernel \
arch/x86/boot/bzImage -M q35 -cpu Cascadelake-Server -no-reboot \
-snapshot -device e1000e,netdev=net0 -netdev user,id=net0 -m 256 \
-drive file=rootfs.iso,format=raw,if=ide,media=cdrom \
--append "earlycon=uart8250,io,0x3f8,9600n8 panic=-1 slub_debug=FZPUA root=/dev/sr0 rootwait console=ttyS0 noreboot" \
-d unimp,guest_errors -nographic -monitor none

Again, this doesn't happen all the time. With Cascadelake-Server
I see it maybe once every 5 boot attempts. I tried with qemu v8.0
and v8.1. Note that it does seem to happen with various CPU types,
only for some it seems to me more likely to happen (so maybe the
CPU type was a red herring). It does seem to depend on the system
load, and happen more often if the system is under heavy load.

Guenter


2023-07-28 23:24:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

> On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
>
> On Jul 27, 2023, at 7:18 PM, Joel Fernandes <[email protected]>
> wrote:
>
> 
>
> On Jul 27, 2023, at 4:33 PM, Paul E. McKenney <[email protected]>
> wrote:
>
> On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
>
> On 7/27/23 09:07, Paul E. McKenney wrote:
>
> ...]
>
> No. However, (unrelated) in linux-next, rcu tests sometimes result
> in apparent hangs
>
> or long runtime.
>
> [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096
> bytes, linear)
>
> [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0,
> 4096 bytes, linear)
>
> [ 0.797998] Running RCU synchronous self tests
>
> [ 0.798209] Running RCU synchronous self tests
>
> [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family:
> 0x15, model: 0x2, stepping: 0x0)
>
> [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1
> rcu_task_cb_adjust=1.
>
> [ 0.925419] Running RCU-tasks wait API self tests
>
> (hangs until aborted). This is primarily with Opteron CPUs, but also
> with others such as Haswell,
>
> Icelake-Server, and pentium3. It is all but impossible to bisect
> because it doesn't happen
>
> all the time. All I was able to figure out was that it has to do
> with rcu changes in linux-next.
>
> I'd be much more concerned about that.
>
> First I have heard of this, so thank you for letting me know.
>
> About what fraction of the time does this happen?
>
> Here is a sample test log from yesterday's -next. This is with
> x86_64.
>
> Today's -next always crashes, so no data.
>
> Building
> x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ...
> running ....... passed
>
> Building
> x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd
> ... running .................R....... passed
>
> Building
> x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd
> ... running ...... passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h
> d ... running ......... passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd
> ... running .................R.... passed
>
> Building
> x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem
> 4G:sdhci:mmc:hd ... running ....... passed
>
> Building
> x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:
> hd ... running ....... passed
>
> Building
> x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]
> :hd ... running ....... passed
>
> Building
> x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC
> 395]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974
> ]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53
> C974]:hd ... running ....... passed
>
> Building
> x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem
> 1G:scsi[53C810]:cd ... running .................R........... passed
>
> Building
> x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e
> fi32:mem2G:scsi[53C895A]:hd ... running ............. passed
>
> Building
> x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI
> ON]:hd ... running ..................R.......... passed
>
> Building
> x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi
> [MEGASAS]:hd ... running ....... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M
> EGASAS2]:hd ... running ...... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS
> 2]:hd ... running .................R.............. failed (silent)
>
> Building
> x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2
> ]:hd ... running .......... passed
>
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd
> ... running ........ passed
>
> Building
> x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ...
> running ...... passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci]:hd ... running .................R................. passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci-old]:hd ... running ................... passed
>
> Building
> x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd
> ... running ......... passed
>
> Building
> x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd
> ... running .................R... passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me
> m2G:virtio:cd ... running ......... passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n
> vme:hd ... running ...... passed
>
> Building
> x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef
> i32:mem1G:sdhci:mmc:hd ... running ...... passed
>
> Building
> x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init
> rd ... running ...... passed
>
> Building
> x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C
> 810]:hd ... running ....... passed
>
> Building
> x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd
> ... running ......... passed
>
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd
> ... running ....................R................. failed (silent)
>
> Building
> x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd
> ... running .....................R....... passed
>
> Building
> x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:
> ata:hd ... running .................R.............. failed (silent)
>
> An earlier test run:
>
> Building
> x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ...
> running ....... passed
>
> Building
> x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd
> ... running .................R....... passed
>
> Building
> x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd
> ... running ........ passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h
> d ... running .......... passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd
> ... running .................R.... passed
>
> Building
> x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem
> 4G:sdhci:mmc:hd ... running ....... passed
>
> Building
> x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:
> hd ... running ......... passed
>
> Building
> x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]
> :hd ... running ....... passed
>
> Building
> x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC
> 395]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974
> ]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53
> C974]:hd ... running ........ passed
>
> Building
> x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem
> 1G:scsi[53C810]:cd ... running .......... passed
>
> Building
> x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e
> fi32:mem2G:scsi[53C895A]:hd ... running .................R.....
> passed
>
> Building
> x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI
> ON]:hd ... running .................R.............. failed (silent)
>
> Building
> x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi
> [MEGASAS]:hd ... running ....... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M
> EGASAS2]:hd ... running ....... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS
> 2]:hd ... running ....... passed
>
> Building
> x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2
> ]:hd ... running .......... passed
>
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd
> ... running ........ passed
>
> Building
> x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ...
> running ...... passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci]:hd ... running .......... passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci-old]:hd ... running .......... passed
>
> Building
> x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd
> ... running ...... passed
>
> Building
> x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd
> ... running ...... passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me
> m2G:virtio:cd ... running ......... passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n
> vme:hd ... running ....... passed
>
> Building
> x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef
> i32:mem1G:sdhci:mmc:hd ... running ....... passed
>
> Building
> x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init
> rd ... running ....... passed
>
> Building
> x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C
> 810]:hd ... running ........ passed
>
> Building
> x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd
> ... running ......... passed
>
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd
> ... running ....................R................. failed (silent)
>
> Building
> x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:
> ata:hd ... running ....... passed
>
> "R" means retry, and the dots reflect time expired. It looks like it
> happens most of the time,
>
> but not always, on affected CPUs. I don't have specific data for
> non-Intel CPUs. I don't think
>
> I see the problem there, but there is too much interference from
> other problems to be sure.
>
> For comparison, here is the result from the latest mainline:
>
> Building
> x86_64:q35:Broadwell-noTSX:defconfig:smp:net,e1000:mem256:ata:hd ...
> running ....... passed
>
> Building
> x86_64:q35:Cascadelake-Server:defconfig:smp:net,e1000e:mem256:ata:cd
> ... running .......... passed
>
> Building
> x86_64:q35:IvyBridge:defconfig:smp2:net,i82801:efi:mem512:nvme:hd
> ... running ...... passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp4:net,ne2k_pci:efi32:mem1G:usb:h
> d ... running ......... passed
>
> Building
> x86_64:q35:SandyBridge:defconfig:smp8:net,ne2k_pci:mem1G:usb-hub:hd
> ... running ........... passed
>
> Building
> x86_64:q35:Haswell:defconfig:smp:tpm-tis:net,pcnet:mem2G:usb-uas:hd
> ... running ........ passed
>
> Building
> x86_64:q35:Skylake-Client:defconfig:smp2:tpm-tis:net,rtl8139:efi:mem
> 4G:sdhci:mmc:hd ... running ....... passed
>
> Building
> x86_64:q35:Conroe:defconfig:smp4:net,tulip:efi32:mem256:scsi[DC395]:
> hd ... running ....... passed
>
> Building
> x86_64:q35:Denverton:defconfig:smp2:net,tulip:efi:mem256:scsi[DC395]
> :hd ... running ....... passed
>
> Building
> x86_64:q35:EPYC-Milan:defconfig:smp:tpm-crb:net,tulip:mem256:scsi[DC
> 395]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net:mem512:scsi[AM53C974
> ]:hd ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp:net,virtio-net-old:mem512:scsi[AM53
> C974]:hd ... running ....... passed
>
> Building
> x86_64:q35:Westmere-IBRS:defconfig:smp2:tpm-crb:net,usb-ohci:efi:mem
> 1G:scsi[53C810]:cd ... running .......... passed
>
> Building
> x86_64:q35:Skylake-Server:defconfig:smp4:tpm-tis:net,e1000-82544gc:e
> fi32:mem2G:scsi[53C895A]:hd ... running ....... passed
>
> Building
> x86_64:pc:EPYC:defconfig:smp:pci-bridge:net,usb-uhci:mem4G:scsi[FUSI
> ON]:hd ... running ............. passed
>
> Building
> x86_64:q35:EPYC-IBPB:defconfig:smp2:net,e1000-82545em:efi:mem8G:scsi
> [MEGASAS]:hd ... running ....... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:efi32:mem256:scsi[M
> EGASAS2]:hd ... running ....... passed
>
> Building
> x86_64:q35:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS
> 2]:hd ... running ...... passed
>
> Building
> x86_64:pc:Opteron_G5:defconfig:smp4:net,i82559c:mem256:scsi[MEGASAS2
> ]:hd ... running ......... passed
>
> Building x86_64:pc:phenom:defconfig:smp:net,i82559er:mem512:initrd
> ... running ......... passed
>
> Building
> x86_64:q35:Opteron_G1:defconfig:smp2:net,i82562:efi:mem1G:initrd ...
> running ......... passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci]:hd ... running ......... passed
>
> Building
> x86_64:pc:Opteron_G2:defconfig:smp:net,usb:efi32:mem2G:scsi[virtio-p
> ci-old]:hd ... running ......... passed
>
> Building
> x86_64:q35:core2duo:defconfig:smp2:net,i82559a:mem4G:virtio-pci:hd
> ... running ...... passed
>
> Building
> x86_64:q35:Broadwell:defconfig:smp4:net,i82558b:efi:mem8G:virtio:hd
> ... running ....... passed
>
> Building
> x86_64:q35:Nehalem:defconfig:smp2:net,i82558a:efi32:mem1G:virtio:hd
> ... running ...... passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me
> m2G:virtio:cd ... running ............ passed
>
> Building
> x86_64:q35:Icelake-Server:defconfig:preempt:smp8:net,i82557a:mem4G:n
> vme:hd ... running ....... passed
>
> Building
> x86_64:q35:Skylake-Client-IBRS:defconfig:preempt:smp2:net,i82558b:ef
> i32:mem1G:sdhci:mmc:hd ... running ...... passed
>
> Building
> x86_64:q35:KnightsMill:defconfig:preempt:smp6:net,i82550:mem512:init
> rd ... running ...... passed
>
> Building
> x86_64:q35:Cooperlake:defconfig:smp2:net,usb-ohci:efi:mem1G:scsi[53C
> 810]:hd ... running ....... passed
>
> Building
> x86_64:q35:EPYC-Rome:defconfig:smp4:net,igb:mem2G:scsi[53C895A]:hd
> ... running .......... passed
>
> Building x86_64:pc:Opteron_G3:defconfig:nosmp:net,e1000:mem1G:usb:hd
> ... running .......... passed
>
> Building
> x86_64:q35:Opteron_G4:defconfig:nosmp:net,ne2k_pci:efi:mem512:ata:hd
> ... running ...... passed
>
> Building
> x86_64:q35:Haswell-noTSX-IBRS:defconfig:nosmp:net,pcnet:efi32:mem2G:
> ata:hd ... running ...... passed
>
> I freely confess that I am having a hard time imagining what would
>
> be CPU dependent in that code. Timing, maybe? Whatever the reason,
>
> I am not seeing these failures in my testing.
>
> So which of the following Kconfig options is defined in your
> .config?
>
> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>
> If you have more than one of them, could you please apply this patch
>
> and show me the corresponding console output from the resulting
> hang?
>
> FWIW, I am not able to repro this issue either. If a .config can be
> shared of the problem system, I can try it out to see if it can be
> reproduced on my side.
>
> I do see this now on 5.15 stable:
>
>TASKS03 ------- 3089 GPs (0.858056/s)
>QEMU killed
>TASKS03 no success message, 64 successful version messages
>!!! PID 3309783 hung at 3781 vs. 3600 seconds
>
> I have not looked too closely yet. The full test artifacts are here:
>
> [1]Artifacts of linux-5.15.y 5.15.123 :
> /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins]
> [2]box.joelfernandes.org
> [3]apple-touch-icon.png
>
> Thanks,
>
> - Joel
>
> (Apologies if the email is html, I am sending from phone).

Heh. I have a script that runs lynx. Which isn't perfect, but usually
makes things at least somewhat legible.

This looks like the prototypical hard hang with interrupts disabled,
which could be anywhere in the kernel, including RCU. I am not seeing
this. but the usual cause when I have seen it in the past was deadlock
of irq-disabled locks. In one spectacular case, it was a timekeeping
failure that messed up a CPU-hotplug operation.

If this is reproducible, one trick would be to have a script look at
the console.log file, and have it do something (NMI? sysrq? something
else?) to qemu if output ceased for too long.

One way to do this without messing with the rcutorture scripting is to
grab the qemu-cmd file from this run, and then invoke that file from your
own script, possibly with suitable modifications to qemu's parameters.

Thoughts?

Thanx, Paul

> Cheers,
> - Joel
>
> Thanx, Paul
>
> --------------------------------------------------------------------
> ----
>
> commit 709a917710dc01798e01750ea628ece4bfc42b7b
>
> Author: Paul E. McKenney <[email protected]>
>
> Date: Thu Jul 27 13:13:46 2023 -0700
>
> rcu-tasks: Add printk()s to localize boot-time self-test hang
>
> Currently, rcu_tasks_initiate_self_tests() prints a message and
> then
>
> initiates self tests on up to three different RCU Tasks flavors.
> If one
>
> of the flavors has a grace-period hang, it is not easy to work out
> which
>
> of the three hung. This commit therefore prints a message prior
> to each
>
> individual test.
>
> Reported-by: Guenter Roeck <[email protected]>
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>
> index 56c470a489c8..427433c90935 100644
>
> --- a/kernel/rcu/tasks.h
>
> +++ b/kernel/rcu/tasks.h
>
> @@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct
> rcu_head *rhp)
>
> static void rcu_tasks_initiate_self_tests(void)
>
> {
>
> - pr_info("Running RCU-tasks wait API self tests\n");
>
> #ifdef CONFIG_TASKS_RCU
>
> + pr_info("Running RCU Tasks wait API self tests\n");
>
> tests[0].runstart = jiffies;
>
> synchronize_rcu_tasks();
>
> call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
>
> #endif
>
> #ifdef CONFIG_TASKS_RUDE_RCU
>
> + pr_info("Running RCU Tasks Rude wait API self tests\n");
>
> tests[1].runstart = jiffies;
>
> synchronize_rcu_tasks_rude();
>
> call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
>
> #endif
>
> #ifdef CONFIG_TASKS_TRACE_RCU
>
> + pr_info("Running RCU Tasks Trace wait API self tests\n");
>
> tests[2].runstart = jiffies;
>
> synchronize_rcu_tasks_trace();
>
> call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
>
>References
>
> Visible links:
> 1. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> 2. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> 3. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
>
> Hidden links:
> 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/

2023-07-29 06:44:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Fri, Jul 28, 2023 at 6:58 PM Paul E. McKenney <[email protected]> wrote:
>
> > On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
> >
> > On Jul 27, 2023, at 7:18 PM, Joel Fernandes <[email protected]>
> > wrote:
> >
> > 
> >
> > On Jul 27, 2023, at 4:33 PM, Paul E. McKenney <[email protected]>
> > wrote:
> >
> > On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
> >
> > On 7/27/23 09:07, Paul E. McKenney wrote:
> >
> > ...]
> >
> > No. However, (unrelated) in linux-next, rcu tests sometimes result
> > in apparent hangs
> >
> > or long runtime.
> >
> > [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096
> > bytes, linear)
> >
> > [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0,
> > 4096 bytes, linear)
> >
> > [ 0.797998] Running RCU synchronous self tests
> >
> > [ 0.798209] Running RCU synchronous self tests
> >
> > [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family:
> > 0x15, model: 0x2, stepping: 0x0)
> >
> > [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1
> > rcu_task_cb_adjust=1.
> >
> > [ 0.925419] Running RCU-tasks wait API self tests
> >
> > (hangs until aborted). This is primarily with Opteron CPUs, but also
> > with others such as Haswell,
[...]
> > Building
> > x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me
> > m2G:virtio:cd ... running ......... passed
[...]
> > I freely confess that I am having a hard time imagining what would
> >
> > be CPU dependent in that code. Timing, maybe? Whatever the reason,
> >
> > I am not seeing these failures in my testing.
> >
> > So which of the following Kconfig options is defined in your
> > .config?
> >
> > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> >
> > If you have more than one of them, could you please apply this patch
> >
> > and show me the corresponding console output from the resulting
> > hang?
> >
> > FWIW, I am not able to repro this issue either. If a .config can be
> > shared of the problem system, I can try it out to see if it can be
> > reproduced on my side.
> >
> > I do see this now on 5.15 stable:
> >
> >TASKS03 ------- 3089 GPs (0.858056/s)
> >QEMU killed
> >TASKS03 no success message, 64 successful version messages
> >!!! PID 3309783 hung at 3781 vs. 3600 seconds
> >
> > I have not looked too closely yet. The full test artifacts are here:
> >
> > [1]Artifacts of linux-5.15.y 5.15.123 :
> > /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins]
> > [2]box.joelfernandes.org
> > [3]apple-touch-icon.png
> >
> > Thanks,
> >
> > - Joel
> >
> > (Apologies if the email is html, I am sending from phone).
>
> Heh. I have a script that runs lynx. Which isn't perfect, but usually
> makes things at least somewhat legible.

Sorry I was too optimistic about the iPhone's capabilities when it
came to mailing list emails.
Here's what I said:
--------------
I do see this now on 5.15 stable:

TASKS03 ------- 3089 GPs (0.858056/s)
QEMU killed
TASKS03 no success message, 64 successful version messages
!!! PID 3309783 hung at 3781 vs. 3600 seconds

Link to full logs/artifacts:
http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
----------------

> This looks like the prototypical hard hang with interrupts disabled,
> which could be anywhere in the kernel, including RCU. I am not seeing
> this. but the usual cause when I have seen it in the past was deadlock
> of irq-disabled locks. In one spectacular case, it was a timekeeping
> failure that messed up a CPU-hotplug operation.
>
> If this is reproducible, one trick would be to have a script look at
> the console.log file, and have it do something (NMI? sysrq? something
> else?) to qemu if output ceased for too long.
>
> One way to do this without messing with the rcutorture scripting is to
> grab the qemu-cmd file from this run, and then invoke that file from your
> own script, possibly with suitable modifications to qemu's parameters.

Would it be better to have such monitoring as part of rcutorture
testing itself? Alternatively there is the NMI hardlockup detector
which I believe should also detect such cases and dump stacks.

thanks,

- Joel

>
> Thoughts?
>
> Thanx, Paul
>
> > Cheers,
> > - Joel
> >
> > Thanx, Paul
> >
> > --------------------------------------------------------------------
> > ----
> >
> > commit 709a917710dc01798e01750ea628ece4bfc42b7b
> >
> > Author: Paul E. McKenney <[email protected]>
> >
> > Date: Thu Jul 27 13:13:46 2023 -0700
> >
> > rcu-tasks: Add printk()s to localize boot-time self-test hang
> >
> > Currently, rcu_tasks_initiate_self_tests() prints a message and
> > then
> >
> > initiates self tests on up to three different RCU Tasks flavors.
> > If one
> >
> > of the flavors has a grace-period hang, it is not easy to work out
> > which
> >
> > of the three hung. This commit therefore prints a message prior
> > to each
> >
> > individual test.
> >
> > Reported-by: Guenter Roeck <[email protected]>
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> >
> > index 56c470a489c8..427433c90935 100644
> >
> > --- a/kernel/rcu/tasks.h
> >
> > +++ b/kernel/rcu/tasks.h
> >
> > @@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct
> > rcu_head *rhp)
> >
> > static void rcu_tasks_initiate_self_tests(void)
> >
> > {
> >
> > - pr_info("Running RCU-tasks wait API self tests\n");
> >
> > #ifdef CONFIG_TASKS_RCU
> >
> > + pr_info("Running RCU Tasks wait API self tests\n");
> >
> > tests[0].runstart = jiffies;
> >
> > synchronize_rcu_tasks();
> >
> > call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
> >
> > #endif
> >
> > #ifdef CONFIG_TASKS_RUDE_RCU
> >
> > + pr_info("Running RCU Tasks Rude wait API self tests\n");
> >
> > tests[1].runstart = jiffies;
> >
> > synchronize_rcu_tasks_rude();
> >
> > call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
> >
> > #endif
> >
> > #ifdef CONFIG_TASKS_TRACE_RCU
> >
> > + pr_info("Running RCU Tasks Trace wait API self tests\n");
> >
> > tests[2].runstart = jiffies;
> >
> > synchronize_rcu_tasks_trace();
> >
> > call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
> >
> >References
> >
> > Visible links:
> > 1. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> > 2. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> > 3. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> >
> > Hidden links:
> > 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/

2023-07-29 07:55:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Fri, Jul 28, 2023 at 09:25:35PM -0400, Joel Fernandes wrote:
> On Fri, Jul 28, 2023 at 6:58 PM Paul E. McKenney <[email protected]> wrote:
> >
> > > On Fri, Jul 28, 2023 at 05:17:59PM -0400, Joel Fernandes wrote:
> > >
> > > On Jul 27, 2023, at 7:18 PM, Joel Fernandes <[email protected]>
> > > wrote:
> > >
> > > 
> > >
> > > On Jul 27, 2023, at 4:33 PM, Paul E. McKenney <[email protected]>
> > > wrote:
> > >
> > > On Thu, Jul 27, 2023 at 10:39:17AM -0700, Guenter Roeck wrote:
> > >
> > > On 7/27/23 09:07, Paul E. McKenney wrote:
> > >
> > > ...]
> > >
> > > No. However, (unrelated) in linux-next, rcu tests sometimes result
> > > in apparent hangs
> > >
> > > or long runtime.
> > >
> > > [ 0.778841] Mount-cache hash table entries: 512 (order: 0, 4096
> > > bytes, linear)
> > >
> > > [ 0.779011] Mountpoint-cache hash table entries: 512 (order: 0,
> > > 4096 bytes, linear)
> > >
> > > [ 0.797998] Running RCU synchronous self tests
> > >
> > > [ 0.798209] Running RCU synchronous self tests
> > >
> > > [ 0.912368] smpboot: CPU0: AMD Opteron 63xx class CPU (family:
> > > 0x15, model: 0x2, stepping: 0x0)
> > >
> > > [ 0.923398] RCU Tasks: Setting shift to 2 and lim to 1
> > > rcu_task_cb_adjust=1.
> > >
> > > [ 0.925419] Running RCU-tasks wait API self tests
> > >
> > > (hangs until aborted). This is primarily with Opteron CPUs, but also
> > > with others such as Haswell,
> [...]
> > > Building
> > > x86_64:q35:Icelake-Server:defconfig:preempt:smp4:net,ne2k_pci:efi:me
> > > m2G:virtio:cd ... running ......... passed
> [...]
> > > I freely confess that I am having a hard time imagining what would
> > >
> > > be CPU dependent in that code. Timing, maybe? Whatever the reason,
> > >
> > > I am not seeing these failures in my testing.
> > >
> > > So which of the following Kconfig options is defined in your
> > > .config?
> > >
> > > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> > >
> > > If you have more than one of them, could you please apply this patch
> > >
> > > and show me the corresponding console output from the resulting
> > > hang?
> > >
> > > FWIW, I am not able to repro this issue either. If a .config can be
> > > shared of the problem system, I can try it out to see if it can be
> > > reproduced on my side.
> > >
> > > I do see this now on 5.15 stable:
> > >
> > >TASKS03 ------- 3089 GPs (0.858056/s)
> > >QEMU killed
> > >TASKS03 no success message, 64 successful version messages
> > >!!! PID 3309783 hung at 3781 vs. 3600 seconds
> > >
> > > I have not looked too closely yet. The full test artifacts are here:
> > >
> > > [1]Artifacts of linux-5.15.y 5.15.123 :
> > > /tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44 [Jenkins]
> > > [2]box.joelfernandes.org
> > > [3]apple-touch-icon.png
> > >
> > > Thanks,
> > >
> > > - Joel
> > >
> > > (Apologies if the email is html, I am sending from phone).
> >
> > Heh. I have a script that runs lynx. Which isn't perfect, but usually
> > makes things at least somewhat legible.
>
> Sorry I was too optimistic about the iPhone's capabilities when it
> came to mailing list emails.
> Here's what I said:
> --------------
> I do see this now on 5.15 stable:
>
> TASKS03 ------- 3089 GPs (0.858056/s)
> QEMU killed
> TASKS03 no success message, 64 successful version messages
> !!! PID 3309783 hung at 3781 vs. 3600 seconds
>
> Link to full logs/artifacts:
> http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> ----------------
>
> > This looks like the prototypical hard hang with interrupts disabled,
> > which could be anywhere in the kernel, including RCU. I am not seeing
> > this. but the usual cause when I have seen it in the past was deadlock
> > of irq-disabled locks. In one spectacular case, it was a timekeeping
> > failure that messed up a CPU-hotplug operation.
> >
> > If this is reproducible, one trick would be to have a script look at
> > the console.log file, and have it do something (NMI? sysrq? something
> > else?) to qemu if output ceased for too long.
> >
> > One way to do this without messing with the rcutorture scripting is to
> > grab the qemu-cmd file from this run, and then invoke that file from your
> > own script, possibly with suitable modifications to qemu's parameters.
>
> Would it be better to have such monitoring as part of rcutorture
> testing itself? Alternatively there is the NMI hardlockup detector
> which I believe should also detect such cases and dump stacks.

Quite possibly. But special-casing the prototype is probably going to be
a lot faster and easier. If it works, then it might make a lot of sense
to upgrade the scripting. If it doesn't work, then quite a bit less time
is wasted than would be by messing with the scripting from the get-go.

Also, you have the option of making qemu be interactive and manually
triggering things, for example by checking up on the run near the end.
Or having something handing commands to qemu.

Either way allows much more interaction with qemu, and better
experiementation, than could be done reasonably with the scripts.

Thanx, Paul

> thanks,
>
> - Joel
>
> >
> > Thoughts?
> >
> > Thanx, Paul
> >
> > > Cheers,
> > > - Joel
> > >
> > > Thanx, Paul
> > >
> > > --------------------------------------------------------------------
> > > ----
> > >
> > > commit 709a917710dc01798e01750ea628ece4bfc42b7b
> > >
> > > Author: Paul E. McKenney <[email protected]>
> > >
> > > Date: Thu Jul 27 13:13:46 2023 -0700
> > >
> > > rcu-tasks: Add printk()s to localize boot-time self-test hang
> > >
> > > Currently, rcu_tasks_initiate_self_tests() prints a message and
> > > then
> > >
> > > initiates self tests on up to three different RCU Tasks flavors.
> > > If one
> > >
> > > of the flavors has a grace-period hang, it is not easy to work out
> > > which
> > >
> > > of the three hung. This commit therefore prints a message prior
> > > to each
> > >
> > > individual test.
> > >
> > > Reported-by: Guenter Roeck <[email protected]>
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > >
> > > index 56c470a489c8..427433c90935 100644
> > >
> > > --- a/kernel/rcu/tasks.h
> > >
> > > +++ b/kernel/rcu/tasks.h
> > >
> > > @@ -1981,20 +1981,22 @@ static void test_rcu_tasks_callback(struct
> > > rcu_head *rhp)
> > >
> > > static void rcu_tasks_initiate_self_tests(void)
> > >
> > > {
> > >
> > > - pr_info("Running RCU-tasks wait API self tests\n");
> > >
> > > #ifdef CONFIG_TASKS_RCU
> > >
> > > + pr_info("Running RCU Tasks wait API self tests\n");
> > >
> > > tests[0].runstart = jiffies;
> > >
> > > synchronize_rcu_tasks();
> > >
> > > call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
> > >
> > > #endif
> > >
> > > #ifdef CONFIG_TASKS_RUDE_RCU
> > >
> > > + pr_info("Running RCU Tasks Rude wait API self tests\n");
> > >
> > > tests[1].runstart = jiffies;
> > >
> > > synchronize_rcu_tasks_rude();
> > >
> > > call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
> > >
> > > #endif
> > >
> > > #ifdef CONFIG_TASKS_TRACE_RCU
> > >
> > > + pr_info("Running RCU Tasks Trace wait API self tests\n");
> > >
> > > tests[2].runstart = jiffies;
> > >
> > > synchronize_rcu_tasks_trace();
> > >
> > > call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
> > >
> > >References
> > >
> > > Visible links:
> > > 1. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> > > 2. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> > > 3. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/
> > >
> > > Hidden links:
> > > 5. http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-5.15.y/lastFailedBuild/artifact/tools/testing/selftests/rcutorture/res/2023.07.28-04.00.44/

2023-07-30 08:05:56

by Guenter Roeck

[permalink] [raw]
Subject: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/27/23 16:18, Joel Fernandes wrote:

[ ... ]

>> I freely confess that I am having a hard time imagining what would
>> be CPU dependent in that code. Timing, maybe? Whatever the reason,
>> I am not seeing these failures in my testing.
>>
>> So which of the following Kconfig options is defined in your .config?
>> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>>
>> If you have more than one of them, could you please apply this patch
>> and show me the corresponding console output from the resulting hang?
>
> FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
>

I managed to bisect the problem. See bisect log below. Bisect repeated twice.
so it should be reliable. I don't really understand it, but the following
reverts fix the problem. This is on top of next-20230721 because next-20230728
crashes immediately in my tests.

0caafe9b94ab (HEAD) Revert "sched/fair: Remove sched_feat(START_DEBIT)"
518bdbd39fdb Revert "sched/fair: Add lag based placement"
a011162c3e32 Revert "sched/fair: Implement an EEVDF-like scheduling policy"
df579720bf98 Revert "sched/fair: Commit to lag based placement"
aac459a7e738 Revert "sched/smp: Use lag to simplify cross-runqueue placement"
8d686eb173e1 Revert "sched/fair: Commit to EEVDF"
486474c50f95 Revert "sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice"
79e94d67d08a Revert "sched/fair: Propagate enqueue flags into place_entity()"
ae867bc97b71 (tag: next-20230721) Add linux-next specific files for 20230721

For context: x86 images (32 and 64 bit) in -next tend to hang at

[ 2.309323] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[ 2.311634] Running RCU-tasks wait API self tests

The hang is not seen with every boot; it happens roughly about once every
10 boot attempts. It is not CPU dependent as I initially thought.

Configuration file is at http://server.roeck-us.net/qemu/x86-next/config.
Example qemu command line:

qemu-system-x86_64 -kernel arch/x86/boot/bzImage -M q35 -cpu Broadwell-noTSX -no-reboot \
-snapshot -device e1000,netdev=net0 -netdev user,id=net0 -m 256 \
-drive file=rootfs.ext2,format=raw,if=ide \
--append "earlycon=uart8250,io,0x3f8,9600n8 root=/dev/sda console=ttyS0" \
-nographic -monitor none

Guenter

---
# bad: [ae867bc97b713121b2a7f5fcac68378a0774739b] Add linux-next specific files for 20230721
# good: [fdf0eaf11452d72945af31804e2a1048ee1b574c] Linux 6.5-rc2
git bisect start 'HEAD' 'v6.5-rc2'
# good: [f09bf8f6c8cbbff6f52523abcda88c86db72e31c] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
git bisect good f09bf8f6c8cbbff6f52523abcda88c86db72e31c
# good: [86374a6210aeebceb927204d80f9e65739134bc3] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
git bisect good 86374a6210aeebceb927204d80f9e65739134bc3
# bad: [d588c93cae9e3dff15d125e755edcba5d842f41a] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect bad d588c93cae9e3dff15d125e755edcba5d842f41a
# good: [acadcaf8c67062ad4c1a0ad0e05bf429b04740c5] Merge branch 'for-next' of git://git.kernel.dk/linux-block.git
git bisect good acadcaf8c67062ad4c1a0ad0e05bf429b04740c5
# good: [2c73542f4cdc59fd23514f9e963d0b3419bd5e16] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git
git bisect good 2c73542f4cdc59fd23514f9e963d0b3419bd5e16
# good: [be15b91155cd5a6c4ac8f46740ae62e610981b79] Merge remote-tracking branch 'spi/for-6.6' into spi-next
git bisect good be15b91155cd5a6c4ac8f46740ae62e610981b79
# bad: [8f4995b370a57e7ad92c0f66664d171b23234337] Merge branch into tip/master: 'sched/eevdf'
git bisect bad 8f4995b370a57e7ad92c0f66664d171b23234337
# bad: [99d4d26551b56f4e523dd04e4970b94aa796a64e] rbtree: Add rb_add_augmented_cached() helper
git bisect bad 99d4d26551b56f4e523dd04e4970b94aa796a64e
# good: [7ff1693236f5d97a939dbeb660c07671a2d57071] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
git bisect good 7ff1693236f5d97a939dbeb660c07671a2d57071
# good: [48b5583719cdfbdee238f9549a6a1a47af2b0469] sched/headers: Rename task_struct::state to task_struct::__state in the comments too
git bisect good 48b5583719cdfbdee238f9549a6a1a47af2b0469
# good: [af4cf40470c22efa3987200fd19478199e08e103] sched/fair: Add cfs_rq::avg_vruntime
git bisect good af4cf40470c22efa3987200fd19478199e08e103
# bad: [86bfbb7ce4f67a88df2639198169b685668e7349] sched/fair: Add lag based placement
git bisect bad 86bfbb7ce4f67a88df2639198169b685668e7349
# bad: [e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13] sched/fair: Remove sched_feat(START_DEBIT)
git bisect bad e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13
# first bad commit: [e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13] sched/fair: Remove sched_feat(START_DEBIT)


2023-07-31 04:12:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Sun, Jul 30, 2023 at 08:54:46PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 27, 2023 at 09:22:52PM -0700, Guenter Roeck wrote:
> > On 7/27/23 13:33, Paul E. McKenney wrote:
> > [ ... ]
> >
> > > So which of the following Kconfig options is defined in your .config?
> > > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> > >
> >
> > Only CONFIG_TASKS_RCU. I added another log message after call_rcu_tasks().
> > It never returns from that function.
> >
> > [ 1.168993] Running RCU synchronous self tests
> > [ 1.169219] Running RCU synchronous self tests
> > [ 1.285795] smpboot: CPU0: Intel Xeon Processor (Cascadelake) (family: 0x6, model: 0x55, stepping: 0x6)
> > [ 1.302827] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
> > [ 1.304526] Running RCU Tasks wait API self tests
> >
> > ... and then nothing for at least 10 minutes (then I gave up and stopped the test).
> >
> > Qemu command line:
> >
> > qemu-system-x86_64 -kernel \
> > arch/x86/boot/bzImage -M q35 -cpu Cascadelake-Server -no-reboot \
> > -snapshot -device e1000e,netdev=net0 -netdev user,id=net0 -m 256 \
> > -drive file=rootfs.iso,format=raw,if=ide,media=cdrom \
> > --append "earlycon=uart8250,io,0x3f8,9600n8 panic=-1 slub_debug=FZPUA root=/dev/sr0 rootwait console=ttyS0 noreboot" \
> > -d unimp,guest_errors -nographic -monitor none
> >
> > Again, this doesn't happen all the time. With Cascadelake-Server
> > I see it maybe once every 5 boot attempts. I tried with qemu v8.0
> > and v8.1. Note that it does seem to happen with various CPU types,
> > only for some it seems to me more likely to happen (so maybe the
> > CPU type was a red herring). It does seem to depend on the system
> > load, and happen more often if the system is under heavy load.
>
> Hmmm... What kernel are you using as your qemu/KVM hypervisor?

Never mind, I now see your bisection result. Good show, thank you!!!

Thanx, Paul

> And I echo Joel's requests for your .config file.
>
> Thanx, Paul

2023-07-31 05:27:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On Thu, Jul 27, 2023 at 09:22:52PM -0700, Guenter Roeck wrote:
> On 7/27/23 13:33, Paul E. McKenney wrote:
> [ ... ]
>
> > So which of the following Kconfig options is defined in your .config?
> > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> >
>
> Only CONFIG_TASKS_RCU. I added another log message after call_rcu_tasks().
> It never returns from that function.
>
> [ 1.168993] Running RCU synchronous self tests
> [ 1.169219] Running RCU synchronous self tests
> [ 1.285795] smpboot: CPU0: Intel Xeon Processor (Cascadelake) (family: 0x6, model: 0x55, stepping: 0x6)
> [ 1.302827] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
> [ 1.304526] Running RCU Tasks wait API self tests
>
> ... and then nothing for at least 10 minutes (then I gave up and stopped the test).
>
> Qemu command line:
>
> qemu-system-x86_64 -kernel \
> arch/x86/boot/bzImage -M q35 -cpu Cascadelake-Server -no-reboot \
> -snapshot -device e1000e,netdev=net0 -netdev user,id=net0 -m 256 \
> -drive file=rootfs.iso,format=raw,if=ide,media=cdrom \
> --append "earlycon=uart8250,io,0x3f8,9600n8 panic=-1 slub_debug=FZPUA root=/dev/sr0 rootwait console=ttyS0 noreboot" \
> -d unimp,guest_errors -nographic -monitor none
>
> Again, this doesn't happen all the time. With Cascadelake-Server
> I see it maybe once every 5 boot attempts. I tried with qemu v8.0
> and v8.1. Note that it does seem to happen with various CPU types,
> only for some it seems to me more likely to happen (so maybe the
> CPU type was a red herring). It does seem to depend on the system
> load, and happen more often if the system is under heavy load.

Hmmm... What kernel are you using as your qemu/KVM hypervisor?

And I echo Joel's requests for your .config file.

Thanx, Paul

2023-07-31 05:38:03

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review

On 7/30/23 20:54, Paul E. McKenney wrote:
> On Thu, Jul 27, 2023 at 09:22:52PM -0700, Guenter Roeck wrote:
>> On 7/27/23 13:33, Paul E. McKenney wrote:
>> [ ... ]
>>
>>> So which of the following Kconfig options is defined in your .config?
>>> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>>>
>>
>> Only CONFIG_TASKS_RCU. I added another log message after call_rcu_tasks().
>> It never returns from that function.
>>
>> [ 1.168993] Running RCU synchronous self tests
>> [ 1.169219] Running RCU synchronous self tests
>> [ 1.285795] smpboot: CPU0: Intel Xeon Processor (Cascadelake) (family: 0x6, model: 0x55, stepping: 0x6)
>> [ 1.302827] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
>> [ 1.304526] Running RCU Tasks wait API self tests
>>
>> ... and then nothing for at least 10 minutes (then I gave up and stopped the test).
>>
>> Qemu command line:
>>
>> qemu-system-x86_64 -kernel \
>> arch/x86/boot/bzImage -M q35 -cpu Cascadelake-Server -no-reboot \
>> -snapshot -device e1000e,netdev=net0 -netdev user,id=net0 -m 256 \
>> -drive file=rootfs.iso,format=raw,if=ide,media=cdrom \
>> --append "earlycon=uart8250,io,0x3f8,9600n8 panic=-1 slub_debug=FZPUA root=/dev/sr0 rootwait console=ttyS0 noreboot" \
>> -d unimp,guest_errors -nographic -monitor none
>>
>> Again, this doesn't happen all the time. With Cascadelake-Server
>> I see it maybe once every 5 boot attempts. I tried with qemu v8.0
>> and v8.1. Note that it does seem to happen with various CPU types,
>> only for some it seems to me more likely to happen (so maybe the
>> CPU type was a red herring). It does seem to depend on the system
>> load, and happen more often if the system is under heavy load.
>
> Hmmm... What kernel are you using as your qemu/KVM hypervisor?
>

Not sure I understand the question. KVM is disabled in my systems.
The host CPUs are Ryzen 3900X and 5900X, but I don't really see why
that would matter.

> And I echo Joel's requests for your .config file.
>

Did you see the e-mail I sent about this problem earlier today ?

https://lore.kernel.org/lkml/[email protected]/

I think I'll declare this to be a problem with my test environment and disable
RCU debugging.

Thanks,
Guenter


2023-07-31 15:02:42

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/31/23 07:39, Peter Zijlstra wrote:
> On Mon, Jul 31, 2023 at 04:19:34PM +0200, Peter Zijlstra wrote:
>> On Sat, Jul 29, 2023 at 09:00:02PM -0700, Guenter Roeck wrote:
>>> On 7/27/23 16:18, Joel Fernandes wrote:
>>>
>>> [ ... ]
>>>
>>>>> I freely confess that I am having a hard time imagining what would
>>>>> be CPU dependent in that code. Timing, maybe? Whatever the reason,
>>>>> I am not seeing these failures in my testing.
>>>>>
>>>>> So which of the following Kconfig options is defined in your .config?
>>>>> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>>>>>
>>>>> If you have more than one of them, could you please apply this patch
>>>>> and show me the corresponding console output from the resulting hang?
>>>>
>>>> FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
>>>>
>>>
>>> I managed to bisect the problem. See bisect log below. Bisect repeated twice.
>>> so it should be reliable. I don't really understand it, but the following
>>> reverts fix the problem. This is on top of next-20230721 because next-20230728
>>> crashes immediately in my tests.
>>>
>>> 0caafe9b94ab (HEAD) Revert "sched/fair: Remove sched_feat(START_DEBIT)"
>>> 518bdbd39fdb Revert "sched/fair: Add lag based placement"
>>> a011162c3e32 Revert "sched/fair: Implement an EEVDF-like scheduling policy"
>>> df579720bf98 Revert "sched/fair: Commit to lag based placement"
>>> aac459a7e738 Revert "sched/smp: Use lag to simplify cross-runqueue placement"
>>> 8d686eb173e1 Revert "sched/fair: Commit to EEVDF"
>>> 486474c50f95 Revert "sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice"
>>> 79e94d67d08a Revert "sched/fair: Propagate enqueue flags into place_entity()"
>>> ae867bc97b71 (tag: next-20230721) Add linux-next specific files for 20230721
>>>
>>> For context: x86 images (32 and 64 bit) in -next tend to hang at
>>>
>>> [ 2.309323] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
>>> [ 2.311634] Running RCU-tasks wait API self tests
>>>
>>> The hang is not seen with every boot; it happens roughly about once every
>>> 10 boot attempts. It is not CPU dependent as I initially thought.
>>>
>>> Configuration file is at http://server.roeck-us.net/qemu/x86-next/config.
>>> Example qemu command line:
>>
>> Hurmph, let me see if I can reproduce on next-20230731 (not having the
>> older next thingies around).
>
> I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
> I've then modified run-sh to use:
>
> qemu-system-x86_64 -enable-kvm -cpu host
>
> What I'm seeing is that some boots get stuck at:
>
> [ 0.608230] Running RCU-tasks wait API self tests
>
> Is this the right 'problem' ?
>


Yes, exactly.

Thanks,
Guenter


2023-07-31 15:33:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Sat, Jul 29, 2023 at 09:00:02PM -0700, Guenter Roeck wrote:
> On 7/27/23 16:18, Joel Fernandes wrote:
>
> [ ... ]
>
> > > I freely confess that I am having a hard time imagining what would
> > > be CPU dependent in that code. Timing, maybe? Whatever the reason,
> > > I am not seeing these failures in my testing.
> > >
> > > So which of the following Kconfig options is defined in your .config?
> > > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> > >
> > > If you have more than one of them, could you please apply this patch
> > > and show me the corresponding console output from the resulting hang?
> >
> > FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
> >
>
> I managed to bisect the problem. See bisect log below. Bisect repeated twice.
> so it should be reliable. I don't really understand it, but the following
> reverts fix the problem. This is on top of next-20230721 because next-20230728
> crashes immediately in my tests.
>
> 0caafe9b94ab (HEAD) Revert "sched/fair: Remove sched_feat(START_DEBIT)"
> 518bdbd39fdb Revert "sched/fair: Add lag based placement"
> a011162c3e32 Revert "sched/fair: Implement an EEVDF-like scheduling policy"
> df579720bf98 Revert "sched/fair: Commit to lag based placement"
> aac459a7e738 Revert "sched/smp: Use lag to simplify cross-runqueue placement"
> 8d686eb173e1 Revert "sched/fair: Commit to EEVDF"
> 486474c50f95 Revert "sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice"
> 79e94d67d08a Revert "sched/fair: Propagate enqueue flags into place_entity()"
> ae867bc97b71 (tag: next-20230721) Add linux-next specific files for 20230721
>
> For context: x86 images (32 and 64 bit) in -next tend to hang at
>
> [ 2.309323] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
> [ 2.311634] Running RCU-tasks wait API self tests
>
> The hang is not seen with every boot; it happens roughly about once every
> 10 boot attempts. It is not CPU dependent as I initially thought.
>
> Configuration file is at http://server.roeck-us.net/qemu/x86-next/config.
> Example qemu command line:

Hurmph, let me see if I can reproduce on next-20230731 (not having the
older next thingies around).

2023-07-31 15:40:40

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/31/23 07:47, Peter Zijlstra wrote:
> On Mon, Jul 31, 2023 at 07:35:13AM -0700, Guenter Roeck wrote:
>
>>> Hurmph, let me see if I can reproduce on next-20230731 (not having the
>>> older next thingies around).
>>
>> That crashes hard with my configuration.
>>
>> [ 6.353191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
>> [ 6.353392] BUG: unable to handle page fault for address: ffff9b10c0013cd0
>> [ 6.353531] #PF: supervisor instruction fetch in kernel mode
>> [ 6.353624] #PF: error_code(0x0011) - permissions violation
>> [ 6.353751] PGD 1000067 P4D 1000067 PUD 1205067 PMD 1206067 PTE 800000000124e063
>> [ 6.354011] Oops: 0011 [#1] PREEMPT SMP PTI
>> [ 6.354164] CPU: 0 PID: 182 Comm: kunit_try_catch Tainted: G N 6.5.0-rc4-next-20230731 #1
>> [ 6.354315] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>> [ 6.354525] RIP: 0010:0xffff9b10c0013cd0
>> [ 6.354793] Code: ff ff 60 64 ce a9 ff ff ff ff 00 00 00 00 00 00 00 00 d1 3a bc a8 ff ff ff ff 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <f0> 00 01 44 10 8a ff ff b8 01 01 44 10 8a ff ff 00 00 00 00 00 00
>> [ 6.355059] RSP: 0000:ffff9b10c027fd60 EFLAGS: 00000246
>> [ 6.355157] RAX: ffff9b10c0013cd0 RBX: ffff8a1043bdb400 RCX: 0000000000000000
>> [ 6.355259] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a1043bdb400
>> [ 6.355358] RBP: ffff9b10c027fdc8 R08: 0000000000000001 R09: 0000000000000001
>> [ 6.355456] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9b10c027fe74
>> [ 6.355556] R13: ffff8a10440100f0 R14: ffff8a10440101b8 R15: ffff9b10c027fe74
>> [ 6.355679] FS: 0000000000000000(0000) GS:ffff8a104fc00000(0000) knlGS:0000000000000000
>> [ 6.355798] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 6.355886] CR2: ffff9b10c0013cd0 CR3: 000000000e048000 CR4: 00000000003506f0
>> [ 6.356029] Call Trace:
>> [ 6.356158] <TASK>
>> [ 6.356334] ? __die+0x1f/0x70
>> [ 6.356472] ? page_fault_oops+0x14a/0x460
>> [ 6.356547] ? exc_page_fault+0xee/0x1c0
>> [ 6.356612] ? asm_exc_page_fault+0x26/0x30
>> [ 6.356703] ? kunit_filter_attr_tests+0xc4/0x2e0
>> [ 6.356796] kunit_filter_suites+0x2e2/0x460
>> [ 6.356889] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
>> [ 6.356979] filter_suites_test+0xea/0x2c0
>> [ 6.357051] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
>> [ 6.357148] kunit_generic_run_threadfn_adapter+0x15/0x20
>> [ 6.357228] kthread+0xef/0x120
>> [ 6.357282] ? __pfx_kthread+0x10/0x10
>> [ 6.357343] ret_from_fork+0x2f/0x50
>> [ 6.357399] ? __pfx_kthread+0x10/0x10
>> [ 6.357458] ret_from_fork_asm+0x1b/0x30
>> [ 6.357560] </TASK>
>> [ 6.357632] Modules linked in:
>> [ 6.357786] CR2: ffff9b10c0013cd0
>> [ 6.358010] ---[ end trace 0000000000000000 ]---
>
> I get:
>
> [ 2.423691] ------------[ cut here ]------------
> [ 2.424994] WARNING: CPU: 0 PID: 184 at mm/slab_common.c:992 free_large_kmalloc+0x4f/0x80
> [ 2.426183] Modules linked in:
> [ 2.426624] CPU: 0 PID: 184 Comm: kunit_try_catch Tainted: G N 6.5.0-rc4-next-20230731 #1
> [ 2.427964] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 2.429265] RIP: 0010:free_large_kmalloc+0x4f/0x80
> [ 2.429952] Code: f7 da 48 63 d2 48 8b 03 be 06 00 00 00 48 c1 e8 3a 48 8b 3c c5 60 ba 11 ab e8 0d 52 ff ff 89 ee 48 89 df 5b 5d e9 41 df 03 00 <0f> 0b 80 3d 49 43 e9 01 00 75
> [ 2.432511] RSP: 0000:ffffadcb0024bdb8 EFLAGS: 00010246
> [ 2.433259] RAX: 0100000000001000 RBX: ffffd16bc018aa40 RCX: ffffadcb0024bd7c
> [ 2.434262] RDX: ffffd16bc018aa48 RSI: ffffffffa96a9ec7 RDI: ffffd16bc018aa40
> [ 2.435265] RBP: ffffadcb0024be60 R08: 0000000000000001 R09: 0000000000000001
> [ 2.436269] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a7084014410
> [ 2.437267] R13: ffff8a70840c4000 R14: 0000000000000002 R15: ffff8a70840564a8
> [ 2.438271] FS: 0000000000000000(0000) GS:ffff8a708f800000(0000) knlGS:0000000000000000
> [ 2.439403] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2.440215] CR2: ffff8a7089401000 CR3: 0000000007a48001 CR4: 0000000000170ef0
> [ 2.441218] Call Trace:
> [ 2.441568] <TASK>
> [ 2.441883] ? free_large_kmalloc+0x4f/0x80
> [ 2.442491] ? __warn+0x80/0x170
> [ 2.442988] ? free_large_kmalloc+0x4f/0x80
> [ 2.443591] ? report_bug+0x171/0x1a0
> [ 2.444145] ? handle_bug+0x3c/0x70
> [ 2.444662] ? exc_invalid_op+0x17/0x70
> [ 2.445225] ? asm_exc_invalid_op+0x1a/0x20
> [ 2.445844] ? kunit_add_action+0xc7/0x140
> [ 2.446455] ? free_large_kmalloc+0x4f/0x80
> [ 2.447054] kunit_filter_suites+0x468/0x480
> [ 2.447662] ? kunit_add_action+0xc7/0x140
> [ 2.448258] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
> [ 2.449105] filter_suites_test+0xea/0x2c0
> [ 2.449702] kunit_generic_run_threadfn_adapter+0x15/0x20
> [ 2.450469] kthread+0xf0/0x120
> [ 2.450940] ? __pfx_kthread+0x10/0x10
> [ 2.451481] ret_from_fork+0x2f/0x50
> [ 2.452012] ? __pfx_kthread+0x10/0x10
> [ 2.452557] ret_from_fork_asm+0x1b/0x30
> [ 2.453146] </TASK>
> [ 2.453474] irq event stamp: 677
> [ 2.453943] hardirqs last enabled at (689): [<ffffffffa911c24a>] console_unlock+0x10a/0x160
> [ 2.455151] hardirqs last disabled at (700): [<ffffffffa911c22f>] console_unlock+0xef/0x160
> [ 2.456329] softirqs last enabled at (662): [<ffffffffa909179a>] irq_exit_rcu+0x7a/0xa0
> [ 2.457474] softirqs last disabled at (657): [<ffffffffa909179a>] irq_exit_rcu+0x7a/0xa0
> [ 2.458610] ---[ end trace 0000000000000000 ]---
>

Same problem. I see the warning on some architectures, the crash on others.
The fix for that problem is at
https://lore.kernel.org/linux-kselftest/[email protected]/
It is caused by the "kunit: Add test attributes API" patch series.
See https://lore.kernel.org/lkml/[email protected]/T/

Guenter


2023-07-31 16:12:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, Jul 31, 2023 at 04:19:34PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 29, 2023 at 09:00:02PM -0700, Guenter Roeck wrote:
> > On 7/27/23 16:18, Joel Fernandes wrote:
> >
> > [ ... ]
> >
> > > > I freely confess that I am having a hard time imagining what would
> > > > be CPU dependent in that code. Timing, maybe? Whatever the reason,
> > > > I am not seeing these failures in my testing.
> > > >
> > > > So which of the following Kconfig options is defined in your .config?
> > > > CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
> > > >
> > > > If you have more than one of them, could you please apply this patch
> > > > and show me the corresponding console output from the resulting hang?
> > >
> > > FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
> > >
> >
> > I managed to bisect the problem. See bisect log below. Bisect repeated twice.
> > so it should be reliable. I don't really understand it, but the following
> > reverts fix the problem. This is on top of next-20230721 because next-20230728
> > crashes immediately in my tests.
> >
> > 0caafe9b94ab (HEAD) Revert "sched/fair: Remove sched_feat(START_DEBIT)"
> > 518bdbd39fdb Revert "sched/fair: Add lag based placement"
> > a011162c3e32 Revert "sched/fair: Implement an EEVDF-like scheduling policy"
> > df579720bf98 Revert "sched/fair: Commit to lag based placement"
> > aac459a7e738 Revert "sched/smp: Use lag to simplify cross-runqueue placement"
> > 8d686eb173e1 Revert "sched/fair: Commit to EEVDF"
> > 486474c50f95 Revert "sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice"
> > 79e94d67d08a Revert "sched/fair: Propagate enqueue flags into place_entity()"
> > ae867bc97b71 (tag: next-20230721) Add linux-next specific files for 20230721
> >
> > For context: x86 images (32 and 64 bit) in -next tend to hang at
> >
> > [ 2.309323] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
> > [ 2.311634] Running RCU-tasks wait API self tests
> >
> > The hang is not seen with every boot; it happens roughly about once every
> > 10 boot attempts. It is not CPU dependent as I initially thought.
> >
> > Configuration file is at http://server.roeck-us.net/qemu/x86-next/config.
> > Example qemu command line:
>
> Hurmph, let me see if I can reproduce on next-20230731 (not having the
> older next thingies around).

I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
I've then modified run-sh to use:

qemu-system-x86_64 -enable-kvm -cpu host

What I'm seeing is that some boots get stuck at:

[ 0.608230] Running RCU-tasks wait API self tests

Is this the right 'problem' ?


2023-07-31 16:19:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, Jul 31, 2023 at 07:48:19AM -0700, Guenter Roeck wrote:

> > I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
> > I've then modified run-sh to use:
> >
> > qemu-system-x86_64 -enable-kvm -cpu host
> >
> > What I'm seeing is that some boots get stuck at:
> >
> > [ 0.608230] Running RCU-tasks wait API self tests
> >
> > Is this the right 'problem' ?
> >
>
>
> Yes, exactly.

Excellent! Let me prod that with something sharp, see what comes
creeping out.

2023-07-31 16:34:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, Jul 31, 2023 at 07:35:13AM -0700, Guenter Roeck wrote:

> > Hurmph, let me see if I can reproduce on next-20230731 (not having the
> > older next thingies around).
>
> That crashes hard with my configuration.
>
> [ 6.353191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> [ 6.353392] BUG: unable to handle page fault for address: ffff9b10c0013cd0
> [ 6.353531] #PF: supervisor instruction fetch in kernel mode
> [ 6.353624] #PF: error_code(0x0011) - permissions violation
> [ 6.353751] PGD 1000067 P4D 1000067 PUD 1205067 PMD 1206067 PTE 800000000124e063
> [ 6.354011] Oops: 0011 [#1] PREEMPT SMP PTI
> [ 6.354164] CPU: 0 PID: 182 Comm: kunit_try_catch Tainted: G N 6.5.0-rc4-next-20230731 #1
> [ 6.354315] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [ 6.354525] RIP: 0010:0xffff9b10c0013cd0
> [ 6.354793] Code: ff ff 60 64 ce a9 ff ff ff ff 00 00 00 00 00 00 00 00 d1 3a bc a8 ff ff ff ff 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <f0> 00 01 44 10 8a ff ff b8 01 01 44 10 8a ff ff 00 00 00 00 00 00
> [ 6.355059] RSP: 0000:ffff9b10c027fd60 EFLAGS: 00000246
> [ 6.355157] RAX: ffff9b10c0013cd0 RBX: ffff8a1043bdb400 RCX: 0000000000000000
> [ 6.355259] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a1043bdb400
> [ 6.355358] RBP: ffff9b10c027fdc8 R08: 0000000000000001 R09: 0000000000000001
> [ 6.355456] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9b10c027fe74
> [ 6.355556] R13: ffff8a10440100f0 R14: ffff8a10440101b8 R15: ffff9b10c027fe74
> [ 6.355679] FS: 0000000000000000(0000) GS:ffff8a104fc00000(0000) knlGS:0000000000000000
> [ 6.355798] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6.355886] CR2: ffff9b10c0013cd0 CR3: 000000000e048000 CR4: 00000000003506f0
> [ 6.356029] Call Trace:
> [ 6.356158] <TASK>
> [ 6.356334] ? __die+0x1f/0x70
> [ 6.356472] ? page_fault_oops+0x14a/0x460
> [ 6.356547] ? exc_page_fault+0xee/0x1c0
> [ 6.356612] ? asm_exc_page_fault+0x26/0x30
> [ 6.356703] ? kunit_filter_attr_tests+0xc4/0x2e0
> [ 6.356796] kunit_filter_suites+0x2e2/0x460
> [ 6.356889] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
> [ 6.356979] filter_suites_test+0xea/0x2c0
> [ 6.357051] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
> [ 6.357148] kunit_generic_run_threadfn_adapter+0x15/0x20
> [ 6.357228] kthread+0xef/0x120
> [ 6.357282] ? __pfx_kthread+0x10/0x10
> [ 6.357343] ret_from_fork+0x2f/0x50
> [ 6.357399] ? __pfx_kthread+0x10/0x10
> [ 6.357458] ret_from_fork_asm+0x1b/0x30
> [ 6.357560] </TASK>
> [ 6.357632] Modules linked in:
> [ 6.357786] CR2: ffff9b10c0013cd0
> [ 6.358010] ---[ end trace 0000000000000000 ]---

I get:

[ 2.423691] ------------[ cut here ]------------
[ 2.424994] WARNING: CPU: 0 PID: 184 at mm/slab_common.c:992 free_large_kmalloc+0x4f/0x80
[ 2.426183] Modules linked in:
[ 2.426624] CPU: 0 PID: 184 Comm: kunit_try_catch Tainted: G N 6.5.0-rc4-next-20230731 #1
[ 2.427964] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 2.429265] RIP: 0010:free_large_kmalloc+0x4f/0x80
[ 2.429952] Code: f7 da 48 63 d2 48 8b 03 be 06 00 00 00 48 c1 e8 3a 48 8b 3c c5 60 ba 11 ab e8 0d 52 ff ff 89 ee 48 89 df 5b 5d e9 41 df 03 00 <0f> 0b 80 3d 49 43 e9 01 00 75
[ 2.432511] RSP: 0000:ffffadcb0024bdb8 EFLAGS: 00010246
[ 2.433259] RAX: 0100000000001000 RBX: ffffd16bc018aa40 RCX: ffffadcb0024bd7c
[ 2.434262] RDX: ffffd16bc018aa48 RSI: ffffffffa96a9ec7 RDI: ffffd16bc018aa40
[ 2.435265] RBP: ffffadcb0024be60 R08: 0000000000000001 R09: 0000000000000001
[ 2.436269] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a7084014410
[ 2.437267] R13: ffff8a70840c4000 R14: 0000000000000002 R15: ffff8a70840564a8
[ 2.438271] FS: 0000000000000000(0000) GS:ffff8a708f800000(0000) knlGS:0000000000000000
[ 2.439403] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.440215] CR2: ffff8a7089401000 CR3: 0000000007a48001 CR4: 0000000000170ef0
[ 2.441218] Call Trace:
[ 2.441568] <TASK>
[ 2.441883] ? free_large_kmalloc+0x4f/0x80
[ 2.442491] ? __warn+0x80/0x170
[ 2.442988] ? free_large_kmalloc+0x4f/0x80
[ 2.443591] ? report_bug+0x171/0x1a0
[ 2.444145] ? handle_bug+0x3c/0x70
[ 2.444662] ? exc_invalid_op+0x17/0x70
[ 2.445225] ? asm_exc_invalid_op+0x1a/0x20
[ 2.445844] ? kunit_add_action+0xc7/0x140
[ 2.446455] ? free_large_kmalloc+0x4f/0x80
[ 2.447054] kunit_filter_suites+0x468/0x480
[ 2.447662] ? kunit_add_action+0xc7/0x140
[ 2.448258] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[ 2.449105] filter_suites_test+0xea/0x2c0
[ 2.449702] kunit_generic_run_threadfn_adapter+0x15/0x20
[ 2.450469] kthread+0xf0/0x120
[ 2.450940] ? __pfx_kthread+0x10/0x10
[ 2.451481] ret_from_fork+0x2f/0x50
[ 2.452012] ? __pfx_kthread+0x10/0x10
[ 2.452557] ret_from_fork_asm+0x1b/0x30
[ 2.453146] </TASK>
[ 2.453474] irq event stamp: 677
[ 2.453943] hardirqs last enabled at (689): [<ffffffffa911c24a>] console_unlock+0x10a/0x160
[ 2.455151] hardirqs last disabled at (700): [<ffffffffa911c22f>] console_unlock+0xef/0x160
[ 2.456329] softirqs last enabled at (662): [<ffffffffa909179a>] irq_exit_rcu+0x7a/0xa0
[ 2.457474] softirqs last disabled at (657): [<ffffffffa909179a>] irq_exit_rcu+0x7a/0xa0
[ 2.458610] ---[ end trace 0000000000000000 ]---

But then it continues and eventually reaches:

Linux version 6.5.0-rc4-next-20230731 (root@ivb-ep) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Mon Jul 31 15:39:05 CEST 2023
Network interface test passed
Boot successful.
/ #


Full log attached.


Attachments:
(No filename) (5.84 kB)
log.gz (24.15 kB)
Download all attachments

2023-07-31 16:35:58

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/31/23 07:19, Peter Zijlstra wrote:
> On Sat, Jul 29, 2023 at 09:00:02PM -0700, Guenter Roeck wrote:
>> On 7/27/23 16:18, Joel Fernandes wrote:
>>
>> [ ... ]
>>
>>>> I freely confess that I am having a hard time imagining what would
>>>> be CPU dependent in that code. Timing, maybe? Whatever the reason,
>>>> I am not seeing these failures in my testing.
>>>>
>>>> So which of the following Kconfig options is defined in your .config?
>>>> CONFIG_TASKS_RCU, CONFIG_TASKS_RUDE_RCU, and CONFIG_TASKS_TRACE_RCU.
>>>>
>>>> If you have more than one of them, could you please apply this patch
>>>> and show me the corresponding console output from the resulting hang?
>>>
>>> FWIW, I am not able to repro this issue either. If a .config can be shared of the problem system, I can try it out to see if it can be reproduced on my side.
>>>
>>
>> I managed to bisect the problem. See bisect log below. Bisect repeated twice.
>> so it should be reliable. I don't really understand it, but the following
>> reverts fix the problem. This is on top of next-20230721 because next-20230728
>> crashes immediately in my tests.
>>
>> 0caafe9b94ab (HEAD) Revert "sched/fair: Remove sched_feat(START_DEBIT)"
>> 518bdbd39fdb Revert "sched/fair: Add lag based placement"
>> a011162c3e32 Revert "sched/fair: Implement an EEVDF-like scheduling policy"
>> df579720bf98 Revert "sched/fair: Commit to lag based placement"
>> aac459a7e738 Revert "sched/smp: Use lag to simplify cross-runqueue placement"
>> 8d686eb173e1 Revert "sched/fair: Commit to EEVDF"
>> 486474c50f95 Revert "sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice"
>> 79e94d67d08a Revert "sched/fair: Propagate enqueue flags into place_entity()"
>> ae867bc97b71 (tag: next-20230721) Add linux-next specific files for 20230721
>>
>> For context: x86 images (32 and 64 bit) in -next tend to hang at
>>
>> [ 2.309323] RCU Tasks: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
>> [ 2.311634] Running RCU-tasks wait API self tests
>>
>> The hang is not seen with every boot; it happens roughly about once every
>> 10 boot attempts. It is not CPU dependent as I initially thought.
>>
>> Configuration file is at http://server.roeck-us.net/qemu/x86-next/config.
>> Example qemu command line:
>
> Hurmph, let me see if I can reproduce on next-20230731 (not having the
> older next thingies around).

That crashes hard with my configuration.

[ 6.353191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 6.353392] BUG: unable to handle page fault for address: ffff9b10c0013cd0
[ 6.353531] #PF: supervisor instruction fetch in kernel mode
[ 6.353624] #PF: error_code(0x0011) - permissions violation
[ 6.353751] PGD 1000067 P4D 1000067 PUD 1205067 PMD 1206067 PTE 800000000124e063
[ 6.354011] Oops: 0011 [#1] PREEMPT SMP PTI
[ 6.354164] CPU: 0 PID: 182 Comm: kunit_try_catch Tainted: G N 6.5.0-rc4-next-20230731 #1
[ 6.354315] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 6.354525] RIP: 0010:0xffff9b10c0013cd0
[ 6.354793] Code: ff ff 60 64 ce a9 ff ff ff ff 00 00 00 00 00 00 00 00 d1 3a bc a8 ff ff ff ff 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <f0> 00 01 44 10 8a ff ff b8 01 01 44 10 8a ff ff 00 00 00 00 00 00
[ 6.355059] RSP: 0000:ffff9b10c027fd60 EFLAGS: 00000246
[ 6.355157] RAX: ffff9b10c0013cd0 RBX: ffff8a1043bdb400 RCX: 0000000000000000
[ 6.355259] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a1043bdb400
[ 6.355358] RBP: ffff9b10c027fdc8 R08: 0000000000000001 R09: 0000000000000001
[ 6.355456] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9b10c027fe74
[ 6.355556] R13: ffff8a10440100f0 R14: ffff8a10440101b8 R15: ffff9b10c027fe74
[ 6.355679] FS: 0000000000000000(0000) GS:ffff8a104fc00000(0000) knlGS:0000000000000000
[ 6.355798] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.355886] CR2: ffff9b10c0013cd0 CR3: 000000000e048000 CR4: 00000000003506f0
[ 6.356029] Call Trace:
[ 6.356158] <TASK>
[ 6.356334] ? __die+0x1f/0x70
[ 6.356472] ? page_fault_oops+0x14a/0x460
[ 6.356547] ? exc_page_fault+0xee/0x1c0
[ 6.356612] ? asm_exc_page_fault+0x26/0x30
[ 6.356703] ? kunit_filter_attr_tests+0xc4/0x2e0
[ 6.356796] kunit_filter_suites+0x2e2/0x460
[ 6.356889] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[ 6.356979] filter_suites_test+0xea/0x2c0
[ 6.357051] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[ 6.357148] kunit_generic_run_threadfn_adapter+0x15/0x20
[ 6.357228] kthread+0xef/0x120
[ 6.357282] ? __pfx_kthread+0x10/0x10
[ 6.357343] ret_from_fork+0x2f/0x50
[ 6.357399] ? __pfx_kthread+0x10/0x10
[ 6.357458] ret_from_fork_asm+0x1b/0x30
[ 6.357560] </TASK>
[ 6.357632] Modules linked in:
[ 6.357786] CR2: ffff9b10c0013cd0
[ 6.358010] ---[ end trace 0000000000000000 ]---

Enabling CONFIG_ZERO_CALL_USED_REGS might fix (hide) this, but I have not tried.

Guenter


2023-07-31 16:55:46

by Roy Hopkins

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, 2023-07-31 at 18:14 +0200, Peter Zijlstra wrote:
> Ha!, I was poking around the same thing. My hack below seems to (so far,
> <20 boots) help things.
>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 56c470a489c8..b083b5a30025 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -652,7 +658,11 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
>         t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
>         if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
>                 return;
> -       smp_mb(); /* Ensure others see full kthread. */
> +       for (;;) {
> +               cond_resched();
> +               if (smp_load_acquire(&rtp->kthread_ptr))
> +                       break;
> +       }
>  }
>  
>  #ifndef CONFIG_TINY_RCU

FWIW, here's my hack which seems to fix it.

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 9b9ce09f8f35..2e76fbfff9c6 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -52,6 +52,7 @@ struct rcu_tasks_percpu {
* @cbs_gbl_lock: Lock protecting callback list.
* @tasks_gp_mutex: Mutex protecting grace period, needed during mid-boot dead zone.
* @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
+ * @kthread_started: Flag that indicates whether kthread has been launched.
* @gp_func: This flavor's grace-period-wait function.
* @gp_state: Grace period's most recent state transition (debugging).
* @gp_sleep: Per-grace-period sleep to prevent CPU-bound looping.
@@ -92,6 +93,7 @@ struct rcu_tasks {
unsigned long n_ipis;
unsigned long n_ipis_fails;
struct task_struct *kthread_ptr;
+ int kthread_started;
rcu_tasks_gp_func_t gp_func;
pregp_func_t pregp_func;
pertask_func_t pertask_func;
@@ -582,7 +584,7 @@ static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp)
return;

// If the grace-period kthread is running, use it.
- if (READ_ONCE(rtp->kthread_ptr)) {
+ if (READ_ONCE(rtp->kthread_started)) {
wait_rcu_gp(rtp->call_func);
return;
}
@@ -595,6 +597,7 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
struct task_struct *t;

t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
+ rtp->kthread_started = 1;
if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
return;
smp_mb(); /* Ensure others see full kthread. */

2023-07-31 17:07:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/31/23 09:14, Peter Zijlstra wrote:
> On Mon, Jul 31, 2023 at 05:08:29PM +0100, Roy Hopkins wrote:
>> On Mon, 2023-07-31 at 16:52 +0200, Peter Zijlstra wrote:
>>> On Mon, Jul 31, 2023 at 07:48:19AM -0700, Guenter Roeck wrote:
>>>
>>>>> I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
>>>>> I've then modified run-sh to use:
>>>>>
>>>>>    qemu-system-x86_64 -enable-kvm -cpu host
>>>>>
>>>>> What I'm seeing is that some boots get stuck at:
>>>>>
>>>>> [    0.608230] Running RCU-tasks wait API self tests
>>>>>
>>>>> Is this the right 'problem' ?
>>>>>
>>>>
>>>>
>>>> Yes, exactly.
>>>
>>> Excellent! Let me prod that with something sharp, see what comes
>>> creeping out.
>>
>> In an effort to get up to speed with this area of the kernel, I've been playing
>> around with this too today and managed to reproduce the problem using the same
>> configuration. I'm completely new to this code but I think I may have found the
>> root of the problem.
>>
>> What I've found is that there is a race condition between starting the RCU tasks
>> grace-period thread in rcu_spawn_tasks_kthread_generic() and a subsequent call
>> to synchronize_rcu_tasks_generic(). This results in rtp->tasks_gp_mutex being
>> locked in the initial thread which subsequently blocks the newly started grace-
>> period thread.
>>
>> The problem is that although synchronize_rcu_tasks_generic() checks to see if
>> the grace-period kthread is running, it uses rtp->kthread_ptr to achieve this.
>> This is only set in the thread entry point and not when the thread is created,
>> meaning that it is set only after the creating thread yields or is preempted. If
>> this has not happened before the next call to synchronize_rcu_tasks_generic()
>> then a deadlock occurs.
>>
>> I've created a debug patch that introduces a new flag in rcu_tasks that is set
>> when the kthread is created and used this in synchronize_rcu_tasks_generic() in
>> place of READ_ONCE(rtp->kthread_ptr). This fixes the issue in my test
>> environment.
>>
>> I'm happy to have a go at submitting a patch for this if it helps.
>
> Ha!, I was poking around the same thing. My hack below seems to (so far,
> <20 boots) help things.
>

So, dumb question:
How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?

Thanks,
Guenter

>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 56c470a489c8..b083b5a30025 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -652,7 +658,11 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
> t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
> if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
> return;
> - smp_mb(); /* Ensure others see full kthread. */
> + for (;;) {
> + cond_resched();
> + if (smp_load_acquire(&rtp->kthread_ptr))
> + break;
> + }
> }
>
> #ifndef CONFIG_TINY_RCU


2023-07-31 17:48:02

by Roy Hopkins

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, 2023-07-31 at 16:52 +0200, Peter Zijlstra wrote:
> On Mon, Jul 31, 2023 at 07:48:19AM -0700, Guenter Roeck wrote:
>
> > > I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
> > > I've then modified run-sh to use:
> > >
> > >    qemu-system-x86_64 -enable-kvm -cpu host
> > >
> > > What I'm seeing is that some boots get stuck at:
> > >
> > > [    0.608230] Running RCU-tasks wait API self tests
> > >
> > > Is this the right 'problem' ?
> > >
> >
> >
> > Yes, exactly.
>
> Excellent! Let me prod that with something sharp, see what comes
> creeping out.

In an effort to get up to speed with this area of the kernel, I've been playing
around with this too today and managed to reproduce the problem using the same
configuration. I'm completely new to this code but I think I may have found the
root of the problem.

What I've found is that there is a race condition between starting the RCU tasks
grace-period thread in rcu_spawn_tasks_kthread_generic() and a subsequent call
to synchronize_rcu_tasks_generic(). This results in rtp->tasks_gp_mutex being
locked in the initial thread which subsequently blocks the newly started grace-
period thread.

The problem is that although synchronize_rcu_tasks_generic() checks to see if
the grace-period kthread is running, it uses rtp->kthread_ptr to achieve this.
This is only set in the thread entry point and not when the thread is created,
meaning that it is set only after the creating thread yields or is preempted. If
this has not happened before the next call to synchronize_rcu_tasks_generic()
then a deadlock occurs.

I've created a debug patch that introduces a new flag in rcu_tasks that is set
when the kthread is created and used this in synchronize_rcu_tasks_generic() in
place of READ_ONCE(rtp->kthread_ptr). This fixes the issue in my test
environment.

I'm happy to have a go at submitting a patch for this if it helps.

2023-07-31 19:20:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, Jul 31, 2023 at 05:08:29PM +0100, Roy Hopkins wrote:
> On Mon, 2023-07-31 at 16:52 +0200, Peter Zijlstra wrote:
> > On Mon, Jul 31, 2023 at 07:48:19AM -0700, Guenter Roeck wrote:
> >
> > > > I've taken your config above, and the rootfs.ext2 and run-sh from x86/.
> > > > I've then modified run-sh to use:
> > > >
> > > > ?? qemu-system-x86_64 -enable-kvm -cpu host
> > > >
> > > > What I'm seeing is that some boots get stuck at:
> > > >
> > > > [??? 0.608230] Running RCU-tasks wait API self tests
> > > >
> > > > Is this the right 'problem' ?
> > > >
> > >
> > >
> > > Yes, exactly.
> >
> > Excellent! Let me prod that with something sharp, see what comes
> > creeping out.
>
> In an effort to get up to speed with this area of the kernel, I've been playing
> around with this too today and managed to reproduce the problem using the same
> configuration. I'm completely new to this code but I think I may have found the
> root of the problem.
>
> What I've found is that there is a race condition between starting the RCU tasks
> grace-period thread in rcu_spawn_tasks_kthread_generic() and a subsequent call
> to synchronize_rcu_tasks_generic(). This results in rtp->tasks_gp_mutex being
> locked in the initial thread which subsequently blocks the newly started grace-
> period thread.
>
> The problem is that although synchronize_rcu_tasks_generic() checks to see if
> the grace-period kthread is running, it uses rtp->kthread_ptr to achieve this.
> This is only set in the thread entry point and not when the thread is created,
> meaning that it is set only after the creating thread yields or is preempted. If
> this has not happened before the next call to synchronize_rcu_tasks_generic()
> then a deadlock occurs.
>
> I've created a debug patch that introduces a new flag in rcu_tasks that is set
> when the kthread is created and used this in synchronize_rcu_tasks_generic() in
> place of READ_ONCE(rtp->kthread_ptr). This fixes the issue in my test
> environment.
>
> I'm happy to have a go at submitting a patch for this if it helps.

Ha!, I was poking around the same thing. My hack below seems to (so far,
<20 boots) help things.


diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 56c470a489c8..b083b5a30025 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -652,7 +658,11 @@ static void __init rcu_spawn_tasks_kthread_generic(struct rcu_tasks *rtp)
t = kthread_run(rcu_tasks_kthread, rtp, "%s_kthread", rtp->kname);
if (WARN_ONCE(IS_ERR(t), "%s: Could not start %s grace-period kthread, OOM is now expected behavior\n", __func__, rtp->name))
return;
- smp_mb(); /* Ensure others see full kthread. */
+ for (;;) {
+ cond_resched();
+ if (smp_load_acquire(&rtp->kthread_ptr))
+ break;
+ }
}

#ifndef CONFIG_TINY_RCU

2023-07-31 22:26:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
> > Ha!, I was poking around the same thing. My hack below seems to (so far,
> > <20 boots) help things.
> >
>
> So, dumb question:
> How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?

That commit changes the timings of things; dumb luck otherwise.

2023-08-01 17:59:23

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 7/31/23 14:15, Peter Zijlstra wrote:
> On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
>>> Ha!, I was poking around the same thing. My hack below seems to (so far,
>>> <20 boots) help things.
>>>
>>
>> So, dumb question:
>> How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
>
> That commit changes the timings of things; dumb luck otherwise.

Kind of scary. So I only experienced the problem because the START_DEBIT patch
happened to be queued roughly at the same time, and it might otherwise have
found its way unnoticed into the upstream kernel. That makes me wonder if this
or other similar patches may uncover similar problems elsewhere in the kernel
(i.e., either hide new or existing race conditions or expose existing ones).

This in turn makes me wonder if it would be possible to define a test which
would uncover such problems without the START_DEBIT patch. Any idea ?

Guenter


2023-08-01 19:41:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Tue, Aug 01, 2023 at 12:11:04PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> > On 7/31/23 14:15, Peter Zijlstra wrote:
> > > On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
> > > > > Ha!, I was poking around the same thing. My hack below seems to (so far,
> > > > > <20 boots) help things.
> > > > >
> > > >
> > > > So, dumb question:
> > > > How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
> > >
> > > That commit changes the timings of things; dumb luck otherwise.
> >
> > Kind of scary. So I only experienced the problem because the START_DEBIT patch
> > happened to be queued roughly at the same time, and it might otherwise have
> > found its way unnoticed into the upstream kernel.

And just to set the record straight, this bug has been in mainline for
about a year, since v5.19.

Thanx, Paul

> > That makes me wonder if this
> > or other similar patches may uncover similar problems elsewhere in the kernel
> > (i.e., either hide new or existing race conditions or expose existing ones).
> >
> > This in turn makes me wonder if it would be possible to define a test which
> > would uncover such problems without the START_DEBIT patch. Any idea ?
>
> Thank you all for tracking this down!
>
> One way is to put a schedule_timeout_idle(100) right before the call to
> rcu_tasks_one_gp() from synchronize_rcu_tasks_generic(). That is quite
> specific to this particular issue, but it does have the virtue of making
> it actually happen in my testing.
>
> There have been a few academic projects that inject delays at points
> chosen by various heuristics plus some randomness. But this would be
> a bit of a challenge to those because each kernel only passes through
> this window once at boot time.
>
> Please see below for my preferred fix. Does this work for you guys?
>
> Back to figuring out why recent kernels occasionally to blow up all
> rcutorture guest OSes...
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 7294be62727b..2d5b8385c357 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
> if (unlikely(midboot)) {
> needgpcb = 0x2;
> } else {
> + mutex_unlock(&rtp->tasks_gp_mutex);
> set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> rcuwait_wait_event(&rtp->cbs_wait,
> (needgpcb = rcu_tasks_need_gpcb(rtp)),
> TASK_IDLE);
> + mutex_lock(&rtp->tasks_gp_mutex);
> }
>
> if (needgpcb & 0x2) {

2023-08-01 19:49:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> On 7/31/23 14:15, Peter Zijlstra wrote:
> > On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
> > > > Ha!, I was poking around the same thing. My hack below seems to (so far,
> > > > <20 boots) help things.
> > > >
> > >
> > > So, dumb question:
> > > How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
> >
> > That commit changes the timings of things; dumb luck otherwise.
>
> Kind of scary. So I only experienced the problem because the START_DEBIT patch
> happened to be queued roughly at the same time, and it might otherwise have
> found its way unnoticed into the upstream kernel. That makes me wonder if this
> or other similar patches may uncover similar problems elsewhere in the kernel
> (i.e., either hide new or existing race conditions or expose existing ones).
>
> This in turn makes me wonder if it would be possible to define a test which
> would uncover such problems without the START_DEBIT patch. Any idea ?

Thank you all for tracking this down!

One way is to put a schedule_timeout_idle(100) right before the call to
rcu_tasks_one_gp() from synchronize_rcu_tasks_generic(). That is quite
specific to this particular issue, but it does have the virtue of making
it actually happen in my testing.

There have been a few academic projects that inject delays at points
chosen by various heuristics plus some randomness. But this would be
a bit of a challenge to those because each kernel only passes through
this window once at boot time.

Please see below for my preferred fix. Does this work for you guys?

Back to figuring out why recent kernels occasionally to blow up all
rcutorture guest OSes...

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7294be62727b..2d5b8385c357 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
if (unlikely(midboot)) {
needgpcb = 0x2;
} else {
+ mutex_unlock(&rtp->tasks_gp_mutex);
set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
rcuwait_wait_event(&rtp->cbs_wait,
(needgpcb = rcu_tasks_need_gpcb(rtp)),
TASK_IDLE);
+ mutex_lock(&rtp->tasks_gp_mutex);
}

if (needgpcb & 0x2) {

2023-08-01 20:15:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> On 7/31/23 14:15, Peter Zijlstra wrote:
> > On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
> > > > Ha!, I was poking around the same thing. My hack below seems to (so far,
> > > > <20 boots) help things.
> > > >
> > >
> > > So, dumb question:
> > > How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
> >
> > That commit changes the timings of things; dumb luck otherwise.
>
> Kind of scary. So I only experienced the problem because the START_DEBIT patch
> happened to be queued roughly at the same time, and it might otherwise have
> found its way unnoticed into the upstream kernel. That makes me wonder if this
> or other similar patches may uncover similar problems elsewhere in the kernel
> (i.e., either hide new or existing race conditions or expose existing ones).
>
> This in turn makes me wonder if it would be possible to define a test which
> would uncover such problems without the START_DEBIT patch. Any idea ?

IIRC some of the thread sanitizers use breakpoints to inject random
sleeps, specifically to tickle races.

2023-08-01 22:47:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Tue, Aug 01, 2023 at 09:08:52PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> > On 7/31/23 14:15, Peter Zijlstra wrote:
> > > On Mon, Jul 31, 2023 at 09:34:29AM -0700, Guenter Roeck wrote:
> > > > > Ha!, I was poking around the same thing. My hack below seems to (so far,
> > > > > <20 boots) help things.
> > > > >
> > > >
> > > > So, dumb question:
> > > > How comes this bisects to "sched/fair: Remove sched_feat(START_DEBIT)" ?
> > >
> > > That commit changes the timings of things; dumb luck otherwise.
> >
> > Kind of scary. So I only experienced the problem because the START_DEBIT patch
> > happened to be queued roughly at the same time, and it might otherwise have
> > found its way unnoticed into the upstream kernel. That makes me wonder if this
> > or other similar patches may uncover similar problems elsewhere in the kernel
> > (i.e., either hide new or existing race conditions or expose existing ones).
> >
> > This in turn makes me wonder if it would be possible to define a test which
> > would uncover such problems without the START_DEBIT patch. Any idea ?
>
> IIRC some of the thread sanitizers use breakpoints to inject random
> sleeps, specifically to tickle races.

I have heard of are some of these, arguably including KCSAN, but they
would have a tough time on this one.

They would have to inject many milliseconds between the check of
->kthread_ptr in synchronize_rcu_tasks_generic() and that mutex_lock()
in rcu_tasks_one_gp(). Plus this window only occurs during boot shortly
before init is spawned.

On the other hand, randomly injecting delay just before acquiring each
lock would cover this case. But such a sanitzer would still only get
one shot per boot of the kernel for this particular bug.

Thanx, Paul

2023-08-02 15:01:02

by Roy Hopkins

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
> On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
>
>
> Please see below for my preferred fix.  Does this work for you guys?
>
> Back to figuring out why recent kernels occasionally to blow up all
> rcutorture guest OSes...
>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 7294be62727b..2d5b8385c357 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
>         if (unlikely(midboot)) {
>                 needgpcb = 0x2;
>         } else {
> +               mutex_unlock(&rtp->tasks_gp_mutex);
>                 set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>                 rcuwait_wait_event(&rtp->cbs_wait,
>                                    (needgpcb = rcu_tasks_need_gpcb(rtp)),
>                                    TASK_IDLE);
> +               mutex_lock(&rtp->tasks_gp_mutex);
>         }
>  
>         if (needgpcb & 0x2) {

Your preferred fix looks good to me.

With the original code I can quite easily reproduce the problem on my 
system every 10 reboots or so. With your fix in place the problem no
longer occurs.


2023-08-02 15:37:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
> On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
> > On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> >
> >
> > Please see below for my preferred fix.? Does this work for you guys?
> >
> > Back to figuring out why recent kernels occasionally to blow up all
> > rcutorture guest OSes...
> >
> > ????????????????????????????????????????????????????????Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index 7294be62727b..2d5b8385c357 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
> > ????????if (unlikely(midboot)) {
> > ????????????????needgpcb = 0x2;
> > ????????} else {
> > +???????????????mutex_unlock(&rtp->tasks_gp_mutex);
> > ????????????????set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> > ????????????????rcuwait_wait_event(&rtp->cbs_wait,
> > ?????????????????????????????????? (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > ?????????????????????????????????? TASK_IDLE);
> > +???????????????mutex_lock(&rtp->tasks_gp_mutex);
> > ????????}
> > ?
> > ????????if (needgpcb & 0x2) {
>
> Your preferred fix looks good to me.
>
> With the original code I can quite easily reproduce the problem on my?
> system every 10 reboots or so. With your fix in place the problem no
> longer occurs.

Very good, thank you! May I add your Tested-by?

Thanx, Paul

2023-08-02 15:51:40

by Roy Hopkins

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Wed, 2023-08-02 at 08:05 -0700, Paul E. McKenney wrote:
> On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
> > On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> > >
> > >
> > > Please see below for my preferred fix.  Does this work for you guys?
> > >
> > > Back to figuring out why recent kernels occasionally to blow up all
> > > rcutorture guest OSes...
> > >
> > >                                                         Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index 7294be62727b..2d5b8385c357 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp,
> > > bool midboot)
> > >         if (unlikely(midboot)) {
> > >                 needgpcb = 0x2;
> > >         } else {
> > > +               mutex_unlock(&rtp->tasks_gp_mutex);
> > >                 set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> > >                 rcuwait_wait_event(&rtp->cbs_wait,
> > >                                    (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > >                                    TASK_IDLE);
> > > +               mutex_lock(&rtp->tasks_gp_mutex);
> > >         }
> > >  
> > >         if (needgpcb & 0x2) {
> >
> > Your preferred fix looks good to me.
> >
> > With the original code I can quite easily reproduce the problem on my 
> > system every 10 reboots or so. With your fix in place the problem no
> > longer occurs.
>
> Very good, thank you!  May I add your Tested-by?
>
>                                                         Thanx, Paul
Yes, please do.

2023-08-02 17:08:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Wed, Aug 02, 2023 at 04:31:12PM +0100, Roy Hopkins wrote:
> On Wed, 2023-08-02 at 08:05 -0700, Paul E. McKenney wrote:
> > On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
> > > On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
> > > > On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> > > >
> > > >
> > > > Please see below for my preferred fix.? Does this work for you guys?
> > > >
> > > > Back to figuring out why recent kernels occasionally to blow up all
> > > > rcutorture guest OSes...
> > > >
> > > > ????????????????????????????????????????????????????????Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > index 7294be62727b..2d5b8385c357 100644
> > > > --- a/kernel/rcu/tasks.h
> > > > +++ b/kernel/rcu/tasks.h
> > > > @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp,
> > > > bool midboot)
> > > > ????????if (unlikely(midboot)) {
> > > > ????????????????needgpcb = 0x2;
> > > > ????????} else {
> > > > +???????????????mutex_unlock(&rtp->tasks_gp_mutex);
> > > > ????????????????set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> > > > ????????????????rcuwait_wait_event(&rtp->cbs_wait,
> > > > ?????????????????????????????????? (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > > > ?????????????????????????????????? TASK_IDLE);
> > > > +???????????????mutex_lock(&rtp->tasks_gp_mutex);
> > > > ????????}
> > > > ?
> > > > ????????if (needgpcb & 0x2) {
> > >
> > > Your preferred fix looks good to me.
> > >
> > > With the original code I can quite easily reproduce the problem on my?
> > > system every 10 reboots or so. With your fix in place the problem no
> > > longer occurs.
> >
> > Very good, thank you!? May I add your Tested-by?
> >
> > ????????????????????????????????????????????????????????Thanx, Paul
> Yes, please do.

Thank you again, and I will apply this on my next rebase.

Thanx, Paul

2023-08-02 17:23:36

by Guenter Roeck

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On 8/2/23 08:05, Paul E. McKenney wrote:
> On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
>> On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
>>> On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
>>>
>>>
>>> Please see below for my preferred fix.  Does this work for you guys?
>>>
>>> Back to figuring out why recent kernels occasionally to blow up all
>>> rcutorture guest OSes...
>>>
>>>                                                         Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>>> index 7294be62727b..2d5b8385c357 100644
>>> --- a/kernel/rcu/tasks.h
>>> +++ b/kernel/rcu/tasks.h
>>> @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
>>>         if (unlikely(midboot)) {
>>>                 needgpcb = 0x2;
>>>         } else {
>>> +               mutex_unlock(&rtp->tasks_gp_mutex);
>>>                 set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>>>                 rcuwait_wait_event(&rtp->cbs_wait,
>>>                                    (needgpcb = rcu_tasks_need_gpcb(rtp)),
>>>                                    TASK_IDLE);
>>> +               mutex_lock(&rtp->tasks_gp_mutex);
>>>         }
>>>
>>>         if (needgpcb & 0x2) {
>>
>> Your preferred fix looks good to me.
>>
>> With the original code I can quite easily reproduce the problem on my
>> system every 10 reboots or so. With your fix in place the problem no
>> longer occurs.
>
> Very good, thank you! May I add your Tested-by?
>

FWIW, I am still working on it. So far I get

[ 8.191589] KTAP version 1
[ 8.191769] # Subtest: kunit_executor_test
[ 8.191972] # module: kunit
[ 8.192012] 1..8
[ 8.197643] ok 1 parse_filter_test
[ 8.201851] ok 2 filter_suites_test
[ 8.206713] ok 3 filter_suites_test_glob_test
[ 8.211806] ok 4 filter_suites_to_empty_test
[ 8.214077] kunit executor: filter operation not found: speed>slow, module!=example
[ 8.217933] # parse_filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:126
[ 8.217933] Expected err == 0, but
[ 8.217933] err == -22 (0xffffffffffffffea)
[ 8.217933]
[ 8.217933] failed to parse filter '(efault)'
[ 8.221266] not ok 5 parse_filter_attr_test
[ 8.224224] kunit executor: filter operation not found: speed>slow
[ 8.225837] # filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:165
[ 8.225837] Expected err == 0, but
[ 8.225837] err == -22 (0xffffffffffffffea)
[ 8.228850] not ok 6 filter_attr_test
[ 8.230942] kunit executor: filter operation not found: module!=dummy
[ 8.232167] # filter_attr_empty_test: ASSERTION FAILED at lib/kunit/executor_test.c:190
[ 8.232167] Expected err == 0, but
[ 8.232167] err == -22 (0xffffffffffffffea)
[ 8.235317] not ok 7 filter_attr_empty_test
[ 8.237065] kunit executor: filter operation not found: speed>slow
[ 8.238796] # filter_attr_skip_test: ASSERTION FAILED at lib/kunit/executor_test.c:209
[ 8.238796] Expected err == 0, but
[ 8.238796] err == -22 (0xffffffffffffffea)
[ 8.241897] not ok 8 filter_attr_skip_test
[ 8.241947] # kunit_executor_test: pass:4 fail:4 skip:0 total:8
[ 8.242144] # Totals: pass:4 fail:4 skip:0 total:8

and it looks like the console no longer works. Most likely this is some other problem
that was introduced while tests were broken. It will take me some time to track that down.

Guenter


2023-08-02 17:33:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Wed, Aug 02, 2023 at 08:45:06AM -0700, Guenter Roeck wrote:
> On 8/2/23 08:05, Paul E. McKenney wrote:
> > On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
> > > On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
> > > > On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:
> > > >
> > > >
> > > > Please see below for my preferred fix.? Does this work for you guys?
> > > >
> > > > Back to figuring out why recent kernels occasionally to blow up all
> > > > rcutorture guest OSes...
> > > >
> > > > ????????????????????????????????????????????????????????Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > index 7294be62727b..2d5b8385c357 100644
> > > > --- a/kernel/rcu/tasks.h
> > > > +++ b/kernel/rcu/tasks.h
> > > > @@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
> > > > ????????if (unlikely(midboot)) {
> > > > ????????????????needgpcb = 0x2;
> > > > ????????} else {
> > > > +???????????????mutex_unlock(&rtp->tasks_gp_mutex);
> > > > ????????????????set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> > > > ????????????????rcuwait_wait_event(&rtp->cbs_wait,
> > > > ?????????????????????????????????? (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > > > ?????????????????????????????????? TASK_IDLE);
> > > > +???????????????mutex_lock(&rtp->tasks_gp_mutex);
> > > > ????????}
> > > > ????????if (needgpcb & 0x2) {
> > >
> > > Your preferred fix looks good to me.
> > >
> > > With the original code I can quite easily reproduce the problem on my
> > > system every 10 reboots or so. With your fix in place the problem no
> > > longer occurs.
> >
> > Very good, thank you! May I add your Tested-by?
> >
>
> FWIW, I am still working on it. So far I get
>
> [ 8.191589] KTAP version 1
> [ 8.191769] # Subtest: kunit_executor_test
> [ 8.191972] # module: kunit
> [ 8.192012] 1..8
> [ 8.197643] ok 1 parse_filter_test
> [ 8.201851] ok 2 filter_suites_test
> [ 8.206713] ok 3 filter_suites_test_glob_test
> [ 8.211806] ok 4 filter_suites_to_empty_test
> [ 8.214077] kunit executor: filter operation not found: speed>slow, module!=example
> [ 8.217933] # parse_filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:126
> [ 8.217933] Expected err == 0, but
> [ 8.217933] err == -22 (0xffffffffffffffea)
> [ 8.217933]
> [ 8.217933] failed to parse filter '(efault)'
> [ 8.221266] not ok 5 parse_filter_attr_test
> [ 8.224224] kunit executor: filter operation not found: speed>slow
> [ 8.225837] # filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:165
> [ 8.225837] Expected err == 0, but
> [ 8.225837] err == -22 (0xffffffffffffffea)
> [ 8.228850] not ok 6 filter_attr_test
> [ 8.230942] kunit executor: filter operation not found: module!=dummy
> [ 8.232167] # filter_attr_empty_test: ASSERTION FAILED at lib/kunit/executor_test.c:190
> [ 8.232167] Expected err == 0, but
> [ 8.232167] err == -22 (0xffffffffffffffea)
> [ 8.235317] not ok 7 filter_attr_empty_test
> [ 8.237065] kunit executor: filter operation not found: speed>slow
> [ 8.238796] # filter_attr_skip_test: ASSERTION FAILED at lib/kunit/executor_test.c:209
> [ 8.238796] Expected err == 0, but
> [ 8.238796] err == -22 (0xffffffffffffffea)
> [ 8.241897] not ok 8 filter_attr_skip_test
> [ 8.241947] # kunit_executor_test: pass:4 fail:4 skip:0 total:8
> [ 8.242144] # Totals: pass:4 fail:4 skip:0 total:8
>
> and it looks like the console no longer works. Most likely this is some other problem
> that was introduced while tests were broken. It will take me some time to track that down.

No rush.

Given that this bug is a year old, that it happens only when debug
options are enabled, and that it has only been seen in current -next,
my plan is to submit it into the next merge window.

So this one stays mutable for about another 10 days.

On the strength of Roy's Tested-by, however, I will push this patch into
-next soon, so that should make things a bit easier. Or so I hope.

And again, thank you all for tracking this down!

Thanx, Paul

2023-08-02 18:08:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

Two quick comments, both of them "this code is a bit odd" rather than
anything else.

On Tue, 1 Aug 2023 at 12:11, Paul E. McKenney <[email protected]> wrote:
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h

Why is this file called "tasks.h"?

It's not a header file. It makes no sense. It's full of C code. It's
included in only one place. It's just _weird_.

However, more relevantly:

> + mutex_unlock(&rtp->tasks_gp_mutex);
> set_tasks_gp_state(rtp, RTGS_WAIT_CBS);

Isn't the tasks_gp_mutex the thing that protects the gp state here?
Shouldn't it be after setting?

> rcuwait_wait_event(&rtp->cbs_wait,
> (needgpcb = rcu_tasks_need_gpcb(rtp)),
> TASK_IDLE);

Also, looking at rcu_tasks_need_gpcb() that is now called outside the
lock, it does something quite odd.

At the very top of the function does

for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {

and 'smp_load_acquire()' is all about saying "everything *after* this
load is ordered,

But the way it is done in that loop, it is indeed done at the
beginning of the loop, but then it's done *after* the loop too, so the
last smp_load_acquire seems a bit nonsensical.

If you want to load a value and say "this value is now sensible for
everything that follows", I think you should load it *first*. No?

IOW, wouldn't the whole sequence make more sense as

dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
for (cpu = 0; cpu < dequeue_limit; cpu++) {

and say that everything in rcu_tasks_need_gpcb() is ordered wrt the
initial limit on entry?

I dunno. That use of "smp_load_acquire()" just seems odd. Memory
ordering is hard to understand to begin with, but then when you have
things like loops that do the same ordered load multiple times, it
goes from "hard to understand" to positively confusing.

Linus

2023-08-02 18:56:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

On Wed, Aug 02, 2023 at 10:14:51AM -0700, Linus Torvalds wrote:
> Two quick comments, both of them "this code is a bit odd" rather than
> anything else.

Good to get eyes on this code, so thank you very much!!!

> On Tue, 1 Aug 2023 at 12:11, Paul E. McKenney <[email protected]> wrote:
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>
> Why is this file called "tasks.h"?
>
> It's not a header file. It makes no sense. It's full of C code. It's
> included in only one place. It's just _weird_.

You are right, it is weird.

This is a holdover from when I was much more concerned about being
criticized for having #ifdef in a .c file, and pretty much every
line in this file is under some combination or another of #ifdefs.
This concern led to kernel/rcu/tree_plugin.h being set up in this way
back when preemptible RCU was introduced, and for good or for bad I just
kept following that pattern.

We could convert this to a .c file, keep the #ifdefs, drop some instances
of "static", add a bunch of declarations, and maybe (or maybe not) push a
function or two into some .h file for performance/inlining reasons. Me, I
would prefer to leave it alone, but we can certainly change it.

> However, more relevantly:
>
> > + mutex_unlock(&rtp->tasks_gp_mutex);
> > set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>
> Isn't the tasks_gp_mutex the thing that protects the gp state here?
> Shouldn't it be after setting?

Much of the gp state is protected by being accessed only by the gp
kthread. But there is a window in time where the gp might be driven
directly out of the synchronize_rcu_tasks() call. That window in time
does not have a definite end, so this ->tasks_gp_mutex does the needed
mutual exclusion during the transition of gp processing to the newly
created gp kthread.

> > rcuwait_wait_event(&rtp->cbs_wait,
> > (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > TASK_IDLE);
>
> Also, looking at rcu_tasks_need_gpcb() that is now called outside the
> lock, it does something quite odd.

The state of each callback list is protected by the ->lock field of
the rcu_tasks_percpu structure. Yes, rcu_segcblist_n_cbs() is invoked
int rcu_tasks_need_gpcb() outside of the lock, but it is designed for
lockless use. If it is modified just after the check, then there will
be a later wakeup on the one hand or we will just uselessly acquire that
->lock this one time on the other.

Also, ncbs records the number of callbacks seen in that first loop,
then used later, where its value might be stale. This might result in
a collapse back to single-callback-queue operation and a later expansion
back up. Except that at this point we are still in single-CPU mode, so
there should not be any lock contention, which means that there should
still be but a single callback queue. The transition itself is protected
by ->cbs_gbl_lock.

> At the very top of the function does
>
> for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {
>
> and 'smp_load_acquire()' is all about saying "everything *after* this
> load is ordered,
>
> But the way it is done in that loop, it is indeed done at the
> beginning of the loop, but then it's done *after* the loop too, so the
> last smp_load_acquire seems a bit nonsensical.
>
> If you want to load a value and say "this value is now sensible for
> everything that follows", I think you should load it *first*. No?
>
> IOW, wouldn't the whole sequence make more sense as
>
> dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
> for (cpu = 0; cpu < dequeue_limit; cpu++) {
>
> and say that everything in rcu_tasks_need_gpcb() is ordered wrt the
> initial limit on entry?
>
> I dunno. That use of "smp_load_acquire()" just seems odd. Memory
> ordering is hard to understand to begin with, but then when you have
> things like loops that do the same ordered load multiple times, it
> goes from "hard to understand" to positively confusing.

Excellent point. I am queueing that change with your Suggested-by.
If testing goes well, it will be as shown below.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 83049a893de5..94bb5abdbb37 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -432,6 +432,7 @@ static void rcu_barrier_tasks_generic(struct rcu_tasks *rtp)
static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
{
int cpu;
+ int dequeue_limit;
unsigned long flags;
bool gpdone = poll_state_synchronize_rcu(rtp->percpu_dequeue_gpseq);
long n;
@@ -439,7 +440,8 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
long ncbsnz = 0;
int needgpcb = 0;

- for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {
+ dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
+ for (cpu = 0; cpu < dequeue_limit; cpu++) {
struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rtp->rtpcpu, cpu);

/* Advance and accelerate any new callbacks. */