We're currently trying to update the kernel in our distro (Flatcar) from
v5.15.x to v6.1.x. When testing on Equinix Metal Ampere instances
(c3.large.arm64) we now get a soft lockup about a minute after boot.
Has anyone else seen this? The splat looks like this, full dmesg is at [1]
(trying without the attachement this time as LKML detects my mail as spam :/)
[ 84.297829] watchdog: BUG: soft lockup - CPU#45 stuck for 26s! [kworker/45:1:474]
[ 84.297834] Modules linked in: veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink bonding mlx5_ib ipmi_ssif ipmi_devintf ib_core ipmi_msghandler evdev sch_fq_codel fuse configfs dmi_sysfs nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 dm_verity dm_bufio nvme nvme_core xhci_pci t10_pi xhci_hcd crc64_rocksoft_generic crc64_rocksoft igb crc_t10dif i2c_algo_bit crct10dif_generic mlx5_core usbcore i2c_core crc64 crct10dif_common usb_common hwmon pci_hyperv_intf btrfs blake2b_generic xor xor_neon lzo_compress zlib_deflate raid6_pq zstd_compress libcrc32c crc32c_generic dm_mirror dm_region_hash dm_log qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath dm_mod scsi_mod scsi_common br_netfilter bridge stp llc overlay
[ 84.435381] CPU: 45 PID: 474 Comm: kworker/45:1 Not tainted 6.1.30-flatcar #1
[ 84.447678] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F17a (SCP: 1.07.20210713) 07/22/2021
[ 84.467226] Workqueue: rcu_par_gp sync_rcu_exp_select_node_cpus
[ 84.478239] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 84.490336] pc : smp_call_function_single+0xe4/0x1d0
[ 84.500287] lr : __sync_rcu_exp_select_node_cpus+0x278/0x480
[ 84.510869] sp : ffff80000b99bcc0
[ 84.518979] x29: ffff80000b99bcc0 x28: 0000000000000200 x27: ffffab5be3134080
[ 84.530897] x26: ffffab5be302d6a8 x25: ffff07ff89f01f40 x24: ffff083e5fba8540
[ 84.542751] x23: ffffab5be2e43ab0 x22: 0000000000000a00 x21: ffffab5be2e47540
[ 84.554510] x20: 000000000000f5ff x19: ffff80000b99bce0 x18: ffff80002be3bc58
[ 84.566213] x17: 0000000000000000 x16: ffffab5be06cdfc0 x15: 0000000000000001
[ 84.577841] x14: 0000000000000000 x13: 0000000000000010 x12: ffffab5be2e43ab0
[ 84.589361] x11: 0000000000000001 x10: 0000000000000040 x9 : ffffab5be312c270
[ 84.600832] x8 : ffffab5be2e34008 x7 : 0000000000028fb0 x6 : ffffab5bdfd0ef20
[ 84.612285] x5 : 0000000000000000 x4 : ffff083e5fc1c748 x3 : 0000000000000001
[ 84.623812] x2 : 0000000000000000 x1 : ffff083e5fc1c740 x0 : 0000000000000029
[ 84.635277] Call trace:
[ 84.641994] smp_call_function_single+0xe4/0x1d0
[ 84.650957] __sync_rcu_exp_select_node_cpus+0x278/0x480
[ 84.660664] sync_rcu_exp_select_node_cpus+0x14/0x20
[ 84.669988] process_one_work+0x214/0x490
[ 84.678313] worker_thread+0x6c/0x430
[ 84.686206] kthread+0x108/0x10c
[ 84.693668] ret_from_fork+0x10/0x20
[ 84.701417] Kernel panic - not syncing: softlockup: hung tasks
[ 84.711437] CPU: 45 PID: 474 Comm: kworker/45:1 Tainted: G L 6.1.30-flatcar #1
[ 84.724261] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F17a (SCP: 1.07.20210713) 07/22/2021
[ 84.741870] Workqueue: rcu_par_gp sync_rcu_exp_select_node_cpus
[ 84.751900] Call trace:
[ 84.758325] dump_backtrace+0xe0/0x140
[ 84.765958] show_stack+0x18/0x30
[ 84.773057] dump_stack_lvl+0x64/0x80
[ 84.780384] dump_stack+0x18/0x34
[ 84.787220] panic+0x180/0x358
[ 84.793698] watchdog_nmi_enable+0x0/0x10
[ 84.801040] __hrtimer_run_queues+0x17c/0x340
[ 84.808624] hrtimer_interrupt+0xe8/0x244
[ 84.815807] arch_timer_handler_phys+0x2c/0x44
[ 84.823388] handle_percpu_devid_irq+0x88/0x230
[ 84.830995] generic_handle_domain_irq+0x2c/0x44
[ 84.838682] gic_handle_irq+0x50/0x140
[ 84.845468] call_on_irq_stack+0x24/0x4c
[ 84.852402] do_interrupt_handler+0x80/0x84
[ 84.859569] el1_interrupt+0x34/0x6c
[ 84.866100] el1h_64_irq_handler+0x18/0x2c
[ 84.873134] el1h_64_irq+0x64/0x68
[ 84.879415] smp_call_function_single+0xe4/0x1d0
[ 84.886933] __sync_rcu_exp_select_node_cpus+0x278/0x480
[ 84.895205] sync_rcu_exp_select_node_cpus+0x14/0x20
[ 84.903118] process_one_work+0x214/0x490
[ 84.910040] worker_thread+0x6c/0x430
[ 84.916645] kthread+0x108/0x10c
[ 84.922733] ret_from_fork+0x10/0x20
Thanks,
Jeremi
[1]: https://gist.githubusercontent.com/dongsupark/3ab5fc464a995623a14542edd9b193ac/raw/db7c0216dcb260603a4ec0052015f86d82d4de30/kernel-6.1-softlockup-EM-arm64.txt
On Tue, Jun 06, 2023 at 07:47:52AM -0700, Jeremi Piotrowski wrote:
> We're currently trying to update the kernel in our distro (Flatcar) from
> v5.15.x to v6.1.x. When testing on Equinix Metal Ampere instances
> (c3.large.arm64) we now get a soft lockup about a minute after boot.
>
> Has anyone else seen this? The splat looks like this, full dmesg is at [1]
> (trying without the attachement this time as LKML detects my mail as spam :/)
You might have better luck asking on the ARM lists, the stable@ list is
for patches to add or other fixes, not generic failures like this one.
Also, why not take a working 6.1.x kernel for this platform (which I
know they are out there) and comparing it to yours?
Good luck!
greg k-h