Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
> after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear in the dmesg
>
>
>
> # free
> total used free shared buff/cache available
> Mem: 16183724 1473068 205664 33472 14504992 14335700
> Swap: 16777212 703596 16073616
>
>
> (zswap enabled)
See bugzilla for the full thread and attached dmesg.
On the report, the reporter can't perform the required bisection,
unfortunately.
Anyway, I'm adding it to regzbot:
#regzbot introduced: v6.2..v6.3 https://bugzilla.kernel.org/show_bug.cgi?id=217466
#regzbot title: btrfs_work_helper dealloc error in v6.3.x
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217466
--
An old man doll... just what I always wanted! - Clara
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> > after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear in the dmesg
> >
> >
> >
> > # free
> > total used free shared buff/cache available
> > Mem: 16183724 1473068 205664 33472 14504992 14335700
> > Swap: 16777212 703596 16073616
> >
> >
> > (zswap enabled)
>
> See bugzilla for the full thread and attached dmesg.
>
> On the report, the reporter can't perform the required bisection,
> unfortunately.
>
> Anyway, I'm adding it to regzbot:
>
> #regzbot introduced: v6.2..v6.3 https://bugzilla.kernel.org/show_bug.cgi?id=217466
> #regzbot title: btrfs_work_helper dealloc error in v6.3.x
>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217466
>
According to dmesg output from the bugzilla, the vmalloc tries to
allocate high order pages: 1 << 9. Since it fails to get a order-9 page
you get the warning:
<snip>
if (area->nr_pages != nr_small_pages) {
/* vm_area_alloc_pages() can also fail due to a fatal signal */
if (!fatal_signal_pending(current))
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, page order %u, failed to allocate pages",
area->nr_pages * PAGE_SIZE, page_order);
goto fail;
}
<snip>
and it fails.
If the __GFP_NOFAIL is passed, the vm_area_alloc_pages() function switches
to allocate 0-order pages instead. I think the fix is to call the
kvmalloc_node() with __GFP_NOFAIL flag.
--
Uladzislau Rezki
On Mon, May 22, 2023 at 06:00:42PM +0200, Uladzislau Rezki wrote:
> > Hi,
> >
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> >
> > > after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear in the dmesg
> > >
> > >
> > >
> > > # free
> > > total used free shared buff/cache available
> > > Mem: 16183724 1473068 205664 33472 14504992 14335700
> > > Swap: 16777212 703596 16073616
> > >
> > >
> > > (zswap enabled)
> >
> > See bugzilla for the full thread and attached dmesg.
> >
> > On the report, the reporter can't perform the required bisection,
> > unfortunately.
> >
> > Anyway, I'm adding it to regzbot:
> >
> > #regzbot introduced: v6.2..v6.3 https://bugzilla.kernel.org/show_bug.cgi?id=217466
> > #regzbot title: btrfs_work_helper dealloc error in v6.3.x
> >
> > Thanks.
> >
> > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217466
> >
> According to dmesg output from the bugzilla, the vmalloc tries to
> allocate high order pages: 1 << 9. Since it fails to get a order-9 page
> you get the warning:
That we want a order 9 is intentional, it's for a compression workspace
(bugzilla comment 5). It's allocated as kvzalloc i.e. with the fallback
to vmalloc in case the first one fails.
> <snip>
> if (area->nr_pages != nr_small_pages) {
> /* vm_area_alloc_pages() can also fail due to a fatal signal */
> if (!fatal_signal_pending(current))
> warn_alloc(gfp_mask, NULL,
> "vmalloc error: size %lu, page order %u, failed to allocate pages",
> area->nr_pages * PAGE_SIZE, page_order);
> goto fail;
> }
> <snip>
>
> and it fails.
>
> If the __GFP_NOFAIL is passed, the vm_area_alloc_pages() function switches
> to allocate 0-order pages instead. I think the fix is to call the
> kvmalloc_node() with __GFP_NOFAIL flag.
__GFP_NOFAIL does not make sense here and we've tried hard not to used
it anywhere because of the deadlocky effects. Did you mean __GFP_NOWARN?
That's a patch I sent today but there's another comment in the bugzilla
that we got more allocation warnings for huge (2M) allocations, this
time it was for a deduplication ioctl.
This seems to be a noticeable change in 6.3, before we disable the
warning in our code I think the MM guys could have a look. So far it
seems we're about to paper of a problem.
---- From: Bagas Sanjaya <[email protected]> -- Sent: 2023-05-22 - 15:25 ----
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
>> after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear in the dmesg
>>
>>
>>
>> # free
>> total used free shared buff/cache available
>> Mem: 16183724 1473068 205664 33472 14504992 14335700
>> Swap: 16777212 703596 16073616
>>
>>
>> (zswap enabled)
>
> See bugzilla for the full thread and attached dmesg.
>
> On the report, the reporter can't perform the required bisection,
> unfortunately.
>
> Anyway, I'm adding it to regzbot:
>
> #regzbot introduced: v6.2..v6.3 https://bugzilla.kernel.org/show_bug.cgi?id=217466
> #regzbot title: btrfs_work_helper dealloc error in v6.3.x
>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217466
>
I have a similar experience with kernel 6.3 where vmalloc fails in a similar way. I was able to reproduce it in a QEMU VM as well as on my system.
https://lore.kernel.org/all/[email protected]/T/
On Mon, May 22, 2023 at 09:04:05PM +0200, Forza wrote:
> I have a similar experience with kernel 6.3 where vmalloc fails in a similar way. I was able to reproduce it in a QEMU VM as well as on my system.
>
> https://lore.kernel.org/all/[email protected]/T/
Thanks for your similar report. Telling regzbot about it:
#regzbot link: https://lore.kernel.org/all/[email protected]/
--
An old man doll... just what I always wanted! - Clara
On Tue, May 23, 2023 at 08:52:21AM +0700, Bagas Sanjaya wrote:
> On Mon, May 22, 2023 at 09:04:05PM +0200, Forza wrote:
> > I have a similar experience with kernel 6.3 where vmalloc fails in a similar way. I was able to reproduce it in a QEMU VM as well as on my system.
> >
> > https://lore.kernel.org/all/[email protected]/T/
>
> Thanks for your similar report. Telling regzbot about it:
>
> #regzbot link: https://lore.kernel.org/all/[email protected]/
>
It is good that you can reproduce it. Could you please test below patch?
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 31ff782d368b..7a06452f7807 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2957,14 +2957,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
page = alloc_pages(alloc_gfp, order);
else
page = alloc_pages_node(nid, alloc_gfp, order);
+
if (unlikely(!page)) {
- if (!nofail)
- break;
+ if (nofail)
+ alloc_gfp |= __GFP_NOFAIL;
- /* fall back to the zero order allocations */
- alloc_gfp |= __GFP_NOFAIL;
- order = 0;
- continue;
+ /* Fall back to the zero order allocations. */
+ if (order || nofail) {
+ order = 0;
+ continue;
+ }
+
+ break;
}
/*
<snip>
Thanks!
--
Uladzislau Rezki
---- From: Uladzislau Rezki <[email protected]> -- Sent: 2023-05-23 - 12:28 ----
> On Tue, May 23, 2023 at 08:52:21AM +0700, Bagas Sanjaya wrote:
>> On Mon, May 22, 2023 at 09:04:05PM +0200, Forza wrote:
>> > I have a similar experience with kernel 6.3 where vmalloc fails in a similar way. I was able to reproduce it in a QEMU VM as well as on my system.
>> >
>> > https://lore.kernel.org/all/[email protected]/T/
>>
>> Thanks for your similar report. Telling regzbot about it:
>>
>> #regzbot link: https://lore.kernel.org/all/[email protected]/
>>
> It is good that you can reproduce it. Could you please test below patch?
Yes, applied it to my test VM and will let it run over night to see how it turns out. I'll post again tomorrow.
Thanks.
>
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d368b..7a06452f7807 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2957,14 +2957,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> page = alloc_pages(alloc_gfp, order);
> else
> page = alloc_pages_node(nid, alloc_gfp, order);
> +
> if (unlikely(!page)) {
> - if (!nofail)
> - break;
> + if (nofail)
> + alloc_gfp |= __GFP_NOFAIL;
>
> - /* fall back to the zero order allocations */
> - alloc_gfp |= __GFP_NOFAIL;
> - order = 0;
> - continue;
> + /* Fall back to the zero order allocations. */
> + if (order || nofail) {
> + order = 0;
> + continue;
> + }
> +
> + break;
> }
>
> /*
> <snip>
>
> Thanks!
>
> --
> Uladzislau Rezki
---- From: Uladzislau Rezki <[email protected]> -- Sent: 2023-05-23 - 12:28 ----
> On Tue, May 23, 2023 at 08:52:21AM +0700, Bagas Sanjaya wrote:
>> On Mon, May 22, 2023 at 09:04:05PM +0200, Forza wrote:
>> > I have a similar experience with kernel 6.3 where vmalloc fails in a similar way. I was able to reproduce it in a QEMU VM as well as on my system.
>> >
>> > https://lore.kernel.org/all/[email protected]/T/
>>
>> Thanks for your similar report. Telling regzbot about it:
>>
>> #regzbot link: https://lore.kernel.org/all/[email protected]/
>>
> It is good that you can reproduce it. Could you please test below patch?
>
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d368b..7a06452f7807 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2957,14 +2957,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> page = alloc_pages(alloc_gfp, order);
> else
> page = alloc_pages_node(nid, alloc_gfp, order);
> +
> if (unlikely(!page)) {
> - if (!nofail)
> - break;
> + if (nofail)
> + alloc_gfp |= __GFP_NOFAIL;
>
> - /* fall back to the zero order allocations */
> - alloc_gfp |= __GFP_NOFAIL;
> - order = 0;
> - continue;
> + /* Fall back to the zero order allocations. */
> + if (order || nofail) {
> + order = 0;
> + continue;
> + }
> +
> + break;
> }
>
> /*
> <snip>
>
> Thanks!
>
> --
> Uladzislau Rezki
There is a different result now that I have not seen before. The full dmesg is available at https://paste.tnonline.net/files/pnnW6gYASxWX_dmesg-mm-patch.txt
[ 8.641506] 8021q: adding VLAN 0 to HW filter on device enp4s0
[ 13.841691] wireguard: WireGuard 1.0.0 loaded. See http://www.wireguard.com for information.
[ 13.841705] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
[13917.280527] ------------[ cut here ]------------
[13917.280753] default_enter_idle leaked IRQ state
[13917.281004] WARNING: CPU: 3 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x3bb/0x430
[13917.281046] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
[13917.281140] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.3.1-gentoo-mm-patched #4
[13917.281150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[13917.281154] RIP: 0010:cpuidle_enter_state+0x3bb/0x430
[13917.281162] Code: 0f 1f 44 00 00 e9 a7 fd ff ff 80 3d 3a 3b d7 00 00 75 19 49 8b 75 50 48 c7 c7 ab b6 79 ac c6 05 26 3b d7 00 01 e8 a5 c4 20 ff <0f> 0b fa 0f 1f 44 00 00 e9 ca fc ff ff 83 c0 01 48 83 c1 40 39 f8
[13917.281176] RSP: 0018:ffffa153c00b7ea0 EFLAGS: 00010286
[13917.281182] RAX: ffff8c15ebfafa28 RBX: ffffc153bfd80900 RCX: 000000000000083f
[13917.281186] RDX: 000000000118feed RSI: 00000000000000f6 RDI: 000000000000083f
[13917.281189] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffa153c00b7d60
[13917.281193] R10: 0000000000000003 R11: ffffffffacb399e8 R12: ffffffffacc2e320
[13917.281196] R13: ffffffffacc2e3a0 R14: 0000000000000001 R15: 0000000000000000
[13917.281202] FS: 0000000000000000(0000) GS:ffff8c15ebf80000(0000) knlGS:0000000000000000
[13917.281206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13917.281210] CR2: 00007f71840b39c8 CR3: 0000000102998000 CR4: 00000000003506e0
[13917.281217] Call Trace:
[13917.281221] <TASK>
[13917.281228] cpuidle_enter+0x29/0x40
[13917.281244] do_idle+0x19b/0x200
[13917.281292] cpu_startup_entry+0x19/0x20
[13917.281297] start_secondary+0x101/0x120
[13917.281324] secondary_startup_64_no_verify+0xe5/0xeb
[13917.281343] </TASK>
[13917.281346] ---[ end trace 0000000000000000 ]---
[17206.750165] BTRFS info (device vdb): using xxhash64 (xxhash64-generic) checksum algorithm
[17206.750190] BTRFS info (device vdb): using free space tree
[17206.904010] BTRFS info (device vdb): auto enabling async discard
[17206.933302] BTRFS info (device vdb): checking UUID tree
[17344.541839] sched: RT throttling activated
[18284.216538] hrtimer: interrupt took 23434934 ns
[18737.100477] BUG: unable to handle page fault for address: 0000000079e0afc0
[18737.100883] #PF: supervisor read access in kernel mode
[18737.101155] #PF: error_code(0x0000) - not-present page
[18737.101462] PGD 0 P4D 0
[18737.101715] Oops: 0000 [#1] PREEMPT SMP NOPTI
[18737.101968] CPU: 1 PID: 25287 Comm: kworker/u8:7 Tainted: G W 6.3.1-gentoo-mm-patched #4
[18737.102391] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[18737.102860] Workqueue: btrfs-delalloc btrfs_work_helper
[18737.103346] RIP: 0010:find_free_extent+0x20a/0x15c0
[18737.103900] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
[18737.104851] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
[18737.105456] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
[18737.106044] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
[18737.106519] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
[18737.107036] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
[18737.107363] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
[18737.107676] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
[18737.107971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18737.108260] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
[18737.108606] Call Trace:
[18737.108964] <TASK>
[18737.109273] btrfs_reserve_extent+0x148/0x260
[18737.109601] submit_compressed_extents+0x14f/0x490
[18737.109934] async_cow_submit+0x37/0x90
[18737.110237] btrfs_work_helper+0x13d/0x360
[18737.110542] process_one_work+0x20f/0x410
[18737.110883] worker_thread+0x4a/0x3b0
[18737.111185] ? __pfx_worker_thread+0x10/0x10
[18737.111482] kthread+0xda/0x100
[18737.111800] ? __pfx_kthread+0x10/0x10
[18737.112097] ret_from_fork+0x2c/0x50
[18737.112387] </TASK>
[18737.112676] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
[18737.114021] CR2: 0000000079e0afc0
[18737.114366] ---[ end trace 0000000000000000 ]---
[18737.114712] RIP: 0010:find_free_extent+0x20a/0x15c0
[18737.115059] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
[18737.115864] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
[18737.116415] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
[18737.117090] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
[18737.117882] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
[18737.118611] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
[18737.119416] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
[18737.120221] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
[18737.120994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18737.121868] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
[18737.122624] note: kworker/u8:7[25287] exited with irqs disabled
[19006.920558] BUG: unable to handle page fault for address: 0000000079e0afc0
[19006.922015] #PF: supervisor read access in kernel mode
[19006.923354] #PF: error_code(0x0000) - not-present page
[19006.924636] PGD 0 P4D 0
[19006.925868] Oops: 0000 [#2] PREEMPT SMP NOPTI
[19006.927066] CPU: 0 PID: 24329 Comm: crawl_writeback Tainted: G D W 6.3.1-gentoo-mm-patched #4
[19006.928510] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[19006.929817] RIP: 0010:find_free_extent+0x20a/0x15c0
[19006.931050] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
[19006.933653] RSP: 0018:ffffa153c0d0f568 EFLAGS: 00010203
[19006.934972] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
[19006.936236] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
[19006.937480] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
[19006.938750] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0d0f757
[19006.939986] R13: ffffa153c0d0f628 R14: 0000000000000001 R15: 0000000079e0af10
[19006.941255] FS: 00007fb245ffb6c0(0000) GS:ffff8c15ebe00000(0000) knlGS:0000000000000000
[19006.942579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19006.943830] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506f0
[19006.945278] Call Trace:
[19006.946730] <TASK>
[19006.947792] ? release_pages+0x13e/0x490
[19006.948741] btrfs_reserve_extent+0x148/0x260
[19006.949637] cow_file_range+0x199/0x610
[19006.950396] btrfs_run_delalloc_range+0x103/0x520
[19006.951135] ? find_lock_delalloc_range+0x1ea/0x210
[19006.951802] writepage_delalloc+0xb9/0x180
[19006.952401] __extent_writepage+0xeb/0x410
[19006.952985] extent_write_cache_pages+0x152/0x3d0
[19006.953552] extent_writepages+0x4c/0x100
[19006.954116] do_writepages+0xbe/0x1d0
[19006.954672] ? memcmp_extent_buffer+0xa2/0xe0
[19006.955199] filemap_fdatawrite_wbc+0x5f/0x80
[19006.955726] __filemap_fdatawrite_range+0x4a/0x60
[19006.956219] btrfs_rename+0x529/0xb60
[19006.956711] ? psi_group_change+0x168/0x400
[19006.957280] btrfs_rename2+0x2a/0x60
[19006.957799] vfs_rename+0x5d4/0xeb0
[19006.958308] ? lookup_dcache+0x17/0x60
[19006.958784] ? do_renameat2+0x507/0x580
[19006.959239] do_renameat2+0x507/0x580
[19006.959702] __x64_sys_renameat+0x45/0x60
[19006.960293] do_syscall_64+0x5b/0xc0
[19006.960848] ? syscall_exit_to_user_mode+0x17/0x40
[19006.961331] ? do_syscall_64+0x67/0xc0
[19006.961812] ? syscall_exit_to_user_mode+0x17/0x40
[19006.962401] ? do_syscall_64+0x67/0xc0
[19006.963371] ? do_syscall_64+0x67/0xc0
[19006.964020] ? do_syscall_64+0x67/0xc0
[19006.965001] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[19006.965952] RIP: 0033:0x7fb25eba492a
[19006.966485] Code: 48 8b 15 d9 44 17 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 a1 44 17 00 f7
[19006.967545] RSP: 002b:00007fb245ff8a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
[19006.968076] RAX: ffffffffffffffda RBX: 0000559a70a039f0 RCX: 00007fb25eba492a
[19006.968623] RDX: 0000000000000004 RSI: 00007fb134000fc0 RDI: 0000000000000004
[19006.977319] RBP: 00007fb245ff8c60 R08: 0000000000000000 R09: 0000000000000000
[19006.977877] R10: 0000559a70a03a00 R11: 0000000000000246 R12: 00007fb245ff8c80
[19006.978301] R13: 0000000000000004 R14: 00007fb245ff8c60 R15: 00000000000070b5
[19006.978727] </TASK>
[19006.979118] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
[19006.981463] CR2: 0000000079e0afc0
[19006.982193] ---[ end trace 0000000000000000 ]---
[19006.982938] RIP: 0010:find_free_extent+0x20a/0x15c0
[19006.983565] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
[19006.984863] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
[19006.985500] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
[19006.986195] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
[19006.986877] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
[19006.987581] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
[19006.988252] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
[19006.988984] FS: 00007fb245ffb6c0(0000) GS:ffff8c15ebe00000(0000) knlGS:0000000000000000
[19006.989646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19006.990336] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506f0
[19006.991037] note: crawl_writeback[24329] exited with irqs disabled
This looks like a different set of problems, though all of them seem to
start on the compression write path in btrfs.
On Wed, May 24, 2023 at 07:57:19AM +0200, Forza wrote:
> [ 8.641506] 8021q: adding VLAN 0 to HW filter on device enp4s0
> [ 13.841691] wireguard: WireGuard 1.0.0 loaded. See http://www.wireguard.com for information.
> [ 13.841705] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> [13917.280527] ------------[ cut here ]------------
> [13917.280753] default_enter_idle leaked IRQ state
> [13917.281004] WARNING: CPU: 3 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x3bb/0x430
Warning in cpuilde
> [13917.281046] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
> [13917.281140] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.3.1-gentoo-mm-patched #4
> [13917.281150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> [13917.281154] RIP: 0010:cpuidle_enter_state+0x3bb/0x430
> [13917.281176] RSP: 0018:ffffa153c00b7ea0 EFLAGS: 00010286
> [13917.281182] RAX: ffff8c15ebfafa28 RBX: ffffc153bfd80900 RCX: 000000000000083f
> [13917.281186] RDX: 000000000118feed RSI: 00000000000000f6 RDI: 000000000000083f
> [13917.281189] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffa153c00b7d60
> [13917.281193] R10: 0000000000000003 R11: ffffffffacb399e8 R12: ffffffffacc2e320
> [13917.281196] R13: ffffffffacc2e3a0 R14: 0000000000000001 R15: 0000000000000000
> [13917.281202] FS: 0000000000000000(0000) GS:ffff8c15ebf80000(0000) knlGS:0000000000000000
> [13917.281206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [13917.281210] CR2: 00007f71840b39c8 CR3: 0000000102998000 CR4: 00000000003506e0
> [13917.281217] Call Trace:
> [13917.281221] <TASK>
> [13917.281228] cpuidle_enter+0x29/0x40
> [13917.281244] do_idle+0x19b/0x200
> [13917.281292] cpu_startup_entry+0x19/0x20
> [13917.281297] start_secondary+0x101/0x120
> [13917.281324] secondary_startup_64_no_verify+0xe5/0xeb
> [13917.281343] </TASK>
> [13917.281346] ---[ end trace 0000000000000000 ]---
> [17206.750165] BTRFS info (device vdb): using xxhash64 (xxhash64-generic) checksum algorithm
> [17206.750190] BTRFS info (device vdb): using free space tree
> [17206.904010] BTRFS info (device vdb): auto enabling async discard
> [17206.933302] BTRFS info (device vdb): checking UUID tree
> [17344.541839] sched: RT throttling activated
> [18284.216538] hrtimer: interrupt took 23434934 ns
> [18737.100477] BUG: unable to handle page fault for address: 0000000079e0afc0
BUG
> [18737.100883] #PF: supervisor read access in kernel mode
> [18737.101155] #PF: error_code(0x0000) - not-present page
> [18737.101462] PGD 0 P4D 0
> [18737.101715] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [18737.101968] CPU: 1 PID: 25287 Comm: kworker/u8:7 Tainted: G W 6.3.1-gentoo-mm-patched #4
> [18737.102391] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> [18737.102860] Workqueue: btrfs-delalloc btrfs_work_helper
> [18737.103346] RIP: 0010:find_free_extent+0x20a/0x15c0
> [18737.103900] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> [18737.104851] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
> [18737.105456] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> [18737.106044] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> [18737.106519] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> [18737.107036] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
> [18737.107363] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
> [18737.107676] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
> [18737.107971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [18737.108260] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
> [18737.108606] Call Trace:
> [18737.108964] <TASK>
> [18737.109273] btrfs_reserve_extent+0x148/0x260
> [18737.109601] submit_compressed_extents+0x14f/0x490
> [18737.109934] async_cow_submit+0x37/0x90
> [18737.110237] btrfs_work_helper+0x13d/0x360
> [18737.110542] process_one_work+0x20f/0x410
> [18737.110883] worker_thread+0x4a/0x3b0
> [18737.111185] ? __pfx_worker_thread+0x10/0x10
> [18737.111482] kthread+0xda/0x100
> [18737.111800] ? __pfx_kthread+0x10/0x10
> [18737.112097] ret_from_fork+0x2c/0x50
> [18737.112387] </TASK>
> [18737.112676] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
> [18737.114021] CR2: 0000000079e0afc0
> [18737.114366] ---[ end trace 0000000000000000 ]---
> [18737.114712] RIP: 0010:find_free_extent+0x20a/0x15c0
> [18737.115059] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> [18737.115864] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
> [18737.116415] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> [18737.117090] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> [18737.117882] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> [18737.118611] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
> [18737.119416] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
> [18737.120221] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
> [18737.120994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [18737.121868] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
> [18737.122624] note: kworker/u8:7[25287] exited with irqs disabled
> [19006.920558] BUG: unable to handle page fault for address: 0000000079e0afc0
And again, so something is going wrong
> [19006.922015] #PF: supervisor read access in kernel mode
> [19006.923354] #PF: error_code(0x0000) - not-present page
> [19006.924636] PGD 0 P4D 0
> [19006.925868] Oops: 0000 [#2] PREEMPT SMP NOPTI
> [19006.927066] CPU: 0 PID: 24329 Comm: crawl_writeback Tainted: G D W 6.3.1-gentoo-mm-patched #4
> [19006.928510] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> [19006.929817] RIP: 0010:find_free_extent+0x20a/0x15c0
> [19006.931050] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> [19006.933653] RSP: 0018:ffffa153c0d0f568 EFLAGS: 00010203
> [19006.934972] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> [19006.936236] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> [19006.937480] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> [19006.938750] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0d0f757
> [19006.939986] R13: ffffa153c0d0f628 R14: 0000000000000001 R15: 0000000079e0af10
> [19006.941255] FS: 00007fb245ffb6c0(0000) GS:ffff8c15ebe00000(0000) knlGS:0000000000000000
> [19006.942579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19006.943830] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506f0
> [19006.945278] Call Trace:
> [19006.946730] <TASK>
> [19006.947792] ? release_pages+0x13e/0x490
> [19006.948741] btrfs_reserve_extent+0x148/0x260
> [19006.949637] cow_file_range+0x199/0x610
> [19006.950396] btrfs_run_delalloc_range+0x103/0x520
> [19006.951135] ? find_lock_delalloc_range+0x1ea/0x210
> [19006.951802] writepage_delalloc+0xb9/0x180
> [19006.952401] __extent_writepage+0xeb/0x410
> [19006.952985] extent_write_cache_pages+0x152/0x3d0
> [19006.953552] extent_writepages+0x4c/0x100
> [19006.954116] do_writepages+0xbe/0x1d0
> [19006.954672] ? memcmp_extent_buffer+0xa2/0xe0
> [19006.955199] filemap_fdatawrite_wbc+0x5f/0x80
> [19006.955726] __filemap_fdatawrite_range+0x4a/0x60
> [19006.956219] btrfs_rename+0x529/0xb60
> [19006.956711] ? psi_group_change+0x168/0x400
> [19006.957280] btrfs_rename2+0x2a/0x60
> [19006.957799] vfs_rename+0x5d4/0xeb0
> [19006.958308] ? lookup_dcache+0x17/0x60
> [19006.958784] ? do_renameat2+0x507/0x580
> [19006.959239] do_renameat2+0x507/0x580
> [19006.959702] __x64_sys_renameat+0x45/0x60
> [19006.960293] do_syscall_64+0x5b/0xc0
> [19006.960848] ? syscall_exit_to_user_mode+0x17/0x40
> [19006.961331] ? do_syscall_64+0x67/0xc0
> [19006.961812] ? syscall_exit_to_user_mode+0x17/0x40
> [19006.962401] ? do_syscall_64+0x67/0xc0
> [19006.963371] ? do_syscall_64+0x67/0xc0
> [19006.964020] ? do_syscall_64+0x67/0xc0
> [19006.965001] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [19006.965952] RIP: 0033:0x7fb25eba492a
> [19006.966485] Code: 48 8b 15 d9 44 17 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 a1 44 17 00 f7
> [19006.967545] RSP: 002b:00007fb245ff8a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
> [19006.968076] RAX: ffffffffffffffda RBX: 0000559a70a039f0 RCX: 00007fb25eba492a
> [19006.968623] RDX: 0000000000000004 RSI: 00007fb134000fc0 RDI: 0000000000000004
> [19006.977319] RBP: 00007fb245ff8c60 R08: 0000000000000000 R09: 0000000000000000
> [19006.977877] R10: 0000559a70a03a00 R11: 0000000000000246 R12: 00007fb245ff8c80
> [19006.978301] R13: 0000000000000004 R14: 00007fb245ff8c60 R15: 00000000000070b5
> [19006.978727] </TASK>
> [19006.979118] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
> [19006.981463] CR2: 0000000079e0afc0
> [19006.982193] ---[ end trace 0000000000000000 ]---
> [19006.982938] RIP: 0010:find_free_extent+0x20a/0x15c0
> [19006.983565] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> [19006.984863] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
> [19006.985500] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> [19006.986195] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> [19006.986877] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> [19006.987581] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
> [19006.988252] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
> [19006.988984] FS: 00007fb245ffb6c0(0000) GS:ffff8c15ebe00000(0000) knlGS:0000000000000000
> [19006.989646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19006.990336] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506f0
> [19006.991037] note: crawl_writeback[24329] exited with irqs disabled
>
>
>
>
>
>
>
On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
> This looks like a different set of problems, though all of them seem to
> start on the compression write path in btrfs.
>
> On Wed, May 24, 2023 at 07:57:19AM +0200, Forza wrote:
> > [ 8.641506] 8021q: adding VLAN 0 to HW filter on device enp4s0
> > [ 13.841691] wireguard: WireGuard 1.0.0 loaded. See http://www.wireguard.com for information.
> > [ 13.841705] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
> > [13917.280527] ------------[ cut here ]------------
> > [13917.280753] default_enter_idle leaked IRQ state
> > [13917.281004] WARNING: CPU: 3 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x3bb/0x430
>
> Warning in cpuilde
>
> > [13917.281046] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
> > [13917.281140] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.3.1-gentoo-mm-patched #4
> > [13917.281150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> > [13917.281154] RIP: 0010:cpuidle_enter_state+0x3bb/0x430
> > [13917.281176] RSP: 0018:ffffa153c00b7ea0 EFLAGS: 00010286
> > [13917.281182] RAX: ffff8c15ebfafa28 RBX: ffffc153bfd80900 RCX: 000000000000083f
> > [13917.281186] RDX: 000000000118feed RSI: 00000000000000f6 RDI: 000000000000083f
> > [13917.281189] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffa153c00b7d60
> > [13917.281193] R10: 0000000000000003 R11: ffffffffacb399e8 R12: ffffffffacc2e320
> > [13917.281196] R13: ffffffffacc2e3a0 R14: 0000000000000001 R15: 0000000000000000
> > [13917.281202] FS: 0000000000000000(0000) GS:ffff8c15ebf80000(0000) knlGS:0000000000000000
> > [13917.281206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [13917.281210] CR2: 00007f71840b39c8 CR3: 0000000102998000 CR4: 00000000003506e0
> > [13917.281217] Call Trace:
> > [13917.281221] <TASK>
> > [13917.281228] cpuidle_enter+0x29/0x40
> > [13917.281244] do_idle+0x19b/0x200
> > [13917.281292] cpu_startup_entry+0x19/0x20
> > [13917.281297] start_secondary+0x101/0x120
> > [13917.281324] secondary_startup_64_no_verify+0xe5/0xeb
> > [13917.281343] </TASK>
> > [13917.281346] ---[ end trace 0000000000000000 ]---
> > [17206.750165] BTRFS info (device vdb): using xxhash64 (xxhash64-generic) checksum algorithm
> > [17206.750190] BTRFS info (device vdb): using free space tree
> > [17206.904010] BTRFS info (device vdb): auto enabling async discard
> > [17206.933302] BTRFS info (device vdb): checking UUID tree
> > [17344.541839] sched: RT throttling activated
> > [18284.216538] hrtimer: interrupt took 23434934 ns
> > [18737.100477] BUG: unable to handle page fault for address: 0000000079e0afc0
>
> BUG
>
> > [18737.100883] #PF: supervisor read access in kernel mode
> > [18737.101155] #PF: error_code(0x0000) - not-present page
> > [18737.101462] PGD 0 P4D 0
> > [18737.101715] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [18737.101968] CPU: 1 PID: 25287 Comm: kworker/u8:7 Tainted: G W 6.3.1-gentoo-mm-patched #4
> > [18737.102391] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> > [18737.102860] Workqueue: btrfs-delalloc btrfs_work_helper
> > [18737.103346] RIP: 0010:find_free_extent+0x20a/0x15c0
> > [18737.103900] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> > [18737.104851] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
> > [18737.105456] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> > [18737.106044] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> > [18737.106519] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> > [18737.107036] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
> > [18737.107363] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
> > [18737.107676] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
> > [18737.107971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [18737.108260] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
> > [18737.108606] Call Trace:
> > [18737.108964] <TASK>
> > [18737.109273] btrfs_reserve_extent+0x148/0x260
> > [18737.109601] submit_compressed_extents+0x14f/0x490
> > [18737.109934] async_cow_submit+0x37/0x90
> > [18737.110237] btrfs_work_helper+0x13d/0x360
> > [18737.110542] process_one_work+0x20f/0x410
> > [18737.110883] worker_thread+0x4a/0x3b0
> > [18737.111185] ? __pfx_worker_thread+0x10/0x10
> > [18737.111482] kthread+0xda/0x100
> > [18737.111800] ? __pfx_kthread+0x10/0x10
> > [18737.112097] ret_from_fork+0x2c/0x50
> > [18737.112387] </TASK>
> > [18737.112676] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 rfkill 8021q garp mrp stp llc binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd iTCO_wdt ccp intel_pmc_bxt iTCO_vendor_support kvm i2c_i801 virtio_gpu irqbypass pcspkr virtio_dma_buf joydev i2c_smbus drm_shmem_helper lpc_ich virtio_balloon drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_console virtio_net net_failover virtio_scsi failover serio_raw virtio_blk qemu_fw_cfg
> > [18737.114021] CR2: 0000000079e0afc0
> > [18737.114366] ---[ end trace 0000000000000000 ]---
> > [18737.114712] RIP: 0010:find_free_extent+0x20a/0x15c0
> > [18737.115059] Code: 4d 8d ba 10 ff ff ff 48 83 c0 0f 49 8d 97 f0 00 00 00 48 c1 e0 04 48 01 d8 48 39 c2 0f 84 c5 03 00 00 41 c6 85 84 00 00 00 00 <45> 8b 9f b0 00 00 00 45 85 db 0f 85 d8 0c 00 00 45 8b 75 28 4c 89
> > [18737.115864] RSP: 0018:ffffa153c0923bd0 EFLAGS: 00010203
> > [18737.116415] RAX: ffff8c14869240f0 RBX: ffff8c1486924000 RCX: 0000000000000001
> > [18737.117090] RDX: 0000000079e0b000 RSI: 0000000000000100 RDI: ffff8c14869bcc00
> > [18737.117882] RBP: ffff8c148b100000 R08: 0000000000000000 R09: 0000000000000000
> > [18737.118611] R10: 0000000079e0b000 R11: 000000000000151b R12: ffffa153c0923dd7
> > [18737.119416] R13: ffffa153c0923c90 R14: 0000000000000001 R15: 0000000079e0af10
> > [18737.120221] FS: 0000000000000000(0000) GS:ffff8c15ebe80000(0000) knlGS:0000000000000000
> > [18737.120994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [18737.121868] CR2: 0000000079e0afc0 CR3: 00000001055e8000 CR4: 00000000003506e0
> > [18737.122624] note: kworker/u8:7[25287] exited with irqs disabled
> > [19006.920558] BUG: unable to handle page fault for address: 0000000079e0afc0
>
> And again, so something is going wrong
>
Indeed.
I suggest you run your kernel with CONFIG_KASAN=y to see if there are
any use-after-free or out-of-bounds bugs.
--
Uladzislau Rezki
---- From: Uladzislau Rezki <[email protected]> -- Sent: 2023-05-26 - 14:24 ----
> On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
>> This looks like a different set of problems, though all of them seem to
>> start on the compression write path in btrfs.
>>
>> On Wed, May 24, 2023 at 07:57:19AM +0200, Forza wrote:
>> > [ 8.641506] 8021q: adding VLAN 0 to HW filter on device enp4s0
>> > [ 13.841691] wireguard: WireGuard 1.0.0 loaded. See http://www.wireguard.com for information.
>> > [ 13.841705] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
>> > [13917.280527] ------------[ cut here ]------------
>> > [13917.280753] default_enter_idle leaked IRQ state
>> > [13917.281004] WARNING: CPU: 3 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x3bb/0x430
>>
>>
... Snip
>>
>> And again, so something is going wrong
>>
> Indeed.
>
> I suggest you run your kernel with CONFIG_KASAN=y to see if there are
> any use-after-free or out-of-bounds bugs.
>
> --
> Uladzislau Rezki
Pardon the delay... I have enabled KASAN and UBSAN on this kernel. It produced a lot of output and plenty of warnings for misalignment.
The full dmesg is at https://paste.tnonline.net/files/aBoUMuTd5KBC_dmesg.ubsan.txt (approx 1.7MiB)
The full kernel .conf is. at https://paste.tnonline.net/files/z1mX8TWFgZQ3_kernel.conf-kasan-ubsan.txt
A small exctract around what I think is the default_enter_idle leaked IRQ event. Is this helpful?
================================================================================
Jul 03 00:33:57 git kernel: UBSAN: misaligned-access in net/ipv4/tcp_ipv4.c:1848:13
Jul 03 00:33:57 git kernel: member access within misaligned address 000000007604d82f for type 'const struct tcphdr'
Jul 03 00:33:57 git kernel: which requires 4 byte alignment
Jul 03 00:33:57 git kernel: CPU: 2 PID: 29 Comm: ksoftirqd/2 Not tainted 6.3.10-ksan-ubsan #8
Jul 03 00:33:57 git kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
Jul 03 00:33:57 git kernel: Call Trace:
Jul 03 00:33:57 git kernel: <TASK>
Jul 03 00:33:57 git kernel: dump_stack_lvl+0x86/0xd0
Jul 03 00:33:57 git kernel: ubsan_type_mismatch_common+0xdf/0x240
Jul 03 00:33:57 git kernel: __ubsan_handle_type_mismatch_v1+0x44/0x60
Jul 03 00:33:57 git kernel: tcp_add_backlog+0x1fac/0x3ab0
Jul 03 00:33:57 git kernel: ? sk_filter_trim_cap+0xcc/0xb60
Jul 03 00:33:57 git kernel: ? __pfx_tcp_add_backlog+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock+0x10/0x10
Jul 03 00:33:57 git kernel: tcp_v4_rcv+0x3583/0x4c40
Jul 03 00:33:57 git kernel: ? __pfx_tcp_v4_rcv+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock_irqsave+0x10/0x10
Jul 03 00:33:57 git kernel: ip_protocol_deliver_rcu+0x6c/0x480
Jul 03 00:33:57 git kernel: ip_local_deliver_finish+0x2ae/0x4d0
Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
Jul 03 00:33:57 git kernel: ip_local_deliver+0x1ba/0x380
Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
Jul 03 00:33:57 git kernel: ? ipv4_dst_check+0x104/0x250
Jul 03 00:33:57 git kernel: ? __ubsan_handle_type_mismatch_v1+0x44/0x60
Jul 03 00:33:57 git kernel: ip_sublist_rcv_finish+0x172/0x380
Jul 03 00:33:57 git kernel: ------------[ cut here ]------------
Jul 03 00:33:57 git kernel: ip_sublist_rcv+0x3cd/0x900
Jul 03 00:33:57 git kernel: default_enter_idle leaked IRQ state
Jul 03 00:33:57 git kernel: ? __pfx_ip_sublist_rcv+0x10/0x10
Jul 03 00:33:57 git kernel: ? __ubsan_handle_type_mismatch_v1+0x44/0x60
Jul 03 00:33:57 git kernel: ? ip_rcv_core+0x972/0x1b20
Jul 03 00:33:57 git kernel: ip_list_rcv+0x318/0x750
Jul 03 00:33:57 git kernel: ? __pfx_ip_list_rcv+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx_ip_list_rcv+0x10/0x10
Jul 03 00:33:57 git kernel: __netif_receive_skb_list_core+0x5ad/0x1170
Jul 03 00:33:57 git kernel: ? tcp_gro_receive+0x1f45/0x2990
Jul 03 00:33:57 git kernel: ? __pfx___netif_receive_skb_list_core+0x10/0x10
Jul 03 00:33:57 git kernel: ? kvm_clock_read+0x16/0x40
Jul 03 00:33:57 git kernel: ? ktime_get_with_offset+0xd0/0x1f0
Jul 03 00:33:57 git kernel: netif_receive_skb_list_internal+0x76f/0x1530
Jul 03 00:33:57 git kernel: ? __pfx_netif_receive_skb_list_internal+0x10/0x10
Jul 03 00:33:57 git kernel: ? dev_gro_receive+0x67f/0x4900
Jul 03 00:33:57 git kernel: ? free_unref_page+0x2fd/0x680
Jul 03 00:33:57 git kernel: ? put_page+0x69/0x2b0
Jul 03 00:33:57 git kernel: ? __pfx_eth_type_trans+0x10/0x10
Jul 03 00:33:57 git kernel: napi_gro_receive+0x77b/0xdc0
Jul 03 00:33:57 git kernel: receive_buf+0x1001/0xac40
Jul 03 00:33:57 git kernel: ? _raw_spin_lock_irqsave+0xaa/0x180
Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock_irqsave+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx_receive_buf+0x10/0x10
Jul 03 00:33:57 git kernel: ? _raw_spin_unlock_irqrestore+0x40/0x80
Jul 03 00:33:57 git kernel: ? trace_hardirqs_on+0x2d/0xd0
Jul 03 00:33:57 git kernel: ? detach_buf_split+0x27e/0xa70
Jul 03 00:33:57 git kernel: ? virtqueue_get_buf_ctx_split+0x3c3/0x1400
Jul 03 00:33:57 git kernel: ? virtqueue_enable_cb_delayed+0x5d0/0x1180
Jul 03 00:33:57 git kernel: virtnet_poll+0x7c7/0x2030
Jul 03 00:33:57 git kernel: ? __pfx_virtnet_poll+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock+0x10/0x10
Jul 03 00:33:57 git kernel: ? __run_timers+0x43d/0xf70
Jul 03 00:33:57 git kernel: __napi_poll.constprop.0+0xd4/0x840
Jul 03 00:33:57 git kernel: net_rx_action+0x7a0/0x26e0
Jul 03 00:33:57 git kernel: ? __pfx_net_rx_action+0x10/0x10
Jul 03 00:33:57 git kernel: __do_softirq+0x277/0x95d
Jul 03 00:33:57 git kernel: ? __pfx___do_softirq+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx_run_ksoftirqd+0x10/0x10
Jul 03 00:33:57 git kernel: ? __pfx_run_ksoftirqd+0x10/0x10
Jul 03 00:33:57 git kernel: run_ksoftirqd+0x2c/0x40
Jul 03 00:33:57 git kernel: smpboot_thread_fn+0x380/0xbc0
Jul 03 00:33:57 git kernel: ? __kthread_parkme+0xdc/0x280
Jul 03 00:33:57 git kernel: ? schedule+0x158/0x360
Jul 03 00:33:57 git kernel: ? __pfx_smpboot_thread_fn+0x10/0x10
Jul 03 00:33:57 git kernel: kthread+0x259/0x3d0
Jul 03 00:33:57 git kernel: ? __pfx_kthread+0x10/0x10
Jul 03 00:33:57 git kernel: ret_from_fork+0x2c/0x50
Jul 03 00:33:57 git kernel: </TASK>
Jul 03 00:33:57 git kernel: ================================================================================
---- From: Forza <[email protected]> -- Sent: 2023-07-03 - 01:28 ----
>
>
> ---- From: Uladzislau Rezki <[email protected]> -- Sent: 2023-05-26 - 14:24 ----
>
>> On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
>>> This looks like a different set of problems, though all of them seem to
>>> start on the compression write path in btrfs.
>>>
>>> On Wed, May 24, 2023 at 07:57:19AM +0200, Forza wrote:
>>> > [ 8.641506] 8021q: adding VLAN 0 to HW filter on device enp4s0
>>> > [ 13.841691] wireguard: WireGuard 1.0.0 loaded. See http://www.wireguard.com for information.
>>> > [ 13.841705] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
>>> > [13917.280527] ------------[ cut here ]------------
>>> > [13917.280753] default_enter_idle leaked IRQ state
>>> > [13917.281004] WARNING: CPU: 3 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x3bb/0x430
>>>
>>>
> ... Snip
>>>
>>> And again, so something is going wrong
>>>
>> Indeed.
>>
>> I suggest you run your kernel with CONFIG_KASAN=y to see if there are
>> any use-after-free or out-of-bounds bugs.
>>
>> --
>> Uladzislau Rezki
>
>
> Pardon the delay... I have enabled KASAN and UBSAN on this kernel. It produced a lot of output and plenty of warnings for misalignment.
>
> The full dmesg is at https://paste.tnonline.net/files/aBoUMuTd5KBC_dmesg.ubsan.txt (approx 1.7MiB)
>
> The full kernel .conf is. at https://paste.tnonline.net/files/z1mX8TWFgZQ3_kernel.conf-kasan-ubsan.txt
>
> A small exctract around what I think is the default_enter_idle leaked IRQ event. Is this helpful?
>
> ================================================================================
> Jul 03 00:33:57 git kernel: UBSAN: misaligned-access in net/ipv4/tcp_ipv4.c:1848:13
> Jul 03 00:33:57 git kernel: member access within misaligned address 000000007604d82f for type 'const struct tcphdr'
> Jul 03 00:33:57 git kernel: which requires 4 byte alignment
> Jul 03 00:33:57 git kernel: CPU: 2 PID: 29 Comm: ksoftirqd/2 Not tainted 6.3.10-ksan-ubsan #8
> Jul 03 00:33:57 git kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
> Jul 03 00:33:57 git kernel: Call Trace:
> Jul 03 00:33:57 git kernel: <TASK>
> Jul 03 00:33:57 git kernel: dump_stack_lvl+0x86/0xd0
> Jul 03 00:33:57 git kernel: ubsan_type_mismatch_common+0xdf/0x240
> Jul 03 00:33:57 git kernel: __ubsan_handle_type_mismatch_v1+0x44/0x60
> Jul 03 00:33:57 git kernel: tcp_add_backlog+0x1fac/0x3ab0
> Jul 03 00:33:57 git kernel: ? sk_filter_trim_cap+0xcc/0xb60
> Jul 03 00:33:57 git kernel: ? __pfx_tcp_add_backlog+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock+0x10/0x10
> Jul 03 00:33:57 git kernel: tcp_v4_rcv+0x3583/0x4c40
> Jul 03 00:33:57 git kernel: ? __pfx_tcp_v4_rcv+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> Jul 03 00:33:57 git kernel: ip_protocol_deliver_rcu+0x6c/0x480
> Jul 03 00:33:57 git kernel: ip_local_deliver_finish+0x2ae/0x4d0
> Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
> Jul 03 00:33:57 git kernel: ip_local_deliver+0x1ba/0x380
> Jul 03 00:33:57 git kernel: ? __pfx_ip_local_deliver+0x10/0x10
> Jul 03 00:33:57 git kernel: ? ipv4_dst_check+0x104/0x250
> Jul 03 00:33:57 git kernel: ? __ubsan_handle_type_mismatch_v1+0x44/0x60
> Jul 03 00:33:57 git kernel: ip_sublist_rcv_finish+0x172/0x380
> Jul 03 00:33:57 git kernel: ------------[ cut here ]------------
> Jul 03 00:33:57 git kernel: ip_sublist_rcv+0x3cd/0x900
> Jul 03 00:33:57 git kernel: default_enter_idle leaked IRQ state
> Jul 03 00:33:57 git kernel: ? __pfx_ip_sublist_rcv+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __ubsan_handle_type_mismatch_v1+0x44/0x60
> Jul 03 00:33:57 git kernel: ? ip_rcv_core+0x972/0x1b20
> Jul 03 00:33:57 git kernel: ip_list_rcv+0x318/0x750
> Jul 03 00:33:57 git kernel: ? __pfx_ip_list_rcv+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx_ip_list_rcv+0x10/0x10
> Jul 03 00:33:57 git kernel: __netif_receive_skb_list_core+0x5ad/0x1170
> Jul 03 00:33:57 git kernel: ? tcp_gro_receive+0x1f45/0x2990
> Jul 03 00:33:57 git kernel: ? __pfx___netif_receive_skb_list_core+0x10/0x10
> Jul 03 00:33:57 git kernel: ? kvm_clock_read+0x16/0x40
> Jul 03 00:33:57 git kernel: ? ktime_get_with_offset+0xd0/0x1f0
> Jul 03 00:33:57 git kernel: netif_receive_skb_list_internal+0x76f/0x1530
> Jul 03 00:33:57 git kernel: ? __pfx_netif_receive_skb_list_internal+0x10/0x10
> Jul 03 00:33:57 git kernel: ? dev_gro_receive+0x67f/0x4900
> Jul 03 00:33:57 git kernel: ? free_unref_page+0x2fd/0x680
> Jul 03 00:33:57 git kernel: ? put_page+0x69/0x2b0
> Jul 03 00:33:57 git kernel: ? __pfx_eth_type_trans+0x10/0x10
> Jul 03 00:33:57 git kernel: napi_gro_receive+0x77b/0xdc0
> Jul 03 00:33:57 git kernel: receive_buf+0x1001/0xac40
> Jul 03 00:33:57 git kernel: ? _raw_spin_lock_irqsave+0xaa/0x180
> Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx_receive_buf+0x10/0x10
> Jul 03 00:33:57 git kernel: ? _raw_spin_unlock_irqrestore+0x40/0x80
> Jul 03 00:33:57 git kernel: ? trace_hardirqs_on+0x2d/0xd0
> Jul 03 00:33:57 git kernel: ? detach_buf_split+0x27e/0xa70
> Jul 03 00:33:57 git kernel: ? virtqueue_get_buf_ctx_split+0x3c3/0x1400
> Jul 03 00:33:57 git kernel: ? virtqueue_enable_cb_delayed+0x5d0/0x1180
> Jul 03 00:33:57 git kernel: virtnet_poll+0x7c7/0x2030
> Jul 03 00:33:57 git kernel: ? __pfx_virtnet_poll+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx__raw_spin_lock+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __run_timers+0x43d/0xf70
> Jul 03 00:33:57 git kernel: __napi_poll.constprop.0+0xd4/0x840
> Jul 03 00:33:57 git kernel: net_rx_action+0x7a0/0x26e0
> Jul 03 00:33:57 git kernel: ? __pfx_net_rx_action+0x10/0x10
> Jul 03 00:33:57 git kernel: __do_softirq+0x277/0x95d
> Jul 03 00:33:57 git kernel: ? __pfx___do_softirq+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx_run_ksoftirqd+0x10/0x10
> Jul 03 00:33:57 git kernel: ? __pfx_run_ksoftirqd+0x10/0x10
> Jul 03 00:33:57 git kernel: run_ksoftirqd+0x2c/0x40
> Jul 03 00:33:57 git kernel: smpboot_thread_fn+0x380/0xbc0
> Jul 03 00:33:57 git kernel: ? __kthread_parkme+0xdc/0x280
> Jul 03 00:33:57 git kernel: ? schedule+0x158/0x360
> Jul 03 00:33:57 git kernel: ? __pfx_smpboot_thread_fn+0x10/0x10
> Jul 03 00:33:57 git kernel: kthread+0x259/0x3d0
> Jul 03 00:33:57 git kernel: ? __pfx_kthread+0x10/0x10
> Jul 03 00:33:57 git kernel: ret_from_fork+0x2c/0x50
> Jul 03 00:33:57 git kernel: </TASK>
> Jul 03 00:33:57 git kernel: ================================================================================
>
A small update.
I have been able test 6.2.16, all 6.3.x and 6.4.1 and they all show the same issue.
I am now trying 6.1.37 since two days and have not been able to reproduce this issue on any of my virtual qemu/kvm machines. Perhaps this information is helpful in finding the root cause?
~Forza
On 06.07.23 10:08, Forza wrote:
>>> On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
> [...]
> A small update.
Thx for this.
> I have been able test 6.2.16, all 6.3.x and 6.4.1 and they all show
> the same issue.
>
> I am now trying 6.1.37 since two days and have not been able to
> reproduce this issue on any of my virtual qemu/kvm machines. Perhaps
> this information is helpful in finding the root cause?
That means it's most likely a regression between v6.1..v6.2 (or
v6.1..v6.2.16 if we are unlucky) somewhere (from earlier in the thread
it sounds like it might not be Btrfs).
Which makes we wonder: how long do you usually need to reproduce the
issue? If it's not too long it might mean that a bisection is the best
way forward, unless some developer sits down and looks closely at the
logs. With a bit of luck some dev will do that; but if we are unlucky we
likely will need a bisection.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
On 2023-07-06 12:54, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 06.07.23 10:08, Forza wrote:
>>>> On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
>> [...]
>> A small update.
>
> Thx for this.
>
>> I have been able test 6.2.16, all 6.3.x and 6.4.1 and they all show
>> the same issue.
>>
>> I am now trying 6.1.37 since two days and have not been able to
>> reproduce this issue on any of my virtual qemu/kvm machines. Perhaps
>> this information is helpful in finding the root cause?
>
> That means it's most likely a regression between v6.1..v6.2 (or
> v6.1..v6.2.16 if we are unlucky) somewhere (from earlier in the thread
> it sounds like it might not be Btrfs).
Agreed, I do not think this specific bug (cpuidle / default_enter_idle
leaked IRQ state) is Btrfs related. Some of the virtual machines I test
on do not use Btrfs.
>
> Which makes we wonder: how long do you usually need to reproduce the
> issue? If it's not too long it might mean that a bisection is the best
> way forward, unless some developer sits down and looks closely at the
> logs. With a bit of luck some dev will do that; but if we are unlucky we
> likely will need a bisection.
>
It has varied. Sometimes immediately upon boot, but can take several
hours or a day before showing up.
Also, I forgot to say I was basing my kernels on gentoo-kernels, which
has some patches against vanilla. Therefore I will I will compile a set
of vanilla kernels from 6.1.37 until 6.4.2 and run them in my testing
machines to see where the problem is happening.
This is not a fast system, so it will likely take several days. But I
will keep you posted.
Meanwhile, if you think of any specific kernel debug options, tracing,
etc, that I should enable, let me know
Should we change the Subject line of this email thread?
Thanks
~Forza
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
On 2023-07-07 12:13, Forza wrote:
>
>
> On 2023-07-06 12:54, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 06.07.23 10:08, Forza wrote:
>>>>> On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
>>> [...]
>>> A small update.
>>
>> Thx for this.
>>
>>> I have been able test 6.2.16, all 6.3.x and 6.4.1 and they all show
>>> the same issue.
>>>
>>> I am now trying 6.1.37 since two days and have not been able to
>>> reproduce this issue on any of my virtual qemu/kvm machines. Perhaps
>>> this information is helpful in finding the root cause?
>>
>> That means it's most likely a regression between v6.1..v6.2 (or
>> v6.1..v6.2.16 if we are unlucky) somewhere (from earlier in the thread
>> it sounds like it might not be Btrfs).
>
> Agreed, I do not think this specific bug (cpuidle / default_enter_idle
> leaked IRQ state) is Btrfs related. Some of the virtual machines I test
> on do not use Btrfs.
>>
>> Which makes we wonder: how long do you usually need to reproduce the
>> issue? If it's not too long it might mean that a bisection is the best
>> way forward, unless some developer sits down and looks closely at the
>> logs. With a bit of luck some dev will do that; but if we are unlucky we
>> likely will need a bisection.
>>
>
> It has varied. Sometimes immediately upon boot, but can take several
> hours or a day before showing up.
>
>
> Also, I forgot to say I was basing my kernels on gentoo-kernels, which
> has some patches against vanilla. Therefore I will I will compile a set
> of vanilla kernels from 6.1.37 until 6.4.2 and run them in my testing
> machines to see where the problem is happening.
>
> This is not a fast system, so it will likely take several days. But I
> will keep you posted.
>
> Meanwhile, if you think of any specific kernel debug options, tracing,
> etc, that I should enable, let me know
>
> Should we change the Subject line of this email thread?
Have been running for a few weeks now. The earliest vanilla-kernel I
have ben able to reproduce the issue with is 6.2.16, it only happened
once over two weeks.
The issue still happens frequently on 6.4.x kernels. Here is the output
from 6.4.7, and as you see happened only a few seconds after booting.
This VM has only a minimal system installed:
# pstree
init-+-acpid
|-crond
|-dropbear---dropbear---bash---pstree
|-2*[getty]
|-qemu-ga---{qemu-ga}
|-supervise-daemo---tgtd---16*[{tgtd}]
|-syslogd
|-udhcpc
`-watchdog
# dmesg
[ 4.328170][ T1] Run /sbin/init as init process
[ 4.329279][ T1] with arguments:
[ 4.329282][ T1] /sbin/init
[ 4.329284][ T1] with environment:
[ 4.329286][ T1] HOME=/
[ 4.329288][ T1] TERM=linux
[ 11.540637][ T1992] Adding 1048572k swap on /dev/vda. Priority:-2
extents:1 across:1048572k DscFS
[ 12.917894][ T0] ------------[ cut here ]------------
[ 12.917907][ T0] default_enter_idle leaked IRQ state
[ 12.918001][ T0] WARNING: CPU: 2 PID: 0 at
drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x35f/0x510
[ 12.918052][ T0] Modules linked in:
[ 12.918064][ T0] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
6.4.7-ksan-ubsan #1
[ 12.918070][ T0] Hardware name: QEMU Standard PC (Q35 + ICH9,
2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 12.918074][ T0] RIP: 0010:cpuidle_enter_state+0x35f/0x510
[ 12.918088][ T0] Code: 89 c2 83 e2 01 3c 01 0f 87 b1 01 00 00 84
d2 75 1d c6 05 ac bf 1f 02 01 90 49 8b 76 50 48 c7 c7 60 23 b1 ba e8 82
ef ed fc 90 <0f> 0b 90 90 fa e9 36 fd ff ff 83 c2 01 48 83 c1 40 41 39
d0 0f 8e
[ 12.918093][ T0] RSP: 0018:ffffc9000015fd98 EFLAGS: 00010246
[ 12.918112][ T0] RAX: 0000000000000000 RBX: ffffe8ffffd08c00 RCX:
0000000000000000
[ 12.918116][ T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[ 12.918119][ T0] RBP: ffffffffbc39ae40 R08: 0000000000000000 R09:
0000000000000000
[ 12.918123][ T0] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000001
[ 12.918125][ T0] R13: 0000000000000001 R14: ffffffffbc39aec0 R15:
0000000000000000
[ 12.918130][ T0] FS: 0000000000000000(0000)
GS:ffff88806d100000(0000) knlGS:0000000000000000
[ 12.918146][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.918150][ T0] CR2: 00007f8dbf5bc000 CR3: 000000000f880000 CR4:
00000000003506e0
[ 12.918154][ T0] Call Trace:
[ 12.918156][ T0] <TASK>
[ 12.918160][ T0] ? __warn+0xc9/0x240
[ 12.918198][ T0] ? cpuidle_enter_state+0x35f/0x510
[ 12.918203][ T0] ? report_bug+0x261/0x2c0
[ 12.918219][ T0] ? __pfx_default_enter_idle+0x10/0x10
[ 12.918226][ T0] ? handle_bug+0x3d/0x80
[ 12.918234][ T0] ? exc_invalid_op+0x13/0x50
[ 12.918238][ T0] ? asm_exc_invalid_op+0x16/0x20
[ 12.918251][ T0] ? cpuidle_enter_state+0x35f/0x510
[ 12.918255][ T0] ? cpuidle_enter_state+0x35e/0x510
[ 12.918269][ T0] ? tick_nohz_stop_tick+0x22c/0x660
[ 12.918288][ T0] cpuidle_enter+0x4a/0xb0
[ 12.918294][ T0] do_idle+0x31c/0x400
[ 12.918304][ T0] ? __pfx_do_idle+0x10/0x10
[ 12.918308][ T0] cpu_startup_entry+0x19/0x20
[ 12.918311][ T0] start_secondary+0x241/0x2d0
[ 12.918323][ T0] ? __pfx_start_secondary+0x10/0x10
[ 12.918328][ T0] secondary_startup_64_no_verify+0xf4/0xfb
[ 12.918338][ T0] </TASK>
[ 12.918339][ T0] ---[ end trace 0000000000000000 ]---
kernel-6.4.7-config:
https://paste.tnonline.net/files/A1IHEUjgCh8Y_config-6.4.7.txt
kernel-6.4.7-dmesg:
https://paste.tnonline.net/files/YP11KEMpKXAI_dmesg-6.4.7.txt
~Forza
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>