Hi,
Today I saw for the first time "BUG: Bad page state in process
kswapd0 pfn:d6e840"
Trace:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
Modules linked in: uvcvideo uvc videobuf2_vmalloc videobuf2_memops
videobuf2_v4l2 videobuf2_common videodev rndis_host uas cdc_ether
usbnet usb_storage mii overlay tun uinput snd_seq_dummy snd_hrtimer
rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc
binfmt_misc snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi mc
amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi mt76x2u
mt7921e edac_mce_amd snd_hda_intel mt76x2_common mt7921_common
snd_intel_dspcfg mt76x02_usb snd_intel_sdw_acpi mt76_usb mt792x_lib
snd_hda_codec mt76x02_lib mt76_connac_lib btusb btrtl mt76
snd_hda_core btintel kvm_amd btbcm btmtk snd_hwdep mac80211 snd_seq
kvm vfat snd_seq_device bluetooth libarc4 fat irqbypass snd_pcm rapl
cfg80211 snd_timer wmi_bmof pcspkr snd i2c_piix4 k10temp rfkill
soundcore joydev
apple_mfi_fastcharge gpio_amdpt gpio_generic loop nfnetlink zram
amdgpu hid_apple crct10dif_pclmul crc32_pclmul crc32c_intel
polyval_clmulni polyval_generic amdxcp i2c_algo_bit drm_ttm_helper ttm
ghash_clmulni_intel drm_exec gpu_sched drm_suballoc_helper
sha512_ssse3 nvme drm_buddy sha256_ssse3 sha1_ssse3 drm_display_helper
nvme_core sp5100_tco r8169 ccp cec realtek nvme_auth video wmi
ip6_tables ip_tables fuse
CPU: 17 PID: 268 Comm: kswapd0 Tainted: G W L -------
--- 6.9.0-0.rc0.20240315gite5eb28f6d1af.8.fc41.x86_64+debug #1
Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I
EDGE WIFI (MS-7D73), BIOS 1.82 01/24/2024
Call Trace:
<TASK>
dump_stack_lvl+0xce/0xf0
bad_page+0xd4/0x230
? __pfx_bad_page+0x10/0x10
? page_bad_reason+0x9d/0x1f0
free_unref_page_prepare+0x80e/0xe00
? __pfx___mem_cgroup_uncharge_folios+0x10/0x10
? __pfx_lock_release+0x10/0x10
free_unref_folios+0x26e/0x9c0
? _raw_spin_unlock_irq+0x28/0x60
move_folios_to_lru+0xc0e/0xe80
? __pfx_move_folios_to_lru+0x10/0x10
evict_folios+0xe5c/0x1610
? evict_folios+0x5f3/0x1610
? __pfx_lock_acquire+0x10/0x10
? __pfx_evict_folios+0x10/0x10
? rcu_is_watching+0x15/0xb0
? rcu_is_watching+0x15/0xb0
? __pfx_lock_acquire+0x10/0x10
? __pfx___might_resched+0x10/0x10
? mem_cgroup_get_nr_swap_pages+0x25/0x120
try_to_shrink_lruvec+0x4d8/0x800
? rcu_is_watching+0x15/0xb0
? __pfx_try_to_shrink_lruvec+0x10/0x10
? lock_release+0x581/0xc60
? __pfx_lock_release+0x10/0x10
shrink_one+0x37c/0x6f0
shrink_node+0x1d60/0x3080
? shrink_node+0x1d47/0x3080
? shrink_node+0x1afa/0x3080
? __pfx_shrink_node+0x10/0x10
? pgdat_balanced+0x7b/0x1a0
balance_pgdat+0x88b/0x1480
? rcu_is_watching+0x15/0xb0
? __pfx_balance_pgdat+0x10/0x10
? __switch_to+0x409/0xdd0
? __switch_to_asm+0x37/0x70
? __schedule+0x10cd/0x61d0
? __pfx_debug_object_free+0x10/0x10
? __try_to_del_timer_sync+0xe5/0x140
? __pfx_lock_release+0x10/0x10
? __pfx___might_resched+0x10/0x10
? set_pgdat_percpu_threshold+0x1c4/0x2f0
? __pfx_calculate_pressure_threshold+0x10/0x10
kswapd+0x51d/0x910
? __pfx_kswapd+0x10/0x10
? __pfx_autoremove_wake_function+0x10/0x10
? lockdep_hardirqs_on+0x80/0x110
? __kthread_parkme+0xba/0x1f0
? __pfx_kswapd+0x10/0x10
kthread+0x2ed/0x3c0
? _raw_spin_unlock_irq+0x28/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Quick googling doesn't give a reassuring answer.
If it is really a hardware problem then it is unclear what is the culprit here.
The memory was checked a year ago by testmem86 and no errors were found.
Considering the absolute randomness of the appearance of this bug
message, it may be worth ignoring it, but an unpleasant aftertaste
remains.
Machine spec: https://linux-hardware.org/?probe=24b7696f8a
I attached below full kernel log and build config.
--
Best Regards,
Mike Gavrilov.
On Mon, Mar 18, 2024 at 2:55 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> Hi,
> Today I saw for the first time "BUG: Bad page state in process
> kswapd0 pfn:d6e840"
>
> Trace:
> BUG: Bad page state in process kswapd0 pfn:d6e840
> page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
> pfn:0xd6e840
> aops:btree_aops ino:1
> flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
> page_type: 0xffffffff()
> raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
> raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: non-NULL mapping
> Modules linked in: uvcvideo uvc videobuf2_vmalloc videobuf2_memops
> videobuf2_v4l2 videobuf2_common videodev rndis_host uas cdc_ether
> usbnet usb_storage mii overlay tun uinput snd_seq_dummy snd_hrtimer
> rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc
> binfmt_misc snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi mc
> amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi mt76x2u
> mt7921e edac_mce_amd snd_hda_intel mt76x2_common mt7921_common
> snd_intel_dspcfg mt76x02_usb snd_intel_sdw_acpi mt76_usb mt792x_lib
> snd_hda_codec mt76x02_lib mt76_connac_lib btusb btrtl mt76
> snd_hda_core btintel kvm_amd btbcm btmtk snd_hwdep mac80211 snd_seq
> kvm vfat snd_seq_device bluetooth libarc4 fat irqbypass snd_pcm rapl
> cfg80211 snd_timer wmi_bmof pcspkr snd i2c_piix4 k10temp rfkill
> soundcore joydev
> apple_mfi_fastcharge gpio_amdpt gpio_generic loop nfnetlink zram
> amdgpu hid_apple crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni polyval_generic amdxcp i2c_algo_bit drm_ttm_helper ttm
> ghash_clmulni_intel drm_exec gpu_sched drm_suballoc_helper
> sha512_ssse3 nvme drm_buddy sha256_ssse3 sha1_ssse3 drm_display_helper
> nvme_core sp5100_tco r8169 ccp cec realtek nvme_auth video wmi
> ip6_tables ip_tables fuse
> CPU: 17 PID: 268 Comm: kswapd0 Tainted: G W L -------
> --- 6.9.0-0.rc0.20240315gite5eb28f6d1af.8.fc41.x86_64+debug #1
> Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I
> EDGE WIFI (MS-7D73), BIOS 1.82 01/24/2024
> Call Trace:
> <TASK>
> dump_stack_lvl+0xce/0xf0
> bad_page+0xd4/0x230
> ? __pfx_bad_page+0x10/0x10
> ? page_bad_reason+0x9d/0x1f0
> free_unref_page_prepare+0x80e/0xe00
> ? __pfx___mem_cgroup_uncharge_folios+0x10/0x10
> ? __pfx_lock_release+0x10/0x10
> free_unref_folios+0x26e/0x9c0
> ? _raw_spin_unlock_irq+0x28/0x60
> move_folios_to_lru+0xc0e/0xe80
> ? __pfx_move_folios_to_lru+0x10/0x10
> evict_folios+0xe5c/0x1610
> ? evict_folios+0x5f3/0x1610
> ? __pfx_lock_acquire+0x10/0x10
> ? __pfx_evict_folios+0x10/0x10
> ? rcu_is_watching+0x15/0xb0
> ? rcu_is_watching+0x15/0xb0
> ? __pfx_lock_acquire+0x10/0x10
> ? __pfx___might_resched+0x10/0x10
> ? mem_cgroup_get_nr_swap_pages+0x25/0x120
> try_to_shrink_lruvec+0x4d8/0x800
> ? rcu_is_watching+0x15/0xb0
> ? __pfx_try_to_shrink_lruvec+0x10/0x10
> ? lock_release+0x581/0xc60
> ? __pfx_lock_release+0x10/0x10
> shrink_one+0x37c/0x6f0
> shrink_node+0x1d60/0x3080
> ? shrink_node+0x1d47/0x3080
> ? shrink_node+0x1afa/0x3080
> ? __pfx_shrink_node+0x10/0x10
> ? pgdat_balanced+0x7b/0x1a0
> balance_pgdat+0x88b/0x1480
> ? rcu_is_watching+0x15/0xb0
> ? __pfx_balance_pgdat+0x10/0x10
> ? __switch_to+0x409/0xdd0
> ? __switch_to_asm+0x37/0x70
> ? __schedule+0x10cd/0x61d0
> ? __pfx_debug_object_free+0x10/0x10
> ? __try_to_del_timer_sync+0xe5/0x140
> ? __pfx_lock_release+0x10/0x10
> ? __pfx___might_resched+0x10/0x10
> ? set_pgdat_percpu_threshold+0x1c4/0x2f0
> ? __pfx_calculate_pressure_threshold+0x10/0x10
> kswapd+0x51d/0x910
> ? __pfx_kswapd+0x10/0x10
> ? __pfx_autoremove_wake_function+0x10/0x10
> ? lockdep_hardirqs_on+0x80/0x110
> ? __kthread_parkme+0xba/0x1f0
> ? __pfx_kswapd+0x10/0x10
> kthread+0x2ed/0x3c0
> ? _raw_spin_unlock_irq+0x28/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x31/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> Quick googling doesn't give a reassuring answer.
> If it is really a hardware problem then it is unclear what is the culprit here.
> The memory was checked a year ago by testmem86 and no errors were found.
> Considering the absolute randomness of the appearance of this bug
> message, it may be worth ignoring it, but an unpleasant aftertaste
> remains.
>
> Machine spec: https://linux-hardware.org/?probe=24b7696f8a
> I attached below full kernel log and build config.
>
> --
> Best Regards,
> Mike Gavrilov.
Sorry for writing again, but no one answered me.
Of course, I checked my memory again and ensured that it was fine.
I even changed three motherboards, but this error appears once a week.
Every time the common phrase is "BUG: Bad page state in process" but
the process can be different.
kcompactd, kswapd, kworker, btrfs-transacti
I don’t know if this will help, but usually this happens when I run
the “docker pull” command and a huge container is being updated.
--
Best Regards,
Mike Gavrilov.
On 08.05.24 12:16, Mikhail Gavrilov wrote:
> On Mon, Mar 18, 2024 at 2:55 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> Hi,
>> Today I saw for the first time "BUG: Bad page state in process
>> kswapd0 pfn:d6e840"
>>
>> Trace:
>> BUG: Bad page state in process kswapd0 pfn:d6e840
>> page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
>> pfn:0xd6e840
>> aops:btree_aops ino:1
>> flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
>> page_type: 0xffffffff()
>> raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
>> raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
>> page dumped because: non-NULL mapping
>> Modules linked in: uvcvideo uvc videobuf2_vmalloc videobuf2_memops
>> videobuf2_v4l2 videobuf2_common videodev rndis_host uas cdc_ether
>> usbnet usb_storage mii overlay tun uinput snd_seq_dummy snd_hrtimer
>> rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
>> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
>> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
>> nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc
>> binfmt_misc snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi mc
>> amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi mt76x2u
>> mt7921e edac_mce_amd snd_hda_intel mt76x2_common mt7921_common
>> snd_intel_dspcfg mt76x02_usb snd_intel_sdw_acpi mt76_usb mt792x_lib
>> snd_hda_codec mt76x02_lib mt76_connac_lib btusb btrtl mt76
>> snd_hda_core btintel kvm_amd btbcm btmtk snd_hwdep mac80211 snd_seq
>> kvm vfat snd_seq_device bluetooth libarc4 fat irqbypass snd_pcm rapl
>> cfg80211 snd_timer wmi_bmof pcspkr snd i2c_piix4 k10temp rfkill
>> soundcore joydev
>> apple_mfi_fastcharge gpio_amdpt gpio_generic loop nfnetlink zram
>> amdgpu hid_apple crct10dif_pclmul crc32_pclmul crc32c_intel
>> polyval_clmulni polyval_generic amdxcp i2c_algo_bit drm_ttm_helper ttm
>> ghash_clmulni_intel drm_exec gpu_sched drm_suballoc_helper
>> sha512_ssse3 nvme drm_buddy sha256_ssse3 sha1_ssse3 drm_display_helper
>> nvme_core sp5100_tco r8169 ccp cec realtek nvme_auth video wmi
>> ip6_tables ip_tables fuse
>> CPU: 17 PID: 268 Comm: kswapd0 Tainted: G W L -------
>> --- 6.9.0-0.rc0.20240315gite5eb28f6d1af.8.fc41.x86_64+debug #1
>> Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I
>> EDGE WIFI (MS-7D73), BIOS 1.82 01/24/2024
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0xce/0xf0
>> bad_page+0xd4/0x230
>> ? __pfx_bad_page+0x10/0x10
>> ? page_bad_reason+0x9d/0x1f0
>> free_unref_page_prepare+0x80e/0xe00
>> ? __pfx___mem_cgroup_uncharge_folios+0x10/0x10
>> ? __pfx_lock_release+0x10/0x10
>> free_unref_folios+0x26e/0x9c0
>> ? _raw_spin_unlock_irq+0x28/0x60
>> move_folios_to_lru+0xc0e/0xe80
>> ? __pfx_move_folios_to_lru+0x10/0x10
>> evict_folios+0xe5c/0x1610
>> ? evict_folios+0x5f3/0x1610
>> ? __pfx_lock_acquire+0x10/0x10
>> ? __pfx_evict_folios+0x10/0x10
>> ? rcu_is_watching+0x15/0xb0
>> ? rcu_is_watching+0x15/0xb0
>> ? __pfx_lock_acquire+0x10/0x10
>> ? __pfx___might_resched+0x10/0x10
>> ? mem_cgroup_get_nr_swap_pages+0x25/0x120
>> try_to_shrink_lruvec+0x4d8/0x800
>> ? rcu_is_watching+0x15/0xb0
>> ? __pfx_try_to_shrink_lruvec+0x10/0x10
>> ? lock_release+0x581/0xc60
>> ? __pfx_lock_release+0x10/0x10
>> shrink_one+0x37c/0x6f0
>> shrink_node+0x1d60/0x3080
>> ? shrink_node+0x1d47/0x3080
>> ? shrink_node+0x1afa/0x3080
>> ? __pfx_shrink_node+0x10/0x10
>> ? pgdat_balanced+0x7b/0x1a0
>> balance_pgdat+0x88b/0x1480
>> ? rcu_is_watching+0x15/0xb0
>> ? __pfx_balance_pgdat+0x10/0x10
>> ? __switch_to+0x409/0xdd0
>> ? __switch_to_asm+0x37/0x70
>> ? __schedule+0x10cd/0x61d0
>> ? __pfx_debug_object_free+0x10/0x10
>> ? __try_to_del_timer_sync+0xe5/0x140
>> ? __pfx_lock_release+0x10/0x10
>> ? __pfx___might_resched+0x10/0x10
>> ? set_pgdat_percpu_threshold+0x1c4/0x2f0
>> ? __pfx_calculate_pressure_threshold+0x10/0x10
>> kswapd+0x51d/0x910
>> ? __pfx_kswapd+0x10/0x10
>> ? __pfx_autoremove_wake_function+0x10/0x10
>> ? lockdep_hardirqs_on+0x80/0x110
>> ? __kthread_parkme+0xba/0x1f0
>> ? __pfx_kswapd+0x10/0x10
>> kthread+0x2ed/0x3c0
>> ? _raw_spin_unlock_irq+0x28/0x60
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x31/0x70
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1a/0x30
>> </TASK>
>>
>> Quick googling doesn't give a reassuring answer.
>> If it is really a hardware problem then it is unclear what is the culprit here.
>> The memory was checked a year ago by testmem86 and no errors were found.
>> Considering the absolute randomness of the appearance of this bug
>> message, it may be worth ignoring it, but an unpleasant aftertaste
>> remains.
>>
>> Machine spec: https://linux-hardware.org/?probe=24b7696f8a
>> I attached below full kernel log and build config.
>>
>> --
>> Best Regards,
>> Mike Gavrilov.
>
> Sorry for writing again, but no one answered me.
> Of course, I checked my memory again and ensured that it was fine.
> I even changed three motherboards, but this error appears once a week.
> Every time the common phrase is "BUG: Bad page state in process" but
> the process can be different.
> kcompactd, kswapd, kworker, btrfs-transacti
> I don’t know if this will help, but usually this happens when I run
> the “docker pull” command and a huge container is being updated.
"page dumped because: non-NULL mapping"
Is the relevant bit. We are freeing a page, but page->mapping is not
NULL. IIUC, it might happen under memory pressure when reclaiming memory.
It's weird that only you are seeing that, if it would be something
"obvious" I would expect multiple reports :/
--
Cheers,
David / dhildenb
On Wed, May 8, 2024 at 10:45 PM David Hildenbrand <[email protected]> wrote:
>
> "page dumped because: non-NULL mapping"
>
> Is the relevant bit. We are freeing a page, but page->mapping is not
> NULL. IIUC, it might happen under memory pressure when reclaiming memory.
>
> It's weird that only you are seeing that, if it would be something
> "obvious" I would expect multiple reports :/
>
Maybe because the problem is really difficult to reproduce (rare
combination of the kernel build option and workload). I even thought
that the problem was fixed because it did not reproduce itself for a
week.
But yesterday when it happened again with the kworker process, I
stopped doubting that it was fixed.
Now I am concerned with the question of how to be as useful as
possible when reproducing bug again?
--
Best Regards,
Mike Gavrilov.
On 09.05.24 13:59, Mikhail Gavrilov wrote:
> On Wed, May 8, 2024 at 10:45 PM David Hildenbrand <[email protected]> wrote:
>>
>> "page dumped because: non-NULL mapping"
>>
>> Is the relevant bit. We are freeing a page, but page->mapping is not
>> NULL. IIUC, it might happen under memory pressure when reclaiming memory.
>>
>> It's weird that only you are seeing that, if it would be something
>> "obvious" I would expect multiple reports :/
>>
>
> Maybe because the problem is really difficult to reproduce (rare
> combination of the kernel build option and workload). I even thought
> that the problem was fixed because it did not reproduce itself for a
> week.
>
> But yesterday when it happened again with the kworker process, I
> stopped doubting that it was fixed.
>
> Now I am concerned with the question of how to be as useful as
> possible when reproducing bug again?
Do you have the other stracktrace as well?
Maybe triggering memory reclaim (e.g., using "stress" or "memhog") could
trigger it, that might be reasonable to trey. Once we have a reproducer
we could at least bisect.
--
Cheers,
David / dhildenb
On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <[email protected]> wrote:
>
> Do you have the other stracktrace as well?
>
> Maybe triggering memory reclaim (e.g., using "stress" or "memhog") could
> trigger it, that might be reasonable to trey. Once we have a reproducer
> we could at least bisect.
>
The only known workload that causes this is updating a large
container. Unfortunately, not every container update reproduces the
problem.
[24119.281379] BUG: Bad page state in process kcompactd0 pfn:3ae37e
[24119.281387] page: refcount:0 mapcount:0 mapping:00000000d16c2d75
index:0x272ea3200 pfn:0x3ae37e
[24119.281390] aops:btree_aops ino:1
[24119.281395] flags:
0x17ffffc000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
[24119.281400] raw: 0017ffffc000020c dead000000000100 dead000000000122
ffff888136ecd220
[24119.281402] raw: 0000000272ea3200 0000000000000000 00000000ffffffff
0000000000000000
[24119.281403] page dumped because: non-NULL mapping
[24119.281405] Modules linked in: overlay tun crypto_user uinput
snd_seq_dummy snd_hrtimer rfcomm nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables qrtr uhid bnep sunrpc binfmt_misc amd_atl intel_rapl_msr
intel_rapl_common mt76x2u mt76x2_common mt7921e mt7921_common
mt76x02_usb mt76_usb mt792x_lib mt76x02_lib mt76_connac_lib vfat mt76
fat mac80211 snd_hda_codec_hdmi snd_hda_intel edac_mce_amd
snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio uvcvideo
snd_hda_codec kvm_amd btusb snd_usbmidi_lib uvc snd_hda_core snd_ump
btrtl videobuf2_vmalloc btintel videobuf2_memops snd_rawmidi snd_hwdep
videobuf2_v4l2 btbcm btmtk snd_seq videobuf2_common libarc4
snd_seq_device kvm bluetooth ledtrig_netdev videodev snd_pcm cfg80211
joydev asus_nb_wmi eeepc_wmi mc snd_timer asus_wmi sparse_keymap rapl
apple_mfi_fastcharge snd wmi_bmof platform_profile
[24119.281465] pcspkr igc k10temp soundcore i2c_piix4 rfkill
gpio_amdpt gpio_generic loop nfnetlink zram amdgpu crct10dif_pclmul
crc32_pclmul crc32c_intel amdxcp polyval_clmulni i2c_algo_bit
polyval_generic drm_ttm_helper ttm nvme drm_exec ghash_clmulni_intel
gpu_sched drm_suballoc_helper sha512_ssse3 drm_buddy nvme_core ccp
sha256_ssse3 drm_display_helper sha1_ssse3 sp5100_tco video nvme_auth
wmi hid_apple ip6_tables ip_tables fuse
[24119.281496] CPU: 30 PID: 221 Comm: kcompactd0 Tainted: G W
L 6.9.0-test-5f16eb0549ab-with-drop-RANDOM_ORVALUE-bits+ #34
[24119.281498] Hardware name: ASUS System Product Name/ROG STRIX
B650E-I GAMING WIFI, BIOS 2611 04/07/2024
[24119.281500] Call Trace:
[24119.281502] <TASK>
[24119.281503] dump_stack_lvl+0x84/0xd0
[24119.281508] bad_page.cold+0xbe/0xe0
[24119.281510] ? __pfx_bad_page+0x10/0x10
[24119.281514] ? page_bad_reason+0x9d/0x1f0
[24119.281517] free_unref_page+0x838/0x10e0
[24119.281520] __folio_put+0x1ba/0x2b0
[24119.281523] ? __pfx___folio_put+0x10/0x10
[24119.281525] ? __pfx___might_resched+0x10/0x10
[24119.281528] ? migrate_folio_done+0x1de/0x2b0
[24119.281531] migrate_pages_batch+0xe73/0x2880
[24119.281534] ? __pfx_compaction_alloc+0x10/0x10
[24119.281536] ? __pfx_compaction_free+0x10/0x10
[24119.281539] ? __pfx_migrate_pages_batch+0x10/0x10
[24119.281543] ? rcu_is_watching+0x12/0xc0
[24119.281546] migrate_pages+0x194f/0x22f0
[24119.281548] ? __pfx_compaction_alloc+0x10/0x10
[24119.281550] ? __pfx_compaction_free+0x10/0x10
[24119.281553] ? __pfx_migrate_pages+0x10/0x10
[24119.281555] ? rcu_is_watching+0x12/0xc0
[24119.281557] ? isolate_migratepages_block+0x2b02/0x4560
[24119.281561] ? __pfx_isolate_migratepages_block+0x10/0x10
[24119.281563] ? folio_putback_lru+0x5e/0xb0
[24119.281566] compact_zone+0x1a7c/0x3860
[24119.281569] ? rcu_is_watching+0x12/0xc0
[24119.281571] ? __pfx___free_object+0x10/0x10
[24119.281575] ? __pfx_compact_zone+0x10/0x10
[24119.281577] ? rcu_is_watching+0x12/0xc0
[24119.281579] ? lock_acquire+0x457/0x540
[24119.281581] ? kcompactd+0x2fa/0xc70
[24119.281583] ? rcu_is_watching+0x12/0xc0
[24119.281585] compact_node+0x144/0x240
[24119.281588] ? __pfx_compact_node+0x10/0x10
[24119.281593] ? rcu_is_watching+0x12/0xc0
[24119.281595] kcompactd+0x686/0xc70
[24119.281598] ? __pfx_kcompactd+0x10/0x10
[24119.281600] ? __pfx_autoremove_wake_function+0x10/0x10
[24119.281603] ? __kthread_parkme+0xb1/0x1d0
[24119.281605] ? __pfx_kcompactd+0x10/0x10
[24119.281608] ? __pfx_kcompactd+0x10/0x10
[24119.281610] kthread+0x2d2/0x3a0
[24119.281612] ? _raw_spin_unlock_irq+0x28/0x60
[24119.281614] ? __pfx_kthread+0x10/0x10
[24119.281616] ret_from_fork+0x31/0x70
[24119.281618] ? __pfx_kthread+0x10/0x10
[24119.281620] ret_from_fork_asm+0x1a/0x30
[24119.281624] </TASK>
[24171.367867] watchdog: BUG: soft lockup - CPU#25 stuck for 26s!
[kworker/u130:3:2474335]
--
Best Regards,
Mike Gavrilov.
On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <[email protected]> wrote:
> >
> > Do you have the other stracktrace as well?
> >
> > Maybe triggering memory reclaim (e.g., using "stress" or "memhog") could
> > trigger it, that might be reasonable to trey. Once we have a reproducer
> > we could at least bisect.
> >
>
> The only known workload that causes this is updating a large
> container. Unfortunately, not every container update reproduces the
> problem.
Is it possible to add more debugging information to make it clearer
what's going on?
BUG: Bad page state in process kcompactd0 pfn:605811
page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
pfn:0x605811
aops:btree_aops ino:1
flags: 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
Modules linked in: overlay tun uvcvideo uvc videobuf2_vmalloc
videobuf2_memops videobuf2_v4l2 videobuf2_common videodev rndis_host
uas cdc_ether usbnet usb_storage mii uinput rfcomm snd_seq_dummy
snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep
sunrpc binfmt_misc snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi
mc mt76x2u snd_hda_codec_hdmi mt76x2_common mt7921e mt76x02_usb
mt7921_common mt76_usb mt792x_lib snd_hda_intel intel_rapl_msr amd_atl
snd_intel_dspcfg mt76_connac_lib mt76x02_lib intel_rapl_common
snd_intel_sdw_acpi btusb snd_hda_codec mt76 btrtl edac_mce_amd btintel
snd_hda_core btbcm vfat btmtk snd_hwdep mac80211 fat snd_seq bluetooth
snd_seq_device kvm_amd eeepc_wmi asus_nb_wmi libarc4 joydev
ledtrig_netdev hid_apple snd_pcm kvm asus_wmi cfg80211
apple_mfi_fastcharge snd_timer
sparse_keymap platform_profile pcspkr wmi_bmof snd rapl rfkill
soundcore igc k10temp i2c_piix4 gpio_amdpt gpio_generic loop nfnetlink
zram amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
polyval_generic amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec
ghash_clmulni_intel gpu_sched sha512_ssse3 nvme drm_suballoc_helper
drm_buddy sha256_ssse3 nvme_core sha1_ssse3 drm_display_helper ccp
sp5100_tco video nvme_auth wmi ip6_tables ip_tables fuse
CPU: 17 PID: 221 Comm: kcompactd0 Tainted: G W L -------
--- 6.10.0-0.rc1.18.fc41.x86_64+debug #1
Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
BIOS 2611 04/07/2024
Call Trace:
<TASK>
dump_stack_lvl+0x84/0xd0
bad_page.cold+0xbe/0xe0
? __pfx_bad_page+0x10/0x10
? page_bad_reason+0x9d/0x1f0
free_unref_page+0x838/0x10e0
__folio_put+0x1ba/0x2b0
? __pfx___folio_put+0x10/0x10
? __pfx___might_resched+0x10/0x10
? migrate_folio_done+0x1de/0x2b0
migrate_pages_batch+0xe73/0x2880
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __pfx_migrate_pages_batch+0x10/0x10
? trace_irq_enable.constprop.0+0xce/0x110
? __pfx_remove_migration_pte+0x10/0x10
? rcu_is_watching+0x12/0xc0
migrate_pages+0x194f/0x22f0
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __pfx_migrate_pages+0x10/0x10
? trace_irq_enable.constprop.0+0xce/0x110
? rcu_is_watching+0x12/0xc0
? isolate_migratepages_block+0x2b02/0x4560
? __pfx_isolate_migratepages_block+0x10/0x10
? __pfx___might_resched+0x10/0x10
compact_zone+0x1a7c/0x3860
? rcu_is_watching+0x12/0xc0
? __pfx___free_object+0x10/0x10
? __pfx_compact_zone+0x10/0x10
? rcu_is_watching+0x12/0xc0
? lock_acquire+0x457/0x540
? kcompactd+0x2fa/0xc70
? rcu_is_watching+0x12/0xc0
compact_node+0x144/0x240
? __pfx_compact_node+0x10/0x10
? rcu_is_watching+0x12/0xc0
kcompactd+0x686/0xc70
? __pfx_kcompactd+0x10/0x10
? __pfx_autoremove_wake_function+0x10/0x10
? __kthread_parkme+0xb1/0x1d0
? __pfx_kcompactd+0x10/0x10
? __pfx_kcompactd+0x10/0x10
kthread+0x2d2/0x3a0
? _raw_spin_unlock_irq+0x28/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
--
Best Regards,
Mike Gavrilov.
Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> Do you have the other stracktrace as well?
>>>
>>> Maybe triggering memory reclaim (e.g., using "stress" or "memhog") could
>>> trigger it, that might be reasonable to trey. Once we have a reproducer
>>> we could at least bisect.
>>>
>>
>> The only known workload that causes this is updating a large
>> container. Unfortunately, not every container update reproduces the
>> problem.
>
> Is it possible to add more debugging information to make it clearer
> what's going on?
If we knew who originally allocated that problematic page, that might help.
Maybe page_owner could give some hints?
>
> BUG: Bad page state in process kcompactd0 pfn:605811
> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
> pfn:0x605811
> aops:btree_aops ino:1
> flags: 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: non-NULL mapping
Seems to be an order-0 page, otherwise we would have another "head: ..." report.
It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
field for them manually on the page freeing path. Likely it's a pagecache folio.
So one option is that something seems to not properly set folio->mapping to
NULL. But that problem would then also show up without page migration? Hmm.
> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
> BIOS 2611 04/07/2024
> Call Trace:
> <TASK>
> dump_stack_lvl+0x84/0xd0
> bad_page.cold+0xbe/0xe0
> ? __pfx_bad_page+0x10/0x10
> ? page_bad_reason+0x9d/0x1f0
> free_unref_page+0x838/0x10e0
> __folio_put+0x1ba/0x2b0
> ? __pfx___folio_put+0x10/0x10
> ? __pfx___might_resched+0x10/0x10
I suspect we come via
migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
Maybe this is the "Folio was freed from under us. So we are done." path
when "folio_ref_count(src) == 1".
Alternatively, we might come via
migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
For ordinary migration, move_to_new_folio() will clear src->mapping if
the folio was migrated successfully. That's the very first thing that
migrate_folio_move() does, so I doubt that is the problem.
So I suspect we are in the migrate_folio_unmap() path. But for
a !anon folio, who should be freeing the folio concurrently (and not clearing
folio->mapping?)? After all, we have to hold the folio lock while migrating.
In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
dropping the reference.
Something to try might be (to see if the problem goes away).
diff --git a/mm/migrate.c b/mm/migrate.c
index dd04f578c19c..45e92e14c904 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
/* Folio was freed from under us. So we are done. */
folio_clear_active(src);
folio_clear_unevictable(src);
+ /*
+ * Anonymous and movable src->mapping will be cleared by
+ * free_pages_prepare so don't reset it here for keeping
+ * the type to work PageAnon, for example.
+ */
+ if (!folio_mapping_flags(src))
+ src->mapping = NULL;
/* free_pages_prepare() will clear PG_isolated. */
list_del(&src->lru);
migrate_folio_done(src, reason);
But it does feel weird: who freed the page concurrently and didn't clear
folio->mapping ...
We don't hold the folio lock of src, though, but have the only reference. So
another possible thing might be folio refcount mis-counting: folio_ref_count()
== 1 but there are other references (e.g., from the pagecache).
> ? migrate_folio_done+0x1de/0x2b0
> migrate_pages_batch+0xe73/0x2880
> ? __pfx_compaction_alloc+0x10/0x10
> ? __pfx_compaction_free+0x10/0x10
> ? __pfx_migrate_pages_batch+0x10/0x10
> ? trace_irq_enable.constprop.0+0xce/0x110
> ? __pfx_remove_migration_pte+0x10/0x10
> ? rcu_is_watching+0x12/0xc0
> migrate_pages+0x194f/0x22f0
> ? __pfx_compaction_alloc+0x10/0x10
> ? __pfx_compaction_free+0x10/0x10
> ? __pfx_migrate_pages+0x10/0x10
> ? trace_irq_enable.constprop.0+0xce/0x110
> ? rcu_is_watching+0x12/0xc0
> ? isolate_migratepages_block+0x2b02/0x4560
> ? __pfx_isolate_migratepages_block+0x10/0x10
> ? __pfx___might_resched+0x10/0x10
> compact_zone+0x1a7c/0x3860
> ? rcu_is_watching+0x12/0xc0
> ? __pfx___free_object+0x10/0x10
> ? __pfx_compact_zone+0x10/0x10
> ? rcu_is_watching+0x12/0xc0
> ? lock_acquire+0x457/0x540
> ? kcompactd+0x2fa/0xc70
> ? rcu_is_watching+0x12/0xc0
> compact_node+0x144/0x240
> ? __pfx_compact_node+0x10/0x10
> ? rcu_is_watching+0x12/0xc0
> kcompactd+0x686/0xc70
> ? __pfx_kcompactd+0x10/0x10
> ? __pfx_autoremove_wake_function+0x10/0x10
> ? __kthread_parkme+0xb1/0x1d0
> ? __pfx_kcompactd+0x10/0x10
> ? __pfx_kcompactd+0x10/0x10
> kthread+0x2d2/0x3a0
> ? _raw_spin_unlock_irq+0x28/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x31/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
--
Thanks,
David / dhildenb
Am 28.05.24 um 15:57 schrieb David Hildenbrand:
> Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
>> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
>> <[email protected]> wrote:
>>>
>>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> The only known workload that causes this is updating a large
>>> container. Unfortunately, not every container update reproduces the
>>> problem.
>>
>> Is it possible to add more debugging information to make it clearer
>> what's going on?
>
> If we knew who originally allocated that problematic page, that might help.
> Maybe page_owner could give some hints?
>
>>
>> BUG: Bad page state in process kcompactd0 pfn:605811
>> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
>> pfn:0x605811
>> aops:btree_aops ino:1
>> flags:
>> 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
>> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
>> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
>> page dumped because: non-NULL mapping
>
> Seems to be an order-0 page, otherwise we would have another "head: ..." report.
>
> It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
> field for them manually on the page freeing path. Likely it's a pagecache folio.
>
> So one option is that something seems to not properly set folio->mapping to
> NULL. But that problem would then also show up without page migration? Hmm.
>
>> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
>> BIOS 2611 04/07/2024
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x84/0xd0
>> bad_page.cold+0xbe/0xe0
>> ? __pfx_bad_page+0x10/0x10
>> ? page_bad_reason+0x9d/0x1f0
>> free_unref_page+0x838/0x10e0
>> __folio_put+0x1ba/0x2b0
>> ? __pfx___folio_put+0x10/0x10
>> ? __pfx___might_resched+0x10/0x10
>
> I suspect we come via
> migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
>
> Maybe this is the "Folio was freed from under us. So we are done." path
> when "folio_ref_count(src) == 1".
>
> Alternatively, we might come via
> migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
>
> For ordinary migration, move_to_new_folio() will clear src->mapping if
> the folio was migrated successfully. That's the very first thing that
> migrate_folio_move() does, so I doubt that is the problem.
>
> So I suspect we are in the migrate_folio_unmap() path. But for
> a !anon folio, who should be freeing the folio concurrently (and not clearing
> folio->mapping?)? After all, we have to hold the folio lock while migrating.
>
> In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
> dropping the reference.
>
> Something to try might be (to see if the problem goes away).
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dd04f578c19c..45e92e14c904 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> /* Folio was freed from under us. So we are done. */
> folio_clear_active(src);
> folio_clear_unevictable(src);
> + /*
> + * Anonymous and movable src->mapping will be cleared by
> + * free_pages_prepare so don't reset it here for keeping
> + * the type to work PageAnon, for example.
> + */
> + if (!folio_mapping_flags(src))
> + src->mapping = NULL;
> /* free_pages_prepare() will clear PG_isolated. */
> list_del(&src->lru);
> migrate_folio_done(src, reason);
>
> But it does feel weird: who freed the page concurrently and didn't clear
> folio->mapping ...
>
> We don't hold the folio lock of src, though, but have the only reference. So
> another possible thing might be folio refcount mis-counting: folio_ref_count()
> == 1 but there are other references (e.g., from the pagecache).
Hmm, your original report mentions kswapd, so I'm getting the feeling someone
does one folio_put() too much and we are freeing a pageache folio that is still
in the pageache and, therefore, has folio->mapping set ... bisecting would
really help.
--
Thanks,
David / dhildenb
On 28.05.24 16:24, David Hildenbrand wrote:
> Am 28.05.24 um 15:57 schrieb David Hildenbrand:
>> Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
>>> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
>>> <[email protected]> wrote:
>>>>
>>>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> The only known workload that causes this is updating a large
>>>> container. Unfortunately, not every container update reproduces the
>>>> problem.
>>>
>>> Is it possible to add more debugging information to make it clearer
>>> what's going on?
>>
>> If we knew who originally allocated that problematic page, that might help.
>> Maybe page_owner could give some hints?
>>
>>>
>>> BUG: Bad page state in process kcompactd0 pfn:605811
>>> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
>>> pfn:0x605811
>>> aops:btree_aops ino:1
>>> flags:
>>> 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
>>> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
>>> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
>>> page dumped because: non-NULL mapping
>>
>> Seems to be an order-0 page, otherwise we would have another "head: ..." report.
>>
>> It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
>> field for them manually on the page freeing path. Likely it's a pagecache folio.
>>
>> So one option is that something seems to not properly set folio->mapping to
>> NULL. But that problem would then also show up without page migration? Hmm.
>>
>>> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
>>> BIOS 2611 04/07/2024
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x84/0xd0
>>> bad_page.cold+0xbe/0xe0
>>> ? __pfx_bad_page+0x10/0x10
>>> ? page_bad_reason+0x9d/0x1f0
>>> free_unref_page+0x838/0x10e0
>>> __folio_put+0x1ba/0x2b0
>>> ? __pfx___folio_put+0x10/0x10
>>> ? __pfx___might_resched+0x10/0x10
>>
>> I suspect we come via
>> migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
>>
>> Maybe this is the "Folio was freed from under us. So we are done." path
>> when "folio_ref_count(src) == 1".
>>
>> Alternatively, we might come via
>> migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
>>
>> For ordinary migration, move_to_new_folio() will clear src->mapping if
>> the folio was migrated successfully. That's the very first thing that
>> migrate_folio_move() does, so I doubt that is the problem.
>>
>> So I suspect we are in the migrate_folio_unmap() path. But for
>> a !anon folio, who should be freeing the folio concurrently (and not clearing
>> folio->mapping?)? After all, we have to hold the folio lock while migrating.
>>
>> In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
>> dropping the reference.
>>
>> Something to try might be (to see if the problem goes away).
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index dd04f578c19c..45e92e14c904 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>> /* Folio was freed from under us. So we are done. */
>> folio_clear_active(src);
>> folio_clear_unevictable(src);
>> + /*
>> + * Anonymous and movable src->mapping will be cleared by
>> + * free_pages_prepare so don't reset it here for keeping
>> + * the type to work PageAnon, for example.
>> + */
>> + if (!folio_mapping_flags(src))
>> + src->mapping = NULL;
>> /* free_pages_prepare() will clear PG_isolated. */
>> list_del(&src->lru);
>> migrate_folio_done(src, reason);
>>
>> But it does feel weird: who freed the page concurrently and didn't clear
>> folio->mapping ...
>>
>> We don't hold the folio lock of src, though, but have the only reference. So
>> another possible thing might be folio refcount mis-counting: folio_ref_count()
>> == 1 but there are other references (e.g., from the pagecache).
>
> Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> does one folio_put() too much and we are freeing a pageache folio that is still
> in the pageache and, therefore, has folio->mapping set ... bisecting would
> really help.
>
A little bird just told me that I missed an important piece in the dmesg
output: "aops:btree_aops ino:1" from dump_mapping():
This is btrfs, i_ino is 1, and we don't have a dentry. Is that
BTRFS_BTREE_INODE_OBJECTID?
Summarizing what we know so far:
(1) Freeing an order-0 btrfs folio where folio->mapping
is still set
(2) Triggered by kswapd and kcompactd; not triggered by other means of
page freeing so far
Possible theories:
(A) folio->mapping not cleared when freeing the folio. But shouldn't
this also happen on other freeing paths? Or are we simply lucky to
never trigger that for that folio?
(B) Messed-up refcounting: freeing a folio that is still in use (and
therefore has folio-> mapping still set)
I was briefly wondering if large folio splitting could be involved.
CCing btrfs maintainers.
--
Cheers,
David / dhildenb
On Wed, May 29, 2024 at 08:57:48AM +0200, David Hildenbrand wrote:
> On 28.05.24 16:24, David Hildenbrand wrote:
> > Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> > does one folio_put() too much and we are freeing a pageache folio that is still
> > in the pageache and, therefore, has folio->mapping set ... bisecting would
> > really help.
>
> A little bird just told me that I missed an important piece in the dmesg
> output: "aops:btree_aops ino:1" from dump_mapping():
>
> This is btrfs, i_ino is 1, and we don't have a dentry. Is that
> BTRFS_BTREE_INODE_OBJECTID?
Yes, that's right, inode with number 1 is representing metadata.
> Summarizing what we know so far:
> (1) Freeing an order-0 btrfs folio where folio->mapping
> is still set
> (2) Triggered by kswapd and kcompactd; not triggered by other means of
> page freeing so far
>
> Possible theories:
> (A) folio->mapping not cleared when freeing the folio. But shouldn't
> this also happen on other freeing paths? Or are we simply lucky to
> never trigger that for that folio?
> (B) Messed-up refcounting: freeing a folio that is still in use (and
> therefore has folio-> mapping still set)
>
> I was briefly wondering if large folio splitting could be involved.
We do not have large folios enabled for btrfs, the conversion from pages
to folios is still ongoing.
With increased number of strange reports either from syzbot or others,
it seems that something got wrong in the 6.10-rc update or maybe
earlier.
在 2024/5/29 16:27, David Hildenbrand 写道:
> On 28.05.24 16:24, David Hildenbrand wrote:
[...]
>> Hmm, your original report mentions kswapd, so I'm getting the feeling
>> someone
>> does one folio_put() too much and we are freeing a pageache folio that
>> is still
>> in the pageache and, therefore, has folio->mapping set ... bisecting
>> would
>> really help.
>>
>
> A little bird just told me that I missed an important piece in the dmesg
> output: "aops:btree_aops ino:1" from dump_mapping():
>
> This is btrfs, i_ino is 1, and we don't have a dentry. Is that
> BTRFS_BTREE_INODE_OBJECTID?
>
> Summarizing what we know so far:
> (1) Freeing an order-0 btrfs folio where folio->mapping
> is still set
> (2) Triggered by kswapd and kcompactd; not triggered by other means of
> page freeing so far
From the implementation of filemap_migrate_folio() (and previous
migrate_page_moving_mapping()), it looks like the migration only involves:
- Migrate the mapping
- Copy the page private value
- Copy the contents (if needed)
- Copy all the page flags
The most recent touch on migration is from v6.0, which I do not believe
is the cause at all.
>
> Possible theories:
> (A) folio->mapping not cleared when freeing the folio. But shouldn't
> this also happen on other freeing paths? Or are we simply lucky to
> never trigger that for that folio?
Yeah, in fact we never manually clean folio->mapping inside btrfs, thus
I'm not sure if it's the case.
> (B) Messed-up refcounting: freeing a folio that is still in use (and
> therefore has folio-> mapping still set)
>
> I was briefly wondering if large folio splitting could be involved.
Although we have all the metadata support for larger folios, we do not
yet enable it.
My current guess is, could it be some race with this commit?
09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method")
For example, when we're allocating an extent buffer (btrfs' metadata
structure), and one page is already attached to the page cache, then the
page is being migrated meanwhile the remaining pages are not yet attached?
It's first introduced in v6.8, matching the earliest report.
But that patch is not easy to revert.
Do you have any extra reproducibility or extra way to debug the lifespan
of that specific patch?
Or is there any way to temporarily disable migration?
Thanks,
Qu
>
> CCing btrfs maintainers.
>
在 2024/5/30 08:07, Qu Wenruo 写道:
>
>
> 在 2024/5/29 16:27, David Hildenbrand 写道:
>>
>> A little bird just told me that I missed an important piece in the dmesg
>> output: "aops:btree_aops ino:1" from dump_mapping():
>>
>> This is btrfs, i_ino is 1, and we don't have a dentry. Is that
>> BTRFS_BTREE_INODE_OBJECTID?
>>
>> Summarizing what we know so far:
>> (1) Freeing an order-0 btrfs folio where folio->mapping
>> is still set
>> (2) Triggered by kswapd and kcompactd; not triggered by other means of
>> page freeing so far
>
> From the implementation of filemap_migrate_folio() (and previous
> migrate_page_moving_mapping()), it looks like the migration only involves:
>
> - Migrate the mapping
> - Copy the page private value
> - Copy the contents (if needed)
> - Copy all the page flags
>
> The most recent touch on migration is from v6.0, which I do not believe
> is the cause at all.
>
>>
>> Possible theories:
>> (A) folio->mapping not cleared when freeing the folio. But shouldn't
>> this also happen on other freeing paths? Or are we simply lucky to
>> never trigger that for that folio?
>
> Yeah, in fact we never manually clean folio->mapping inside btrfs, thus
> I'm not sure if it's the case.
>
>> (B) Messed-up refcounting: freeing a folio that is still in use (and
>> therefore has folio-> mapping still set)
>>
>> I was briefly wondering if large folio splitting could be involved.
>
> Although we have all the metadata support for larger folios, we do not
> yet enable it.
After some extra code digging and tons of trace_printk(), it indeed
looks like btrfs is underflowing the folio ref count.
During the lifespan of an extent buffer (btrfs' metadata), it should at
least has 3 refs after attached to the address space:
1) folio_alloc() inside btrfs_alloc_folio_array()
2) folio_ref_add() inside __filemap_add_folio()
3) folio_add_lru() inside filemap_add_folio()
Even if btrfs wants to release the folio of an eb, we only do:
- Detach the folio::private
- put_folio()
So even if an eb got released, as long as it is not yet detached from
filemap, its refcount should still be >= 2.
Thus the warning is indeed correct, by somehow btrfs called extra
put_folio() on the eb page which is already attached to the btree inode.
I'll continue digging around the eb folio refs inside btrfs, meanwhile I
will also test some extra checks for eb folios on their refcount.
Thanks,
Qu