2024-04-12 06:44:29

by zhaoyang.huang

[permalink] [raw]
Subject: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

From: Zhaoyang Huang <[email protected]>

Livelock in [1] is reported multitimes since v515, where the zero-ref
folio is repeatly found on the page cache by find_get_entry. A possible
timing sequence is proposed in [2], which can be described briefly as
the lockless xarray operation could get harmed by an illegal folio
remaining on the slot[offset]. This commit would like to protect
the xa split stuff(folio_ref_freeze and __split_huge_page) under
lruvec->lock to remove the race window.

[1]
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0 state:R running task stack: 24 pid: 155 ppid: 2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0 state:R running task stack: 24 pid: 155 ppid: 2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342] dump_backtrace.cfi_jt+0x0/0x8
[167789.800355] show_stack+0x1c/0x2c
[167789.800363] sched_show_task+0x1ac/0x27c
[167789.800370] print_other_cpu_stall+0x314/0x4dc
[167789.800377] check_cpu_stall+0x1c4/0x36c
[167789.800382] rcu_sched_clock_irq+0xe8/0x388
[167789.800389] update_process_times+0xa0/0xe0
[167789.800396] tick_sched_timer+0x7c/0xd4
[167789.800404] __run_hrtimer+0xd8/0x30c
[167789.800408] hrtimer_interrupt+0x1e4/0x2d0
[167789.800414] arch_timer_handler_phys+0x5c/0xa0
[167789.800423] handle_percpu_devid_irq+0xbc/0x318
[167789.800430] handle_domain_irq+0x7c/0xf0
[167789.800437] gic_handle_irq+0x54/0x12c
[167789.800445] call_on_irq_stack+0x40/0x70
[167789.800451] do_interrupt_handler+0x44/0xa0
[167789.800457] el1_interrupt+0x34/0x64
[167789.800464] el1h_64_irq_handler+0x1c/0x2c
[167789.800470] el1h_64_irq+0x7c/0x80
[167789.800474] xas_find+0xb4/0x28c
[167789.800481] find_get_entry+0x3c/0x178
[167789.800487] find_lock_entries+0x98/0x2f8
[167789.800492] __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500] invalidate_mapping_pages+0x18/0x28
[167789.800506] inode_lru_isolate+0x140/0x2a4
[167789.800512] __list_lru_walk_one+0xd8/0x204
[167789.800519] list_lru_walk_one+0x64/0x90
[167789.800524] prune_icache_sb+0x54/0xe0
[167789.800529] super_cache_scan+0x160/0x1ec
[167789.800535] do_shrink_slab+0x20c/0x5c0
[167789.800541] shrink_slab+0xf0/0x20c
[167789.800546] shrink_node_memcgs+0x98/0x320
[167789.800553] shrink_node+0xe8/0x45c
[167789.800557] balance_pgdat+0x464/0x814
[167789.800563] kswapd+0xfc/0x23c
[167789.800567] kthread+0x164/0x1c8
[167789.800573] ret_from_fork+0x10/0x20

[2]
Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages via pfn
(folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

4. alloc_contig_range->migrate_pages->__split_huge_page->folio_lruvec_lock
failed which have the folio be failed in setting refcnt to 2

5. Thread_kswapd enter the livelock by the chain below
rcu_read_lock();
retry:
find_get_entry
folio = xas_find
if(!folio_try_get_rcu)
xas_reset;
goto retry;
rcu_read_unlock();

5'. Thread_holdlock as the lruvec->lru_lock holder could be stalled in
the same core of Thread_kswapd.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
mm/huge_memory.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..418e8d03480a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2891,7 +2891,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
{
struct folio *folio = page_folio(page);
struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
int i, nr_dropped = 0;
@@ -2908,8 +2908,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_lock(&swap_cache->i_pages);
}

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);

ClearPageHasHWPoisoned(head);

@@ -2942,7 +2940,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,

folio_set_order(new_folio, new_order);
}
- unlock_page_lruvec(lruvec);
/* Caller disabled irqs, so they are still disabled here */

split_page_owner(head, order, new_order);
@@ -2961,7 +2958,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
folio_ref_add(folio, 1 + new_nr);
xa_unlock(&folio->mapping->i_pages);
}
- local_irq_enable();

if (nr_dropped)
shmem_uncharge(folio->mapping->host, nr_dropped);
@@ -3048,6 +3044,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
int extra_pins, ret;
pgoff_t end;
bool is_hzp;
+ struct lruvec *lruvec;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3159,6 +3156,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,

/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
if (mapping) {
/*
* Check if the folio is present in page cache.
@@ -3203,12 +3208,16 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
}

__split_huge_page(page, list, end, new_order);
+ unlock_page_lruvec(lruvec);
+ local_irq_enable();
ret = 0;
} else {
spin_unlock(&ds_queue->split_queue_lock);
fail:
if (mapping)
xas_unlock(&xas);
+
+ unlock_page_lruvec(lruvec);
local_irq_enable();
remap_page(folio, folio_nr_pages(folio));
ret = -EAGAIN;
--
2.25.1



2024-04-12 12:24:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> Livelock in [1] is reported multitimes since v515, where the zero-ref
> folio is repeatly found on the page cache by find_get_entry. A possible
> timing sequence is proposed in [2], which can be described briefly as

I have no patience for going through another one of your "analyses".

1. Can you reproduce this bug without this patch?
2. Does the reproducer stop working after this patch?

Otherwise I'm not interested. Sorry. You burnt all my good will.

> the lockless xarray operation could get harmed by an illegal folio
> remaining on the slot[offset]. This commit would like to protect
> the xa split stuff(folio_ref_freeze and __split_huge_page) under
> lruvec->lock to remove the race window.

2024-04-12 21:35:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <[email protected]> wrote:

> Livelock in [1] is reported multitimes since v515,

Are you able to provide us with a means by which others can reproduce this?

Thanks.

2024-04-13 02:01:49

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

loop Dave, since he has ever helped set up an reproducer in
https://lore.kernel.org/linux-mm/[email protected]/
@Dave Chinner , I would like to ask for your kindly help on if you can
verify this patch on your environment if convenient. Thanks a lot.


On Sat, Apr 13, 2024 at 5:34 AM Andrew Morton <[email protected]> wrote:
>
> On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <[email protected]> wrote:
>
> > Livelock in [1] is reported multitimes since v515,
>
> Are you able to provide us with a means by which others can reproduce this?
>
> Thanks.

2024-04-13 07:10:22

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Fri, Apr 12, 2024 at 8:24 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > Livelock in [1] is reported multitimes since v515, where the zero-ref
> > folio is repeatly found on the page cache by find_get_entry. A possible
> > timing sequence is proposed in [2], which can be described briefly as
>
> I have no patience for going through another one of your "analyses".
>
> 1. Can you reproduce this bug without this patch?
> 2. Does the reproducer stop working after this patch?
>
> Otherwise I'm not interested. Sorry. You burnt all my good will.

This bug has been reported many times by three people including me as
below for at least two years, have you ever tried to solve it? Do Dave
and Brian also burnt your good will if you ever have? Be aware that
you are the maintainer who has the responsibility for maintaining the
code but not us. "To wear crowns shall bear the heavy or give up". Put
me on your SPAM list, thank you.

https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/Y0%2FkZbIvMgkNhWpM@bfoster/

>
> > the lockless xarray operation could get harmed by an illegal folio
> > remaining on the slot[offset]. This commit would like to protect
> > the xa split stuff(folio_ref_freeze and __split_huge_page) under
> > lruvec->lock to remove the race window.

2024-04-15 01:50:41

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> wrote:
>
> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang Huang wrote:
> > loop Dave, since he has ever helped set up an reproducer in
> > https://lore.kernel.org/linux-mm/[email protected]/
> > @Dave Chinner , I would like to ask for your kindly help on if you can
> > verify this patch on your environment if convenient. Thanks a lot.
>
> I don't have the test environment from 18 months ago available any
> more. Also, I haven't seen this problem since that specific test
> environment tripped over the issue. Hence I don't have any way of
> confirming that the problem is fixed, either, because first I'd have
> to reproduce it...
Thanks for the information. I noticed that you reported another soft
lockup which is related to xas_load since NOV.2023. This patch is
supposed to be helpful for this. With regard to the version timing,
this commit is actually a revert of <mm/thp: narrow lru locking>
b6769834aac1d467fa1c71277d15688efcbb4d76 which is merged before v5.15.

For saving your time, a brief description below. IMO, b6769834aa
introduce a potential stall between freeze the folio's refcnt and
store it back to 2, which have the xas_load->folio_try_get_rcu loops
as livelock if it stalls the lru_lock's holder.

b6769834aa
split_huge_page_to_list
- spin_lock(lru_lock)
xas_split(&xas, folio,order)
folio_refcnt_freeze(folio, 1 + folio_nr_pages(folio0)
+ spin_lock(lru_lock)
xas_store(&xas, offset++, head+i)
page_ref_add(head, 2)
spin_unlock(lru_lock)

Sorry in advance if the above doesn't make sense, I am just a
developer who is also suffering from this bug and trying to fix it
>
> -Dave.
> --
> Dave Chinner
> [email protected]

2024-05-20 19:50:50

by Marcin Wanat

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On 15.04.2024 03:50, Zhaoyang Huang wrote:
> On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
Huang wrote: >>> loop Dave, since he has ever helped set up an
reproducer in >>> https://lore.kernel.org/linux- >>>
mm/[email protected]/ @Dave Chinner , >>> I
would like to ask for your kindly help on if you can verify >>> this
patch on your environment if convenient. Thanks a lot. >> >> I don't
have the test environment from 18 months ago available any >> more.
Also, I haven't seen this problem since that specific test >>
environment tripped over the issue. Hence I don't have any way of >>
confirming that the problem is fixed, either, because first I'd >> have
to reproduce it... > Thanks for the information. I noticed that you
reported another soft > lockup which is related to xas_load since
NOV.2023. This patch is > supposed to be helpful for this. With regard
to the version timing, > this commit is actually a revert of <mm/thp:
narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
merged before > v5.15. > > For saving your time, a brief description
below. IMO, b6769834aa > introduce a potential stall between freeze the
folio's refcnt and > store it back to 2, which have the
xas_load->folio_try_get_rcu loops > as livelock if it stalls the
lru_lock's holder. > > b6769834aa split_huge_page_to_list -
spin_lock(lru_lock) > xas_split(&xas, folio,order)
folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
above doesn't make sense, I am just a > developer who is also suffering
from this bug and trying to fix it
I am experiencing a similar error on dozens of hosts, with stack traces
that are all similar:

[627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
[file_get:953301]
[627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
scsi_transport_sas i2c_algo_bit wmi
[627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
Tainted: G             L     6.6.30.el9 #2
[627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
2.21.2 02/19/2024
[627163.727847] RIP: 0010:xas_descend+0x1b/0x70
[627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
[627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
[627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
0000000000000020
[627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
ffffc90034a67990
[627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
0000000000008820
[627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
ffffc90034a67a20
[627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
ffffc90034a67a18
[627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
knlGS:0000000000000000
[627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
00000000007706e0
[627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[627163.727878] PKRU: 55555554
[627163.727879] Call Trace:
[627163.727882]  <IRQ>
[627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
[627163.727892]  ? softlockup_fn+0x70/0x70
[627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
[627163.727903]  ? hrtimer_interrupt+0x106/0x240
[627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
[627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
[627163.727917]  </IRQ>
[627163.727918]  <TASK>
[627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[627163.727927]  ? xas_descend+0x1b/0x70
[627163.727930]  xas_load+0x2c/0x40
[627163.727933]  xas_find+0x161/0x1a0
[627163.727937]  find_get_entries+0x77/0x1d0
[627163.727944]  truncate_inode_pages_range+0x244/0x3f0
[627163.727950]  truncate_pagecache+0x44/0x60
[627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
[627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
[627163.728153]  notify_change+0x34f/0x4f0
[627163.728158]  ? _raw_spin_lock+0x13/0x30
[627163.728165]  ? do_truncate+0x80/0xd0
[627163.728169]  do_truncate+0x80/0xd0
[627163.728172]  do_open+0x2ce/0x400
[627163.728177]  path_openat+0x10d/0x280
[627163.728181]  do_filp_open+0xb2/0x150
[627163.728186]  ? check_heap_object+0x34/0x190
[627163.728189]  ? __check_object_size.part.0+0x5a/0x130
[627163.728194]  do_sys_openat2+0x92/0xc0
[627163.728197]  __x64_sys_openat+0x53/0x90
[627163.728200]  do_syscall_64+0x35/0x80
[627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
[627163.728210] RIP: 0033:0x7fc5e493e7fb
[627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
00007fc5e493e7fb
[627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
00000000ffffff9c
[627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
0000000000000001
[627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
0000000000000241
[627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
0000000000000000
[627163.728227]  </TASK>

I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will appear.

I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.

Regards,
Marcin

2024-05-21 00:59:19

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/[email protected]/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G L 6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882] <IRQ>
> [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892] ? softlockup_fn+0x70/0x70
> [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903] ? hrtimer_interrupt+0x106/0x240
> [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917] </IRQ>
> [627163.727918] <TASK>
> [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927] ? xas_descend+0x1b/0x70
> [627163.727930] xas_load+0x2c/0x40
> [627163.727933] xas_find+0x161/0x1a0
> [627163.727937] find_get_entries+0x77/0x1d0
> [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> [627163.727950] truncate_pagecache+0x44/0x60
> [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153] notify_change+0x34f/0x4f0
> [627163.728158] ? _raw_spin_lock+0x13/0x30
> [627163.728165] ? do_truncate+0x80/0xd0
> [627163.728169] do_truncate+0x80/0xd0
> [627163.728172] do_open+0x2ce/0x400
> [627163.728177] path_openat+0x10d/0x280
> [627163.728181] do_filp_open+0xb2/0x150
> [627163.728186] ? check_heap_object+0x34/0x190
> [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> [627163.728194] do_sys_openat2+0x92/0xc0
> [627163.728197] __x64_sys_openat+0x53/0x90
> [627163.728200] do_syscall_64+0x35/0x80
> [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227] </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
Could you please try this one which could be applied on 6.6 directly. Thank you!
>
> Regards,
> Marcin

2024-05-21 01:00:30

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmailcom> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanatpl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbitcom> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/[email protected]/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G L 6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882] <IRQ>
> > [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892] ? softlockup_fn+0x70/0x70
> > [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903] ? hrtimer_interrupt+0x106/0x240
> > [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917] </IRQ>
> > [627163.727918] <TASK>
> > [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927] ? xas_descend+0x1b/0x70
> > [627163.727930] xas_load+0x2c/0x40
> > [627163.727933] xas_find+0x161/0x1a0
> > [627163.727937] find_get_entries+0x77/0x1d0
> > [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950] truncate_pagecache+0x44/0x60
> > [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153] notify_change+0x34f/0x4f0
> > [627163.728158] ? _raw_spin_lock+0x13/0x30
> > [627163.728165] ? do_truncate+0x80/0xd0
> > [627163.728169] do_truncate+0x80/0xd0
> > [627163.728172] do_open+0x2ce/0x400
> > [627163.728177] path_openat+0x10d/0x280
> > [627163.728181] do_filp_open+0xb2/0x150
> > [627163.728186] ? check_heap_object+0x34/0x190
> > [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194] do_sys_openat2+0x92/0xc0
> > [627163.728197] __x64_sys_openat+0x53/0x90
> > [627163.728200] do_syscall_64+0x35/0x80
> > [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227] </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
> Could you please try this one which could be applied on 6.6 directly. Thank you!
URL: https://lore.kernel.org/linux-mm/[email protected]/

> >
> > Regards,
> > Marcin

2024-05-21 15:47:55

by Marcin Wanat

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On 21.05.2024 03:00, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <[email protected]> wrote:
>>
>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
>>>
>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>> (tested at least 10 different versions), this lockup always appears
>>> after 2-30 days, similar to the report in the original thread.
>>> The more load (for example, copying a lot of local files while
>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>
>>> I haven't been able to reproduce this during synthetic tests,
>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>> If anyone can provide a patch, I can test it on multiple machines
>>> over the next few days.
>> Could you please try this one which could be applied on 6.6 directly. Thank you!
> URL: https://lore.kernel.org/linux-mm/[email protected]/
>

Unfortunately, I am unable to cleanly apply this patch against the
latest 6.6.31

2024-05-22 05:38:10

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <[email protected]> wrote:
>
> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> > On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <[email protected]> wrote:
> >>
> >> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
> >>>
> >>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> >>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>> (tested at least 10 different versions), this lockup always appears
> >>> after 2-30 days, similar to the report in the original thread.
> >>> The more load (for example, copying a lot of local files while
> >>> serving 20Gbps traffic), the higher the chance that the bug will appear.
> >>>
> >>> I haven't been able to reproduce this during synthetic tests,
> >>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >>> If anyone can provide a patch, I can test it on multiple machines
> >>> over the next few days.
> >> Could you please try this one which could be applied on 6.6 directly. Thank you!
> > URL: https://lore.kernel.org/linux-mm/[email protected]/
> >
>
> Unfortunately, I am unable to cleanly apply this patch against the
> latest 6.6.31
Please try below one which works on my v6.6 based android. Thank you
for your test in advance :D

mm/huge_memory.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
{
struct folio *folio = page_folio(page);
struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
xa_lock(&swap_cache->i_pages);
}

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
ClearPageHasHWPoisoned(head);

for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
}

ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
split_page_owner(head, nr);

/* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
page_ref_add(head, 2);
xa_unlock(&head->mapping->i_pages);
}
- local_irq_enable();

if (nr_dropped)
shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
int extra_pins, ret;
pgoff_t end;
bool is_hzp;
+ struct lruvec *lruvec;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
if (mapping) {
/*
* Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
}

__split_huge_page(page, list, end);
+ unlock_page_lruvec(lruvec);
+ local_irq_enable();
ret = 0;
} else {
spin_unlock(&ds_queue->split_queue_lock);
fail:
if (mapping)
xas_unlock(&xas);
+
+ unlock_page_lruvec(lruvec);
local_irq_enable();
remap_page(folio, folio_nr_pages(folio));
ret = -EAGAIN;
--
2.25.1

2024-05-22 10:13:37

by Marcin Wanat

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On 22.05.2024 07:37, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <[email protected]> wrote:
>>
>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <[email protected]> wrote:
>>>>
>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
>>>>>
>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>> (tested at least 10 different versions), this lockup always appears
>>>>> after 2-30 days, similar to the report in the original thread.
>>>>> The more load (for example, copying a lot of local files while
>>>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>>>
>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>> over the next few days.
>>>> Could you please try this one which could be applied on 6.6 directly. Thank you!
>>> URL: https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>
>> Unfortunately, I am unable to cleanly apply this patch against the
>> latest 6.6.31
> Please try below one which works on my v6.6 based android. Thank you
> for your test in advance :D
>
> mm/huge_memory.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c

I have compiled 6.6.31 with this patch and will test it on multiple
machines over the next 30 days. I will provide an update after 30 days
if everything is fine or sooner if any of the hosts experience the same
soft lockup again.

2024-05-27 08:22:46

by Marcin Wanat

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On 22.05.2024 12:13, Marcin Wanat wrote:
> On 22.05.2024 07:37, Zhaoyang Huang wrote:
>> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <[email protected]>
>> wrote:
>>>
>>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
>>>> <[email protected]> wrote:
>>>>>
>>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
>>>>>> affected.
>>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>>> (tested at least 10 different versions), this lockup always appears
>>>>>> after 2-30 days, similar to the report in the original thread.
>>>>>> The more load (for example, copying a lot of local files while
>>>>>> serving 20Gbps traffic), the higher the chance that the bug will
>>>>>> appear.
>>>>>>
>>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
>>>>>> days.
>>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>>> over the next few days.
>>>>> Could you please try this one which could be applied on 6.6
>>>>> directly. Thank you!
>>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
>>>> [email protected]/
>>>>
>>>
>>> Unfortunately, I am unable to cleanly apply this patch against the
>>> latest 6.6.31
>> Please try below one which works on my v6.6 based android. Thank you
>> for your test in advance :D
>>
>> mm/huge_memory.c | 22 ++++++++++++++--------
>>   1 file changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>
> I have compiled 6.6.31 with this patch and will test it on multiple
> machines over the next 30 days. I will provide an update after 30 days
> if everything is fine or sooner if any of the hosts experience the same
> soft lockup again.
>

First server with 6.6.31 and this patch hang today. Soft lockup changed
to hard lockup:

[26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
[26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
i40e libata megaraid_sas dca ghash_clmulni_intel wmi
[26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
W 6.6.31.el9 #3
[26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
[26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
0000000000000000
[26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ad6c6f67050
[26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
0000000000000067
[26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000046
[26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
ffffe1138aa98000
[26887.389707] FS: 0000000000000000(0000) GS:ffff9ade20340000(0000)
knlGS:0000000000000000
[26887.389708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
00000000007706e0
[26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[26887.389713] PKRU: 55555554
[26887.389714] Call Trace:
[26887.389717] <NMI>
[26887.389720] ? watchdog_hardlockup_check+0xac/0x150
[26887.389725] ? __perf_event_overflow+0x102/0x1d0
[26887.389729] ? handle_pmi_common+0x189/0x3e0
[26887.389735] ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389738] ? flush_tlb_one_kernel+0xa/0x20
[26887.389742] ? native_set_fixmap+0x65/0x80
[26887.389745] ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389751] ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389755] ? intel_pmu_handle_irq+0x10b/0x230
[26887.389756] ? perf_event_nmi_handler+0x28/0x50
[26887.389759] ? nmi_handle+0x58/0x150
[26887.389764] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389768] ? default_do_nmi+0x6b/0x170
[26887.389770] ? exc_nmi+0x12c/0x1a0
[26887.389772] ? end_repeat_nmi+0x16/0x1f
[26887.389777] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389780] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389784] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389787] </NMI>
[26887.389788] <TASK>
[26887.389789] __raw_spin_lock_irqsave+0x3d/0x50
[26887.389793] folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389798] __page_cache_release+0x68/0x230
[26887.389801] ? remove_migration_ptes+0x5c/0x80
[26887.389807] __folio_put+0x24/0x60
[26887.389808] __split_huge_page+0x368/0x520
[26887.389812] split_huge_page_to_list+0x4b3/0x570
[26887.389816] deferred_split_scan+0x1c8/0x290
[26887.389819] do_shrink_slab+0x12f/0x2d0
[26887.389824] shrink_slab_memcg+0x133/0x1d0
[26887.389829] shrink_node_memcgs+0x18e/0x1d0
[26887.389832] shrink_node+0xa7/0x370
[26887.389836] balance_pgdat+0x332/0x6f0
[26887.389842] kswapd+0xf0/0x190
[26887.389845] ? balance_pgdat+0x6f0/0x6f0
[26887.389848] kthread+0xee/0x120
[26887.389851] ? kthread_complete_and_exit+0x20/0x20
[26887.389853] ret_from_fork+0x2d/0x50
[26887.389857] ? kthread_complete_and_exit+0x20/0x20
[26887.389859] ret_from_fork_asm+0x11/0x20
[26887.389864] </TASK>
[26887.389865] Kernel panic - not syncing: Hard LOCKUP
[26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
W 6.6.31.el9 #3
[26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
[26887.389870] Call Trace:
[26887.389871] <NMI>
[26887.389872] dump_stack_lvl+0x44/0x60
[26887.389877] panic+0x241/0x330
[26887.389881] nmi_panic+0x2f/0x40
[26887.389883] watchdog_hardlockup_check+0x119/0x150
[26887.389886] __perf_event_overflow+0x102/0x1d0
[26887.389889] handle_pmi_common+0x189/0x3e0
[26887.389893] ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389896] ? flush_tlb_one_kernel+0xa/0x20
[26887.389899] ? native_set_fixmap+0x65/0x80
[26887.389902] ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389906] ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389909] intel_pmu_handle_irq+0x10b/0x230
[26887.389911] perf_event_nmi_handler+0x28/0x50
[26887.389913] nmi_handle+0x58/0x150
[26887.389916] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389920] default_do_nmi+0x6b/0x170
[26887.389922] exc_nmi+0x12c/0x1a0
[26887.389923] end_repeat_nmi+0x16/0x1f
[26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
0000000000000000
[26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ad6c6f67050
[26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
0000000000000067
[26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000046
[26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
ffffe1138aa98000
[26887.389940] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389943] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389946] </NMI>
[26887.389947] <TASK>
[26887.389947] __raw_spin_lock_irqsave+0x3d/0x50
[26887.389950] folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389953] __page_cache_release+0x68/0x230
[26887.389955] ? remove_migration_ptes+0x5c/0x80
[26887.389958] __folio_put+0x24/0x60
[26887.389960] __split_huge_page+0x368/0x520
[26887.389963] split_huge_page_to_list+0x4b3/0x570
[26887.389967] deferred_split_scan+0x1c8/0x290
[26887.389971] do_shrink_slab+0x12f/0x2d0
[26887.389974] shrink_slab_memcg+0x133/0x1d0
[26887.389978] shrink_node_memcgs+0x18e/0x1d0
[26887.389982] shrink_node+0xa7/0x370
[26887.389985] balance_pgdat+0x332/0x6f0
[26887.389991] kswapd+0xf0/0x190
[26887.389994] ? balance_pgdat+0x6f0/0x6f0
[26887.389997] kthread+0xee/0x120
[26887.389998] ? kthread_complete_and_exit+0x20/0x20
[26887.390000] ret_from_fork+0x2d/0x50
[26887.390003] ? kthread_complete_and_exit+0x20/0x20
[26887.390004] ret_from_fork_asm+0x11/0x20
[26887.390009] </TASK>


2024-05-27 08:53:40

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <[email protected]> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <[email protected]>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> [email protected]/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >> 1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS: 0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717] <NMI>
> [26887.389720] ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725] ? __perf_event_overflow+0x102/0x1d0
> [26887.389729] ? handle_pmi_common+0x189/0x3e0
> [26887.389735] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742] ? native_set_fixmap+0x65/0x80
> [26887.389745] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755] ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756] ? perf_event_nmi_handler+0x28/0x50
> [26887.389759] ? nmi_handle+0x58/0x150
> [26887.389764] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768] ? default_do_nmi+0x6b/0x170
> [26887.389770] ? exc_nmi+0x12c/0x1a0
> [26887.389772] ? end_repeat_nmi+0x16/0x1f
> [26887.389777] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787] </NMI>
> [26887.389788] <TASK>
> [26887.389789] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798] __page_cache_release+0x68/0x230
> [26887.389801] ? remove_migration_ptes+0x5c/0x80
> [26887.389807] __folio_put+0x24/0x60
> [26887.389808] __split_huge_page+0x368/0x520
> [26887.389812] split_huge_page_to_list+0x4b3/0x570
> [26887.389816] deferred_split_scan+0x1c8/0x290
> [26887.389819] do_shrink_slab+0x12f/0x2d0
> [26887.389824] shrink_slab_memcg+0x133/0x1d0
> [26887.389829] shrink_node_memcgs+0x18e/0x1d0
> [26887.389832] shrink_node+0xa7/0x370
> [26887.389836] balance_pgdat+0x332/0x6f0
> [26887.389842] kswapd+0xf0/0x190
> [26887.389845] ? balance_pgdat+0x6f0/0x6f0
> [26887.389848] kthread+0xee/0x120
> [26887.389851] ? kthread_complete_and_exit+0x20/0x20
> [26887.389853] ret_from_fork+0x2d/0x50
> [26887.389857] ? kthread_complete_and_exit+0x20/0x20
> [26887.389859] ret_from_fork_asm+0x11/0x20
> [26887.389864] </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389870] Call Trace:
> [26887.389871] <NMI>
> [26887.389872] dump_stack_lvl+0x44/0x60
> [26887.389877] panic+0x241/0x330
> [26887.389881] nmi_panic+0x2f/0x40
> [26887.389883] watchdog_hardlockup_check+0x119/0x150
> [26887.389886] __perf_event_overflow+0x102/0x1d0
> [26887.389889] handle_pmi_common+0x189/0x3e0
> [26887.389893] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899] ? native_set_fixmap+0x65/0x80
> [26887.389902] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909] intel_pmu_handle_irq+0x10b/0x230
> [26887.389911] perf_event_nmi_handler+0x28/0x50
> [26887.389913] nmi_handle+0x58/0x150
> [26887.389916] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920] default_do_nmi+0x6b/0x170
> [26887.389922] exc_nmi+0x12c/0x1a0
> [26887.389923] end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946] </NMI>
> [26887.389947] <TASK>
> [26887.389947] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953] __page_cache_release+0x68/0x230
> [26887.389955] ? remove_migration_ptes+0x5c/0x80
> [26887.389958] __folio_put+0x24/0x60
> [26887.389960] __split_huge_page+0x368/0x520
> [26887.389963] split_huge_page_to_list+0x4b3/0x570
> [26887.389967] deferred_split_scan+0x1c8/0x290
> [26887.389971] do_shrink_slab+0x12f/0x2d0
> [26887.389974] shrink_slab_memcg+0x133/0x1d0
> [26887.389978] shrink_node_memcgs+0x18e/0x1d0
> [26887.389982] shrink_node+0xa7/0x370
> [26887.389985] balance_pgdat+0x332/0x6f0
> [26887.389991] kswapd+0xf0/0x190
> [26887.389994] ? balance_pgdat+0x6f0/0x6f0
> [26887.389997] kthread+0xee/0x120
> [26887.389998] ? kthread_complete_and_exit+0x20/0x20
> [26887.390000] ret_from_fork+0x2d/0x50
> [26887.390003] ? kthread_complete_and_exit+0x20/0x20
> [26887.390004] ret_from_fork_asm+0x11/0x20
> [26887.390009] </TASK>
>
ok, thanks for the information. That should be generated by lock's
contention. I will check the code and keep you posted.

2024-05-30 08:49:42

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/[email protected]/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G L 6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882] <IRQ>
> [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892] ? softlockup_fn+0x70/0x70
> [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903] ? hrtimer_interrupt+0x106/0x240
> [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917] </IRQ>
> [627163.727918] <TASK>
> [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927] ? xas_descend+0x1b/0x70
> [627163.727930] xas_load+0x2c/0x40
> [627163.727933] xas_find+0x161/0x1a0
> [627163.727937] find_get_entries+0x77/0x1d0
> [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> [627163.727950] truncate_pagecache+0x44/0x60
> [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153] notify_change+0x34f/0x4f0
> [627163.728158] ? _raw_spin_lock+0x13/0x30
> [627163.728165] ? do_truncate+0x80/0xd0
> [627163.728169] do_truncate+0x80/0xd0
> [627163.728172] do_open+0x2ce/0x400
> [627163.728177] path_openat+0x10d/0x280
> [627163.728181] do_filp_open+0xb2/0x150
> [627163.728186] ? check_heap_object+0x34/0x190
> [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> [627163.728194] do_sys_openat2+0x92/0xc0
> [627163.728197] __x64_sys_openat+0x53/0x90
> [627163.728200] do_syscall_64+0x35/0x80
> [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227] </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.

We encountered a similar issue several months ago. Some of our
production servers crashed within days after deploying the 6.1.y
stable kernel. The soft lock info as follows,

[282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
[container-execu:1572375]
[282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
overlay af_packet bonding intel_rapl_msr intel_rapl_common
64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
sd_mod t10_pi ahci libahci libata
[282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
loaded Tainted: G W O L 6.1.38-rc3 #rc3.pdd
[282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
[282879.612576] RIP: 0010:xas_descend+0x18/0x80
[282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
00 00
[282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
[282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
0000000000000006
[282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
ffffad700b247c68
[282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
fffffffffffffffe
[282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
ffffad700b247cf8
[282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
ffffdfcd2c778000
[282879.612594] FS: 00007f5f576fb740(0000) GS:ffff922df0840000(0000)
knlGS:0000000000000000
[282879.612596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
0000000000350ee0
[282879.612599] Call Trace:
[282879.612601] <IRQ>
[282879.612605] ? show_regs.cold+0x1a/0x1f
[282879.612610] ? watchdog_timer_fn+0x1c4/0x220
[282879.612614] ? softlockup_fn+0x30/0x30
[282879.612616] ? __hrtimer_run_queues+0xa2/0x2b0
[282879.612620] ? hrtimer_interrupt+0x109/0x220
[282879.612622] ? __sysvec_apic_timer_interrupt+0x5e/0x110
[282879.612625] ? sysvec_apic_timer_interrupt+0x7b/0x90
[282879.612629] </IRQ>
[282879.612630] <TASK>
[282879.612631] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[282879.612640] ? xas_descend+0x18/0x80
[282879.612641] ? xas_load+0x35/0x40
[282879.612643] xas_find+0x197/0x1d0
[282879.612645] find_get_entries+0x6e/0x170
[282879.612649] truncate_inode_pages_range+0x294/0x4c0
[282879.612655] ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
[282879.612787] ? kvfree+0x2c/0x40
[282879.612791] ? trace_hardirqs_off+0x36/0xf0
[282879.612795] truncate_inode_pages_final+0x44/0x50
[282879.612798] evict+0x177/0x190
[282879.612802] iput.part.0+0x183/0x1e0
[282879.612804] iput+0x1c/0x30
[282879.612806] do_unlinkat+0x1c7/0x2c0
[282879.612810] __x64_sys_unlinkat+0x38/0x70
[282879.612812] do_syscall_64+0x38/0x90
[282879.612815] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[282879.612818] RIP: 0033:0x7f5f56cf120d
[282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
89 02
[282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
0000000000000107
[282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007f5f56cf120d
[282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
0000000000000003
[282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
0000000001640403
[282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
0000000000000003
[282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
0000000000000000
[282879.612836] </TASK>


Unfortunately, we couldn't reproduce the issue on our test servers. We
worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
production servers have been running smoothly for several months.

> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
>


--
Regards
Yafang

2024-05-30 09:00:26

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Thu, May 30, 2024 at 4:49 PM Yafang Shao <[email protected]> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanatpl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbitcom> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/[email protected]/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G L 6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882] <IRQ>
> > [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892] ? softlockup_fn+0x70/0x70
> > [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903] ? hrtimer_interrupt+0x106/0x240
> > [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917] </IRQ>
> > [627163.727918] <TASK>
> > [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927] ? xas_descend+0x1b/0x70
> > [627163.727930] xas_load+0x2c/0x40
> > [627163.727933] xas_find+0x161/0x1a0
> > [627163.727937] find_get_entries+0x77/0x1d0
> > [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950] truncate_pagecache+0x44/0x60
> > [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153] notify_change+0x34f/0x4f0
> > [627163.728158] ? _raw_spin_lock+0x13/0x30
> > [627163.728165] ? do_truncate+0x80/0xd0
> > [627163.728169] do_truncate+0x80/0xd0
> > [627163.728172] do_open+0x2ce/0x400
> > [627163.728177] path_openat+0x10d/0x280
> > [627163.728181] do_filp_open+0xb2/0x150
> > [627163.728186] ? check_heap_object+0x34/0x190
> > [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194] do_sys_openat2+0x92/0xc0
> > [627163.728197] __x64_sys_openat+0x53/0x90
> > [627163.728200] do_syscall_64+0x35/0x80
> > [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227] </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>
> We encountered a similar issue several months ago. Some of our
> production servers crashed within days after deploying the 6.1.y
> stable kernel. The soft lock info as follows,
>
> [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> [container-execu:1572375]
> [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> overlay af_packet bonding intel_rapl_msr intel_rapl_common
> 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> sd_mod t10_pi ahci libahci libata
> [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> loaded Tainted: G W O L 6.1.38-rc3 #rc3.pdd
> [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> 00 00
> [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> 0000000000000006
> [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> ffffad700b247c68
> [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> fffffffffffffffe
> [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> ffffad700b247cf8
> [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> ffffdfcd2c778000
> [282879.612594] FS: 00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> knlGS:0000000000000000
> [282879.612596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> 0000000000350ee0
> [282879.612599] Call Trace:
> [282879.612601] <IRQ>
> [282879.612605] ? show_regs.cold+0x1a/0x1f
> [282879.612610] ? watchdog_timer_fn+0x1c4/0x220
> [282879.612614] ? softlockup_fn+0x30/0x30
> [282879.612616] ? __hrtimer_run_queues+0xa2/0x2b0
> [282879.612620] ? hrtimer_interrupt+0x109/0x220
> [282879.612622] ? __sysvec_apic_timer_interrupt+0x5e/0x110
> [282879.612625] ? sysvec_apic_timer_interrupt+0x7b/0x90
> [282879.612629] </IRQ>
> [282879.612630] <TASK>
> [282879.612631] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [282879.612640] ? xas_descend+0x18/0x80
> [282879.612641] ? xas_load+0x35/0x40
> [282879.612643] xas_find+0x197/0x1d0
> [282879.612645] find_get_entries+0x6e/0x170
> [282879.612649] truncate_inode_pages_range+0x294/0x4c0
> [282879.612655] ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> [282879.612787] ? kvfree+0x2c/0x40
> [282879.612791] ? trace_hardirqs_off+0x36/0xf0
> [282879.612795] truncate_inode_pages_final+0x44/0x50
> [282879.612798] evict+0x177/0x190
> [282879.612802] iput.part.0+0x183/0x1e0
> [282879.612804] iput+0x1c/0x30
> [282879.612806] do_unlinkat+0x1c7/0x2c0
> [282879.612810] __x64_sys_unlinkat+0x38/0x70
> [282879.612812] do_syscall_64+0x38/0x90
> [282879.612815] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [282879.612818] RIP: 0033:0x7f5f56cf120d
> [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> 89 02
> [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000107
> [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007f5f56cf120d
> [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> 0000000000000003
> [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> 0000000001640403
> [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> 0000000000000003
> [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> 0000000000000000
> [282879.612836] </TASK>
>
>
> Unfortunately, we couldn't reproduce the issue on our test servers. We
> worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> production servers have been running smoothly for several months.
>
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
It is highly appreciated that you could help to try below one which
works on my v6.6 based android. However, there is a hard lockup
reported on an ongoing regression test(not sure if caused by this
patch yet). Thank you!

mm/huge_memory.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
{
struct folio *folio = page_folio(page);
struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
xa_lock(&swap_cache->i_pages);
}

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
ClearPageHasHWPoisoned(head);

for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
}

ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
split_page_owner(head, nr);

/* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
page_ref_add(head, 2);
xa_unlock(&head->mapping->i_pages);
}
- local_irq_enable();

if (nr_dropped)
shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
int extra_pins, ret;
pgoff_t end;
bool is_hzp;
+ struct lruvec *lruvec;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
if (mapping) {
/*
* Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
}

__split_huge_page(page, list, end);

> >
>
>
> --
> Regards
> Yafang

2024-05-30 09:26:22

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <huangzhaoyang@gmailcom> wrote:
>
> On Thu, May 30, 2024 at 4:49 PM Yafang Shao <[email protected]> wrote:
> >
> > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
> > >
> > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > mm/[email protected]/ @Dave Chinner , >>> I
> > > would like to ask for your kindly help on if you can verify >>> this
> > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > have the test environment from 18 months ago available any >> more.
> > > Also, I haven't seen this problem since that specific test >>
> > > environment tripped over the issue. Hence I don't have any way of >>
> > > confirming that the problem is fixed, either, because first I'd >> have
> > > to reproduce it... > Thanks for the information. I noticed that you
> > > reported another soft > lockup which is related to xas_load since
> > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > merged before > v5.15. > > For saving your time, a brief description
> > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > folio's refcnt and > store it back to 2, which have the
> > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > above doesn't make sense, I am just a > developer who is also suffering
> > > from this bug and trying to fix it
> > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > that are all similar:
> > >
> > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > [file_get:953301]
> > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > scsi_transport_sas i2c_algo_bit wmi
> > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > Tainted: G L 6.6.30.el9 #2
> > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > 2.21.2 02/19/2024
> > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > 0000000000000020
> > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > ffffc90034a67990
> > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > 0000000000008820
> > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > ffffc90034a67a20
> > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > ffffc90034a67a18
> > > [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > knlGS:0000000000000000
> > > [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > 00000000007706e0
> > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [627163.727878] PKRU: 55555554
> > > [627163.727879] Call Trace:
> > > [627163.727882] <IRQ>
> > > [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> > > [627163.727892] ? softlockup_fn+0x70/0x70
> > > [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> > > [627163.727903] ? hrtimer_interrupt+0x106/0x240
> > > [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > [627163.727917] </IRQ>
> > > [627163.727918] <TASK>
> > > [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [627163.727927] ? xas_descend+0x1b/0x70
> > > [627163.727930] xas_load+0x2c/0x40
> > > [627163.727933] xas_find+0x161/0x1a0
> > > [627163.727937] find_get_entries+0x77/0x1d0
> > > [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> > > [627163.727950] truncate_pagecache+0x44/0x60
> > > [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> > > [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> > > [627163.728153] notify_change+0x34f/0x4f0
> > > [627163.728158] ? _raw_spin_lock+0x13/0x30
> > > [627163.728165] ? do_truncate+0x80/0xd0
> > > [627163.728169] do_truncate+0x80/0xd0
> > > [627163.728172] do_open+0x2ce/0x400
> > > [627163.728177] path_openat+0x10d/0x280
> > > [627163.728181] do_filp_open+0xb2/0x150
> > > [627163.728186] ? check_heap_object+0x34/0x190
> > > [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> > > [627163.728194] do_sys_openat2+0x92/0xc0
> > > [627163.728197] __x64_sys_openat+0x53/0x90
> > > [627163.728200] do_syscall_64+0x35/0x80
> > > [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000101
> > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > 00007fc5e493e7fb
> > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > 00000000ffffff9c
> > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > 0000000000000001
> > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > 0000000000000241
> > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > 0000000000000000
> > > [627163.728227] </TASK>
> > >
> > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > (tested at least 10 different versions), this lockup always appears
> > > after 2-30 days, similar to the report in the original thread.
> > > The more load (for example, copying a lot of local files while
> > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > >
> > > I haven't been able to reproduce this during synthetic tests,
> > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >
> > We encountered a similar issue several months ago. Some of our
> > production servers crashed within days after deploying the 6.1.y
> > stable kernel. The soft lock info as follows,
> >
> > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > [container-execu:1572375]
> > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > sd_mod t10_pi ahci libahci libata
> > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > loaded Tainted: G W O L 6.1.38-rc3 #rc3.pdd
> > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > 00 00
> > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > 0000000000000006
> > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > ffffad700b247c68
> > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > fffffffffffffffe
> > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > ffffad700b247cf8
> > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > ffffdfcd2c778000
> > [282879.612594] FS: 00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > knlGS:0000000000000000
> > [282879.612596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > 0000000000350ee0
> > [282879.612599] Call Trace:
> > [282879.612601] <IRQ>
> > [282879.612605] ? show_regs.cold+0x1a/0x1f
> > [282879.612610] ? watchdog_timer_fn+0x1c4/0x220
> > [282879.612614] ? softlockup_fn+0x30/0x30
> > [282879.612616] ? __hrtimer_run_queues+0xa2/0x2b0
> > [282879.612620] ? hrtimer_interrupt+0x109/0x220
> > [282879.612622] ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > [282879.612625] ? sysvec_apic_timer_interrupt+0x7b/0x90
> > [282879.612629] </IRQ>
> > [282879.612630] <TASK>
> > [282879.612631] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > [282879.612640] ? xas_descend+0x18/0x80
> > [282879.612641] ? xas_load+0x35/0x40
> > [282879.612643] xas_find+0x197/0x1d0
> > [282879.612645] find_get_entries+0x6e/0x170
> > [282879.612649] truncate_inode_pages_range+0x294/0x4c0
> > [282879.612655] ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > [282879.612787] ? kvfree+0x2c/0x40
> > [282879.612791] ? trace_hardirqs_off+0x36/0xf0
> > [282879.612795] truncate_inode_pages_final+0x44/0x50
> > [282879.612798] evict+0x177/0x190
> > [282879.612802] iput.part.0+0x183/0x1e0
> > [282879.612804] iput+0x1c/0x30
> > [282879.612806] do_unlinkat+0x1c7/0x2c0
> > [282879.612810] __x64_sys_unlinkat+0x38/0x70
> > [282879.612812] do_syscall_64+0x38/0x90
> > [282879.612815] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > 89 02
> > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > 0000000000000107
> > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > 00007f5f56cf120d
> > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > 0000000000000003
> > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > 0000000001640403
> > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > 0000000000000003
> > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > 0000000000000000
> > [282879.612836] </TASK>
> >
> >
> > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > production servers have been running smoothly for several months.
> >
> > > If anyone can provide a patch, I can test it on multiple machines
> > > over the next few days.
> It is highly appreciated that you could help to try below one which
> works on my v6.6 based android. However, there is a hard lockup
> reported on an ongoing regression test(not sure if caused by this
> patch yet). Thank you!

I'm sorry to inform you that our users are unwilling to experiment
with these changes on our production servers again, and I am unable to
reproduce the issue on our test servers. I am reporting this issue to
highlight to the community that it is indeed a serious problem, and we
should consider it carefully.

>
> mm/huge_memory.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 064fbd90822b..5899906c326a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
> {
> struct folio *folio = page_folio(page);
> struct page *head = &folio->page;
> - struct lruvec *lruvec;
> + struct lruvec *lruvec = folio_lruvec(folio);
> struct address_space *swap_cache = NULL;
> unsigned long offset = 0;
> unsigned int nr = thp_nr_pages(head);
> @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
> xa_lock(&swap_cache->i_pages);
> }
>
> - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> - lruvec = folio_lruvec_lock(folio);
> -
> ClearPageHasHWPoisoned(head);
>
> for (i = nr - 1; i >= 1; i--) {
> @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
> }
>
> ClearPageCompound(head);
> - unlock_page_lruvec(lruvec);
> - /* Caller disabled irqs, so they are still disabled here */
> -
> split_page_owner(head, nr);
>
> /* See comment in __split_huge_page_tail() */
> @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
> page_ref_add(head, 2);
> xa_unlock(&head->mapping->i_pages);
> }
> - local_irq_enable();
>
> if (nr_dropped)
> shmem_uncharge(head->mapping->host, nr_dropped);
> @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
> int extra_pins, ret;
> pgoff_t end;
> bool is_hzp;
> + struct lruvec *lruvec;
>
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>
> /* block interrupt reentry in xa_lock and spinlock */
> local_irq_disable();
> +
> + /*
> + * take lruvec's lock before freeze the folio to prevent the folio
> + * remains in the page cache with refcnt == 0, which could lead to
> + * find_get_entry enters livelock by iterating the xarray.
> + */
> + lruvec = folio_lruvec_lock(folio);
> +
> if (mapping) {
> /*
> * Check if the folio is present in page cache.
> @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
> }
>
> __split_huge_page(page, list, end);
>
> > >
> >
> >
> > --
> > Regards
> > Yafang



--
Regards
Yafang

2024-05-31 06:18:19

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Thu, May 30, 2024 at 5:24 PM Yafang Shao <[email protected]> wrote:
>
> On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <[email protected]> wrote:
> >
> > On Thu, May 30, 2024 at 4:49 PM Yafang Shao <[email protected]> wrote:
> > >
> > > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <[email protected]> wrote:
> > > >
> > > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <[email protected]> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > > mm/[email protected]/ @Dave Chinner , >>> I
> > > > would like to ask for your kindly help on if you can verify >>> this
> > > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > > have the test environment from 18 months ago available any >> more.
> > > > Also, I haven't seen this problem since that specific test >>
> > > > environment tripped over the issue. Hence I don't have any way of >>
> > > > confirming that the problem is fixed, either, because first I'd >> have
> > > > to reproduce it... > Thanks for the information. I noticed that you
> > > > reported another soft > lockup which is related to xas_load since
> > > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > > merged before > v5.15. > > For saving your time, a brief description
> > > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > > folio's refcnt and > store it back to 2, which have the
> > > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > > above doesn't make sense, I am just a > developer who is also suffering
> > > > from this bug and trying to fix it
> > > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > > that are all similar:
> > > >
> > > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > > [file_get:953301]
> > > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > > scsi_transport_sas i2c_algo_bit wmi
> > > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > > Tainted: G L 6.6.30.el9 #2
> > > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > > 2.21.2 02/19/2024
> > > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > > 0000000000000020
> > > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > > ffffc90034a67990
> > > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > > 0000000000008820
> > > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > > ffffc90034a67a20
> > > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > > ffffc90034a67a18
> > > > [627163.727870] FS: 00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > > knlGS:0000000000000000
> > > > [627163.727871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > > 00000000007706e0
> > > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [627163.727878] PKRU: 55555554
> > > > [627163.727879] Call Trace:
> > > > [627163.727882] <IRQ>
> > > > [627163.727886] ? watchdog_timer_fn+0x22a/0x2a0
> > > > [627163.727892] ? softlockup_fn+0x70/0x70
> > > > [627163.727895] ? __hrtimer_run_queues+0x10f/0x2a0
> > > > [627163.727903] ? hrtimer_interrupt+0x106/0x240
> > > > [627163.727906] ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > > [627163.727913] ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > > [627163.727917] </IRQ>
> > > > [627163.727918] <TASK>
> > > > [627163.727920] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [627163.727927] ? xas_descend+0x1b/0x70
> > > > [627163.727930] xas_load+0x2c/0x40
> > > > [627163.727933] xas_find+0x161/0x1a0
> > > > [627163.727937] find_get_entries+0x77/0x1d0
> > > > [627163.727944] truncate_inode_pages_range+0x244/0x3f0
> > > > [627163.727950] truncate_pagecache+0x44/0x60
> > > > [627163.727955] xfs_setattr_size+0x168/0x490 [xfs]
> > > > [627163.728074] xfs_vn_setattr+0x78/0x140 [xfs]
> > > > [627163.728153] notify_change+0x34f/0x4f0
> > > > [627163.728158] ? _raw_spin_lock+0x13/0x30
> > > > [627163.728165] ? do_truncate+0x80/0xd0
> > > > [627163.728169] do_truncate+0x80/0xd0
> > > > [627163.728172] do_open+0x2ce/0x400
> > > > [627163.728177] path_openat+0x10d/0x280
> > > > [627163.728181] do_filp_open+0xb2/0x150
> > > > [627163.728186] ? check_heap_object+0x34/0x190
> > > > [627163.728189] ? __check_object_size.part.0+0x5a/0x130
> > > > [627163.728194] do_sys_openat2+0x92/0xc0
> > > > [627163.728197] __x64_sys_openat+0x53/0x90
> > > > [627163.728200] do_syscall_64+0x35/0x80
> > > > [627163.728206] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > > 0000000000000101
> > > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > > 00007fc5e493e7fb
> > > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > > 00000000ffffff9c
> > > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > > 0000000000000001
> > > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > > 0000000000000241
> > > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > > 0000000000000000
> > > > [627163.728227] </TASK>
> > > >
> > > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > > (tested at least 10 different versions), this lockup always appears
> > > > after 2-30 days, similar to the report in the original thread.
> > > > The more load (for example, copying a lot of local files while
> > > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > > >
> > > > I haven't been able to reproduce this during synthetic tests,
> > > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > >
> > > We encountered a similar issue several months ago. Some of our
> > > production servers crashed within days after deploying the 6.1.y
> > > stable kernel. The soft lock info as follows,
> > >
> > > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > > [container-execu:1572375]
> > > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > > sd_mod t10_pi ahci libahci libata
> > > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > > loaded Tainted: G W O L 6.1.38-rc3 #rc3.pdd
> > > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > > 00 00
> > > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > > 0000000000000006
> > > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > > ffffad700b247c68
> > > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > > fffffffffffffffe
> > > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > > ffffad700b247cf8
> > > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > > ffffdfcd2c778000
> > > [282879.612594] FS: 00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > > knlGS:0000000000000000
> > > [282879.612596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > > 0000000000350ee0
> > > [282879.612599] Call Trace:
> > > [282879.612601] <IRQ>
> > > [282879.612605] ? show_regs.cold+0x1a/0x1f
> > > [282879.612610] ? watchdog_timer_fn+0x1c4/0x220
> > > [282879.612614] ? softlockup_fn+0x30/0x30
> > > [282879.612616] ? __hrtimer_run_queues+0xa2/0x2b0
> > > [282879.612620] ? hrtimer_interrupt+0x109/0x220
> > > [282879.612622] ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > > [282879.612625] ? sysvec_apic_timer_interrupt+0x7b/0x90
> > > [282879.612629] </IRQ>
> > > [282879.612630] <TASK>
> > > [282879.612631] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > > [282879.612640] ? xas_descend+0x18/0x80
> > > [282879.612641] ? xas_load+0x35/0x40
> > > [282879.612643] xas_find+0x197/0x1d0
> > > [282879.612645] find_get_entries+0x6e/0x170
> > > [282879.612649] truncate_inode_pages_range+0x294/0x4c0
> > > [282879.612655] ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > > [282879.612787] ? kvfree+0x2c/0x40
> > > [282879.612791] ? trace_hardirqs_off+0x36/0xf0
> > > [282879.612795] truncate_inode_pages_final+0x44/0x50
> > > [282879.612798] evict+0x177/0x190
> > > [282879.612802] iput.part.0+0x183/0x1e0
> > > [282879.612804] iput+0x1c/0x30
> > > [282879.612806] do_unlinkat+0x1c7/0x2c0
> > > [282879.612810] __x64_sys_unlinkat+0x38/0x70
> > > [282879.612812] do_syscall_64+0x38/0x90
> > > [282879.612815] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > > 89 02
> > > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > > 0000000000000107
> > > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > > 00007f5f56cf120d
> > > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > > 0000000000000003
> > > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > > 0000000001640403
> > > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > > 0000000000000003
> > > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > > 0000000000000000
> > > [282879.612836] </TASK>
> > >
> > >
> > > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > > production servers have been running smoothly for several months.
> > >
> > > > If anyone can provide a patch, I can test it on multiple machines
> > > > over the next few days.
> > It is highly appreciated that you could help to try below one which
> > works on my v6.6 based android. However, there is a hard lockup
> > reported on an ongoing regression test(not sure if caused by this
> > patch yet). Thank you!
>
> I'm sorry to inform you that our users are unwilling to experiment
> with these changes on our production servers again, and I am unable to
> reproduce the issue on our test servers. I am reporting this issue to
> highlight to the community that it is indeed a serious problem, and we
> should consider it carefully.
ok. I would like to suggest a possible reproduce timing sequence which
inspired during investigation as mmap and truncate the same file
simultaneously by multi-process and reserve a certain number of CMA
area via dts could be more helpful.
>
> >
> > mm/huge_memory.c | 22 ++++++++++++++--------
> > 1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 064fbd90822b..5899906c326a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> > {
> > struct folio *folio = page_folio(page);
> > struct page *head = &folio->page;
> > - struct lruvec *lruvec;
> > + struct lruvec *lruvec = folio_lruvec(folio);
> > struct address_space *swap_cache = NULL;
> > unsigned long offset = 0;
> > unsigned int nr = thp_nr_pages(head);
> > @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> > xa_lock(&swap_cache->i_pages);
> > }
> >
> > - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > - lruvec = folio_lruvec_lock(folio);
> > -
> > ClearPageHasHWPoisoned(head);
> >
> > for (i = nr - 1; i >= 1; i--) {
> > @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> > }
> >
> > ClearPageCompound(head);
> > - unlock_page_lruvec(lruvec);
> > - /* Caller disabled irqs, so they are still disabled here */
> > -
> > split_page_owner(head, nr);
> >
> > /* See comment in __split_huge_page_tail() */
> > @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> > page_ref_add(head, 2);
> > xa_unlock(&head->mapping->i_pages);
> > }
> > - local_irq_enable();
> >
> > if (nr_dropped)
> > shmem_uncharge(head->mapping->host, nr_dropped);
> > @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> > int extra_pins, ret;
> > pgoff_t end;
> > bool is_hzp;
> > + struct lruvec *lruvec;
> >
> > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> > @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >
> > /* block interrupt reentry in xa_lock and spinlock */
> > local_irq_disable();
> > +
> > + /*
> > + * take lruvec's lock before freeze the folio to prevent the folio
> > + * remains in the page cache with refcnt == 0, which could lead to
> > + * find_get_entry enters livelock by iterating the xarray.
> > + */
> > + lruvec = folio_lruvec_lock(folio);
> > +
> > if (mapping) {
> > /*
> > * Check if the folio is present in page cache.
> > @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> > }
> >
> > __split_huge_page(page, list, end);
> >
> > > >
> > >
> > >
> > > --
> > > Regards
> > > Yafang
>
>
>
> --
> Regards
> Yafang

2024-06-14 03:33:53

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <[email protected]> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <[email protected]>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> [email protected]/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >> 1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS: 0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717] <NMI>
> [26887.389720] ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725] ? __perf_event_overflow+0x102/0x1d0
> [26887.389729] ? handle_pmi_common+0x189/0x3e0
> [26887.389735] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742] ? native_set_fixmap+0x65/0x80
> [26887.389745] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755] ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756] ? perf_event_nmi_handler+0x28/0x50
> [26887.389759] ? nmi_handle+0x58/0x150
> [26887.389764] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768] ? default_do_nmi+0x6b/0x170
> [26887.389770] ? exc_nmi+0x12c/0x1a0
> [26887.389772] ? end_repeat_nmi+0x16/0x1f
> [26887.389777] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787] </NMI>
> [26887.389788] <TASK>
> [26887.389789] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798] __page_cache_release+0x68/0x230
> [26887.389801] ? remove_migration_ptes+0x5c/0x80
> [26887.389807] __folio_put+0x24/0x60
> [26887.389808] __split_huge_page+0x368/0x520
> [26887.389812] split_huge_page_to_list+0x4b3/0x570
> [26887.389816] deferred_split_scan+0x1c8/0x290
> [26887.389819] do_shrink_slab+0x12f/0x2d0
> [26887.389824] shrink_slab_memcg+0x133/0x1d0
> [26887.389829] shrink_node_memcgs+0x18e/0x1d0
> [26887.389832] shrink_node+0xa7/0x370
> [26887.389836] balance_pgdat+0x332/0x6f0
> [26887.389842] kswapd+0xf0/0x190
> [26887.389845] ? balance_pgdat+0x6f0/0x6f0
> [26887.389848] kthread+0xee/0x120
> [26887.389851] ? kthread_complete_and_exit+0x20/0x20
> [26887.389853] ret_from_fork+0x2d/0x50
> [26887.389857] ? kthread_complete_and_exit+0x20/0x20
> [26887.389859] ret_from_fork_asm+0x11/0x20
> [26887.389864] </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389870] Call Trace:
> [26887.389871] <NMI>
> [26887.389872] dump_stack_lvl+0x44/0x60
> [26887.389877] panic+0x241/0x330
> [26887.389881] nmi_panic+0x2f/0x40
> [26887.389883] watchdog_hardlockup_check+0x119/0x150
> [26887.389886] __perf_event_overflow+0x102/0x1d0
> [26887.389889] handle_pmi_common+0x189/0x3e0
> [26887.389893] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899] ? native_set_fixmap+0x65/0x80
> [26887.389902] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909] intel_pmu_handle_irq+0x10b/0x230
> [26887.389911] perf_event_nmi_handler+0x28/0x50
> [26887.389913] nmi_handle+0x58/0x150
> [26887.389916] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920] default_do_nmi+0x6b/0x170
> [26887.389922] exc_nmi+0x12c/0x1a0
> [26887.389923] end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946] </NMI>
> [26887.389947] <TASK>
> [26887.389947] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953] __page_cache_release+0x68/0x230
> [26887.389955] ? remove_migration_ptes+0x5c/0x80
> [26887.389958] __folio_put+0x24/0x60
> [26887.389960] __split_huge_page+0x368/0x520
> [26887.389963] split_huge_page_to_list+0x4b3/0x570
> [26887.389967] deferred_split_scan+0x1c8/0x290
> [26887.389971] do_shrink_slab+0x12f/0x2d0
> [26887.389974] shrink_slab_memcg+0x133/0x1d0
> [26887.389978] shrink_node_memcgs+0x18e/0x1d0
> [26887.389982] shrink_node+0xa7/0x370
> [26887.389985] balance_pgdat+0x332/0x6f0
> [26887.389991] kswapd+0xf0/0x190
> [26887.389994] ? balance_pgdat+0x6f0/0x6f0
> [26887.389997] kthread+0xee/0x120
> [26887.389998] ? kthread_complete_and_exit+0x20/0x20
> [26887.390000] ret_from_fork+0x2d/0x50
> [26887.390003] ? kthread_complete_and_exit+0x20/0x20
> [26887.390004] ret_from_fork_asm+0x11/0x20
> [26887.390009] </TASK>
>
Hi Marcin. Sorry for this late reply. I think the above hard lockup is
caused by a recursive deadlock as [1] and has been fixed by [2] which
is on v6.8+. I would like to know if your regression test is still
going on? Thanks very much.

[1]
static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned int new_order)
{
/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
lruvec = folio_lruvec_lock(folio);
//take lruvec_lock
here 1st

for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(folio, i, lruvec, list, new_order);
/* Some pages can be beyond EOF: drop them from page cache */
if (head[i].index >= end) {
folio_put(tail);
__page_cache_release
folio_lruvec_lock_irqsave
//hanged by 2nd try

[2]
commit f1ee018baee9f4e724e08859c2559323be768be3
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Tue Feb 27 17:42:42 2024 +0000

mm: use __page_cache_release() in folios_put()

Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave(). Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.