2023-05-09 16:44:29

by Daniel Dao

[permalink] [raw]
Subject: rcu_preempt self-detected stall in filemap_get_read_batch

Hi Matthew,

I'm trying to track down a problem that occurred intermittenly when
doing rocksdb compaction
and manifested in RCU self detected stall

[949170.657126][ C97] rcu: INFO: rcu_preempt self-detected stall on CPU
[949170.666237][ C97] rcu: 97-....: (701243 ticks this GP)
idle=948c/1/0x4000000000000000 softirq=295729919/295729919 fqs=286689
[949170.683019][ C97] (t=718763 jiffies g=500631033
q=31964872 ncpus=128)
[949170.692508][ C97] CPU: 97 PID: 184960 Comm: rocksdb:low1
Kdump: loaded Tainted: G W O L 6.1.22-cloudflare-2023.3.27
#1
[949170.710034][ C97] Hardware name: HYVE
EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
[949170.721788][ C97] RIP: 0010:xas_load+0x61/0xa0
[949170.729326][ C97] Code: ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44
c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00 00
00 76 0d 88 57 12 <80> 3e 00 75 a5 c3 cc cc cc cc 48 c1 e8 02 89 c2 89
c0 48 83 c0 04
[949170.757643][ C97] RSP: 0018:ffffabeee65f3bf8 EFLAGS: 00000293
[949170.766704][ C97] RAX: fffff17ec1cc3000 RBX: ffffabeee65f3d70
RCX: 0000000000000000
[949170.777729][ C97] RDX: 0000000000000000 RSI: ffff99e781536d80
RDI: ffffabeee65f3c00
[949170.788799][ C97] RBP: 000000000000157e R08: 0000000000000402
R09: fffff17ec1cc3000
[949170.799865][ C97] R10: 0000000000000001 R11: 0000000000000000
R12: ffffabeee65f3e90
[949170.810927][ C97] R13: 000000000000157e R14: 0000000000001540
R15: ffff99e2615f6538
[949170.821983][ C97] FS: 00007ff7be571700(0000)
GS:ffff99d51fc40000(0000) knlGS:0000000000000000
[949170.834065][ C97] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[949170.843820][ C97] CR2: 00007fa0ea422000 CR3: 0000004b17f62002
CR4: 0000000000770ee0
[949170.855008][ C97] PKRU: 55555554
[949170.861762][ C97] Call Trace:
[949170.868244][ C97] <TASK>
[949170.874358][ C97] filemap_get_read_batch+0x186/0x250
[949170.882949][ C97] filemap_get_pages+0xa2/0x630
[949170.890991][ C97] ? free_unref_page_commit+0x7c/0x170
[949170.899646][ C97] ? _raw_spin_unlock_irqrestore+0x1b/0x40
[949170.908603][ C97] ? free_unref_page+0x1a8/0x1e0
[949170.916604][ C97] filemap_read+0xc1/0x320
[949170.923990][ C97] ? find_css_set+0x200/0x680
[949170.931570][ C97] xfs_file_buffered_read+0x50/0xd0
[949170.939639][ C97] xfs_file_read_iter+0x6a/0xd0
[949170.947322][ C97] vfs_read+0x204/0x2d0
[949170.954298][ C97] __x64_sys_pread64+0x90/0xc0
[949170.961865][ C97] do_syscall_64+0x3b/0x90
[949170.969094][ C97] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
[949170.977781][ C97] RIP: 0033:0x7ff7ee27b917
[949170.984964][ C97] Code: 08 89 3c 24 48 89 4c 24 18 e8 05 f4 ff
ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8
11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 35
f4 ff ff 48 8b
[949171.013041][ C97] RSP: 002b:00007ff7be56d1b0 EFLAGS: 00000293
ORIG_RAX: 0000000000000011
[949171.024300][ C97] RAX: ffffffffffffffda RBX: 0000000001595dd0
RCX: 00007ff7ee27b917
[949171.035104][ C97] RDX: 00000000000b20c3 RSI: 0000000027460000
RDI: 0000000000000050
[949171.045842][ C97] RBP: 00007ff7be56d2f0 R08: 0000000000000000
R09: 00007ff7be56d3a0
[949171.056529][ C97] R10: 00000000014cccf2 R11: 0000000000000293
R12: 0000000003a2fc80
[949171.067185][ C97] R13: 00000000000b20c3 R14: 00000000000b20c3
R15: 00000000014cccf2
[949171.077761][ C97] </TASK>

We have not been able to reproduce this reliably.

Does this look similar to problems seen in
https://lore.kernel.org/linux-mm/Y1lZ9Rm87GpFRM%[email protected]/.
I wonder if it's reasonable to try the patch and see what we have
since it looks sane.

Best,
Daniel.


2024-04-16 08:03:01

by zhaoyang.huang

[permalink] [raw]
Subject: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

From: Zhaoyang Huang <[email protected]>

Livelock in [1] is reported multitimes since v515, where the zero-ref
folio is repeatly found on the page cache by find_get_entry. A possible
timing sequence is proposed in [2], which can be described briefly as
the lockless xarray operation could get harmed by an illegal folio
remaining on the slot[offset]. This commit would like to protect
the xa split stuff(folio_ref_freeze and __split_huge_page) under
lruvec->lock to remove the race window.

[1]
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0 state:R running task stack: 24 pid: 155 ppid: 2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0 state:R running task stack: 24 pid: 155 ppid: 2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342] dump_backtrace.cfi_jt+0x0/0x8
[167789.800355] show_stack+0x1c/0x2c
[167789.800363] sched_show_task+0x1ac/0x27c
[167789.800370] print_other_cpu_stall+0x314/0x4dc
[167789.800377] check_cpu_stall+0x1c4/0x36c
[167789.800382] rcu_sched_clock_irq+0xe8/0x388
[167789.800389] update_process_times+0xa0/0xe0
[167789.800396] tick_sched_timer+0x7c/0xd4
[167789.800404] __run_hrtimer+0xd8/0x30c
[167789.800408] hrtimer_interrupt+0x1e4/0x2d0
[167789.800414] arch_timer_handler_phys+0x5c/0xa0
[167789.800423] handle_percpu_devid_irq+0xbc/0x318
[167789.800430] handle_domain_irq+0x7c/0xf0
[167789.800437] gic_handle_irq+0x54/0x12c
[167789.800445] call_on_irq_stack+0x40/0x70
[167789.800451] do_interrupt_handler+0x44/0xa0
[167789.800457] el1_interrupt+0x34/0x64
[167789.800464] el1h_64_irq_handler+0x1c/0x2c
[167789.800470] el1h_64_irq+0x7c/0x80
[167789.800474] xas_find+0xb4/0x28c
[167789.800481] find_get_entry+0x3c/0x178
[167789.800487] find_lock_entries+0x98/0x2f8
[167789.800492] __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500] invalidate_mapping_pages+0x18/0x28
[167789.800506] inode_lru_isolate+0x140/0x2a4
[167789.800512] __list_lru_walk_one+0xd8/0x204
[167789.800519] list_lru_walk_one+0x64/0x90
[167789.800524] prune_icache_sb+0x54/0xe0
[167789.800529] super_cache_scan+0x160/0x1ec
[167789.800535] do_shrink_slab+0x20c/0x5c0
[167789.800541] shrink_slab+0xf0/0x20c
[167789.800546] shrink_node_memcgs+0x98/0x320
[167789.800553] shrink_node+0xe8/0x45c
[167789.800557] balance_pgdat+0x464/0x814
[167789.800563] kswapd+0xfc/0x23c
[167789.800567] kthread+0x164/0x1c8
[167789.800573] ret_from_fork+0x10/0x20

[2]
Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages via pfn
(folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

4. alloc_contig_range->migrate_pages->__split_huge_page->folio_lruvec_lock
failed which have the folio be failed in setting refcnt to 2

5. Thread_kswapd enter the livelock by the chain below
rcu_read_lock();
retry:
find_get_entry
folio = xas_find
if(!folio_try_get_rcu)
xas_reset;
goto retry;
rcu_read_unlock();

5'. Thread_holdlock as the lruvec->lru_lock holder could be stalled in
the same core of Thread_kswapd.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
mm/huge_memory.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..418e8d03480a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2891,7 +2891,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
{
struct folio *folio = page_folio(page);
struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
int i, nr_dropped = 0;
@@ -2908,8 +2908,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_lock(&swap_cache->i_pages);
}

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);

ClearPageHasHWPoisoned(head);

@@ -2942,7 +2940,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,

folio_set_order(new_folio, new_order);
}
- unlock_page_lruvec(lruvec);
/* Caller disabled irqs, so they are still disabled here */

split_page_owner(head, order, new_order);
@@ -2961,7 +2958,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
folio_ref_add(folio, 1 + new_nr);
xa_unlock(&folio->mapping->i_pages);
}
- local_irq_enable();

if (nr_dropped)
shmem_uncharge(folio->mapping->host, nr_dropped);
@@ -3048,6 +3044,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
int extra_pins, ret;
pgoff_t end;
bool is_hzp;
+ struct lruvec *lruvec;

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3159,6 +3156,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,

/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
if (mapping) {
/*
* Check if the folio is present in page cache.
@@ -3203,12 +3208,16 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
}

__split_huge_page(page, list, end, new_order);
+ unlock_page_lruvec(lruvec);
+ local_irq_enable();
ret = 0;
} else {
spin_unlock(&ds_queue->split_queue_lock);
fail:
if (mapping)
xas_unlock(&xas);
+
+ unlock_page_lruvec(lruvec);
local_irq_enable();
remap_page(folio, folio_nr_pages(folio));
ret = -EAGAIN;
--
2.25.1