LinuxLists.cc - [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

2024-04-25 11:58:02

Subject: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

From: Pankaj Raghav <[email protected]>

Splitting a larger folio with a base order is supported using
split_huge_page_to_list_to_order() API. However, using that API for LBS
is resulting in an NULL ptr dereference error in the writeback path [1].

Refuse to split a folio if it has minimum folio order requirement until
we can start using split_huge_page_to_list_to_order() API. Splitting the
folio can be added as a later optimization.

[1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df

Signed-off-by: Pankaj Raghav <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
mm/huge_memory.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..dadf1e68dbdc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3117,6 +3117,15 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
goto out;
}

+ /*
+ * Do not split if mapping has minimum folio order
+ * requirement.
+ */
+ if (mapping_min_folio_order(mapping)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
gfp = current_gfp_context(mapping_gfp_mask(mapping) &
GFP_RECLAIM_MASK);

--
2.34.1

2024-04-26 00:48:17

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

2024-04-26 00:57:36

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <[email protected]>
>
> Splitting a larger folio with a base order is supported using
> split_huge_page_to_list_to_order() API. However, using that API for LBS
> is resulting in an NULL ptr dereference error in the writeback path [1].
>
> Refuse to split a folio if it has minimum folio order requirement until
> we can start using split_huge_page_to_list_to_order() API. Splitting the
> folio can be added as a later optimization.
>
> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df

Obviously this has to be tracked down and fixed before this patchset can
be merged ... I think I have some ideas. Let me look a bit. How
would I go about reproducing this?

2024-04-26 15:49:44

by Pankaj Raghav (Samsung)

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <[email protected]>
> >
> > Splitting a larger folio with a base order is supported using
> > split_huge_page_to_list_to_order() API. However, using that API for LBS
> > is resulting in an NULL ptr dereference error in the writeback path [1].
> >
> > Refuse to split a folio if it has minimum folio order requirement until
> > we can start using split_huge_page_to_list_to_order() API. Splitting the
> > folio can be added as a later optimization.
> >
> > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>
> Obviously this has to be tracked down and fixed before this patchset can
> be merged ... I think I have some ideas. Let me look a bit. How
> would I go about reproducing this?

I am able to reproduce it in a VM with 4G RAM and running generic/447
(sometimes you have to run it twice) on a 16K BS on a 4K PS system.

I have a suspicion on this series: https://lore.kernel.org/linux-fsdevel/[email protected]/
but I am still unsure why this is happening when we split with LBS
configurations.

If you have kdevops installed, then go with Luis's suggestion, or else
this is my local config.

This is the diff I applied instead of this patch:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..63ee7b6ed03d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
{
struct folio *folio = page_folio(page);
struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+ unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);
+
+ if (!folio_test_anon(folio))
+ new_order = max_t(unsigned int, mapping_min_order, new_order);
/* reset xarray order to new order after split */
XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
struct anon_vma *anon_vma = NULL;
@@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
goto out;
}

+ // XXX: Remove it later
+ VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio);
gfp = current_gfp_context(mapping_gfp_mask(mapping) &
GFP_RECLAIM_MASK);

(END)

xfstests is based on https://github.com/kdave/xfstests/tree/v2024.04.14

xfstests config:

[default]
FSTYP=xfs
RESULT_BASE=/root/results/
DUMP_CORRUPT_FS=1
CANON_DEVS=yes
RECREATE_TEST_DEV=true
TEST_DEV=/dev/nvme0n1
TEST_DIR=/media/test
SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/media/scratch
LOGWRITES_DEV=/dev/vdc

[16k_4ks]
MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -b size=16k, -s size=4k'

[nix-shell:~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
vdb 254:16 0 32G 0 disk /media/scratch
vdc 254:32 0 32G 0 disk
nvme0n1 259:0 0 32G 0 disk /media/test

$ ./check -s 16k_4ks generic/447

BT:
[ 74.170698] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x14b/0x510
[ 74.170938] Write of size 4 at addr 0000000000000036 by task kworker/u16:6/284
[ 74.170938]
[ 74.170938] CPU: 0 PID: 284 Comm: kworker/u16:6 Not tainted 6.9.0-rc4-00011-g4676d00b6f6f #7
[ 74.170938] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 74.170938] Workqueue: writeback wb_workfn (flush-254:16)
[ 74.170938] Call Trace:
[ 74.170938] <TASK>
[ 74.170938] dump_stack_lvl+0x51/0x70
[ 74.170938] kasan_report+0xab/0xe0
[ 74.170938] ? filemap_get_folios_tag+0x14b/0x510
[ 74.170938] kasan_check_range+0x35/0x1b0
[ 74.170938] filemap_get_folios_tag+0x14b/0x510
[ 74.170938] ? __pfx_filemap_get_folios_tag+0x10/0x10
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] writeback_iter+0x508/0xcc0
[ 74.170938] ? __pfx_iomap_do_writepage+0x10/0x10
[ 74.170938] write_cache_pages+0x80/0x100
[ 74.170938] ? __pfx_write_cache_pages+0x10/0x10
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? _raw_spin_lock+0x87/0xe0
[ 74.170938] iomap_writepages+0x85/0xe0
[ 74.170938] xfs_vm_writepages+0xe3/0x140 [xfs]
[ 74.170938] ? __pfx_xfs_vm_writepages+0x10/0x10 [xfs]
[ 74.170938] ? kasan_save_track+0x10/0x30
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? __kasan_kmalloc+0x7b/0x90
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? virtqueue_add_split+0x605/0x1b00
[ 74.170938] do_writepages+0x176/0x740
[ 74.170938] ? __pfx_do_writepages+0x10/0x10
[ 74.170938] ? __pfx_virtqueue_add_split+0x10/0x10
[ 74.170938] ? __pfx_update_sd_lb_stats.constprop.0+0x10/0x10
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? virtqueue_add_sgs+0xfe/0x130
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? virtblk_add_req+0x15c/0x280
[ 74.170938] __writeback_single_inode+0x9f/0x840
[ 74.170938] ? wbc_attach_and_unlock_inode+0x345/0x5d0
[ 74.170938] writeback_sb_inodes+0x491/0xce0
[ 74.170938] ? __pfx_wb_calc_thresh+0x10/0x10
[ 74.170938] ? __pfx_writeback_sb_inodes+0x10/0x10
[ 74.170938] ? __wb_calc_thresh+0x1a0/0x3c0
[ 74.170938] ? __pfx_down_read_trylock+0x10/0x10
[ 74.170938] ? wb_over_bg_thresh+0x16b/0x5e0
[ 74.170938] ? __pfx_move_expired_inodes+0x10/0x10
[ 74.170938] __writeback_inodes_wb+0xb7/0x200
[ 74.170938] wb_writeback+0x2c4/0x660
[ 74.170938] ? __pfx_wb_writeback+0x10/0x10
[ 74.170938] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 74.170938] wb_workfn+0x54e/0xaf0
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? __pfx_wb_workfn+0x10/0x10
[ 74.170938] ? __pfx___schedule+0x10/0x10
[ 74.170938] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 74.170938] process_one_work+0x622/0x1020
[ 74.170938] worker_thread+0x844/0x10e0
[ 74.170938] ? srso_return_thunk+0x5/0x5f
[ 74.170938] ? __kthread_parkme+0x82/0x150
[ 74.170938] ? __pfx_worker_thread+0x10/0x10
[ 74.170938] kthread+0x2b4/0x380
[ 74.170938] ? __pfx_kthread+0x10/0x10
[ 74.170938] ret_from_fork+0x30/0x70
[ 74.170938] ? __pfx_kthread+0x10/0x10
[ 74.170938] ret_from_fork_asm+0x1a/0x30
[ 74.170938] </TASK>

2024-04-27 04:32:46

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <[email protected]>
> > >
> > > using that API for LBS is resulting in an NULL ptr dereference
> > > error in the writeback path [1].
> > >
> > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >
> > How would I go about reproducing this?
>
> Using kdevops this is easy:
>
> make defconfig-lbs-xfs-small -j $(nproc)
> make -j $(nproc)

I forgot after the above:

make bringup

> make fstests
> make linux
> make fstests-baseline TESTS=generic/447 COUNT=10
> tail -f guestfs/*-xfs-reflink-16k-4ks/console.log

The above tail command needs sudo prefix for now too.

> or
> sudo virsh list
> sudo virsh console ${foo}-xfs-reflink-16k-4ks
>
> Where $foo is the value of CONFIG_KDEVOPS_HOSTS_PREFIX in .config for
> your kdevops run.
>
> I didn't have time to verify if the above commands for kdevops worked

I did now, I forgot to git push commits to kdevops yesterday but I
confirm the above steps can be used to repdroduce this issue now.

Luis

2024-04-28 05:46:51

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > > From: Pankaj Raghav <[email protected]>
> > > >
> > > > using that API for LBS is resulting in an NULL ptr dereference
> > > > error in the writeback path [1].
> > > >
> > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > >
> > > How would I go about reproducing this?

Well so the below fixes this but I am not sure if this is correct.
folio_mark_dirty() at least says that a folio should not be truncated
while its running. I am not sure if we should try to split folios then
even though we check for writeback once. truncate_inode_partial_folio()
will folio_wait_writeback() but it will split_folio() before checking
for claiming to fail to truncate with folio_test_dirty(). But since the
folio is locked its not clear why this should be possible.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 83955362d41c..90195506211a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
if (new_order >= folio_order(folio))
return -EINVAL;

- if (folio_test_writeback(folio))
+ if (folio_test_dirty(folio) || folio_test_writeback(folio))
return -EBUSY;

if (!folio_test_anon(folio)) {

2024-04-29 05:11:52

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> > > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > > > From: Pankaj Raghav <[email protected]>
> > > > >
> > > > > using that API for LBS is resulting in an NULL ptr dereference
> > > > > error in the writeback path [1].
> > > > >
> > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > > >
> > > > How would I go about reproducing this?
>
> Well so the below fixes this but I am not sure if this is correct.
> folio_mark_dirty() at least says that a folio should not be truncated
> while its running. I am not sure if we should try to split folios then
> even though we check for writeback once. truncate_inode_partial_folio()
> will folio_wait_writeback() but it will split_folio() before checking
> for claiming to fail to truncate with folio_test_dirty(). But since the
> folio is locked its not clear why this should be possible.
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 83955362d41c..90195506211a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> if (new_order >= folio_order(folio))
> return -EINVAL;
>
> - if (folio_test_writeback(folio))
> + if (folio_test_dirty(folio) || folio_test_writeback(folio))
> return -EBUSY;
>
> if (!folio_test_anon(folio)) {

I wondered what code path is causing this and triggering this null
pointer, so I just sprinkled a check here:

VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);

The answer was:

kcompactd() --> migrate_pages_batch()
--> try_split_folio --> split_folio_to_list() -->
split_huge_page_to_list_to_order()

Since it took running fstests generic/447 twice to reproduce the crash
with a minorder and 16k block size, modified generic/447 as followed and
it helps to speed up the reproducer with just running the test once:

diff --git a/tests/generic/447 b/tests/generic/447
index 16b814ee7347..43050b58e8ba 100755
--- a/tests/generic/447
+++ b/tests/generic/447
@@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1
testdir="$SCRATCH_MNT/test-$seq"
mkdir "$testdir"

+runfile="$tmp.compaction"
+touch $runfile
+while [ -e $runfile ]; do
+ echo 1 > /proc/sys/vm/compact_memory
+ sleep 10
+done &
+compaction_pid=$!
+
+
# Setup for one million blocks, but we'll accept stress testing down to
# 2^17 blocks... that should be plenty for anyone.
fnr=20
@@ -69,6 +81,8 @@ _scratch_cycle_mount
echo "Delete file1"
rm -rf $testdir/file1

+rm -f $runfile
+wait > /dev/null 2>&1
# success, all done
status=0
exit

And I verified that moving the check only to the migrate_pages_batch()
path also fixes the crash:

diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..83b528eb7100 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
int rc;

folio_lock(folio);
+ if (folio_test_dirty(folio)) {
+ rc = -EBUSY;
+ goto out;
+ }
rc = split_folio_to_list(folio, split_folios);
+out:
folio_unlock(folio);
if (!rc)
list_move_tail(&folio->lru, split_folios);

However I'd like compaction folks to review this. I see some indications
in the code that migration can race with truncation but we feel fine by
it by taking the folio lock. However here we have a case where we see
the folio clearly locked and the folio is dirty. Other migraiton code
seems to write back the code and can wait, here we just move on. Further
reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
as completely unmovable") seems to hint that migration is safe if the
mapping either does not exist or the mapping does exist but has
mapping->a_ops->migrate_folio so I'd like further feedback on this.
Another thing which requires review is if we we split a folio but not
down to order 0 but to the new min order, does the accounting on
migrate_pages_batch() require changing? And most puzzling, why do we
not see this with regular large folios, but we do see it with minorder ?

Luis

2024-04-29 14:30:02

by Zi Yan

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:

> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
>>>>>> From: Pankaj Raghav <[email protected]>
>>>>>>
>>>>>> using that API for LBS is resulting in an NULL ptr dereference
>>>>>> error in the writeback path [1].
>>>>>>
>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>>>>>
>>>>> How would I go about reproducing this?
>>
>> Well so the below fixes this but I am not sure if this is correct.
>> folio_mark_dirty() at least says that a folio should not be truncated
>> while its running. I am not sure if we should try to split folios then
>> even though we check for writeback once. truncate_inode_partial_folio()
>> will folio_wait_writeback() but it will split_folio() before checking
>> for claiming to fail to truncate with folio_test_dirty(). But since the
>> folio is locked its not clear why this should be possible.
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 83955362d41c..90195506211a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> if (new_order >= folio_order(folio))
>> return -EINVAL;
>>
>> - if (folio_test_writeback(folio))
>> + if (folio_test_dirty(folio) || folio_test_writeback(folio))
>> return -EBUSY;
>>
>> if (!folio_test_anon(folio)) {
>
> I wondered what code path is causing this and triggering this null
> pointer, so I just sprinkled a check here:
>
> VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
>
> The answer was:
>
> kcompactd() --> migrate_pages_batch()
> --> try_split_folio --> split_folio_to_list() -->
> split_huge_page_to_list_to_order()
>

There are 3 try_split_folio() in migrate_pages_batch(). First one is to
split anonymous large folios that are on deferred split list, so not related;
second one is to split THPs when thp migration is not supported, but
this is compaction, so not related; third one is to split large folios
when there is no same size free page in the system, and this should be
the one.

> Since it took running fstests generic/447 twice to reproduce the crash
> with a minorder and 16k block size, modified generic/447 as followed and
> it helps to speed up the reproducer with just running the test once:
>
> diff --git a/tests/generic/447 b/tests/generic/447
> index 16b814ee7347..43050b58e8ba 100755
> --- a/tests/generic/447
> +++ b/tests/generic/447
> @@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1
> testdir="$SCRATCH_MNT/test-$seq"
> mkdir "$testdir"
>
> +runfile="$tmp.compaction"
> +touch $runfile
> +while [ -e $runfile ]; do
> + echo 1 > /proc/sys/vm/compact_memory
> + sleep 10
> +done &
> +compaction_pid=$!
> +
> +
> # Setup for one million blocks, but we'll accept stress testing down to
> # 2^17 blocks... that should be plenty for anyone.
> fnr=20
> @@ -69,6 +81,8 @@ _scratch_cycle_mount
> echo "Delete file1"
> rm -rf $testdir/file1
>
> +rm -f $runfile
> +wait > /dev/null 2>&1
> # success, all done
> status=0
> exit
>
> And I verified that moving the check only to the migrate_pages_batch()
> path also fixes the crash:
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 73a052a382f1..83b528eb7100 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> int rc;
>
> folio_lock(folio);
> + if (folio_test_dirty(folio)) {
> + rc = -EBUSY;
> + goto out;
> + }
> rc = split_folio_to_list(folio, split_folios);
> +out:
> folio_unlock(folio);
> if (!rc)
> list_move_tail(&folio->lru, split_folios);
>
> However I'd like compaction folks to review this. I see some indications
> in the code that migration can race with truncation but we feel fine by
> it by taking the folio lock. However here we have a case where we see
> the folio clearly locked and the folio is dirty. Other migraiton code
> seems to write back the code and can wait, here we just move on. Further
> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> as completely unmovable") seems to hint that migration is safe if the
> mapping either does not exist or the mapping does exist but has
> mapping->a_ops->migrate_folio so I'd like further feedback on this.

During migration, all page table entries pointing to this dirty folio
are invalid, and accesses to this folio will cause page fault and
wait on the migration entry. I am not sure we need to skip dirty folios.

> Another thing which requires review is if we we split a folio but not
> down to order 0 but to the new min order, does the accounting on
> migrate_pages_batch() require changing? And most puzzling, why do we

What accounting are you referring to? split code should take care of it.

> not see this with regular large folios, but we do see it with minorder ?

I wonder if the split code handles folio->mapping->i_pages properly.
Does the i_pages store just folio pointers or also need all tail page
pointers? I am no expert in fs, thus need help.

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2024-04-30 02:43:44

by Zi Yan

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On 29 Apr 2024, at 20:31, Luis Chamberlain wrote:

> On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
>> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
>>
>>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
>>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
>>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
>>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
>>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
>>>>>>>> From: Pankaj Raghav <[email protected]>
>>>>>>>>
>>>>>>>> using that API for LBS is resulting in an NULL ptr dereference
>>>>>>>> error in the writeback path [1].
>>>>>>>>
>>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>>>>>>>
>>>>>>> How would I go about reproducing this?
>>>>
>>>> Well so the below fixes this but I am not sure if this is correct.
>>>> folio_mark_dirty() at least says that a folio should not be truncated
>>>> while its running. I am not sure if we should try to split folios then
>>>> even though we check for writeback once. truncate_inode_partial_folio()
>>>> will folio_wait_writeback() but it will split_folio() before checking
>>>> for claiming to fail to truncate with folio_test_dirty(). But since the
>>>> folio is locked its not clear why this should be possible.
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 83955362d41c..90195506211a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> if (new_order >= folio_order(folio))
>>>> return -EINVAL;
>>>>
>>>> - if (folio_test_writeback(folio))
>>>> + if (folio_test_dirty(folio) || folio_test_writeback(folio))
>>>> return -EBUSY;
>>>>
>>>> if (!folio_test_anon(folio)) {
>>>
>>> I wondered what code path is causing this and triggering this null
>>> pointer, so I just sprinkled a check here:
>>>
>>> VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
>>>
>>> The answer was:
>>>
>>> kcompactd() --> migrate_pages_batch()
>>> --> try_split_folio --> split_folio_to_list() -->
>>> split_huge_page_to_list_to_order()
>>>
>>
>> There are 3 try_split_folio() in migrate_pages_batch().
>
> This is only true for linux-next, for v6.9-rc5 off of which this testing
> is based on there are only two.
>
>> First one is to split anonymous large folios that are on deferred
>> split list, so not related;
>
> This is in linux-next and not v6.9-rc5.
>
>> second one is to split THPs when thp migration is not supported, but
>> this is compaction, so not related; third one is to split large folios
>> when there is no same size free page in the system, and this should be
>> the one.
>
> Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
> also helps us enhance the reproducer further, which I'll do next.
>
>>> And I verified that moving the check only to the migrate_pages_batch()
>>> path also fixes the crash:
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 73a052a382f1..83b528eb7100 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>>> int rc;
>>>
>>> folio_lock(folio);
>>> + if (folio_test_dirty(folio)) {
>>> + rc = -EBUSY;
>>> + goto out;
>>> + }
>>> rc = split_folio_to_list(folio, split_folios);
>>> +out:
>>> folio_unlock(folio);
>>> if (!rc)
>>> list_move_tail(&folio->lru, split_folios);
>>>
>>> However I'd like compaction folks to review this. I see some indications
>>> in the code that migration can race with truncation but we feel fine by
>>> it by taking the folio lock. However here we have a case where we see
>>> the folio clearly locked and the folio is dirty. Other migraiton code
>>> seems to write back the code and can wait, here we just move on. Further
>>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
>>> as completely unmovable") seems to hint that migration is safe if the
>>> mapping either does not exist or the mapping does exist but has
>>> mapping->a_ops->migrate_folio so I'd like further feedback on this.
>>
>> During migration, all page table entries pointing to this dirty folio
>> are invalid, and accesses to this folio will cause page fault and
>> wait on the migration entry. I am not sure we need to skip dirty folios.
>
> I see.. thanks!
>
>>> Another thing which requires review is if we we split a folio but not
>>> down to order 0 but to the new min order, does the accounting on
>>> migrate_pages_batch() require changing? And most puzzling, why do we
>>
>> What accounting are you referring to? split code should take care of it.
>
> The folio order can change after split, and so I was concerned about the
> nr_pages used in migrate_pages_batch(). But I see now that when
> migrate_folio_unmap() first failed we try to split the folio, and if
> successful I see now we the caller will again call migrate_pages_batch()
> with a retry attempt of 1 only to the split folios. I also see the
> nr_pages is just local to each list for each loop, first on the from
> list to unmap and afte on the unmap list so we move the folios.
>
>>> not see this with regular large folios, but we do see it with minorder ?
>>
>> I wonder if the split code handles folio->mapping->i_pages properly.
>> Does the i_pages store just folio pointers or also need all tail page
>> pointers? I am no expert in fs, thus need help.
>
> mapping->i_pages stores folio pointers in the page cache or
> swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
> special and we special-case them with shmem_mapping(mapping) checks.
> split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
> entries, and we also bail out on shmem_mapping(mapping) already.

Hmm, I misunderstood the issue above. To clarify it, the error comes out
when a page cache folio with minorder is split to order-0, an NULL ptr
defer shows up in the writeback path. I thought the folio was split to
non-0 order. split_huge_page_to_list_to_order() should be fine, since
splitting to order-0 is not changed after my patches.

I wonder if you can isolate the issue by just splitting a dirty minorder
page cache folio instead of having folio split and migration going on together.
You probably can use the debugfs to do that. Depending on the result,
we can narrow down the cause of the issue.

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2024-04-30 06:28:06

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
>
> > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>> From: Pankaj Raghav <[email protected]>
> >>>>>>
> >>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>> error in the writeback path [1].
> >>>>>>
> >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>
> >>>>> How would I go about reproducing this?
> >>
> >> Well so the below fixes this but I am not sure if this is correct.
> >> folio_mark_dirty() at least says that a folio should not be truncated
> >> while its running. I am not sure if we should try to split folios then
> >> even though we check for writeback once. truncate_inode_partial_folio()
> >> will folio_wait_writeback() but it will split_folio() before checking
> >> for claiming to fail to truncate with folio_test_dirty(). But since the
> >> folio is locked its not clear why this should be possible.
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 83955362d41c..90195506211a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >> if (new_order >= folio_order(folio))
> >> return -EINVAL;
> >>
> >> - if (folio_test_writeback(folio))
> >> + if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >> return -EBUSY;
> >>
> >> if (!folio_test_anon(folio)) {
> >
> > I wondered what code path is causing this and triggering this null
> > pointer, so I just sprinkled a check here:
> >
> > VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >
> > The answer was:
> >
> > kcompactd() --> migrate_pages_batch()
> > --> try_split_folio --> split_folio_to_list() -->
> > split_huge_page_to_list_to_order()
> >
>
> There are 3 try_split_folio() in migrate_pages_batch().

This is only true for linux-next, for v6.9-rc5 off of which this testing
is based on there are only two.

> First one is to split anonymous large folios that are on deferred
> split list, so not related;

This is in linux-next and not v6.9-rc5.

> second one is to split THPs when thp migration is not supported, but
> this is compaction, so not related; third one is to split large folios
> when there is no same size free page in the system, and this should be
> the one.

Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
also helps us enhance the reproducer further, which I'll do next.

> > And I verified that moving the check only to the migrate_pages_batch()
> > path also fixes the crash:
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 73a052a382f1..83b528eb7100 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> > int rc;
> >
> > folio_lock(folio);
> > + if (folio_test_dirty(folio)) {
> > + rc = -EBUSY;
> > + goto out;
> > + }
> > rc = split_folio_to_list(folio, split_folios);
> > +out:
> > folio_unlock(folio);
> > if (!rc)
> > list_move_tail(&folio->lru, split_folios);
> >
> > However I'd like compaction folks to review this. I see some indications
> > in the code that migration can race with truncation but we feel fine by
> > it by taking the folio lock. However here we have a case where we see
> > the folio clearly locked and the folio is dirty. Other migraiton code
> > seems to write back the code and can wait, here we just move on. Further
> > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > as completely unmovable") seems to hint that migration is safe if the
> > mapping either does not exist or the mapping does exist but has
> > mapping->a_ops->migrate_folio so I'd like further feedback on this.
>
> During migration, all page table entries pointing to this dirty folio
> are invalid, and accesses to this folio will cause page fault and
> wait on the migration entry. I am not sure we need to skip dirty folios.

I see.. thanks!

> > Another thing which requires review is if we we split a folio but not
> > down to order 0 but to the new min order, does the accounting on
> > migrate_pages_batch() require changing? And most puzzling, why do we
>
> What accounting are you referring to? split code should take care of it.

The folio order can change after split, and so I was concerned about the
nr_pages used in migrate_pages_batch(). But I see now that when
migrate_folio_unmap() first failed we try to split the folio, and if
successful I see now we the caller will again call migrate_pages_batch()
with a retry attempt of 1 only to the split folios. I also see the
nr_pages is just local to each list for each loop, first on the from
list to unmap and afte on the unmap list so we move the folios.

> > not see this with regular large folios, but we do see it with minorder ?
>
> I wonder if the split code handles folio->mapping->i_pages properly.
> Does the i_pages store just folio pointers or also need all tail page
> pointers? I am no expert in fs, thus need help.

mapping->i_pages stores folio pointers in the page cache or
swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
special and we special-case them with shmem_mapping(mapping) checks.
split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
entries, and we also bail out on shmem_mapping(mapping) already.

Luis

2024-04-30 06:58:09

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Mon, Apr 29, 2024 at 05:31:04PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> > On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 73a052a382f1..83b528eb7100 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> > > int rc;
> > >
> > > folio_lock(folio);
> > > + if (folio_test_dirty(folio)) {
> > > + rc = -EBUSY;
> > > + goto out;
> > > + }
> > > rc = split_folio_to_list(folio, split_folios);
> > > +out:
> > > folio_unlock(folio);
> > > if (!rc)
> > > list_move_tail(&folio->lru, split_folios);
> > >
> > > However I'd like compaction folks to review this. I see some indications
> > > in the code that migration can race with truncation but we feel fine by
> > > it by taking the folio lock. However here we have a case where we see
> > > the folio clearly locked and the folio is dirty. Other migraiton code
> > > seems to write back the code and can wait, here we just move on. Further
> > > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > > as completely unmovable") seems to hint that migration is safe if the
> > > mapping either does not exist or the mapping does exist but has
> > > mapping->a_ops->migrate_folio so I'd like further feedback on this.
> >
> > During migration, all page table entries pointing to this dirty folio
> > are invalid, and accesses to this folio will cause page fault and
> > wait on the migration entry. I am not sure we need to skip dirty folios.

That would seem to explain why we get this, if we allow these to
continue, ie, without the hunk above, so it begs to ask, are we sure
we are waiting for migration to complete if we're doing writeback?

Apr 29 17:28:54 min-xfs-reflink-16k-4ks unknown: run fstests generic/447 at 2024-04-29 17:28:54
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled.
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount
Apr 29 17:28:56 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Unmounting Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled.
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem e29c3eee-18d1-4fec-9a17-076b3eccbd74
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: supervisor read access in kernel mode
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: error_code(0x0000) - not-present page
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PGD 0 P4D 0
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CPU: 3 PID: 2245 Comm: kworker/u36:5 Not tainted 6.9.0-rc5+ #26
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Workqueue: writeback wb_workfn (flush-7:5)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Code: ad 05 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53
All code
========
0: ad lods %ds:(%rsi),%eax
1: 05 86 00 48 89 add $0x89480086,%eax
6: c3 ret
7: 48 3d 06 04 00 00 cmp $0x406,%rax
d: 74 e8 je 0xfffffffffffffff7
f: 48 81 fb 02 04 00 00 cmp $0x402,%rbx
16: 0f 84 d0 00 00 00 je 0xec
1c: 48 85 db test %rbx,%rbx
1f: 0f 84 04 01 00 00 je 0x129
25: f6 c3 01 test $0x1,%bl
28: 75 c4 jne 0xffffffffffffffee
2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction
2d: 85 c0 test %eax,%eax
2f: 0f 84 b7 00 00 00 je 0xec
35: 8d 50 01 lea 0x1(%rax),%edx
38: 48 8d 73 34 lea 0x34(%rbx),%rsi
3c: f0 lock
3d: 0f .byte 0xf
3e: b1 53 mov $0x53,%cl

Code starting with the faulting instruction
===========================================
0: 8b 43 34 mov 0x34(%rbx),%eax
3: 85 c0 test %eax,%eax
5: 0f 84 b7 00 00 00 je 0xc2
b: 8d 50 01 lea 0x1(%rax),%edx
e: 48 8d 73 34 lea 0x34(%rbx),%rsi
12: f0 lock
13: 0f .byte 0xf
14: b1 53 mov $0x53,%cl
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RSP: 0018:ffffad0a0396b8f8 EFLAGS: 00010246
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 00000000000ba200
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9f408d20d488
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R13: ffffad0a0396bbb8 R14: ffffad0a0396bcb8 R15: ffff9f40633bfc00
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: FS: 0000000000000000(0000) GS:ffff9f40bbcc0000(0000) knlGS:0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036 CR3: 000000010bd44006 CR4: 0000000000770ef0
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PKRU: 55555554
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Call Trace:
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: <TASK>
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? page_fault_oops (arch/x86/mm/fault.c:713)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? write_cache_pages (mm/page-writeback.c:2569)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1))
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (mm/filemap.c:1972 mm/filemap.c:2222)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: write_cache_pages (mm/page-writeback.c:2568)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: iomap_writepages (fs/iomap/buffered-io.c:1984)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: do_writepages (mm/page-writeback.c:2612)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_single_inode (fs/fs-writeback.c:1659)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? _raw_spin_lock (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:134 (discriminator 4) kernel/locking/spinlock.c:154 (discriminator 4))
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_sb_inodes (fs/fs-writeback.c:1943)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_writeback (fs/fs-writeback.c:2119)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1))
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: process_one_work (kernel/workqueue.c:3254)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2))
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: kthread (kernel/kthread.c:388)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork (arch/x86/kernel/process.c:147)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: </TASK>
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Modules linked in: xfs nvme_fabrics nvme_core t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sunrpc 9p netfs kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_balloon pcspkr 9pnet_virtio virtio_console button evdev joydev serio_raw loop drm dm_mod autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover failover virtio_blk crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ---[ end trace 0000000000000000 ]---

2024-04-30 23:37:04

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Mon, Apr 29, 2024 at 10:43:16PM -0400, Zi Yan wrote:
> On 29 Apr 2024, at 20:31, Luis Chamberlain wrote:
>
> > On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> >> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> >>
> >>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>>>> From: Pankaj Raghav <[email protected]>
> >>>>>>>>
> >>>>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>>>> error in the writeback path [1].
> >>>>>>>>
> >>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>>>
> >>>>>>> How would I go about reproducing this?
> >>>>
> >>>> Well so the below fixes this but I am not sure if this is correct.
> >>>> folio_mark_dirty() at least says that a folio should not be truncated
> >>>> while its running. I am not sure if we should try to split folios then
> >>>> even though we check for writeback once. truncate_inode_partial_folio()
> >>>> will folio_wait_writeback() but it will split_folio() before checking
> >>>> for claiming to fail to truncate with folio_test_dirty(). But since the
> >>>> folio is locked its not clear why this should be possible.
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 83955362d41c..90195506211a 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>>> if (new_order >= folio_order(folio))
> >>>> return -EINVAL;
> >>>>
> >>>> - if (folio_test_writeback(folio))
> >>>> + if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >>>> return -EBUSY;
> >>>>
> >>>> if (!folio_test_anon(folio)) {
> >>>
> >>> I wondered what code path is causing this and triggering this null
> >>> pointer, so I just sprinkled a check here:
> >>>
> >>> VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >>>
> >>> The answer was:
> >>>
> >>> kcompactd() --> migrate_pages_batch()
> >>> --> try_split_folio --> split_folio_to_list() -->
> >>> split_huge_page_to_list_to_order()
> >>>
> >>
> >> There are 3 try_split_folio() in migrate_pages_batch().
> >
> > This is only true for linux-next, for v6.9-rc5 off of which this testing
> > is based on there are only two.
> >
> >> First one is to split anonymous large folios that are on deferred
> >> split list, so not related;
> >
> > This is in linux-next and not v6.9-rc5.
> >
> >> second one is to split THPs when thp migration is not supported, but
> >> this is compaction, so not related; third one is to split large folios
> >> when there is no same size free page in the system, and this should be
> >> the one.
> >
> > Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
> > also helps us enhance the reproducer further, which I'll do next.
> >
> >>> And I verified that moving the check only to the migrate_pages_batch()
> >>> path also fixes the crash:
> >>>
> >>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>> index 73a052a382f1..83b528eb7100 100644
> >>> --- a/mm/migrate.c
> >>> +++ b/mm/migrate.c
> >>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> >>> int rc;
> >>>
> >>> folio_lock(folio);
> >>> + if (folio_test_dirty(folio)) {
> >>> + rc = -EBUSY;
> >>> + goto out;
> >>> + }
> >>> rc = split_folio_to_list(folio, split_folios);
> >>> +out:
> >>> folio_unlock(folio);
> >>> if (!rc)
> >>> list_move_tail(&folio->lru, split_folios);
> >>>
> >>> However I'd like compaction folks to review this. I see some indications
> >>> in the code that migration can race with truncation but we feel fine by
> >>> it by taking the folio lock. However here we have a case where we see
> >>> the folio clearly locked and the folio is dirty. Other migraiton code
> >>> seems to write back the code and can wait, here we just move on. Further
> >>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> >>> as completely unmovable") seems to hint that migration is safe if the
> >>> mapping either does not exist or the mapping does exist but has
> >>> mapping->a_ops->migrate_folio so I'd like further feedback on this.
> >>
> >> During migration, all page table entries pointing to this dirty folio
> >> are invalid, and accesses to this folio will cause page fault and
> >> wait on the migration entry. I am not sure we need to skip dirty folios.
> >
> > I see.. thanks!
> >
> >>> Another thing which requires review is if we we split a folio but not
> >>> down to order 0 but to the new min order, does the accounting on
> >>> migrate_pages_batch() require changing? And most puzzling, why do we
> >>
> >> What accounting are you referring to? split code should take care of it.
> >
> > The folio order can change after split, and so I was concerned about the
> > nr_pages used in migrate_pages_batch(). But I see now that when
> > migrate_folio_unmap() first failed we try to split the folio, and if
> > successful I see now we the caller will again call migrate_pages_batch()
> > with a retry attempt of 1 only to the split folios. I also see the
> > nr_pages is just local to each list for each loop, first on the from
> > list to unmap and afte on the unmap list so we move the folios.
> >
> >>> not see this with regular large folios, but we do see it with minorder ?
> >>
> >> I wonder if the split code handles folio->mapping->i_pages properly.
> >> Does the i_pages store just folio pointers or also need all tail page
> >> pointers? I am no expert in fs, thus need help.
> >
> > mapping->i_pages stores folio pointers in the page cache or
> > swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
> > special and we special-case them with shmem_mapping(mapping) checks.
> > split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
> > entries, and we also bail out on shmem_mapping(mapping) already.
>
> Hmm, I misunderstood the issue above. To clarify it, the error comes out
> when a page cache folio with minorder is split to order-0,

No, min order is used.

In order to support splits with min order we require an out of tree
patch not yet posted:

https://github.com/linux-kdevops/linux/commit/e77a2a4fd6d9aa7e2641d5ea456ad0522c1e8a04

The important part is if no order is specified we use the min order:

int split_folio_to_list(struct folio *folio, struct list_head *list)
{
unsigned int min_order = 0;

if (!folio_test_anon(folio))
min_order = mapping_min_folio_order(folio->mapping);

return split_huge_page_to_list_to_order(&folio->page, list, min_order);
}

and so compaction's try_split_folio() -->
split_folio_to_list(folio, split_folios)

will use the min order implicitly due to the above.

So yes, we see a null ptr deref on the writeback path when min order is set.

> I wonder if you can isolate the issue by just splitting a dirty minorder
> page cache folio instead of having folio split and migration going on together.
> You probably can use the debugfs to do that. Depending on the result,
> we can narrow down the cause of the issue.

That's what I had tried with my new fstsest test but now I see where it
also failed -- on 4k filesystems it was trying to split to order 0 and
that had no issues as you pointed out. We can now fine tune the test
very well. I can now reproduce the crash on plain on boring vanilla
linux v6.9-rc6 on a plain xfs filesystem with 4k block size on x86_64
doing this:

You may want the following appended to your kernel command line:

dyndbg='file mm/huge_memory.c +p'

mkfs.xfs /dev/vdd
mkdir -p /media/scratch/
mount /dev/vdd /media/scratch/

while true; do dd if=/dev/zero of=$FILE bs=4M count=200 2> /dev/null; done &
while true; do sleep 2; echo $FILE,0x0,0x4000,2 > /sys/kernel/debug/split_huge_pages 0x400000 2> /dev/null; done

The crash:

Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: SGI XFS with ACLs, security attributes, realtime, scrub, repair, quota, fatal assert, debug enabled
Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Mounting V5 Filesystem d1f9e444-f61c-4439-a2bf-61a13f6d8e81
Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Ending clean mount
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: huge_memory: split file-backed THPs in file: /media/scratch/foo, page offset: [0x0 - 0x200000]
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: supervisor read access in kernel mode
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: error_code(0x0000) - not-present page
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PGD 0 P4D 0
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CPU: 4 PID: 89 Comm: kworker/u37:2 Not tainted 6.9.0-rc6 #10
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Workqueue: writeback wb_workfn (flush-254:48)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Code: bd 06 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53
All code
========
0: bd 06 86 00 48 mov $0x48008606,%ebp
5: 89 c3 mov %eax,%ebx
7: 48 3d 06 04 00 00 cmp $0x406,%rax
d: 74 e8 je 0xfffffffffffffff7
f: 48 81 fb 02 04 00 00 cmp $0x402,%rbx
16: 0f 84 d0 00 00 00 je 0xec
1c: 48 85 db test %rbx,%rbx
1f: 0f 84 04 01 00 00 je 0x129
25: f6 c3 01 test $0x1,%bl
28: 75 c4 jne 0xffffffffffffffee
2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction
2d: 85 c0 test %eax,%eax
2f: 0f 84 b7 00 00 00 je 0xec
35: 8d 50 01 lea 0x1(%rax),%edx
38: 48 8d 73 34 lea 0x34(%rbx),%rsi
3c: f0 lock
3d: 0f .byte 0xf
3e: b1 53 mov $0x53,%cl

Code starting with the faulting instruction
===========================================
0: 8b 43 34 mov 0x34(%rbx),%eax
3: 85 c0 test %eax,%eax
5: 0f 84 b7 00 00 00 je 0xc2
b: 8d 50 01 lea 0x1(%rax),%edx
e: 48 8d 73 34 lea 0x34(%rbx),%rsi
12: f0 lock
13: 0f .byte 0xf
14: b1 53 mov $0x53,%cl
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RSP: 0018:ffffa8f0c07cb8f8 EFLAGS: 00010246
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000018000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff987380564920
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R13: ffffa8f0c07cbbb8 R14: ffffa8f0c07cbcb8 R15: ffff98738c4ea800
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff9873fbd00000(0000) knlGS:0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CR2: 0000000000000036 CR3: 000000011aca8003 CR4: 0000000000770ef0
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PKRU: 55555554
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Call Trace:
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: <TASK>
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? page_fault_oops (arch/x86/mm/fault.c:713)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1))
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (mm/filemap.c:1968 mm/filemap.c:2218)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: write_cache_pages (mm/page-writeback.c:2568)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: iomap_writepages (fs/iomap/buffered-io.c:1984)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: do_writepages (mm/page-writeback.c:2612)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9902 (discriminator 2) kernel/sched/fair.c:10583 (discriminator 2))
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_single_inode (fs/fs-writeback.c:1659)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_sb_inodes (fs/fs-writeback.c:1943)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_writeback (fs/fs-writeback.c:2119)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1))
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: process_one_work (kernel/workqueue.c:3254)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2))
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4))
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: kthread (kernel/kthread.c:388)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork (arch/x86/kernel/process.c:147)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: </TASK>

The full decoded crash on v6.9-rc6:

https://gist.github.com/mcgrof/c44aaed21b99ae4ecf3d7fc6a1bb00bc

Luis

2024-05-01 09:02:02

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote:
> 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction
> RBX: 0000000000000002 RCX: 0000000000018000

Thanks, got it. I'll send a patch in the morning, but I know exactly
what the problem is. You're seeing sibling entries tagged as dirty.
That shouldn't happen; we should only see folios tagged as dirty.
The bug is in node_set_marks() which calls node_mark_all(). This works
fine when splitting to order 0, but we should only mark the first entry
of each order. eg if we split to order 3, we should tag slots 0, 8,
16, 24, .., 56.

2024-05-01 18:06:56

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

On Wed, May 01, 2024 at 05:13:59AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote:
> > 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction
> > RBX: 0000000000000002 RCX: 0000000000018000
>
> Thanks, got it. I'll send a patch in the morning, but I know exactly
> what the problem is. You're seeing sibling entries tagged as dirty.
> That shouldn't happen; we should only see folios tagged as dirty.
> The bug is in node_set_marks() which calls node_mark_all(). This works
> fine when splitting to order 0, but we should only mark the first entry
> of each order. eg if we split to order 3, we should tag slots 0, 8,
> 16, 24, .., 56.

Confirmed:

+++ b/lib/test_xarray.c
@@ -1789,8 +1789,10 @@ static void check_split_1(struct xarray *xa, unsigned lon
g index,
{
XA_STATE_ORDER(xas, xa, index, new_order);
unsigned int i;
+ void *entry;

xa_store_order(xa, index, order, xa, GFP_KERNEL);
+ xa_set_mark(xa, index, XA_MARK_1);

xas_split_alloc(&xas, xa, order, GFP_KERNEL);
xas_lock(&xas);
@@ -1807,6 +1809,12 @@ static void check_split_1(struct xarray *xa, unsigned long index,
xa_set_mark(xa, index, XA_MARK_0);
XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));

+ xas_set_order(&xas, index, 0);
+ rcu_read_lock();
+ xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1)
+ XA_BUG_ON(xa, xa_is_internal(entry));
+ rcu_read_unlock();
+
xa_destroy(xa);
}

spits out:

$ ./tools/testing/radix-tree/xarray
BUG at check_split_1:1815
xarray: 0x562b4043e580x head 0x50c0095cc082x flags 3000000 marks 1 1 0
0-63: node 0x50c0095cc080x max 0 parent (nil)x shift 3 count 1 values 0 array 0x562b4043e580x list 0x50c0095cc098x 0x50c0095cc098x marks 1 1 0
0-7: node 0x50c0095cc140x offset 0 parent 0x50c0095cc080x shift 0 count 8 values 4 array 0x562b4043e580x list 0x50c0095cc158x 0x50c0095cc158x marks 1 ff 0
0: value 0 (0x0) [0x1x]
1: sibling (slot 0)
2: value 2 (0x2) [0x5x]
3: sibling (slot 2)
4: value 4 (0x4) [0x9x]
5: sibling (slot 4)
6: value 6 (0x6) [0xdx]
7: sibling (slot 6)
xarray: ../../../lib/test_xarray.c:1815: check_split_1: Assertion `0' failed.
Aborted