LinuxLists.cc - [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan

2023-03-30 15:57:13

Subject: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

Syzkaller reported the following issue:

kernel BUG at mm/khugepaged.c:1823!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023
RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline]
RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233
Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89
RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093
RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80
RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1
RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3
R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000
R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000
FS: 00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693
madvise_vma_behavior mm/madvise.c:1086 [inline]
madvise_walk_vmas mm/madvise.c:1260 [inline]
do_madvise+0x9e5/0x4680 mm/madvise.c:1439
__do_sys_madvise mm/madvise.c:1452 [inline]
__se_sys_madvise mm/madvise.c:1450 [inline]
__x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The 'xas_store' call during page cache scanning can potentially
translate 'xas' into the error state (with the reproducer provided
by the syzkaller the error code is -ENOMEM). However, there are no
further checks after the 'xas_store', and the next call of 'xas_next'
at the start of the scanning cycle doesn't increase the xa_index,
and the issue occurs.

This patch will add the xarray state error checking after the
'xas_store' and the corresponding result error code. It will
also add xarray state error checking via WARN_ON_ONCE macros,
to be sure that ENOMEM or other possible errors don't occur
at the places they shouldn't.

Tested via syzbot.

Reported-by: [email protected]
Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
Signed-off-by: Ivan Orlov <[email protected]>
---
V1 -> V2: Add WARN_ON_ONCE error checking and comments

mm/khugepaged.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 92e6f56a932d..8b6580b13339 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -55,6 +55,7 @@ enum scan_result {
SCAN_CGROUP_CHARGE_FAIL,
SCAN_TRUNCATED,
SCAN_PAGE_HAS_PRIVATE,
+ SCAN_STORE_FAILED,
};

#define CREATE_TRACE_POINTS
@@ -1840,6 +1841,15 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
goto xa_locked;
}
xas_store(&xas, hpage);
+ if (xas_error(&xas)) {
+ /* revert shmem_charge performed
+ * in the previous condition
+ */
+ mapping->nrpages--;
+ shmem_uncharge(mapping->host, 1);
+ result = SCAN_STORE_FAILED;
+ goto xa_locked;
+ }
nr_none++;
continue;
}
@@ -1992,6 +2002,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,

/* Finally, replace with the new page. */
xas_store(&xas, hpage);
+ /* We can't get an ENOMEM here (because the allocation happened before)
+ * but let's check for errors (XArray implementation can be
+ * changed in the future)
+ */
+ WARN_ON_ONCE(xas_error(&xas));
continue;
out_unlock:
unlock_page(page);
@@ -2029,6 +2044,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
/* Join all the small entries into a single multi-index entry */
xas_set_order(&xas, start, HPAGE_PMD_ORDER);
xas_store(&xas, hpage);
+ /* Here we can't get an ENOMEM (because entries were
+ * previously allocated) But let's check for errors
+ * (XArray implementation can be changed in the future)
+ */
+ WARN_ON_ONCE(xas_error(&xas));
xa_locked:
xas_unlock_irq(&xas);
xa_unlocked:
--
2.34.1

2023-03-31 01:45:12

by Zach O'Keefe

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On Mar 30 19:53, Ivan Orlov wrote:
> Syzkaller reported the following issue:
>
> kernel BUG at mm/khugepaged.c:1823!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023
> RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline]
> RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233
> Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89
> RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093
> RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80
> RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1
> RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3
> R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000
> R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000
> FS: 00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693
> madvise_vma_behavior mm/madvise.c:1086 [inline]
> madvise_walk_vmas mm/madvise.c:1260 [inline]
> do_madvise+0x9e5/0x4680 mm/madvise.c:1439
> __do_sys_madvise mm/madvise.c:1452 [inline]
> __se_sys_madvise mm/madvise.c:1450 [inline]
> __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>

Thanks, Ivan.

In the process of reviewing this, I starting thinking if the !shmem case was
also susceptible to a similar race, and I *think* it might be. Unfortunately, my
time has ran out, and I haven't been able to validate ; I'm less familiar with
the file-side of things.

The underlying problem is race with truncation/hole-punch under OOM condition.
The nice do-while loop near the top of collapse_file() attempts to avoid this
scenario by making sure enough slots are available. However, when we drop xarray
lock, we open ourselves up to concurrent removal + slot deletion. Those slots
then need to be allocated again -- which under OOM condition is failable.

The syzbot reproducer picks on shmem, but I think this can occur for file as
well. If we find a hole, we unlock the xarray and call
page_cache_sync_readahead(), which if it succeeds, IIUC, will have allocated a
new slot in our mapping pointing to the new page. We *then* locks the page. Only
after the page is locked are we protected from concurrent removal (Note: this is
what provides us protection in many of the xas_store() cases ; we've held the
slot's contained page-lock since verifying the slot exists, protecting us from
removal / reallocation races).

Maybe I'm just low on caffeine at the end of the day, and am missing something,
but if I had more time, I'd be looking into the file-side some more to verify.
Apologies that hasn't occurred to me until now ; I was looking at one of your
comments and double-checked why I *thought* we were safe.

Anyways, irrespective of that looming issues, some more notes to follow:

> The 'xas_store' call during page cache scanning can potentially
> translate 'xas' into the error state (with the reproducer provided
> by the syzkaller the error code is -ENOMEM). However, there are no
> further checks after the 'xas_store', and the next call of 'xas_next'
> at the start of the scanning cycle doesn't increase the xa_index,
> and the issue occurs.
>
> This patch will add the xarray state error checking after the
> 'xas_store' and the corresponding result error code. It will
> also add xarray state error checking via WARN_ON_ONCE macros,
> to be sure that ENOMEM or other possible errors don't occur
> at the places they shouldn't.

Thanks for the additions here. I think it's worthwhile providing even more
details about the specifics of the race we are fixing and/or guarding against to
help ppl understand how that -ENOMEM comes about if the do-while loop has
"Ensured" we have slots available (additionally, I think that comment can be
augmented).

> Tested via syzbot.
>
> Reported-by: [email protected]
> Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
> Signed-off-by: Ivan Orlov <[email protected]>
> ---
> V1 -> V2: Add WARN_ON_ONCE error checking and comments
>
> mm/khugepaged.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 92e6f56a932d..8b6580b13339 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -55,6 +55,7 @@ enum scan_result {
> SCAN_CGROUP_CHARGE_FAIL,
> SCAN_TRUNCATED,
> SCAN_PAGE_HAS_PRIVATE,
> + SCAN_STORE_FAILED,
> };

I'm still reluctant to add a new error code for this as this seems like quite a
rare race that requires OOM to trigger. I'd be happier just reusing SCAN_FAIL,
or, something we might get some millage out of later: SCAN_OOM.

Also, a reminder to update include/trace/events/huge_memory.h, if you go that
route.

>
> #define CREATE_TRACE_POINTS
> @@ -1840,6 +1841,15 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> goto xa_locked;
> }
> xas_store(&xas, hpage);
> + if (xas_error(&xas)) {
> + /* revert shmem_charge performed
> + * in the previous condition
> + */

Nit: Here, and following, I think standard convention for multiline comment is
to have an empty first and last line, eg:

+ /*
+ * revert shmem_charge performed
+ * in the previous condition
+ */

Though, checkpatch.pl --strict didn't seem to care.

> + mapping->nrpages--;
> + shmem_uncharge(mapping->host, 1);
> + result = SCAN_STORE_FAILED;
> + goto xa_locked;
> + }
> nr_none++;
> continue;
> }
> @@ -1992,6 +2002,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>
> /* Finally, replace with the new page. */
> xas_store(&xas, hpage);
> + /* We can't get an ENOMEM here (because the allocation happened before)
> + * but let's check for errors (XArray implementation can be
> + * changed in the future)
> + */
> + WARN_ON_ONCE(xas_error(&xas));

Nit: it's not just that allocation happened before -- need some guarantee we've
been protected from concurrent removal. This is what made me look at the file
side.

> continue;
> out_unlock:
> unlock_page(page);
> @@ -2029,6 +2044,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> /* Join all the small entries into a single multi-index entry */
> xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> xas_store(&xas, hpage);
> + /* Here we can't get an ENOMEM (because entries were
> + * previously allocated) But let's check for errors
> + * (XArray implementation can be changed in the future)
> + */
> + WARN_ON_ONCE(xas_error(&xas));

Ditto.

Apologies I won't be around to see this change through -- I'm just out of time,
and will be shutting my computer down tomorrow for 3 months. Sorry for the poor
timing, for raising issues, then disappearing. Hopefully I'm wrong and the
file-side isn't a concern.

Best,
Zach

> xa_locked:
> xas_unlock_irq(&xas);
> xa_unlocked:
> --
> 2.34.1
>

2023-03-31 07:02:41

by Ivan Orlov

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On 3/31/23 05:33, Zach O'Keefe wrote:

> Thanks, Ivan.
>
> In the process of reviewing this, I starting thinking if the !shmem case was
> also susceptible to a similar race, and I *think* it might be. Unfortunately, my
> time has ran out, and I haven't been able to validate ; I'm less familiar with
> the file-side of things.
>
> The underlying problem is race with truncation/hole-punch under OOM condition.
> The nice do-while loop near the top of collapse_file() attempts to avoid this
> scenario by making sure enough slots are available. However, when we drop xarray
> lock, we open ourselves up to concurrent removal + slot deletion. Those slots
> then need to be allocated again -- which under OOM condition is failable.
>
> The syzbot reproducer picks on shmem, but I think this can occur for file as
> well. If we find a hole, we unlock the xarray and call
> page_cache_sync_readahead(), which if it succeeds, IIUC, will have allocated a
> new slot in our mapping pointing to the new page. We *then* locks the page. Only
> after the page is locked are we protected from concurrent removal (Note: this is
> what provides us protection in many of the xas_store() cases ; we've held the
> slot's contained page-lock since verifying the slot exists, protecting us from
> removal / reallocation races).
>
> Maybe I'm just low on caffeine at the end of the day, and am missing something,
> but if I had more time, I'd be looking into the file-side some more to verify.
> Apologies that hasn't occurred to me until now ; I was looking at one of your
> comments and double-checked why I *thought* we were safe.
>
> Anyways, irrespective of that looming issues, some more notes to follow:
>
>> The 'xas_store' call during page cache scanning can potentially
>> translate 'xas' into the error state (with the reproducer provided
>> by the syzkaller the error code is -ENOMEM). However, there are no
>> further checks after the 'xas_store', and the next call of 'xas_next'
>> at the start of the scanning cycle doesn't increase the xa_index,
>> and the issue occurs.
>>
>> This patch will add the xarray state error checking after the
>> 'xas_store' and the corresponding result error code. It will
>> also add xarray state error checking via WARN_ON_ONCE macros,
>> to be sure that ENOMEM or other possible errors don't occur
>> at the places they shouldn't.
>
> Thanks for the additions here. I think it's worthwhile providing even more
> details about the specifics of the race we are fixing and/or guarding against to
> help ppl understand how that -ENOMEM comes about if the do-while loop has
> "Ensured" we have slots available (additionally, I think that comment can be
> augmented).
>
>> Tested via syzbot.
>>
>> Reported-by: [email protected]
>> Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
>> Signed-off-by: Ivan Orlov <[email protected]>
>> ---
>> V1 -> V2: Add WARN_ON_ONCE error checking and comments
>>
>> mm/khugepaged.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 92e6f56a932d..8b6580b13339 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -55,6 +55,7 @@ enum scan_result {
>> SCAN_CGROUP_CHARGE_FAIL,
>> SCAN_TRUNCATED,
>> SCAN_PAGE_HAS_PRIVATE,
>> + SCAN_STORE_FAILED,
>> };
>
> I'm still reluctant to add a new error code for this as this seems like quite a
> rare race that requires OOM to trigger. I'd be happier just reusing SCAN_FAIL,
> or, something we might get some millage out of later: SCAN_OOM.
>
> Also, a reminder to update include/trace/events/huge_memory.h, if you go that
> route.
>
>>
>> #define CREATE_TRACE_POINTS
>> @@ -1840,6 +1841,15 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>> goto xa_locked;
>> }
>> xas_store(&xas, hpage);
>> + if (xas_error(&xas)) {
>> + /* revert shmem_charge performed
>> + * in the previous condition
>> + */
>
> Nit: Here, and following, I think standard convention for multiline comment is
> to have an empty first and last line, eg:
>
> + /*
> + * revert shmem_charge performed
> + * in the previous condition
> + */
>
> Though, checkpatch.pl --strict didn't seem to care.
>
>> + mapping->nrpages--;
>> + shmem_uncharge(mapping->host, 1);
>> + result = SCAN_STORE_FAILED;
>> + goto xa_locked;
>> + }
>> nr_none++;
>> continue;
>> }
>> @@ -1992,6 +2002,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>>
>> /* Finally, replace with the new page. */
>> xas_store(&xas, hpage);
>> + /* We can't get an ENOMEM here (because the allocation happened before)
>> + * but let's check for errors (XArray implementation can be
>> + * changed in the future)
>> + */
>> + WARN_ON_ONCE(xas_error(&xas));
>
> Nit: it's not just that allocation happened before -- need some guarantee we've
> been protected from concurrent removal. This is what made me look at the file
> side.
>
>> continue;
>> out_unlock:
>> unlock_page(page);
>> @@ -2029,6 +2044,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>> /* Join all the small entries into a single multi-index entry */
>> xas_set_order(&xas, start, HPAGE_PMD_ORDER);
>> xas_store(&xas, hpage);
>> + /* Here we can't get an ENOMEM (because entries were
>> + * previously allocated) But let's check for errors
>> + * (XArray implementation can be changed in the future)
>> + */
>> + WARN_ON_ONCE(xas_error(&xas));
>
> Ditto.
>
> Apologies I won't be around to see this change through -- I'm just out of time,
> and will be shutting my computer down tomorrow for 3 months. Sorry for the poor
> timing, for raising issues, then disappearing. Hopefully I'm wrong and the
> file-side isn't a concern.
>
> Best,
> Zach
>
>> xa_locked:
>> xas_unlock_irq(&xas);
>> xa_unlocked:
>> --
>> 2.34.1
>>

Hello, Zach! Thank you very much for the detailed review! I thought that
locking is our guarantee to not get an -ENOMEM, but I didn't have the
deep understanding of the problem's cause, due to the fact I'm just
starting my memory management journey :) In any case, I will do a
research about this problem, and hopefully after you get out of the
vacation you will see a new patch fully fixes this problem. Have a nice
vacation! Thanks again!

2023-04-04 01:04:45

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On Thu, Mar 30, 2023 at 6:33 PM Zach O'Keefe <[email protected]> wrote:
>
> On Mar 30 19:53, Ivan Orlov wrote:
> > Syzkaller reported the following issue:
> >
> > kernel BUG at mm/khugepaged.c:1823!
> > invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023
> > RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline]
> > RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233
> > Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89
> > RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093
> > RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80
> > RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1
> > RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3
> > R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000
> > R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000
> > FS: 00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693
> > madvise_vma_behavior mm/madvise.c:1086 [inline]
> > madvise_walk_vmas mm/madvise.c:1260 [inline]
> > do_madvise+0x9e5/0x4680 mm/madvise.c:1439
> > __do_sys_madvise mm/madvise.c:1452 [inline]
> > __se_sys_madvise mm/madvise.c:1450 [inline]
> > __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
>
> Thanks, Ivan.
>
> In the process of reviewing this, I starting thinking if the !shmem case was
> also susceptible to a similar race, and I *think* it might be. Unfortunately, my
> time has ran out, and I haven't been able to validate ; I'm less familiar with
> the file-side of things.
>
> The underlying problem is race with truncation/hole-punch under OOM condition.
> The nice do-while loop near the top of collapse_file() attempts to avoid this
> scenario by making sure enough slots are available. However, when we drop xarray
> lock, we open ourselves up to concurrent removal + slot deletion. Those slots
> then need to be allocated again -- which under OOM condition is failable.
>
> The syzbot reproducer picks on shmem, but I think this can occur for file as
> well. If we find a hole, we unlock the xarray and call
> page_cache_sync_readahead(), which if it succeeds, IIUC, will have allocated a
> new slot in our mapping pointing to the new page. We *then* locks the page. Only
> after the page is locked are we protected from concurrent removal (Note: this is
> what provides us protection in many of the xas_store() cases ; we've held the
> slot's contained page-lock since verifying the slot exists, protecting us from
> removal / reallocation races).

IIUC, the find_lock_page() should be able to handle the race. It does
check whether the page is truncated or not after locking the page.

>
> Maybe I'm just low on caffeine at the end of the day, and am missing something,
> but if I had more time, I'd be looking into the file-side some more to verify.
> Apologies that hasn't occurred to me until now ; I was looking at one of your
> comments and double-checked why I *thought* we were safe.
>
> Anyways, irrespective of that looming issues, some more notes to follow:
>
> > The 'xas_store' call during page cache scanning can potentially
> > translate 'xas' into the error state (with the reproducer provided
> > by the syzkaller the error code is -ENOMEM). However, there are no
> > further checks after the 'xas_store', and the next call of 'xas_next'
> > at the start of the scanning cycle doesn't increase the xa_index,
> > and the issue occurs.
> >
> > This patch will add the xarray state error checking after the
> > 'xas_store' and the corresponding result error code. It will
> > also add xarray state error checking via WARN_ON_ONCE macros,
> > to be sure that ENOMEM or other possible errors don't occur
> > at the places they shouldn't.
>
> Thanks for the additions here. I think it's worthwhile providing even more
> details about the specifics of the race we are fixing and/or guarding against to
> help ppl understand how that -ENOMEM comes about if the do-while loop has
> "Ensured" we have slots available (additionally, I think that comment can be
> augmented).
>
> > Tested via syzbot.
> >
> > Reported-by: [email protected]
> > Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
> > Signed-off-by: Ivan Orlov <[email protected]>
> > ---
> > V1 -> V2: Add WARN_ON_ONCE error checking and comments
> >
> > mm/khugepaged.c | 20 ++++++++++++++++++++
> > 1 file changed, 20 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 92e6f56a932d..8b6580b13339 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -55,6 +55,7 @@ enum scan_result {
> > SCAN_CGROUP_CHARGE_FAIL,
> > SCAN_TRUNCATED,
> > SCAN_PAGE_HAS_PRIVATE,
> > + SCAN_STORE_FAILED,
> > };
>
> I'm still reluctant to add a new error code for this as this seems like quite a
> rare race that requires OOM to trigger. I'd be happier just reusing SCAN_FAIL,
> or, something we might get some millage out of later: SCAN_OOM.
>
> Also, a reminder to update include/trace/events/huge_memory.h, if you go that
> route.
>
> >
> > #define CREATE_TRACE_POINTS
> > @@ -1840,6 +1841,15 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > goto xa_locked;
> > }
> > xas_store(&xas, hpage);
> > + if (xas_error(&xas)) {
> > + /* revert shmem_charge performed
> > + * in the previous condition
> > + */
>
> Nit: Here, and following, I think standard convention for multiline comment is
> to have an empty first and last line, eg:
>
> + /*
> + * revert shmem_charge performed
> + * in the previous condition
> + */
>
> Though, checkpatch.pl --strict didn't seem to care.
>
> > + mapping->nrpages--;
> > + shmem_uncharge(mapping->host, 1);
> > + result = SCAN_STORE_FAILED;
> > + goto xa_locked;
> > + }
> > nr_none++;
> > continue;
> > }
> > @@ -1992,6 +2002,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >
> > /* Finally, replace with the new page. */
> > xas_store(&xas, hpage);
> > + /* We can't get an ENOMEM here (because the allocation happened before)
> > + * but let's check for errors (XArray implementation can be
> > + * changed in the future)
> > + */
> > + WARN_ON_ONCE(xas_error(&xas));
>
> Nit: it's not just that allocation happened before -- need some guarantee we've
> been protected from concurrent removal. This is what made me look at the file
> side.
>
> > continue;
> > out_unlock:
> > unlock_page(page);
> > @@ -2029,6 +2044,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > /* Join all the small entries into a single multi-index entry */
> > xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > xas_store(&xas, hpage);
> > + /* Here we can't get an ENOMEM (because entries were
> > + * previously allocated) But let's check for errors
> > + * (XArray implementation can be changed in the future)
> > + */
> > + WARN_ON_ONCE(xas_error(&xas));
>
> Ditto.
>
> Apologies I won't be around to see this change through -- I'm just out of time,
> and will be shutting my computer down tomorrow for 3 months. Sorry for the poor
> timing, for raising issues, then disappearing. Hopefully I'm wrong and the
> file-side isn't a concern.
>
> Best,
> Zach
>
> > xa_locked:
> > xas_unlock_irq(&xas);
> > xa_unlocked:
> > --
> > 2.34.1
> >

2023-04-16 18:47:06

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

Circling back to this fix...

The BUG() is obviously real. We're unsure that Ivan's fix is the best
one. We haven't identified a Fixes:, and as this report is against the 6.2
kernel, a cc:stable will be needed.

According to the sysbot bisection
(https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc),
this is present in linux-5.19, so it might predate Zach's
58ac9a8993a13ebc changes. But that bisection claim might be
misleading.

And Zach is offline for a few months. So can people please take a look
and see if we can get this wrapped up?

Matthew, the assertion failure is in the

VM_BUG_ON(index != xas.xa_index);

which was added in 77da9389b9d5f, so perhaps you could take a look?

Thanks.

2023-04-16 20:49:27

by Ivan Orlov

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On 4/16/23 22:33, Andrew Morton wrote:
>
> Circling back to this fix...
>
> The BUG() is obviously real. We're unsure that Ivan's fix is the best
> one. We haven't identified a Fixes:, and as this report is against the 6.2
> kernel, a cc:stable will be needed.
>
> According to the sysbot bisection
> (https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc),
> this is present in linux-5.19, so it might predate Zach's
> 58ac9a8993a13ebc changes. But that bisection claim might be
> misleading.
>
> And Zach is offline for a few months. So can people please take a look
> and see if we can get this wrapped up?
>
> Matthew, the assertion failure is in the
>
> VM_BUG_ON(index != xas.xa_index);
>
> which was added in 77da9389b9d5f, so perhaps you could take a look?
>
> Thanks.

Hello, Andrew!

I have been unable to reproduce the problem with any of the existing
reproducers on the 3d7cb6b04c3f3115719235cc6866b10326de34cd commit,
which was detected by the syzkaller bisection. I also tried to test if
the problem is reproducible on this commit via syzbot, but it did not
detect the problem. It's possible that the bisection claim is
misleading, as the issue may not be consistently reproducible.

Why did you mention the 58ac9a8993a13ebc commit? I thought 99cb0dbd47a15
was considered as a "Fixes". 99cb0dbd47a15 is older than 3d7cb6b04c3f3,
and the problematic code appeared there, so probably the problem could
appear in 99cb0dbd47a15 as well.

Please, correct me if I'm missing something.

Kind regards,
Ivan Orlov.

2023-04-16 23:36:27

by Ivan Orlov

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On 4/16/23 22:33, Andrew Morton wrote:
>
> Circling back to this fix...
>
> The BUG() is obviously real. We're unsure that Ivan's fix is the best
> one. We haven't identified a Fixes:, and as this report is against the 6.2
> kernel, a cc:stable will be needed.
>
> According to the sysbot bisection
> (https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc),
> this is present in linux-5.19, so it might predate Zach's
> 58ac9a8993a13ebc changes. But that bisection claim might be
> misleading.
>
> And Zach is offline for a few months. So can people please take a look
> and see if we can get this wrapped up?
>
> Matthew, the assertion failure is in the
>
> VM_BUG_ON(index != xas.xa_index);
>
> which was added in 77da9389b9d5f, so perhaps you could take a look?
>
> Thanks.

I tested the reproducers on the 99cb0dbd47a15 commit, and they do not
trigger the problematic condition of shared memory truncation or
hole-punching. I will investigate further, as there have been many
changes in khugepaged since the 99cb0dbd47a15 commit that could
potentially affect its behavior.

2023-04-17 01:12:18

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On Sun, 16 Apr 2023, Andrew Morton wrote:

>
> Circling back to this fix...
>
> The BUG() is obviously real.

It's worth remembering that syzbot's reproducer involves artificially
injecting page allocation failures. So although the bug may be "real",
it is rather in the theoretical category, way down on my own list to
look at, and I'd say not at all urgent to fix.

Hugh

> We're unsure that Ivan's fix is the best
> one. We haven't identified a Fixes:, and as this report is against the 6.2
> kernel, a cc:stable will be needed.
>
> According to the sysbot bisection
> (https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc),
> this is present in linux-5.19, so it might predate Zach's
> 58ac9a8993a13ebc changes. But that bisection claim might be
> misleading.
>
> And Zach is offline for a few months. So can people please take a look
> and see if we can get this wrapped up?
>
> Matthew, the assertion failure is in the
>
> VM_BUG_ON(index != xas.xa_index);
>
> which was added in 77da9389b9d5f, so perhaps you could take a look?
>
> Thanks.

2023-04-17 18:30:58

by Ivan Orlov

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On 16.04.2023 22:33, Andrew Morton wrote:
>
> Circling back to this fix...
>
> The BUG() is obviously real. We're unsure that Ivan's fix is the best
> one. We haven't identified a Fixes:, and as this report is against the 6.2
> kernel, a cc:stable will be needed.
>
> According to the sysbot bisection
> (https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc),
> this is present in linux-5.19, so it might predate Zach's
> 58ac9a8993a13ebc changes. But that bisection claim might be
> misleading.
>
> And Zach is offline for a few months. So can people please take a look
> and see if we can get this wrapped up?
>
> Matthew, the assertion failure is in the
>
> VM_BUG_ON(index != xas.xa_index);
>
> which was added in 77da9389b9d5f, so perhaps you could take a look?
>
> Thanks.

Tested today on the latest rc7, the bug still exists.

Kind regards,
Ivan Orlov.

2023-04-18 21:34:03

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2] mm: khugepaged: Fix kernel BUG in hpage_collapse_scan_file

On Mon, 17 Apr 2023 22:28:51 +0400 Ivan Orlov <[email protected]> wrote:

> Tested today on the latest rc7, the bug still exists.

OK, thanks. I'll send this patch upstream during the next merge window.
If you, Zach or someone else decides to remove SCAN_STORE_FAILED and
perhaps those WARN_ON_ONCE()es then we can address these things later.