2023-11-15 07:02:05

by José Pekkarinen

[permalink] [raw]
Subject: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

Documentation of __pte_offset_map_lock suggest there is situations where
a pmd may not have a corresponding page table, in which case it should
return NULL without changing ptlp. Syzbot found its ways to produce a
NULL dereference in the function showing this case. This patch will
provide the exit path suggested if this unlikely situation turns up. The
output of the kasan null-ptr-report follows:

Unable to handle kernel paging request at virtual address dfff800000000004
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[dfff800000000004] address between user and kernel address ranges
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 7952 Comm: syz-executor682 Not tainted 6.6.0-rc6-syzkaller-g78124b0c1d10 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __lock_acquire+0x104/0x75e8 kernel/locking/lockdep.c:5004
lr : lock_acquire+0x23c/0x71c kernel/locking/lockdep.c:5753
sp : ffff800098f26d40
x29: ffff800098f27000 x28: ffff8000808df4bc x27: ffff7000131e4e18
x26: 1ffff00011c340b9 x25: 0000000000000000 x24: 0000000000000000
x23: ffff7000131e4dd0 x22: 0000000000000000 x21: 0000000000000000
x20: 0000000000000000 x19: 0000000000000022 x18: ffff800098f27750
x17: 0000ffff833dafff x16: ffff80008a632120 x15: 0000000000000001
x14: ffff80008e1a05d0 x13: ffff800098f26e80 x12: dfff800000000000
x11: ffff800080319468 x10: ffff80008e1a05cc x9 : 00000000000000f3
x8 : 0000000000000004 x7 : ffff8000808df4bc x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000022
Call trace:
__lock_acquire+0x104/0x75e8 kernel/locking/lockdep.c:5004
lock_acquire+0x23c/0x71c kernel/locking/lockdep.c:5753
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x48/0x60 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:351 [inline]
__pte_offset_map_lock+0x154/0x360 mm/pgtable-generic.c:373
pte_offset_map_lock include/linux/mm.h:2939 [inline]
filemap_map_pages+0x698/0x11f0 mm/filemap.c:3582
do_fault_around mm/memory.c:4525 [inline]
do_read_fault mm/memory.c:4558 [inline]
do_fault mm/memory.c:4705 [inline]
do_pte_missing mm/memory.c:3669 [inline]
handle_pte_fault mm/memory.c:4978 [inline]
__handle_mm_fault mm/memory.c:5119 [inline]
handle_mm_fault+0x326c/0x49fc mm/memory.c:5284
faultin_page mm/gup.c:956 [inline]
__get_user_pages+0x3e0/0xa24 mm/gup.c:1239
populate_vma_page_range+0x254/0x328 mm/gup.c:1666
__mm_populate+0x240/0x3d8 mm/gup.c:1775
mm_populate include/linux/mm.h:3305 [inline]
vm_mmap_pgoff+0x2bc/0x3d4 mm/util.c:551
ksys_mmap_pgoff+0xd0/0x5b0 mm/mmap.c:1400
__do_sys_mmap arch/arm64/kernel/sys.c:28 [inline]
__se_sys_mmap arch/arm64/kernel/sys.c:21 [inline]
__arm64_sys_mmap+0xf8/0x110 arch/arm64/kernel/sys.c:21
__invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
Code: b006f948 b943a108 34000208 d343fe68 (386c6908)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
0: b006f948 adrp x8, 0xdf29000
4: b943a108 ldr w8, [x8, #928]
8: 34000208 cbz w8, 0x48
c: d343fe68 lsr x8, x19, #3
* 10: 386c6908 ldrb w8, [x8, x12] <-- trapping instruction

Reported-by: [email protected]
Fixes: 0d940a9b270b9 ("mm/pgtable: allow pte_offset_map[_lock]() to fail")
Closes: https://syzkaller.appspot.com/bug?id=7494636a5865150aecc6e480e0e7e17f2980ad8d
Signed-off-by: José Pekkarinen <[email protected]>
---
include/linux/mm.h | 4 ++--
mm/pgtable-generic.c | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..38d876529e1f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2854,7 +2854,7 @@ void ptlock_free(struct ptdesc *ptdesc);

static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
{
- return ptdesc->ptl;
+ return (likely(ptdesc)) ? ptdesc->ptl : NULL;
}
#else /* ALLOC_SPLIT_PTLOCKS */
static inline void ptlock_cache_init(void)
@@ -2872,7 +2872,7 @@ static inline void ptlock_free(struct ptdesc *ptdesc)

static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
{
- return &ptdesc->ptl;
+ return (likely(ptdesc)) ? &ptdesc->ptl : NULL;
}
#endif /* ALLOC_SPLIT_PTLOCKS */

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4fcd959dcc4d..7796339d7ef2 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -370,6 +370,8 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
if (unlikely(!pte))
return pte;
ptl = pte_lockptr(mm, &pmdval);
+ if (unlikely(!ptl))
+ return NULL;
spin_lock(ptl);
if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
*ptlp = ptl;
--
2.39.2


2023-11-15 14:20:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On Wed, Nov 15, 2023 at 08:55:05AM +0200, Jos? Pekkarinen wrote:
> Documentation of __pte_offset_map_lock suggest there is situations where

You should have cc'd Hugh who changed all this code recently.

> a pmd may not have a corresponding page table, in which case it should
> return NULL without changing ptlp. Syzbot found its ways to produce a
> NULL dereference in the function showing this case. This patch will
> provide the exit path suggested if this unlikely situation turns up. The
> output of the kasan null-ptr-report follows:

There's no need to include all this nonsense in the changelog.

> spin_lock include/linux/spinlock.h:351 [inline]
> __pte_offset_map_lock+0x154/0x360 mm/pgtable-generic.c:373
> pte_offset_map_lock include/linux/mm.h:2939 [inline]
> filemap_map_pages+0x698/0x11f0 mm/filemap.c:3582

This was the only interesting part.

> +++ b/include/linux/mm.h
> @@ -2854,7 +2854,7 @@ void ptlock_free(struct ptdesc *ptdesc);
>
> static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
> {
> - return ptdesc->ptl;
> + return (likely(ptdesc)) ? ptdesc->ptl : NULL;
> }

I don't think we should be changing ptlock_ptr().

> +++ b/mm/pgtable-generic.c
> @@ -370,6 +370,8 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> if (unlikely(!pte))
> return pte;
> ptl = pte_lockptr(mm, &pmdval);
> + if (unlikely(!ptl))
> + return NULL;
> spin_lock(ptl);

I don't understand how this could possibly solve the problem. If there's
no PTE level, then __pte_offset_map() should return NULL and we'd already
return due to the check for !pte.

2023-11-15 16:15:58

by José Pekkarinen

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On 2023-11-15 16:19, Matthew Wilcox wrote:
> On Wed, Nov 15, 2023 at 08:55:05AM +0200, José Pekkarinen wrote:
>> Documentation of __pte_offset_map_lock suggest there is situations
>> where
>
> You should have cc'd Hugh who changed all this code recently.

Hi,

Sorry, he seems to be missing if I run get_maintainer.pl:

$ ./scripts/get_maintainer.pl include/linux/mm.h
Andrew Morton <[email protected]> (maintainer:MEMORY MANAGEMENT)
[email protected] (open list:MEMORY MANAGEMENT)
[email protected] (open list)

>> a pmd may not have a corresponding page table, in which case it should
>> return NULL without changing ptlp. Syzbot found its ways to produce a
>> NULL dereference in the function showing this case. This patch will
>> provide the exit path suggested if this unlikely situation turns up.
>> The
>> output of the kasan null-ptr-report follows:
>
> There's no need to include all this nonsense in the changelog.

No problem, we can clean the patch if we find there is something
worth upstreaming.

>> spin_lock include/linux/spinlock.h:351 [inline]
>> __pte_offset_map_lock+0x154/0x360 mm/pgtable-generic.c:373
>> pte_offset_map_lock include/linux/mm.h:2939 [inline]
>> filemap_map_pages+0x698/0x11f0 mm/filemap.c:3582
>
> This was the only interesting part.
>
>> +++ b/include/linux/mm.h
>> @@ -2854,7 +2854,7 @@ void ptlock_free(struct ptdesc *ptdesc);
>>
>> static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
>> {
>> - return ptdesc->ptl;
>> + return (likely(ptdesc)) ? ptdesc->ptl : NULL;
>> }
>
> I don't think we should be changing ptlock_ptr().

This is where the null ptr dereference originates, so the only
alternative I can think of is to protect the life cycle of the ptdesc
to prevent it to die between the pte check and the spin_unlock of
__pte_offset_map_lock. Would that work for you?

>> +++ b/mm/pgtable-generic.c
>> @@ -370,6 +370,8 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm,
>> pmd_t *pmd,
>> if (unlikely(!pte))
>> return pte;
>> ptl = pte_lockptr(mm, &pmdval);
>> + if (unlikely(!ptl))
>> + return NULL;
>> spin_lock(ptl);
>
> I don't understand how this could possibly solve the problem. If
> there's
> no PTE level, then __pte_offset_map() should return NULL and we'd
> already
> return due to the check for !pte.

I tested the syzbot reproducer in x86 and it doesn't produce this
kasan
report anymore.

José.

2023-11-15 19:04:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On Wed, Nov 15, 2023 at 06:05:30PM +0200, Jos? Pekkarinen wrote:
> On 2023-11-15 16:19, Matthew Wilcox wrote:
> > On Wed, Nov 15, 2023 at 08:55:05AM +0200, Jos? Pekkarinen wrote:
> > > Documentation of __pte_offset_map_lock suggest there is situations
> > > where
> >
> > You should have cc'd Hugh who changed all this code recently.
>
> Hi,
>
> Sorry, he seems to be missing if I run get_maintainer.pl:
>
> $ ./scripts/get_maintainer.pl include/linux/mm.h
> Andrew Morton <[email protected]> (maintainer:MEMORY MANAGEMENT)
> [email protected] (open list:MEMORY MANAGEMENT)
> [email protected] (open list)

That's a good example of why get_maintainer.pl is not great. It's
just a stupid perl program. Ideally, you should research what changes
have been made to that code recently and see who else might be
implicated. Who introduced the exact code that you're fixing?

In this specific instance, you can see Hugh already responded to it:

https://lore.kernel.org/all/[email protected]/T/

Now, part of Hugh's response turns out to be incorrect; syzbot can
reproduce this on a current mainline kernel. But, for some reason,
syzbot has not done a bisect to track it down to a particular commit.
I don't understand why it hasn't; maybe someone who knows syzbot better
can explain why.

> > > +++ b/include/linux/mm.h
> > > @@ -2854,7 +2854,7 @@ void ptlock_free(struct ptdesc *ptdesc);
> > >
> > > static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
> > > {
> > > - return ptdesc->ptl;
> > > + return (likely(ptdesc)) ? ptdesc->ptl : NULL;
> > > }
> >
> > I don't think we should be changing ptlock_ptr().
>
> This is where the null ptr dereference originates, so the only
> alternative I can think of is to protect the life cycle of the ptdesc
> to prevent it to die between the pte check and the spin_unlock of
> __pte_offset_map_lock. Would that work for you?

Ah! I think I found the problem.

If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl
is embedded in the struct page. But if ALLOC_SPLIT_PTLOCKS is set
(eg you have LOCKDEP enabled), we can _return_ a NULL pointer from
ptlock_ptr. The NULL pointer dereference isn't in ptlock_ptr(), it's
in __pte_offset_map_lock().

So, how to solve this? We can't just check the ptl against NULL; the
memory that ptl points to may have been freed. We could grab a reference
to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set,
but that will slow down everything. We could make page_ptl_cachep
SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if
the lock might not be associated with this page any more).

Other ideas?

2023-11-16 07:24:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On Wed, 15 Nov 2023, Matthew Wilcox wrote:
> On Wed, Nov 15, 2023 at 06:05:30PM +0200, Jos? Pekkarinen wrote:
>
> > > I don't think we should be changing ptlock_ptr().
> >
> > This is where the null ptr dereference originates, so the only
> > alternative I can think of is to protect the life cycle of the ptdesc
> > to prevent it to die between the pte check and the spin_unlock of
> > __pte_offset_map_lock. Would that work for you?

Thanks for pursuing this, Jos?, but I agree with Matthew: I don't
think your patch is right at all. The change in ptlock_ptr() did not
make sense to me, and the change in __pte_offset_map_lock() leaves us
still wondering what has gone wrong (and misses an rcu_read_unlock()).

You mentioned "I tested the syzbot reproducer in x86 and it doesn't
produce this kasan report anymore": are you saying that you were able
to reproduce the issue on x86 (without your patch)? That would be very
interesting (and I think would disprove my hypothesis below). I ought
to try on x86 if you managed to reproduce on it, but it's not worth
the effort if you did not. If you have an x86 stack and registers,
please show (though I'm uncertain how much KASAN spoils the output).

>
> Ah! I think I found the problem.
>
> If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl
> is embedded in the struct page. But if ALLOC_SPLIT_PTLOCKS is set
> (eg you have LOCKDEP enabled), we can _return_ a NULL pointer from
> ptlock_ptr. The NULL pointer dereference isn't in ptlock_ptr(), it's
> in __pte_offset_map_lock().
>
> So, how to solve this? We can't just check the ptl against NULL; the
> memory that ptl points to may have been freed. We could grab a reference
> to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set,
> but that will slow down everything. We could make page_ptl_cachep
> SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if
> the lock might not be associated with this page any more).

But I don't think that's quite right either: or, I hope it's not right,
because it would say that all my business of rcu_read_lock() and
pte_free_defer() etc was a failing waste of everyone's time: if the
page table (and/or its lock) can be freed while someone is in that RCU-
protected area, then I've simply got it wrong.

What I thought, when yesterday's flurry of syzbot messages came in,
was perhaps the problem is at the other end - when the new page table
is allocated, rather than when it's being freed: a barrier missing there?

But pmd_install() does have an smp_wmb(), with a (suspiciously) long
comment about why no smp_rmb()s are needed, except on alpha.

I'm not much good on barriers, but the thought now in my mind is that
that comment is not quite accurate: that at the bottom level an smp_rmb()
is (very seldom!) needed - not just to avoid a NULL (or previous garbage)
ptl pointer in the ALLOC_SPLIT_LOCK case, but to make sure that the ptl
is visibly correctly initialized (or would spin_lock on it achieve that?);
and that what pte_offset_map() points to is visibly empty. (I'm imagining
stale cache lines left behind on the oopsing CPU, which need to be
refetched after pmdval has been read.)

And that this issue has, strictly speaking, always been there, but in
practice never a problem, because of mmap_lock held for write while (or
prior to) freeing page table, and racers holding mmap_lock for read
(vma lock not changing the calculus); but collapse_pte_mapped_thp()
can now be done with mmap_lock for read, letting those racers in
(and maybe your filemap_map_pages() has helped speed these races up -
any blame I can shift on to you, the better ;)

But I may well be spouting nonsense. And even if I'm right, is adding
an smp_rmb() in there too high a price to pay on some architectures?

I did make an attempt to reproduce on an arm64, but there's too much
more I'd need to muck around with to get it working right. Let's ask for
a syz test on top of v6.7-rc1 - hmm, how do I tell syzbot that it's arm64
that it needs to try on? I hope it gets that from the Cc'ed syz number.

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b85ea95d086471afb4ad062012a4d73cd328fa86

[PATCH] mm/pgtable: smp_rmb() to match smp_wmb() in pmd_install()

Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memory.c | 2 ++
mm/pgtable-generic.c | 5 +++++
2 files changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..8939357f1509 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -425,6 +425,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
* being the notable exception) will already guarantee loads are
* seen in-order. See the alpha page table accessors for the
* smp_rmb() barriers in page table walking code.
+ *
+ * See __pte_offset_map() for the smp_rmb() at the pte level.
*/
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
pmd_populate(mm, pmd, *pte);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4fcd959dcc4d..3330b666e9c3 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -297,6 +297,11 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
pmd_clear_bad(pmd);
goto nomap;
}
+ /*
+ * Pair with the smp_wmb() in pmd_install(): make sure that the
+ * page table lock and page table contents are visibly initialized.
+ */
+ smp_rmb();
return __pte_map(&pmdval, addr);
nomap:
rcu_read_unlock();
--
2.35.3

2023-11-16 08:27:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On Wed, 15 Nov 2023, Hugh Dickins wrote:
> On Wed, 15 Nov 2023, Matthew Wilcox wrote:
> > On Wed, Nov 15, 2023 at 06:05:30PM +0200, Jos? Pekkarinen wrote:
> >
> > > > I don't think we should be changing ptlock_ptr().
> > >
> > > This is where the null ptr dereference originates, so the only
> > > alternative I can think of is to protect the life cycle of the ptdesc
> > > to prevent it to die between the pte check and the spin_unlock of
> > > __pte_offset_map_lock. Would that work for you?
>
> Thanks for pursuing this, Jos?, but I agree with Matthew: I don't
> think your patch is right at all. The change in ptlock_ptr() did not
> make sense to me, and the change in __pte_offset_map_lock() leaves us
> still wondering what has gone wrong (and misses an rcu_read_unlock()).
>
> You mentioned "I tested the syzbot reproducer in x86 and it doesn't
> produce this kasan report anymore": are you saying that you were able
> to reproduce the issue on x86 (without your patch)? That would be very
> interesting (and I think would disprove my hypothesis below). I ought
> to try on x86 if you managed to reproduce on it, but it's not worth
> the effort if you did not. If you have an x86 stack and registers,
> please show (though I'm uncertain how much KASAN spoils the output).
>
> >
> > Ah! I think I found the problem.
> >
> > If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl
> > is embedded in the struct page. But if ALLOC_SPLIT_PTLOCKS is set
> > (eg you have LOCKDEP enabled), we can _return_ a NULL pointer from
> > ptlock_ptr. The NULL pointer dereference isn't in ptlock_ptr(), it's
> > in __pte_offset_map_lock().
> >
> > So, how to solve this? We can't just check the ptl against NULL; the
> > memory that ptl points to may have been freed. We could grab a reference
> > to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set,
> > but that will slow down everything. We could make page_ptl_cachep
> > SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if
> > the lock might not be associated with this page any more).
>
> But I don't think that's quite right either: or, I hope it's not right,
> because it would say that all my business of rcu_read_lock() and
> pte_free_defer() etc was a failing waste of everyone's time: if the
> page table (and/or its lock) can be freed while someone is in that RCU-
> protected area, then I've simply got it wrong.
>
> What I thought, when yesterday's flurry of syzbot messages came in,
> was perhaps the problem is at the other end - when the new page table
> is allocated, rather than when it's being freed: a barrier missing there?
>
> But pmd_install() does have an smp_wmb(), with a (suspiciously) long
> comment about why no smp_rmb()s are needed, except on alpha.
>
> I'm not much good on barriers, but the thought now in my mind is that
> that comment is not quite accurate: that at the bottom level an smp_rmb()
> is (very seldom!) needed - not just to avoid a NULL (or previous garbage)
> ptl pointer in the ALLOC_SPLIT_LOCK case, but to make sure that the ptl
> is visibly correctly initialized (or would spin_lock on it achieve that?);
> and that what pte_offset_map() points to is visibly empty. (I'm imagining
> stale cache lines left behind on the oopsing CPU, which need to be
> refetched after pmdval has been read.)
>
> And that this issue has, strictly speaking, always been there, but in
> practice never a problem, because of mmap_lock held for write while (or
> prior to) freeing page table, and racers holding mmap_lock for read
> (vma lock not changing the calculus); but collapse_pte_mapped_thp()
> can now be done with mmap_lock for read, letting those racers in
> (and maybe your filemap_map_pages() has helped speed these races up -
> any blame I can shift on to you, the better ;)
>
> But I may well be spouting nonsense. And even if I'm right, is adding
> an smp_rmb() in there too high a price to pay on some architectures?
>
> I did make an attempt to reproduce on an arm64, but there's too much
> more I'd need to muck around with to get it working right. Let's ask for
> a syz test on top of v6.7-rc1 - hmm, how do I tell syzbot that it's arm64
> that it needs to try on? I hope it gets that from the Cc'ed syz number.

Syzbot couldn't do it from this mail, but did it from the original thread:
https://lore.kernel.org/linux-mm/[email protected]/
tells us that I was spouting nonsense: this patch does not fix it.
I have no further idea yet.

>
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b85ea95d086471afb4ad062012a4d73cd328fa86
>
> [PATCH] mm/pgtable: smp_rmb() to match smp_wmb() in pmd_install()
>
> Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/memory.c | 2 ++
> mm/pgtable-generic.c | 5 +++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..8939357f1509 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -425,6 +425,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
> * being the notable exception) will already guarantee loads are
> * seen in-order. See the alpha page table accessors for the
> * smp_rmb() barriers in page table walking code.
> + *
> + * See __pte_offset_map() for the smp_rmb() at the pte level.
> */
> smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> pmd_populate(mm, pmd, *pte);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4fcd959dcc4d..3330b666e9c3 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -297,6 +297,11 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> pmd_clear_bad(pmd);
> goto nomap;
> }
> + /*
> + * Pair with the smp_wmb() in pmd_install(): make sure that the
> + * page table lock and page table contents are visibly initialized.
> + */
> + smp_rmb();
> return __pte_map(&pmdval, addr);
> nomap:
> rcu_read_unlock();
> --
> 2.35.3

2023-11-16 10:21:00

by José Pekkarinen

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On 2023-11-16 07:23, Hugh Dickins wrote:
> On Wed, 15 Nov 2023, Matthew Wilcox wrote:
>> On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote:
>>
>> > > I don't think we should be changing ptlock_ptr().
>> >
>> > This is where the null ptr dereference originates, so the only
>> > alternative I can think of is to protect the life cycle of the ptdesc
>> > to prevent it to die between the pte check and the spin_unlock of
>> > __pte_offset_map_lock. Would that work for you?
>
> Thanks for pursuing this, José, but I agree with Matthew: I don't
> think your patch is right at all. The change in ptlock_ptr() did not
> make sense to me, and the change in __pte_offset_map_lock() leaves us
> still wondering what has gone wrong (and misses an rcu_read_unlock()).
>
> You mentioned "I tested the syzbot reproducer in x86 and it doesn't
> produce this kasan report anymore": are you saying that you were able
> to reproduce the issue on x86 (without your patch)? That would be very
> interesting (and I think would disprove my hypothesis below). I ought
> to try on x86 if you managed to reproduce on it, but it's not worth
> the effort if you did not. If you have an x86 stack and registers,
> please show (though I'm uncertain how much KASAN spoils the output).

Hi,

Yes, I have a local setup based in [1], where I can spin a small
vm, build the reproducer and run it in. The only thing I took from
the webpage is the kernel config file, and the image I made it locally
by debootstrapping and running the modifications in create-image.sh
manually, the kasan report follows:

[ 111.408746][ T8885] general protection fault, probably for
non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
NOPTI
[ 111.413181][ T8885] KASAN: null-ptr-deref in range
[0x0000000000000028-0x000000000000002f]
[ 111.413181][ T8885] CPU: 1 PID: 8885 Comm: handle_kernel_p Not
tainted 6.7.0-rc1-00007-ge612cb00e200 #6
[ 111.413181][ T8885] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 111.413181][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
[ 111.423642][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00
4c 03 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa
48 c1 ea 03 <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0
a1 3a 09
[ 111.423642][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
[ 111.423642][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067 RCX:
ffffffff81ada02e
[ 111.423642][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f RDI:
0000000000000028
[ 111.423642][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007 R09:
0000000000000000
[ 111.423642][ T8885] R10: 0000000000000000 R11: 0000000000000000 R12:
0005088000000a80
[ 111.423642][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88 R15:
0000000000000000
[ 111.423642][ T8885] FS: 00007f8d3972c6c0(0000)
GS:ffff888069700000(0000) knlGS:0000000000000000
[ 111.423642][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 111.423642][ T8885] CR2: 00007f8d3970af78 CR3: 00000000224d6000 CR4:
00000000000006f0
[ 111.423642][ T8885] Call Trace:
[ 111.423642][ T8885] <TASK>
[ 111.423642][ T8885] ? show_regs+0x8f/0xa0
[ 111.423642][ T8885] ? die_addr+0x4f/0xd0
[ 111.423642][ T8885] ? exc_general_protection+0x150/0x220
[ 111.423642][ T8885] ? asm_exc_general_protection+0x26/0x30
[ 111.423642][ T8885] ? __pte_offset_map_lock+0x1de/0x310
[ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
[ 111.423642][ T8885] ? __pte_offset_map_lock+0xfa/0x310
[ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
[ 111.423642][ T8885] ? __pfx___pte_offset_map_lock+0x10/0x10
[ 111.423642][ T8885] filemap_map_pages+0x336/0x13b0
[ 111.423642][ T8885] ? __pfx_filemap_map_pages+0x10/0x10
[ 111.423642][ T8885] ? rcu_read_unlock+0x33/0xb0
[ 111.423642][ T8885] do_fault+0x86a/0x1350
[ 111.423642][ T8885] __handle_mm_fault+0xe53/0x23a0
[ 111.423642][ T8885] ? __pfx___handle_mm_fault+0x10/0x10
[ 111.483413][ T8885] handle_mm_fault+0x369/0x890
[ 111.483413][ T8885] __get_user_pages+0x46d/0x15d0
[ 111.483413][ T8885] ? __pfx___get_user_pages+0x10/0x10
[ 111.483413][ T8885] populate_vma_page_range+0x2de/0x420
[ 111.483413][ T8885] ? __pfx_populate_vma_page_range+0x10/0x10
[ 111.483413][ T8885] ? __pfx_find_vma_intersection+0x10/0x10
[ 111.483413][ T8885] ? vm_mmap_pgoff+0x299/0x3c0
[ 111.483413][ T8885] __mm_populate+0x1da/0x380
[ 111.483413][ T8885] ? __pfx___mm_populate+0x10/0x10
[ 111.483413][ T8885] ? up_write+0x1b3/0x520
[ 111.483413][ T8885] vm_mmap_pgoff+0x2d1/0x3c0
[ 111.483413][ T8885] ? __pfx_vm_mmap_pgoff+0x10/0x10
[ 111.483413][ T8885] ksys_mmap_pgoff+0x7d/0x5b0
[ 111.483413][ T8885] __x64_sys_mmap+0x125/0x190
[ 111.483413][ T8885] do_syscall_64+0x45/0xf0
[ 111.483413][ T8885] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 111.483413][ T8885] RIP: 0033:0x7f8d39831559
[ 111.483413][ T8885] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00
00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64
89 01 48
[ 111.483413][ T8885] RSP: 002b:00007f8d3972be78 EFLAGS: 00000216
ORIG_RAX: 0000000000000009
[ 111.483413][ T8885] RAX: ffffffffffffffda RBX: 00007f8d3972c6c0 RCX:
00007f8d39831559
[ 111.483413][ T8885] RDX: b635773f07ebbeea RSI: 0000000000b36000 RDI:
0000000020000000
[ 111.483413][ T8885] RBP: 00007f8d3972bea0 R08: 00000000ffffffff R09:
0000000000000000
[ 111.483413][ T8885] R10: 0000000000008031 R11: 0000000000000216 R12:
ffffffffffffff80
[ 111.483413][ T8885] R13: 0000000000000000 R14: 00007fffcef921d0 R15:
00007f8d3970c000
[ 111.483413][ T8885] </TASK>
[ 111.483413][ T8885] Modules linked in:
[ 111.763549][ T8885] ---[ end trace 0000000000000000 ]---
[ 111.773557][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
[ 111.776045][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00
4c 03 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa
48 c1 ea 03 <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0
a1 3a 09
[ 111.805040][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
[ 111.820041][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067 RCX:
ffffffff81ada02e
[ 111.837884][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f RDI:
0000000000000028
[ 111.855313][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007 R09:
0000000000000000
[ 111.878314][ T8885] R10: 0000000000000000 R11: 0000000000000000 R12:
0005088000000a80
[ 111.910624][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88 R15:
0000000000000000
[ 111.923627][ T8885] FS: 00007f8d3972c6c0(0000)
GS:ffff888069700000(0000) knlGS:0000000000000000
[ 111.932017][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 111.941166][ T8885] CR2: 00007fa26ac38178 CR3: 00000000224d6000 CR4:
00000000000006f0
[ 111.950619][ T8885] Kernel panic - not syncing: Fatal exception
[ 111.953981][ T8885] Kernel Offset: disabled
[ 111.953981][ T8885] Rebooting in 86400 seconds..

I can test some patches for you if it helps finding out
the issue.

[1]
https://github.com/google/syzkaller/blob/master/docs/linux/setup_ubuntu-host_qemu-vm_x86-64-kernel.md

José.

2023-11-17 06:14:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On Thu, 16 Nov 2023, José Pekkarinen wrote:
> On 2023-11-16 07:23, Hugh Dickins wrote:
> > On Wed, 15 Nov 2023, Matthew Wilcox wrote:
> >> On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote:
> >>
> >> > > I don't think we should be changing ptlock_ptr().
> >> >
> >> > This is where the null ptr dereference originates, so the only
> >> > alternative I can think of is to protect the life cycle of the ptdesc
> >> > to prevent it to die between the pte check and the spin_unlock of
> >> > __pte_offset_map_lock. Would that work for you?
> >
> > Thanks for pursuing this, José, but I agree with Matthew: I don't
> > think your patch is right at all. The change in ptlock_ptr() did not
> > make sense to me, and the change in __pte_offset_map_lock() leaves us
> > still wondering what has gone wrong (and misses an rcu_read_unlock()).
> >
> > You mentioned "I tested the syzbot reproducer in x86 and it doesn't
> > produce this kasan report anymore": are you saying that you were able
> > to reproduce the issue on x86 (without your patch)? That would be very
> > interesting (and I think would disprove my hypothesis below). I ought
> > to try on x86 if you managed to reproduce on it, but it's not worth
> > the effort if you did not. If you have an x86 stack and registers,
> > please show (though I'm uncertain how much KASAN spoils the output).
>
> Hi,
>
> Yes, I have a local setup based in [1], where I can spin a small
> vm, build the reproducer and run it in. The only thing I took from
> the webpage is the kernel config file, and the image I made it locally
> by debootstrapping and running the modifications in create-image.sh
> manually, the kasan report follows:
>
> [ 111.408746][ T8885] general protection fault, probably for non-canonical
> address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [ 111.413181][ T8885] KASAN: null-ptr-deref in range
> [0x0000000000000028-0x000000000000002f]
> [ 111.413181][ T8885] CPU: 1 PID: 8885 Comm: handle_kernel_p Not tainted
> 6.7.0-rc1-00007-ge612cb00e200 #6
> [ 111.413181][ T8885] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 111.413181][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
> [ 111.423642][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00 4c 03
> 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa 48 c1 ea 03
> <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0 a1 3a 09
> [ 111.423642][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
> [ 111.423642][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067 RCX:
> ffffffff81ada02e
> [ 111.423642][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f RDI:
> 0000000000000028
> [ 111.423642][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007 R09:
> 0000000000000000
> [ 111.423642][ T8885] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0005088000000a80
> [ 111.423642][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88 R15:
> 0000000000000000
> [ 111.423642][ T8885] FS: 00007f8d3972c6c0(0000) GS:ffff888069700000(0000)
> knlGS:0000000000000000
> [ 111.423642][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 111.423642][ T8885] CR2: 00007f8d3970af78 CR3: 00000000224d6000 CR4:
> 00000000000006f0
> [ 111.423642][ T8885] Call Trace:
> [ 111.423642][ T8885] <TASK>
> [ 111.423642][ T8885] ? show_regs+0x8f/0xa0
> [ 111.423642][ T8885] ? die_addr+0x4f/0xd0
> [ 111.423642][ T8885] ? exc_general_protection+0x150/0x220
> [ 111.423642][ T8885] ? asm_exc_general_protection+0x26/0x30
> [ 111.423642][ T8885] ? __pte_offset_map_lock+0x1de/0x310
> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xfa/0x310
> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
> [ 111.423642][ T8885] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 111.423642][ T8885] filemap_map_pages+0x336/0x13b0
> [ 111.423642][ T8885] ? __pfx_filemap_map_pages+0x10/0x10
> [ 111.423642][ T8885] ? rcu_read_unlock+0x33/0xb0
> [ 111.423642][ T8885] do_fault+0x86a/0x1350
> [ 111.423642][ T8885] __handle_mm_fault+0xe53/0x23a0
> [ 111.423642][ T8885] ? __pfx___handle_mm_fault+0x10/0x10
> [ 111.483413][ T8885] handle_mm_fault+0x369/0x890
> [ 111.483413][ T8885] __get_user_pages+0x46d/0x15d0
> [ 111.483413][ T8885] ? __pfx___get_user_pages+0x10/0x10
> [ 111.483413][ T8885] populate_vma_page_range+0x2de/0x420
> [ 111.483413][ T8885] ? __pfx_populate_vma_page_range+0x10/0x10
> [ 111.483413][ T8885] ? __pfx_find_vma_intersection+0x10/0x10
> [ 111.483413][ T8885] ? vm_mmap_pgoff+0x299/0x3c0
> [ 111.483413][ T8885] __mm_populate+0x1da/0x380
> [ 111.483413][ T8885] ? __pfx___mm_populate+0x10/0x10
> [ 111.483413][ T8885] ? up_write+0x1b3/0x520
> [ 111.483413][ T8885] vm_mmap_pgoff+0x2d1/0x3c0
> [ 111.483413][ T8885] ? __pfx_vm_mmap_pgoff+0x10/0x10
> [ 111.483413][ T8885] ksys_mmap_pgoff+0x7d/0x5b0
> [ 111.483413][ T8885] __x64_sys_mmap+0x125/0x190
> [ 111.483413][ T8885] do_syscall_64+0x45/0xf0
> [ 111.483413][ T8885] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [ 111.483413][ T8885] RIP: 0033:0x7f8d39831559
> [ 111.483413][ T8885] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00
> 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
> [ 111.483413][ T8885] RSP: 002b:00007f8d3972be78 EFLAGS: 00000216 ORIG_RAX:
> 0000000000000009
> [ 111.483413][ T8885] RAX: ffffffffffffffda RBX: 00007f8d3972c6c0 RCX:
> 00007f8d39831559
> [ 111.483413][ T8885] RDX: b635773f07ebbeea RSI: 0000000000b36000 RDI:
> 0000000020000000
> [ 111.483413][ T8885] RBP: 00007f8d3972bea0 R08: 00000000ffffffff R09:
> 0000000000000000
> [ 111.483413][ T8885] R10: 0000000000008031 R11: 0000000000000216 R12:
> ffffffffffffff80
> [ 111.483413][ T8885] R13: 0000000000000000 R14: 00007fffcef921d0 R15:
> 00007f8d3970c000
> [ 111.483413][ T8885] </TASK>
> [ 111.483413][ T8885] Modules linked in:
> [ 111.763549][ T8885] ---[ end trace 0000000000000000 ]---
> [ 111.773557][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
> [ 111.776045][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00 4c 03
> 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa 48 c1 ea 03
> <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0 a1 3a 09
> [ 111.805040][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
> [ 111.820041][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067 RCX:
> ffffffff81ada02e
> [ 111.837884][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f RDI:
> 0000000000000028
> [ 111.855313][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007 R09:
> 0000000000000000
> [ 111.878314][ T8885] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0005088000000a80
> [ 111.910624][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88 R15:
> 0000000000000000
> [ 111.923627][ T8885] FS: 00007f8d3972c6c0(0000) GS:ffff888069700000(0000)
> knlGS:0000000000000000
> [ 111.932017][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 111.941166][ T8885] CR2: 00007fa26ac38178 CR3: 00000000224d6000 CR4:
> 00000000000006f0
> [ 111.950619][ T8885] Kernel panic - not syncing: Fatal exception
> [ 111.953981][ T8885] Kernel Offset: disabled
> [ 111.953981][ T8885] Rebooting in 86400 seconds..
>
> I can test some patches for you if it helps finding out
> the issue.

Thanks a lot, and you'll see that I've just asked syzbot to try what
I now believe is the correct fix: over in the other thread, since it
didn't recognize yesterday's when I sent from this thread. Please
give that a try yourself, if you have time - thanks.

It turned out that all that I needed was your assurance that you had
the repro working on x86 - I guess I'm simply too x86-centric, and
had assumed that syzbot's arm64 report implied something special on
arm, such as the subtler barriers there.

I gave repro a try on bare metal x86, and it reproduced within a minute:
though in my case not quite the stack trace you and syzbot reported,
but a more obvious oops in pmd_install(). Depending on one's
"memory model", the macro pfn_to_page() can be more or less strict:
in my case it was strict, and pmd_install() oopsed right there in
pmd_populate(); whereas in your case pmd_populate() uncomplainingly
puts something silly into the pmd entry, leaving __pte_offset_map_lock()
to stumble on that immediately afterwards. (Neither KASAN nor lockdep
required - though lockdep's spinlock pointer probably helps to make the
badness more obvious, if pmd_install() did not crash already.)

The problem is simply that filemap_map_pmd() assumed that prealloc_pte
is supplied with a preallocated page table whenever pmd_none(); but if
it has racily become pmd_none() since the preallocation decision, then
the oops. My changes have certainly provided an easy way to get that
race, but if I'm not mistaken, there was already another such race,
with the possible bug going back to 5.12.

I'll work on the commit message while waiting to hear from syzbot.

Hugh

2023-11-17 09:25:38

by José Pekkarinen

[permalink] [raw]
Subject: Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

On 2023-11-17 08:13, Hugh Dickins wrote:
> On Thu, 16 Nov 2023, José Pekkarinen wrote:
>> On 2023-11-16 07:23, Hugh Dickins wrote:
>> > On Wed, 15 Nov 2023, Matthew Wilcox wrote:
>> >> On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote:
>> >>
>> >> > > I don't think we should be changing ptlock_ptr().
>> >> >
>> >> > This is where the null ptr dereference originates, so the only
>> >> > alternative I can think of is to protect the life cycle of the ptdesc
>> >> > to prevent it to die between the pte check and the spin_unlock of
>> >> > __pte_offset_map_lock. Would that work for you?
>> >
>> > Thanks for pursuing this, José, but I agree with Matthew: I don't
>> > think your patch is right at all. The change in ptlock_ptr() did not
>> > make sense to me, and the change in __pte_offset_map_lock() leaves us
>> > still wondering what has gone wrong (and misses an rcu_read_unlock()).
>> >
>> > You mentioned "I tested the syzbot reproducer in x86 and it doesn't
>> > produce this kasan report anymore": are you saying that you were able
>> > to reproduce the issue on x86 (without your patch)? That would be very
>> > interesting (and I think would disprove my hypothesis below). I ought
>> > to try on x86 if you managed to reproduce on it, but it's not worth
>> > the effort if you did not. If you have an x86 stack and registers,
>> > please show (though I'm uncertain how much KASAN spoils the output).
>>
>> Hi,
>>
>> Yes, I have a local setup based in [1], where I can spin a small
>> vm, build the reproducer and run it in. The only thing I took from
>> the webpage is the kernel config file, and the image I made it locally
>> by debootstrapping and running the modifications in create-image.sh
>> manually, the kasan report follows:
>>
>> [ 111.408746][ T8885] general protection fault, probably for
>> non-canonical
>> address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN NOPTI
>> [ 111.413181][ T8885] KASAN: null-ptr-deref in range
>> [0x0000000000000028-0x000000000000002f]
>> [ 111.413181][ T8885] CPU: 1 PID: 8885 Comm: handle_kernel_p Not
>> tainted
>> 6.7.0-rc1-00007-ge612cb00e200 #6
>> [ 111.413181][ T8885] Hardware name: QEMU Standard PC (i440FX + PIIX,
>> 1996),
>> BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>> [ 111.413181][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
>> [ 111.423642][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00
>> 4c 03
>> 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa 48
>> c1 ea 03
>> <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0 a1 3a 09
>> [ 111.423642][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
>> [ 111.423642][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067
>> RCX:
>> ffffffff81ada02e
>> [ 111.423642][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f
>> RDI:
>> 0000000000000028
>> [ 111.423642][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007
>> R09:
>> 0000000000000000
>> [ 111.423642][ T8885] R10: 0000000000000000 R11: 0000000000000000
>> R12:
>> 0005088000000a80
>> [ 111.423642][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88
>> R15:
>> 0000000000000000
>> [ 111.423642][ T8885] FS: 00007f8d3972c6c0(0000)
>> GS:ffff888069700000(0000)
>> knlGS:0000000000000000
>> [ 111.423642][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0:
>> 0000000080050033
>> [ 111.423642][ T8885] CR2: 00007f8d3970af78 CR3: 00000000224d6000
>> CR4:
>> 00000000000006f0
>> [ 111.423642][ T8885] Call Trace:
>> [ 111.423642][ T8885] <TASK>
>> [ 111.423642][ T8885] ? show_regs+0x8f/0xa0
>> [ 111.423642][ T8885] ? die_addr+0x4f/0xd0
>> [ 111.423642][ T8885] ? exc_general_protection+0x150/0x220
>> [ 111.423642][ T8885] ? asm_exc_general_protection+0x26/0x30
>> [ 111.423642][ T8885] ? __pte_offset_map_lock+0x1de/0x310
>> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
>> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xfa/0x310
>> [ 111.423642][ T8885] ? __pte_offset_map_lock+0xbf/0x310
>> [ 111.423642][ T8885] ? __pfx___pte_offset_map_lock+0x10/0x10
>> [ 111.423642][ T8885] filemap_map_pages+0x336/0x13b0
>> [ 111.423642][ T8885] ? __pfx_filemap_map_pages+0x10/0x10
>> [ 111.423642][ T8885] ? rcu_read_unlock+0x33/0xb0
>> [ 111.423642][ T8885] do_fault+0x86a/0x1350
>> [ 111.423642][ T8885] __handle_mm_fault+0xe53/0x23a0
>> [ 111.423642][ T8885] ? __pfx___handle_mm_fault+0x10/0x10
>> [ 111.483413][ T8885] handle_mm_fault+0x369/0x890
>> [ 111.483413][ T8885] __get_user_pages+0x46d/0x15d0
>> [ 111.483413][ T8885] ? __pfx___get_user_pages+0x10/0x10
>> [ 111.483413][ T8885] populate_vma_page_range+0x2de/0x420
>> [ 111.483413][ T8885] ? __pfx_populate_vma_page_range+0x10/0x10
>> [ 111.483413][ T8885] ? __pfx_find_vma_intersection+0x10/0x10
>> [ 111.483413][ T8885] ? vm_mmap_pgoff+0x299/0x3c0
>> [ 111.483413][ T8885] __mm_populate+0x1da/0x380
>> [ 111.483413][ T8885] ? __pfx___mm_populate+0x10/0x10
>> [ 111.483413][ T8885] ? up_write+0x1b3/0x520
>> [ 111.483413][ T8885] vm_mmap_pgoff+0x2d1/0x3c0
>> [ 111.483413][ T8885] ? __pfx_vm_mmap_pgoff+0x10/0x10
>> [ 111.483413][ T8885] ksys_mmap_pgoff+0x7d/0x5b0
>> [ 111.483413][ T8885] __x64_sys_mmap+0x125/0x190
>> [ 111.483413][ T8885] do_syscall_64+0x45/0xf0
>> [ 111.483413][ T8885] entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [ 111.483413][ T8885] RIP: 0033:0x7f8d39831559
>> [ 111.483413][ T8885] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00
>> 00 00
>> 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
>> 08 0f 05
>> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
>> [ 111.483413][ T8885] RSP: 002b:00007f8d3972be78 EFLAGS: 00000216
>> ORIG_RAX:
>> 0000000000000009
>> [ 111.483413][ T8885] RAX: ffffffffffffffda RBX: 00007f8d3972c6c0
>> RCX:
>> 00007f8d39831559
>> [ 111.483413][ T8885] RDX: b635773f07ebbeea RSI: 0000000000b36000
>> RDI:
>> 0000000020000000
>> [ 111.483413][ T8885] RBP: 00007f8d3972bea0 R08: 00000000ffffffff
>> R09:
>> 0000000000000000
>> [ 111.483413][ T8885] R10: 0000000000008031 R11: 0000000000000216
>> R12:
>> ffffffffffffff80
>> [ 111.483413][ T8885] R13: 0000000000000000 R14: 00007fffcef921d0
>> R15:
>> 00007f8d3970c000
>> [ 111.483413][ T8885] </TASK>
>> [ 111.483413][ T8885] Modules linked in:
>> [ 111.763549][ T8885] ---[ end trace 0000000000000000 ]---
>> [ 111.773557][ T8885] RIP: 0010:__pte_offset_map_lock+0xfa/0x310
>> [ 111.776045][ T8885] Code: 48 c1 e8 03 80 3c 10 00 0f 85 12 02 00 00
>> 4c 03
>> 3d db 92 cf 0b 48 b8 00 00 00 00 00 fc ff df 49 8d 7f 28 48 89 fa 48
>> c1 ea 03
>> <80> 3c 02 00 0f 85 e2 01 00 00 4d 8b 7f 28 4c 89 ff e8 f0 a1 3a 09
>> [ 111.805040][ T8885] RSP: 0018:ffffc90005baf738 EFLAGS: 00010216
>> [ 111.820041][ T8885] RAX: dffffc0000000000 RBX: 0005800000000067
>> RCX:
>> ffffffff81ada02e
>> [ 111.837884][ T8885] RDX: 0000000000000005 RSI: ffffffff81ad9f0f
>> RDI:
>> 0000000000000028
>> [ 111.855313][ T8885] RBP: ffff8880224c4800 R08: 0000000000000007
>> R09:
>> 0000000000000000
>> [ 111.878314][ T8885] R10: 0000000000000000 R11: 0000000000000000
>> R12:
>> 0005088000000a80
>> [ 111.910624][ T8885] R13: 1ffff92000b75ee9 R14: ffffc90005bafa88
>> R15:
>> 0000000000000000
>> [ 111.923627][ T8885] FS: 00007f8d3972c6c0(0000)
>> GS:ffff888069700000(0000)
>> knlGS:0000000000000000
>> [ 111.932017][ T8885] CS: 0010 DS: 0000 ES: 0000 CR0:
>> 0000000080050033
>> [ 111.941166][ T8885] CR2: 00007fa26ac38178 CR3: 00000000224d6000
>> CR4:
>> 00000000000006f0
>> [ 111.950619][ T8885] Kernel panic - not syncing: Fatal exception
>> [ 111.953981][ T8885] Kernel Offset: disabled
>> [ 111.953981][ T8885] Rebooting in 86400 seconds..
>>
>> I can test some patches for you if it helps finding out
>> the issue.
>
> Thanks a lot, and you'll see that I've just asked syzbot to try what
> I now believe is the correct fix: over in the other thread, since it
> didn't recognize yesterday's when I sent from this thread. Please
> give that a try yourself, if you have time - thanks.
>
> It turned out that all that I needed was your assurance that you had
> the repro working on x86 - I guess I'm simply too x86-centric, and
> had assumed that syzbot's arm64 report implied something special on
> arm, such as the subtler barriers there.
>
> I gave repro a try on bare metal x86, and it reproduced within a
> minute:
> though in my case not quite the stack trace you and syzbot reported,
> but a more obvious oops in pmd_install(). Depending on one's
> "memory model", the macro pfn_to_page() can be more or less strict:
> in my case it was strict, and pmd_install() oopsed right there in
> pmd_populate(); whereas in your case pmd_populate() uncomplainingly
> puts something silly into the pmd entry, leaving
> __pte_offset_map_lock()
> to stumble on that immediately afterwards. (Neither KASAN nor lockdep
> required - though lockdep's spinlock pointer probably helps to make the
> badness more obvious, if pmd_install() did not crash already.)
>
> The problem is simply that filemap_map_pmd() assumed that prealloc_pte
> is supplied with a preallocated page table whenever pmd_none(); but if
> it has racily become pmd_none() since the preallocation decision, then
> the oops. My changes have certainly provided an easy way to get that
> race, but if I'm not mistaken, there was already another such race,
> with the possible bug going back to 5.12.
>
> I'll work on the commit message while waiting to hear from syzbot.

Yep, your solution works for me also. Thanks!

José.