by Hugh Dickins

[permalink] [raw]

Subject: [PATCH] mm/swapfile: delete outdated pte_offset_map() comment

Delete a triply out-of-date comment from add_swap_count_continuation():
1. vmalloc_to_page() changed from pte_offset_map() to pte_offset_kernel()
2. pte_offset_map() changed from using kmap_atomic() to kmap_local_page()
3. kmap_atomic() changed from using fixed FIX_KMAP addresses in 2.6.37.

Signed-off-by: Hugh Dickins <[email protected]>
---
Here's a late "33/32" to the series just moved to mm-stable - thank you!

mm/swapfile.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12d204e6dae2..0a17d85b50cb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3470,11 +3470,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
goto out;
}

- /*
- * We are fortunate that although vmalloc_to_page uses pte_offset_map,
- * no architecture is using highmem pages for kernel page tables: so it
- * will not corrupt the GFP_ATOMIC caller's atomic page table kmaps.
- */
head = vmalloc_to_page(si->swap_map + offset);
offset &= ~PAGE_MASK;

--
2.35.3

2023-07-10 15:40:31

by Mark Brown

[permalink] [raw]

Subject: Re: [PATCH v2 12/32] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()

On Thu, Jun 08, 2023 at 06:21:41PM -0700, Hugh Dickins wrote:
> vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
> but it's intended for userspace page tables: prefer pte_offset_kernel().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Reviewed-by: Lorenzo Stoakes <[email protected]>

Currently Linus' tree is reliably failing to boot on pine64plus, an
arm64 SBC. Most other boards seem fine, though I am seeing some
additional instability on Tritium which is another Allwinner platform,
I've not dug into that yet and Tritium is generally less stable.

We end up seeing NULL or otherwise bad pointer dereferences, the
specific error does vary a bit though it mostly appears to be in the
pinctrl code. A bisect (full log below) identified this patch as
introducing the failure, nothing is jumping out at me about the patch
and it's not affecting everything so I'd not be surprised if it's just
unconvering some bug in the platform support but I'm not super familiar
with the code.

Sample backtrace:

[ 1.919725] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 1.928551] Mem abort info:
[ 1.931359] ESR = 0x0000000096000044

...

[ 1.968870] [0000000000000000] user address but active_mm is swapper

...

[ 2.093969] Call trace:
[ 2.096414] dt_remember_or_free_map+0xc8/0x120
[ 2.100949] pinctrl_dt_to_map+0x23c/0x364
[ 2.105050] create_pinctrl+0x68/0x3ec
[ 2.108803] pinctrl_get+0xb0/0x124
[ 2.112294] devm_pinctrl_get+0x48/0x90
[ 2.116133] pinctrl_bind_pins+0x58/0x158
[ 2.120148] really_probe+0x54/0x2b0
[ 2.123724] __driver_probe_device+0x78/0x12c

Another common theme is the same but with an address like 0x4c and:

[ 2.098328] __kmem_cache_alloc_node+0x1bc/0x2dc
[ 2.102947] kmalloc_trace+0x20/0x2c
[ 2.106524] pinctrl_register_mappings+0x98/0x178

Full boot log from a failure:

https://lava.sirena.org.uk/scheduler/job/712456

git bisect start
# bad: [06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5] Linux 6.5-rc1
git bisect bad 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
# good: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4
git bisect good 6995e2de6891c724bfeb2db33d7b87775f913ad1
# bad: [1b722407a13b7f8658d2e26917791f32805980a2] Merge tag 'drm-next-2023-06-29' of git://anongit.freedesktop.org/drm/drm
git bisect bad 1b722407a13b7f8658d2e26917791f32805980a2
# bad: [3a8a670eeeaa40d87bd38a587438952741980c18] Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect bad 3a8a670eeeaa40d87bd38a587438952741980c18
# bad: [6e17c6de3ddf3073741d9c91a796ee696914d8a0] Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0
# good: [2605e80d3438c77190f55b821c6575048c68268e] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect good 2605e80d3438c77190f55b821c6575048c68268e
# good: [72dc6db7e3b692f46f3386b8dd5101d3f431adef] Merge tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
git bisect good 72dc6db7e3b692f46f3386b8dd5101d3f431adef
# bad: [179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21] mm/madvise: clean up force_shm_swapin_readahead()
git bisect bad 179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21
# good: [523716770e63e229dbb6307d663f03d990dfefc5] maple_tree: rework mtree_alloc_{range,rrange}()
git bisect good 523716770e63e229dbb6307d663f03d990dfefc5
# good: [b764253c18821da31c49a260f92f5d093cf1637e] selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c
git bisect good b764253c18821da31c49a260f92f5d093cf1637e
# good: [5c7f3bf04a6cf266567fdea1ae4987875e92619f] s390: allow pte_offset_map_lock() to fail
git bisect good 5c7f3bf04a6cf266567fdea1ae4987875e92619f
# good: [0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow pte_offset_map[_lock]() to fail
git bisect good 0d940a9b270b9220dcff74d8e9123c9788365751
# bad: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
git bisect bad 0d1c81edc61e553ed7a5db18fb8074c8b78e1538
# good: [2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
git bisect good 2798bbe75b9c2752b46d292e5c2a49f49da36418
# good: [be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk: walk_pte_range() allow for pte_offset_map()
git bisect good be872f83bf571f4f9a0ac25e2c9c36e905a36619
# good: [e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363] mm/vmwgfx: simplify pmd & pud mapping dirty helpers
git bisect good e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363
# first bad commit: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()

Attachments:

(No filename) (4.76 kB)
signature.asc (499.00 B)
Download all attachments

2023-07-10 18:38:10

by Lorenzo Stoakes

On Mon, 10 Jul 2023, Mark Brown wrote:
> On Mon, Jul 10, 2023 at 06:18:27PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jul 10, 2023 at 03:42:31PM +0100, Mark Brown wrote:
>
> > > We end up seeing NULL or otherwise bad pointer dereferences, the
> > > specific error does vary a bit though it mostly appears to be in the
> > > pinctrl code. A bisect (full log below) identified this patch as
> > > introducing the failure, nothing is jumping out at me about the patch
> > > and it's not affecting everything so I'd not be surprised if it's just
> > > unconvering some bug in the platform support but I'm not super familiar
> > > with the code.
>
> > Yeah seems likely. Do you have a .config you can share for this board? For
> > a 64-bit device you'd expect that this change would probably be a nop.
>
> It's definitely happening with arm64 defconfig, possibly with other
> configs but that's the main one.

I'm sorry for dropping you in it, Mark, but I'm totally baffled.
I've spent most of the day trying to come up with ideas, but failed.
I've no doubt that you're seeing what you're seeing, but how it comes
about is a mystery.

Lorenzo is right that the change should be a no-op - compared with 6.4.
But it's not quite a no-op in this series, because 04/32 0d940a9b270b
("mm/pgtable: allow pte_offset_map[_lock]() to fail") diverts the old
pte_offset_map() macro off to a new function in mm/pgtable-generic.c;
then this commit restores it back to being the pte_offset_kernel() macro.

So the asm in vmalloc_to_page() is expected to change in this commit,
but change back to what it would have been in 6.4.

This feels like one of those bugs which depends on the code size in
some way (a bit like those bugs we used to have, where a function was
mistakenly marked __init, then in some configs its code landed on a
page which got freed at startup - I'm not saying this is that at all,
just saying it feels weird in that way).

Yet your bisection converges convincingly, which I wouldn't expect
in that case.

I suppose I should ask you to try reverting this 0d1c81edc61e alone
from 6.5-rc1: the consistency of your bisection implies that it will
"fix" the issues, and it is a commit which we could drop. It makes
me a little nervous, applying userspace-pagetable validation to kernel
pagetables, so I don't want to drop it; and it would really be cargo-
culting to drop it without understanding. But we could drop it.

I guess it would be interesting to know whether vmalloc_to_page() is
ever even called in your kernel, before it crashes on the pinctrl stuff.
But putting in a printk to report on that may change everything.

And I guess it would be interesting to know (from a DEBUG_INFO build
of the crashing kernel) which line of dt_remember_or_free_map() it
oopses on i.e. which pointer is NULL when it shouldn't be - or maybe
you already worked that out.

And what device (which ->dt_node_to_map) is involved. If one of the
many dt_node_to_map's fails to initialize *map to NULL when it should,
and has relied on it happening to be a NULL on the stack already...
that might explain it.

Another thing to try, would be the kernel at 0d940a9b270b^, just before
pte_offset_map() grew a function call: there's a faint possibility that
the bug came in before this series, that 0d940a9b270b somehow masked it
(I don't see how: vmalloc_to_page() does sensible validation itself),
and then 0d1c81edc61e unmasked it again - so that the bisection skipped
over, and converged on the wrong point.

But I'm thrashing about: I have no confidence that any of this info will
help us. Sorry for wasting your time.

Thanks,
Hugh

2023-07-11 05:50:04

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v2 05/32] mm/filemap: allow pte_offset_map_lock() to fail

On Mon, 10 Jul 2023, Zi Yan wrote:
> On 8 Jun 2023, at 21:11, Hugh Dickins wrote:
>
> > filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the
> > pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely
> > return to filemap_map_pages() and let pte_offset_map_lock() discover that.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > mm/filemap.c | 12 +++++-------
> > 1 file changed, 5 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 28b42ee848a4..9e129ad43e0d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3408,13 +3408,6 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
> > if (pmd_none(*vmf->pmd))
> > pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
> >
> > - /* See comment in handle_pte_fault() */
> > - if (pmd_devmap_trans_unstable(vmf->pmd)) {
> > - folio_unlock(folio);
> > - folio_put(folio);
> > - return true;
> > - }
> > -
>
> There is a pmd_trans_huge() check at the beginning, should it be removed
> as well? Since pte_offset_map_lock() is also able to detect it.

It probably could be removed: but mostly I avoided such cleanups,
in the hope that the patches could be more easily reviewed as safe.
But I was eager to delete that obscure pmd_devmap_trans_unstable().

The whole strategy of dealing with the pmd_trans_huge()-like cases first,
and only finally arriving at the pte_offset_map_lock() when other cases
have been excluded, could be reversed in *many* places. It had to be that
way before, because pte_offset_map_lock() could only cope with a page
table; but now we could reverse them to do the pte_offset_map_lock()
first, and only try the other cases when it fails.

That would in theory be more efficient; but whether measurably more
efficient I doubt. And very easy to introduce errors on the way:
my enthusiasm for such cleanups is low! But maybe there's a few
places where the rearrangement would be worthwhile.

>
> > return false;
> > }
> >
> > @@ -3501,6 +3494,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> >
> > addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> > + if (!vmf->pte) {
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + goto out;
> > + }
> > do {
> > again:
> > page = folio_file_page(folio, xas.xa_index);
> > --
> > 2.35.3
>
> These two changes affect the ret value. Before, pmd_devmap_trans_unstable() == true
> made ret = VM_FAULT_NPAGE, but now ret is the default 0 value. So ret should be set
> to VM_FAULT_NPAGE before goto out in the second hunk?

Qi Zheng raised a similar question on the original posting, I answered
https://lore.kernel.org/linux-mm/[email protected]/

It's a rare case to fault here, then find pmd_devmap(*pmd), and it really
doesn't matter whether we return VM_FAULT_NOPAGE or 0 for it - maybe I've
left it inconsistent between THP and devmap, but it doesn't really matter.

I haven't checked Matthew's v5 "new page table range API" posted today,
but I expect this all looks different here anyway.

Thanks a lot for checking these: they are now in 6.5-rc1, so if you find
something that needs fixing, all the more important that we do fix it.

Hugh

2023-07-11 14:52:17

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH v2 12/32] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 10.07.23 16:42, Mark Brown wrote:
> On Thu, Jun 08, 2023 at 06:21:41PM -0700, Hugh Dickins wrote:
>> vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
>> but it's intended for userspace page tables: prefer pte_offset_kernel().
>>
>> Signed-off-by: Hugh Dickins <[email protected]>
>> Reviewed-by: Lorenzo Stoakes <[email protected]>
>
> Currently Linus' tree is reliably failing to boot on pine64plus, an
> arm64 SBC. Most other boards seem fine, though I am seeing some
> additional instability on Tritium which is another Allwinner platform,
> I've not dug into that yet and Tritium is generally less stable.
>
> We end up seeing NULL or otherwise bad pointer dereferences, the
> [...]
> # first bad commit: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 0d1c81edc61e553ed7a5db18fb8074c8
#regzbot title mm/vmalloc: NULL or otherwise bad pointer dereferences on
ARM64
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-07-11 16:09:52

On Tue, Jul 11, 2023 at 09:13:18AM -0700, Hugh Dickins wrote:
> On Tue, 11 Jul 2023, Mark Brown wrote:
> > On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:

> > > I suppose I should ask you to try reverting this 0d1c81edc61e alone
> > > from 6.5-rc1: the consistency of your bisection implies that it will
> > > "fix" the issues, and it is a commit which we could drop. It makes
> > > me a little nervous, applying userspace-pagetable validation to kernel
> > > pagetables, so I don't want to drop it; and it would really be cargo-
> > > culting to drop it without understanding. But we could drop it.

> > I did look at that, it doesn't revert cleanly by itself. ...

> Right, that ptep_get() wrapper on the next line came in on top.
> The patch to revert just 0d1c81edc61e is this:

Thanks, tried that and it's still exploding in a similar way (though
this time inside a regulator call from the pinctrl code which was
happening in other cases).

Attachments:

(No filename) (981.00 B)
signature.asc (499.00 B)
Download all attachments

2023-07-11 18:07:10

On 08.08.23 13:09, Mark Brown wrote:
> On Tue, Aug 08, 2023 at 07:52:43AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>> Hi Mark, just wondering did anything come out of this and is this still
>> happening? I'm just wondering, as I still have this on my list of
>> tracked regressions.
>
> It's fixed.

In that case:

#regzbot resolve: fixed according to reporter
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.