2023-07-15 09:46:54

by Mikhail Gavrilov

[permalink] [raw]
Subject: [bug/bisected] I see "mm/pgtable-generic.c:53: bad pmd (____ptrval____)(8000000100077061)" every boot time

Hi,
It's ok that I see "mm/pgtable-generic.c:53: bad pmd
(____ptrval____)(8000000100077061)" every boot time?
Unfortunately bisect couldn't say which of commits
# possible first bad commit:
[be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk:
walk_pte_range() allow for pte_offset_map()
# possible first bad commit:
[7780d04046a2288ab85d88bedacc60fa4fad9971] mm/pagewalkers:
ACTION_AGAIN if pte_offset_map_lock() fails
# possible first bad commit:
[2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped:
pte_offset_map_nolock() not pte_lockptr()
# possible first bad commit:
[90f43b0a13cddb09e2686f4d976751c0a9b8b197] mm/page_vma_mapped:
reformat map_pte() with less indentation
# possible first bad commit:
[45fe85e9811ede2d65b21724cae50d6a0563e452] mm/page_vma_mapped: delete
bogosity in page_vma_mapped_walk()
# possible first bad commit:
[65747aaf42b7db6acb8e57a2b8e9959928f404dd] mm/filemap: allow
pte_offset_map_lock() to fail
# possible first bad commit:
[0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow
pte_offset_map[_lock]() to fail
definitely first bad because my machine on which I am was doing
bisection is unbootable on these commits.
I hope Hugh Dickins can figure out what's going on here. He is the
author of these commits.

All mine machines are based on the AMD platform two 7950X and one 5900HX.

It seems that this message is harmless for the system in any way, but
I can't judge it is a bug or not.
From the user side it looks like regression because on commit
46c475bd676bb05077c8a38b37f175552f035406 this message was absent.

--
Best Regards,
Mike Gavrilov.


Attachments:
bisect-log-bad-pmd.txt (3.75 kB)
dmesg.zip (61.28 kB)
Download all attachments

2023-07-16 03:56:22

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [bug/bisected] I see "mm/pgtable-generic.c:53: bad pmd (____ptrval____)(8000000100077061)" every boot time

On Sat, Jul 15, 2023 at 02:24:59PM +0500, Mikhail Gavrilov wrote:
> Hi,
> It's ok that I see "mm/pgtable-generic.c:53: bad pmd
> (____ptrval____)(8000000100077061)" every boot time?
> Unfortunately bisect couldn't say which of commits
> # possible first bad commit:
> [be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk:
> walk_pte_range() allow for pte_offset_map()
> # possible first bad commit:
> [7780d04046a2288ab85d88bedacc60fa4fad9971] mm/pagewalkers:
> ACTION_AGAIN if pte_offset_map_lock() fails
> # possible first bad commit:
> [2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped:
> pte_offset_map_nolock() not pte_lockptr()
> # possible first bad commit:
> [90f43b0a13cddb09e2686f4d976751c0a9b8b197] mm/page_vma_mapped:
> reformat map_pte() with less indentation
> # possible first bad commit:
> [45fe85e9811ede2d65b21724cae50d6a0563e452] mm/page_vma_mapped: delete
> bogosity in page_vma_mapped_walk()
> # possible first bad commit:
> [65747aaf42b7db6acb8e57a2b8e9959928f404dd] mm/filemap: allow
> pte_offset_map_lock() to fail
> # possible first bad commit:
> [0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow
> pte_offset_map[_lock]() to fail
> definitely first bad because my machine on which I am was doing
> bisection is unbootable on these commits.
> I hope Hugh Dickins can figure out what's going on here. He is the
> author of these commits.
>
> All mine machines are based on the AMD platform two 7950X and one 5900HX.
>
> It seems that this message is harmless for the system in any way, but
> I can't judge it is a bug or not.
> >From the user side it looks like regression because on commit
> 46c475bd676bb05077c8a38b37f175552f035406 this message was absent.

What are you doing on your system that leads into this regression?

Anyway, I'm adding this regression to be tracked with regzbot:

#regzbot ^introduced: 0d940a9b270b92
#regzbot title: undescribed regression due to allowing failing pte_offset_map[_lock]()

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (2.04 kB)
signature.asc (235.00 B)
Download all attachments

2023-07-16 04:25:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [bug/bisected] I see "mm/pgtable-generic.c:53: bad pmd (____ptrval____)(8000000100077061)" every boot time

On Sat, 15 Jul 2023, Mikhail Gavrilov wrote:

> Hi,
> It's ok that I see "mm/pgtable-generic.c:53: bad pmd
> (____ptrval____)(8000000100077061)" every boot time?

Many thanks for reporting, Mike. No, I wouldn't call that ok at all.
Though I've more research to do before I can tell how much it matters.

> Unfortunately bisect couldn't say which of commits
> # possible first bad commit:
> [be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk:
> walk_pte_range() allow for pte_offset_map()
> # possible first bad commit:
> [7780d04046a2288ab85d88bedacc60fa4fad9971] mm/pagewalkers:
> ACTION_AGAIN if pte_offset_map_lock() fails
> # possible first bad commit:
> [2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped:
> pte_offset_map_nolock() not pte_lockptr()
> # possible first bad commit:
> [90f43b0a13cddb09e2686f4d976751c0a9b8b197] mm/page_vma_mapped:
> reformat map_pte() with less indentation
> # possible first bad commit:
> [45fe85e9811ede2d65b21724cae50d6a0563e452] mm/page_vma_mapped: delete
> bogosity in page_vma_mapped_walk()
> # possible first bad commit:
> [65747aaf42b7db6acb8e57a2b8e9959928f404dd] mm/filemap: allow
> pte_offset_map_lock() to fail
> # possible first bad commit:
> [0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow
> pte_offset_map[_lock]() to fail
> definitely first bad because my machine on which I am was doing
> bisection is unbootable on these commits.
> I hope Hugh Dickins can figure out what's going on here. He is the
> author of these commits.

And thanks for the patient bisecting. Yes, it will be 0d940a9b270b
which introduced the unexpected problem, then be872f83bf5 which fixed
the unbootability aspect (that's right, isn't it? with be872f83bf5 in,
your machine booted ok? but in between it was unbootable).

Very useful info, since it narrowed the symptom down to users of
that pagewalker, before it was allowing for NULL from pte_offset_map()
(we were not expecting ever to hit a bad pmd in normal circumstances).

I have now been able to reproduce such a message, by setting
CONFIG_EFI_PGT_DUMP=y - am I guessing correctly that you have that?

For now, I recommend that you leave CONFIG_EFI_PGT_DUMP unset.
I wonder how many other people have it set, but have not yet noticed
this "bad pmd" message you are reporting.

The problem comes from a confluence of surprises: the pagewalker
now makes an exception for init_mm, but efi_mm is another odd case;
and espfix sets up pmd entries in an unconventional way, which happens
to fit the "bad pmd" criterion; then the efi_mm pgt dump discovers them.

I'm not rushing to judgment on where and what the right fix will be,
that needs some reflection. And perhaps more urgent than that, is that
I got not one but 12 such messages (with 4 processors): that's another
surprise, I would have expected the condition to be cleared after the
first message (but that clearing to ruin the running of Win16 binaries).

More will follow, at lower priority; but if I'm wrong about you having
CONFIG_EFI_PGT_DUMP=y, and unsetting it hiding the issue, please speak up.

Thanks,
Hugh

>
> All mine machines are based on the AMD platform two 7950X and one 5900HX.
>
> It seems that this message is harmless for the system in any way, but
> I can't judge it is a bug or not.
> From the user side it looks like regression because on commit
> 46c475bd676bb05077c8a38b37f175552f035406 this message was absent.
>
> --
> Best Regards,
> Mike Gavrilov.

2023-07-16 10:22:04

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: [bug/bisected] I see "mm/pgtable-generic.c:53: bad pmd (____ptrval____)(8000000100077061)" every boot time

On Sun, Jul 16, 2023 at 7:42 AM Hugh Dickins <[email protected]> wrote:
>
> And thanks for the patient bisecting. Yes, it will be 0d940a9b270b
> which introduced the unexpected problem, then be872f83bf5 which fixed
> the unbootability aspect (that's right, isn't it? with be872f83bf5 in,
> your machine booted ok? but in between it was unbootable).

Absolutely right.

> Very useful info, since it narrowed the symptom down to users of
> that pagewalker, before it was allowing for NULL from pte_offset_map()
> (we were not expecting ever to hit a bad pmd in normal circumstances).
>
> I have now been able to reproduce such a message, by setting
> CONFIG_EFI_PGT_DUMP=y - am I guessing correctly that you have that?

Yes.
$ cat .config | grep CONFIG_EFI_PGT_DUMP
CONFIG_EFI_PGT_DUMP=y

But distro Fedora has been set this setting to "Y" since 2016.
https://src.fedoraproject.org/rpms/kernel/blob/1b7eeb80190501aaf226e90e8f58f994cfc3efe0/f/kernel-x86_64-debug.config#_1293

commit 1b7eeb80190501aaf226e90e8f58f994cfc3efe0
Author: Laura Abbott <[email protected]>
Date: Thu Nov 10 10:16:25 2016 -0800

Change method of configuration generation

The existing method of managing configuration files gets unweildy.
Changing individual lines in text files gets difficult without
manual organization. Switch to a method of configuration generation
that's inspired from the method used inside Red Hat. Each configuration
option gets its own file which are then combined to form the
configuration files. This makes confirming what's actually enabled much
easier.

> For now, I recommend that you leave CONFIG_EFI_PGT_DUMP unset.
> I wonder how many other people have it set, but have not yet noticed
> this "bad pmd" message you are reporting.
>
> The problem comes from a confluence of surprises: the pagewalker
> now makes an exception for init_mm, but efi_mm is another odd case;
> and espfix sets up pmd entries in an unconventional way, which happens
> to fit the "bad pmd" criterion; then the efi_mm pgt dump discovers them.
>
> I'm not rushing to judgment on where and what the right fix will be,
> that needs some reflection. And perhaps more urgent than that, is that
> I got not one but 12 such messages (with 4 processors): that's another
> surprise, I would have expected the condition to be cleared after the
> first message (but that clearing to ruin the running of Win16 binaries).
>
> More will follow, at lower priority; but if I'm wrong about you having
> CONFIG_EFI_PGT_DUMP=y, and unsetting it hiding the issue, please speak up.

I confirm after unsetting CONFIG_EFI_PGT_DUMP the "bad pmd" message
didn't appear any more.

--
Best Regards,
Mike Gavrilov.

2023-07-16 12:38:55

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [bug/bisected] I see "mm/pgtable-generic.c:53: bad pmd (____ptrval____)(8000000100077061)" every boot time

On Sun, Jul 16, 2023 at 10:12:13AM +0700, Bagas Sanjaya wrote:
> #regzbot ^introduced: 0d940a9b270b92
> #regzbot title: undescribed regression due to allowing failing pte_offset_map[_lock]()
>

Updating entry title (see [1] for why):

#regzbot title: CONFIG_EFI_PGT_DUMP regression due to allowing failing pte_offset_map[_lock]()

Thanks.

[1]: https://lore.kernel.org/linux-mm/CABXGCsMaMgcPskMHPL+E=cOf9YMyaSnxg2dMa2jWO7qbjZGkjQ@mail.gmail.com/

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (521.00 B)
signature.asc (235.00 B)
Download all attachments