2023-03-24 11:24:42

by Michal Hocko

[permalink] [raw]
Subject: WARN_ON in move_normal_pmd

Hi,
our QA is regularly hitting
[ 544.198822][T20518] WARNING: CPU: 1 PID: 20518 at ../mm/mremap.c:255 move_pgt_entry+0x4c6/0x510
triggered by thp01 LTP test. This has been brought up in the past and
resulted in f81fdd0c4ab7 ("mm: document warning in move_normal_pmd() and
make it warn only once"). While it is good that the underlying problem
is understood, it doesn't seem there is enough interest to fix it
properly. Which is fair but I am wondering whether the WARN_ON gives
us anything here.

Our QA process collects all unexpected side effects of tests and a WARN*
in the log is certainly one of those. This trigger bugs which are mostly
ignored as there is no upstream fix for them. This alone is nothing
really critical but there are workloads which do run with panic on warn
configured and this issue would put the system down which is unnecessary
IMHO. Would it be sufficient to replace the WARN_ON_ONCE by
pr_warn_once?

Thanks!
--
Michal Hocko
SUSE Labs


2023-03-24 13:14:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri, Mar 24, 2023 at 12:15:24PM +0100, Michal Hocko wrote:
> Hi,
> our QA is regularly hitting
> [ 544.198822][T20518] WARNING: CPU: 1 PID: 20518 at ../mm/mremap.c:255 move_pgt_entry+0x4c6/0x510
> triggered by thp01 LTP test. This has been brought up in the past and
> resulted in f81fdd0c4ab7 ("mm: document warning in move_normal_pmd() and
> make it warn only once"). While it is good that the underlying problem
> is understood, it doesn't seem there is enough interest to fix it
> properly. Which is fair but I am wondering whether the WARN_ON gives
> us anything here.
>
> Our QA process collects all unexpected side effects of tests and a WARN*
> in the log is certainly one of those. This trigger bugs which are mostly
> ignored as there is no upstream fix for them. This alone is nothing
> really critical but there are workloads which do run with panic on warn
> configured and this issue would put the system down which is unnecessary
> IMHO. Would it be sufficient to replace the WARN_ON_ONCE by
> pr_warn_once?

What about relaxing the check to exclude temporary stack from the WARN
condition:

diff --git a/mm/mremap.c b/mm/mremap.c
index 411a85682b58..eb0778b9d645 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -247,15 +247,12 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
* of any 4kB pages, but still there) PMD in the page table
* tree.
*
- * Warn on it once - because we really should try to figure
- * out how to do this better - but then say "I won't move
- * this pmd".
- *
- * One alternative might be to just unmap the target pmd at
- * this point, and verify that it really is empty. We'll see.
+ * Warn on it once unless it is initial (temporary) stack.
*/
- if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
+ if (!pmd_none(*new_pmd)) {
+ WARN_ON_ONCE(!vma_is_temporary_stack(vma));
return false;
+ }

/*
* We don't have to worry about the ordering of src and dst

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-03-24 13:50:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri, Mar 24, 2023 at 9:05 AM Kirill A. Shutemov <[email protected]> wrote:
>
> On Fri, Mar 24, 2023 at 12:15:24PM +0100, Michal Hocko wrote:
> > Hi,
> > our QA is regularly hitting
> > [ 544.198822][T20518] WARNING: CPU: 1 PID: 20518 at ../mm/mremap.c:255 move_pgt_entry+0x4c6/0x510
> > triggered by thp01 LTP test. This has been brought up in the past and
> > resulted in f81fdd0c4ab7 ("mm: document warning in move_normal_pmd() and
> > make it warn only once"). While it is good that the underlying problem
> > is understood, it doesn't seem there is enough interest to fix it
> > properly. Which is fair but I am wondering whether the WARN_ON gives
> > us anything here.
> >
> > Our QA process collects all unexpected side effects of tests and a WARN*
> > in the log is certainly one of those. This trigger bugs which are mostly
> > ignored as there is no upstream fix for them. This alone is nothing
> > really critical but there are workloads which do run with panic on warn
> > configured and this issue would put the system down which is unnecessary
> > IMHO. Would it be sufficient to replace the WARN_ON_ONCE by
> > pr_warn_once?
>
> What about relaxing the check to exclude temporary stack from the WARN
> condition:
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 411a85682b58..eb0778b9d645 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -247,15 +247,12 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> * of any 4kB pages, but still there) PMD in the page table
> * tree.
> *
> - * Warn on it once - because we really should try to figure
> - * out how to do this better - but then say "I won't move
> - * this pmd".
> - *
> - * One alternative might be to just unmap the target pmd at
> - * this point, and verify that it really is empty. We'll see.
> + * Warn on it once unless it is initial (temporary) stack.
> */
> - if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
> + if (!pmd_none(*new_pmd)) {
> + WARN_ON_ONCE(!vma_is_temporary_stack(vma));
> return false;
> + }

Wouldn't it be better to instead fix it from the caller side? Like
making it non-overlapping.

Reading some old threads, I had tried to fix it [1] along these lines
but Linus was rightfully concerned about that fix [2]. Maybe we can
revisit and fix it properly this time.

Personally I feel the safest thing to do is to not do a
non-overlapping mremap and get rid of the warning. Or is there a
better way like unmapping the target from the caller side first,
before the move?

Michal, would you also mind providing the full stack you are seeing
just so we know it is the same issue (i.e. triggered via exec)?

thanks,

- Joel

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/CAHk-=whxP0Gj70pJN5R7Qec4qjrGr+G9Ex7FJi7=_fPcdQ2ocQ@mail.gmail.com/

2023-03-24 13:53:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri, Mar 24, 2023 at 9:43 AM Joel Fernandes <[email protected]> wrote:
>
> On Fri, Mar 24, 2023 at 9:05 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Fri, Mar 24, 2023 at 12:15:24PM +0100, Michal Hocko wrote:
> > > Hi,
> > > our QA is regularly hitting
> > > [ 544.198822][T20518] WARNING: CPU: 1 PID: 20518 at ../mm/mremap.c:255 move_pgt_entry+0x4c6/0x510
> > > triggered by thp01 LTP test. This has been brought up in the past and
> > > resulted in f81fdd0c4ab7 ("mm: document warning in move_normal_pmd() and
> > > make it warn only once"). While it is good that the underlying problem
> > > is understood, it doesn't seem there is enough interest to fix it
> > > properly. Which is fair but I am wondering whether the WARN_ON gives
> > > us anything here.
> > >
> > > Our QA process collects all unexpected side effects of tests and a WARN*
> > > in the log is certainly one of those. This trigger bugs which are mostly
> > > ignored as there is no upstream fix for them. This alone is nothing
> > > really critical but there are workloads which do run with panic on warn
> > > configured and this issue would put the system down which is unnecessary
> > > IMHO. Would it be sufficient to replace the WARN_ON_ONCE by
> > > pr_warn_once?
> >
> > What about relaxing the check to exclude temporary stack from the WARN
> > condition:
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 411a85682b58..eb0778b9d645 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -247,15 +247,12 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > * of any 4kB pages, but still there) PMD in the page table
> > * tree.
> > *
> > - * Warn on it once - because we really should try to figure
> > - * out how to do this better - but then say "I won't move
> > - * this pmd".
> > - *
> > - * One alternative might be to just unmap the target pmd at
> > - * this point, and verify that it really is empty. We'll see.
> > + * Warn on it once unless it is initial (temporary) stack.
> > */
> > - if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
> > + if (!pmd_none(*new_pmd)) {
> > + WARN_ON_ONCE(!vma_is_temporary_stack(vma));
> > return false;
> > + }
>
> Wouldn't it be better to instead fix it from the caller side? Like
> making it non-overlapping.
>
> Reading some old threads, I had tried to fix it [1] along these lines
> but Linus was rightfully concerned about that fix [2]. Maybe we can
> revisit and fix it properly this time.
>
> Personally I feel the safest thing to do is to not do a
> non-overlapping mremap and get rid of the warning. Or is there a [..]

Aargh, I meant "not do an overlapping mremap", instead of "not do a
non-overlapping mremap". :-)

Thanks.

2023-03-24 13:57:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri, Mar 24, 2023 at 09:43:10AM -0400, Joel Fernandes wrote:
> On Fri, Mar 24, 2023 at 9:05 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Fri, Mar 24, 2023 at 12:15:24PM +0100, Michal Hocko wrote:
> > > Hi,
> > > our QA is regularly hitting
> > > [ 544.198822][T20518] WARNING: CPU: 1 PID: 20518 at ../mm/mremap.c:255 move_pgt_entry+0x4c6/0x510
> > > triggered by thp01 LTP test. This has been brought up in the past and
> > > resulted in f81fdd0c4ab7 ("mm: document warning in move_normal_pmd() and
> > > make it warn only once"). While it is good that the underlying problem
> > > is understood, it doesn't seem there is enough interest to fix it
> > > properly. Which is fair but I am wondering whether the WARN_ON gives
> > > us anything here.
> > >
> > > Our QA process collects all unexpected side effects of tests and a WARN*
> > > in the log is certainly one of those. This trigger bugs which are mostly
> > > ignored as there is no upstream fix for them. This alone is nothing
> > > really critical but there are workloads which do run with panic on warn
> > > configured and this issue would put the system down which is unnecessary
> > > IMHO. Would it be sufficient to replace the WARN_ON_ONCE by
> > > pr_warn_once?
> >
> > What about relaxing the check to exclude temporary stack from the WARN
> > condition:
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 411a85682b58..eb0778b9d645 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -247,15 +247,12 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > * of any 4kB pages, but still there) PMD in the page table
> > * tree.
> > *
> > - * Warn on it once - because we really should try to figure
> > - * out how to do this better - but then say "I won't move
> > - * this pmd".
> > - *
> > - * One alternative might be to just unmap the target pmd at
> > - * this point, and verify that it really is empty. We'll see.
> > + * Warn on it once unless it is initial (temporary) stack.
> > */
> > - if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
> > + if (!pmd_none(*new_pmd)) {
> > + WARN_ON_ONCE(!vma_is_temporary_stack(vma));
> > return false;
> > + }
>
> Wouldn't it be better to instead fix it from the caller side? Like
> making it non-overlapping.
>
> Reading some old threads, I had tried to fix it [1] along these lines
> but Linus was rightfully concerned about that fix [2]. Maybe we can
> revisit and fix it properly this time.
>
> Personally I feel the safest thing to do is to not do a
> non-overlapping mremap and get rid of the warning. Or is there a
> better way like unmapping the target from the caller side first,
> before the move?

Making it non-overlapping limits randomization effectiveness. We need to
quantify it at least.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-03-24 14:47:14

by Michal Hocko

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri 24-03-23 09:43:10, Joel Fernandes wrote:
[...]
> Michal, would you also mind providing the full stack you are seeing
> just so we know it is the same issue (i.e. triggered via exec)?

Yes, it is the exec path
[ 544.236782][T20518] Call Trace:
[ 544.237467][T20518] <TASK>
[ 544.238034][T20518] move_page_tables+0x8e4/0x930
[ 544.238039][T20518] shift_arg_pages+0xb5/0x180
[ 544.238044][T20518] ? mprotect_fixup+0x1ce/0x2e0
[ 544.238046][T20518] setup_arg_pages+0x256/0x2b0
[ 544.238049][T20518] load_elf_binary+0x375/0x15d0
[ 544.238052][T20518] ? __kernel_read+0x2e8/0x310
[ 544.238056][T20518] ? __kernel_read+0x2e8/0x310
[ 544.238059][T20518] bprm_execve+0x2ec/0x650
[ 544.238062][T20518] do_execveat_common.isra.42+0x1ed/0x230
[ 544.238064][T20518] __x64_sys_execve+0x32/0x40
[ 544.238067][T20518] do_syscall_64+0x5b/0x80
[ 544.238072][T20518] ? __SCT__preempt_schedule_notrace+0x8/0x8
[ 544.238075][T20518] ? __SCT__preempt_schedule_notrace+0x8/0x8
[ 544.238076][T20518] ? exit_to_user_mode_prepare+0xfc/0x230
[ 544.250270][T20518] ? syscall_exit_to_user_mode+0x18/0x40
[ 544.250292][T20518] ? do_syscall_64+0x67/0x80
[ 544.250294][T20518] ? exc_page_fault+0x67/0x150
[ 544.250297][T20518] entry_SYSCALL_64_after_hwframe+0x61/0xcb
--
Michal Hocko
SUSE Labs

2023-03-24 23:39:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Fri, Mar 24, 2023 at 6:43 AM Joel Fernandes <[email protected]> wrote:
>
> Wouldn't it be better to instead fix it from the caller side? Like
> making it non-overlapping.

I wonder if we could just do something like this in mremap() instead

- if old/new are mutually PMD_ALIGNED

- *and* there is no vma below new within the same PMD

- then just expand the mremap to be PMD-aligned downwards

IOW, the problem with the exec stack moving case isn't really that
it's overlapping: that part is fine. We're moving downwards, and we
start from the bottom, so the moving part works fine.

No, the problem is that we *start* by moving individual pages, and
then by the time we've a few pages down by a whole PMD, we finish the
source PMD (and we've cleared all the contents of it), but it still
exists.

And at *that* point, when we go and start copying the next page, we're
suddenly fully PMD-aligned, and now we try to copy a whole PMD, and
then that code is unhappy about the fact that the old (empty) PMD is
there in the target.

And for all of this to happen, we need to move things by an exact
multiple of PMD size, because otherwise we'd never get to that aligned
situation at all, and we'd always do all the movement in individual
pages, and everything would be just fine.

And more importantly, if we had just *started* with moving a whole
PMD, this also wouldn't have happened. But we didn't. We started
moving individual pages.

So you could see the warning not as a "this range overlaps" warning
(it's fine, and happens all the time, and we do individual pages that
way quite happily), but really as a "hey, this was very inefficient -
you shouldn't have done those individual pages as several small
independent invidual pages in the first place" warning.

So some kind of

/* Is the movement mutually PMD-aligned? */
if ((old_addr ^ new_addr) & ~PMD_MASK == 0) {
.. try to extend the move_vma() down to the *aligned*
PMD case ..
}

logic in move_page_tables() would get rid of the warning, and would
make the move more efficient since you'd skip the "move individual
pages and allocate a new PMD" case entirely.

This is all fairly com,plicated, and the "try to extend the move
range" would also have to depend on CONFIG_HAVE_MOVE_PMD etc, so I'm
not saying it's trivial.

But it would seem to be a really nice optimization, in addition to
getting rid of the warning.

It could even help real world cases outside of this odd stack
remapping case if users ever end up moving vma's by multiples of
PMD_SIZE, and there aren't other vma's around the source/target that
disable the optimization.

Hmm? Anybody want to look into that? It looks hairy enough that I
think that "you could test this with mutually aligned mremap()
source/targets in some test program" would be a good thing. Because
the pure execve() case is rare enough that using *that* as a test-case
seems like a fool's errand.

(To make things very clear: the important part is that the source and
targets aren't *actually* PMD-aligned, just mutually aligned so that
you *can* do the mremap() by just moving whole PMD's around)

Linus

2023-03-25 16:36:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

Hi Linus,

On Fri, Mar 24, 2023 at 04:38:03PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2023 at 6:43 AM Joel Fernandes <[email protected]> wrote:
> >
> > Wouldn't it be better to instead fix it from the caller side? Like
> > making it non-overlapping.
>
> I wonder if we could just do something like this in mremap() instead
>
> - if old/new are mutually PMD_ALIGNED
>
> - *and* there is no vma below new within the same PMD
>
> - then just expand the mremap to be PMD-aligned downwards
>
> IOW, the problem with the exec stack moving case isn't really that
> it's overlapping: that part is fine. We're moving downwards, and we
> start from the bottom, so the moving part works fine.
>
> No, the problem is that we *start* by moving individual pages, and
> then by the time we've a few pages down by a whole PMD, we finish the
> source PMD (and we've cleared all the contents of it), but it still
> exists.
>
> And at *that* point, when we go and start copying the next page, we're
> suddenly fully PMD-aligned, and now we try to copy a whole PMD, and
> then that code is unhappy about the fact that the old (empty) PMD is
> there in the target.
>

You are very right. I am able to also trigger the warning with a synthetic
program that just mremaps a range and moves it in the same way that triggers
this issue, however I had to hack the kernel to prevent mremap from erroring
out if ranges overlap (unlike exec, mremap does some initial checks for
that). Also had to do other hacks but I did reproduce it consistently.

The issue is that even though the PMD is empty, it is allocated. So
pmd_none() is kind of a lie in some sense, it is pointing to empty PTEs when
the range is really empty.

How about we replace the warning with something like the following?

+ if (unlikely(!pmd_none(*new_pmd))) {
+ // Check if any ptes in the pmd are non-empty. Doing this here
+ // is ok since this is not a fast path.
+ bool pmd_empty = true;
+ unsigned long tmp_addr = new_addr;
+ pte_t* check_pte = pte_offset_map(new_pmd, new_addr);
+
+ new_ptl = pte_lockptr(mm, new_pmd);
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+ for (; tmp_addr < old_addr + PMD_SIZE; check_pte++, tmp_addr += PAGE_SIZE) {
+ if (!pte_none(*check_pte)) {
+ pmd_empty = false;
+ break;
+ }
+ }

+ WARN_ON_ONCE(!pmd_empty);
+ spin_unlock(new_ptl);
+ }
+

> And for all of this to happen, we need to move things by an exact
> multiple of PMD size, because otherwise we'd never get to that aligned
> situation at all, and we'd always do all the movement in individual
> pages, and everything would be just fine.
>
> And more importantly, if we had just *started* with moving a whole
> PMD, this also wouldn't have happened. But we didn't. We started
> moving individual pages.
>
> So you could see the warning not as a "this range overlaps" warning
> (it's fine, and happens all the time, and we do individual pages that
> way quite happily), but really as a "hey, this was very inefficient -
> you shouldn't have done those individual pages as several small
> independent invidual pages in the first place" warning.
>

Exactly.

> So some kind of
>
> /* Is the movement mutually PMD-aligned? */
> if ((old_addr ^ new_addr) & ~PMD_MASK == 0) {
> .. try to extend the move_vma() down to the *aligned*
> PMD case ..
> }
>

I actually didn't follow what you meant by "mutually PMD-aligned". Could you
provide some example address numbers to explain?

AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD.
Otherwise how will you index the bytes in the 2MB page? You need bits in the
address for that.

> logic in move_page_tables() would get rid of the warning, and would
> make the move more efficient since you'd skip the "move individual
> pages and allocate a new PMD" case entirely.
>
> This is all fairly com,plicated, and the "try to extend the move
> range" would also have to depend on CONFIG_HAVE_MOVE_PMD etc, so I'm
> not saying it's trivial.
>
> But it would seem to be a really nice optimization, in addition to
> getting rid of the warning.
>
> It could even help real world cases outside of this odd stack
> remapping case if users ever end up moving vma's by multiples of
> PMD_SIZE, and there aren't other vma's around the source/target that
> disable the optimization.
>
> Hmm? Anybody want to look into that? It looks hairy enough that I
> think that "you could test this with mutually aligned mremap()
> source/targets in some test program" would be a good thing. Because
> the pure execve() case is rare enough that using *that* as a test-case
> seems like a fool's errand.
>

Just to mention, mremap errors out if you try to move overlapping ranges
because this in mremap_to():

/* Ensure the old/new locations do not overlap
if (addr + old_len > new_addr && new_addr + new_len > addr) {
pr_err("%s: (%s) (%d)", __func__, __FILE__, __LINE__);
goto out;
}

Or is there an mremap trick I might be missing which actually allows
overlapping range moves?

thanks,

- Joel


2023-03-25 16:53:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Sat, Mar 25, 2023 at 12:33 PM Joel Fernandes <[email protected]> wrote:
>
> Hi Linus,
>
> On Fri, Mar 24, 2023 at 04:38:03PM -0700, Linus Torvalds wrote:
> > On Fri, Mar 24, 2023 at 6:43 AM Joel Fernandes <[email protected]> wrote:
> > >
> > > Wouldn't it be better to instead fix it from the caller side? Like
> > > making it non-overlapping.
> >
> > I wonder if we could just do something like this in mremap() instead
> >
> > - if old/new are mutually PMD_ALIGNED
> >
> > - *and* there is no vma below new within the same PMD
> >
> > - then just expand the mremap to be PMD-aligned downwards
> >
> > IOW, the problem with the exec stack moving case isn't really that
> > it's overlapping: that part is fine. We're moving downwards, and we
> > start from the bottom, so the moving part works fine.
> >
> > No, the problem is that we *start* by moving individual pages, and
> > then by the time we've a few pages down by a whole PMD, we finish the
> > source PMD (and we've cleared all the contents of it), but it still
> > exists.
> >
> > And at *that* point, when we go and start copying the next page, we're
> > suddenly fully PMD-aligned, and now we try to copy a whole PMD, and
> > then that code is unhappy about the fact that the old (empty) PMD is
> > there in the target.
> >
>
> You are very right. I am able to also trigger the warning with a synthetic
> program that just mremaps a range and moves it in the same way that triggers
> this issue, however I had to hack the kernel to prevent mremap from erroring
> out if ranges overlap (unlike exec, mremap does some initial checks for
> that). Also had to do other hacks but I did reproduce it consistently.
>
> The issue is that even though the PMD is empty, it is allocated. So
> pmd_none() is kind of a lie in some sense, it is pointing to empty PTEs when
> the range is really empty.
>
> How about we replace the warning with something like the following?
>
> + if (unlikely(!pmd_none(*new_pmd))) {
> + // Check if any ptes in the pmd are non-empty. Doing this here
> + // is ok since this is not a fast path.
> + bool pmd_empty = true;
> + unsigned long tmp_addr = new_addr;
> + pte_t* check_pte = pte_offset_map(new_pmd, new_addr);
> +
> + new_ptl = pte_lockptr(mm, new_pmd);
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> + for (; tmp_addr < old_addr + PMD_SIZE; check_pte++, tmp_addr += PAGE_SIZE) {

Apologies, here I was going for "tmp_addr < new_addr + PMD_SIZE". I
made the change and it still works (This is just to show the basic
idea, I am still testing it).

thanks,

- Joel


> + if (!pte_none(*check_pte)) {
> + pmd_empty = false;
> + break;
> + }
> + }
>
> + WARN_ON_ONCE(!pmd_empty);
> + spin_unlock(new_ptl);
> + }
> +
>
> > And for all of this to happen, we need to move things by an exact
> > multiple of PMD size, because otherwise we'd never get to that aligned
> > situation at all, and we'd always do all the movement in individual
> > pages, and everything would be just fine.
> >
> > And more importantly, if we had just *started* with moving a whole
> > PMD, this also wouldn't have happened. But we didn't. We started
> > moving individual pages.
> >
> > So you could see the warning not as a "this range overlaps" warning
> > (it's fine, and happens all the time, and we do individual pages that
> > way quite happily), but really as a "hey, this was very inefficient -
> > you shouldn't have done those individual pages as several small
> > independent invidual pages in the first place" warning.
> >
>
> Exactly.
>
> > So some kind of
> >
> > /* Is the movement mutually PMD-aligned? */
> > if ((old_addr ^ new_addr) & ~PMD_MASK == 0) {
> > .. try to extend the move_vma() down to the *aligned*
> > PMD case ..
> > }
> >
>
> I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> provide some example address numbers to explain?
>
> AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD.
> Otherwise how will you index the bytes in the 2MB page? You need bits in the
> address for that.
>
> > logic in move_page_tables() would get rid of the warning, and would
> > make the move more efficient since you'd skip the "move individual
> > pages and allocate a new PMD" case entirely.
> >
> > This is all fairly com,plicated, and the "try to extend the move
> > range" would also have to depend on CONFIG_HAVE_MOVE_PMD etc, so I'm
> > not saying it's trivial.
> >
> > But it would seem to be a really nice optimization, in addition to
> > getting rid of the warning.
> >
> > It could even help real world cases outside of this odd stack
> > remapping case if users ever end up moving vma's by multiples of
> > PMD_SIZE, and there aren't other vma's around the source/target that
> > disable the optimization.
> >
> > Hmm? Anybody want to look into that? It looks hairy enough that I
> > think that "you could test this with mutually aligned mremap()
> > source/targets in some test program" would be a good thing. Because
> > the pure execve() case is rare enough that using *that* as a test-case
> > seems like a fool's errand.
> >
>
> Just to mention, mremap errors out if you try to move overlapping ranges
> because this in mremap_to():
>
> /* Ensure the old/new locations do not overlap
> if (addr + old_len > new_addr && new_addr + new_len > addr) {
> pr_err("%s: (%s) (%d)", __func__, __FILE__, __LINE__);
> goto out;
> }
>
> Or is there an mremap trick I might be missing which actually allows
> overlapping range moves?
>
> thanks,
>
> - Joel
>
>

2023-03-25 17:17:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Sat, Mar 25, 2023 at 9:33 AM Joel Fernandes <[email protected]> wrote:
>
> I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> provide some example address numbers to explain?

Sure, let me make this more clear with a couple of concrete examples.

Let's say that we have a range '[old, old+len]' and we want to remap
it to '[new, new+len]'.

Furthermore, we'll say that overlapping is fine, but because we're
always moving pages from "low address to high", we only allow
overlapping when we're moving things down (ie 'new < old').

And yes, I know that the overlapping case cannot actually happen with
mremap() itself. So in practice the overlapping case only happens for
the special "move the stack pages" around at execve() startup, but
let's ignore that for now.

So we'll talk about the generic "move pages around" case, not the more
limited 'mremap()' case.

I'll also simplify the thing to just assume that we have that
CONFIG_HAVE_MOVE_PMD enabled, so I'll ignore some of the full grotty
details.

Ok?

So let's take the simple case first:

(a) everything is PMD aligned, so all of old, new, and len are
multiples of the PMD size.

This is the simple case, because we can just do the whole move by
moving the PMD entries, and we'll never hit any issues. As we move the
PMD entries in move_normal_pmd(), we always remove the entry we are
moving down:

/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);

so as we move up in the destination, we will never hit a populated PMD
entry as we walk the page tables.

So (a) is fine today, everybody is happy, we have no issues. It's
efficient and simple.

The other simple case is

(b) we're *not* PMD-aligned, and everything is done one page at a time.

This is basically the exact same thing as (a), except rather than move
the PMD entry around, we move a PTE entry, and we do the exact same
"clear the entry as we move it" in move_ptes(), except (due to our
locking rules being different), it now looks like this instead:

pte = ptep_get_and_clear(mm, old_addr, old_pte);

but otherwise it's all the same as (a), just one level lower.

But then we have case (c): the "mutually aligned" case:

> AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD.
> Otherwise how will you index the bytes in the 2MB page? You need bits in the
> address for that.

The thing is, the problematic case is when 'old' and 'new' are not
initially themselves PMD-aligned (so they have lower bits set), but
they are *mutually* aligned, so they have the *same* lower bits set.

So in this case (c), we start off with case (b), but as we walk the
address one page at a time, at some point we suddenly hit case (a) in
the middle.

To make this really concrete, lets say that we have

old = 0x1fff000
new = 0x0fff000
len = 0x201000

and we have PMD_SIZE being 0x200000.

Notice how 'old' and 'new' are *not* PMD-aligned, but they are
*mutually* aligned in the PMD size (ie "old & (PMD_SIZE-1)" and "new &
(PMD_SIZE-1)" are the same).

Look at what happens: we start with the unaligned case, and we move
one single page, from address 0x1fff000 to 0x0fff000.

But what happens after we moved that page? We'll update old/new/len,
and now, we'll have

old = 0x2000000
new = 0x1000000
len = 0x200000

and now evertthing *is* PMD-aligned, and we just move a single PMD
entry from 0x2000000 to 0x1000000. And then we're done.

And that's fine, and won't cause any problems, because the area wasn't
overlapping at a PMD level, so we're all good.

But it it *was* overlapping, and we started out with

old = 0x1fff000
new = 1dff000
len = 0x201000

instead, we start out with the exact same thing, but after moving one
page, we'll have

old = 0x2000000
new = 0x1e00000
len = 0x200000

and now when we try to move the PMD entry from 0x2000000 to 0x1e00000,
we'll hit that *old* (and empty) PMD entry that used to contain that
one single page that we moved earlier.

And notice how for this all to happen, the old/new addresses have to
start out mutually aligned. So we can see the dangerous case up-front:
even if old and new aren't PMD-aligned to start with, you can see the
low bits are the same:

(old ^ new) & (PMD_SIZE-1) == 0

because as we move pages around, we'll always be updating old/new
together by the same amount.

So what I'm saying is that *if* we start out with that situation, and
we have that

old = 0x1fff000
new = 1dff000
len = 0x201000

we could easily decode "let's just move the whole PMD", and expand the
move to be

old = 0x1e00000
new = 0x1c00000
len = 0x400000

instead. And then instead of moving PTE's around at first, we'd move
PMD's around *all* the time, and turn this into that "simple case
(a)".

NOTE! For this to work, there must be no mapping right below 'old' or
'new', of course. But during the execve() startup, that should be
trivially true.

See what I'm saying?

Linus

2023-03-25 17:34:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Sat, Mar 25, 2023 at 10:06 AM Linus Torvalds
<[email protected]> wrote:
>
> So what I'm saying is that *if* we start out with that situation, and
> we have that
>
> old = 0x1fff000
> new = 1dff000
> len = 0x201000
>
> we could easily decode "let's just move the whole PMD", and expand the
> move to be
>
> old = 0x1e00000
> new = 0x1c00000
> len = 0x400000
>
> instead. And then instead of moving PTE's around at first, we'd move
> PMD's around *all* the time, and turn this into that "simple case
> (a)".
>
> NOTE! For this to work, there must be no mapping right below 'old' or
> 'new', of course. But during the execve() startup, that should be
> trivially true.
>
> See what I'm saying?

Also note that my comments about "this can be tested with mremap()"
are because the above optimization works and is valid even when old
and new are not originally overlapping, but they overlap after the
expansion.

IOW, imagine that you have a 2GB mapping, but it is not 2GB-aligned
virtually, and you want to move that mapping down by 2GB.

Now, because that 2GB mapping is *not* 2GB-aligned, it actually takes
up *two* PMD entries. But if that mapping is the only thing that
exists in those two PMD entries, and the PMD entry below it is clear
(because there is no mapping right below the new address), then we can
still do that unaligned 2GB mapping move entirely at the PMD level.

So instead of wasting time to move it one page at a time (until it is
2GB aligned), we could just move two PMD entries around.

Here's a (UNTESTED! It compiles, but that's it) user test-case for
this situation:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <string.h>

/* Pick some random 2GB-aligned address that isn't near anything else */
#define GB (1ul << 20)
#define VA ((void *)(128 * GB))

#define old (VA+GB)
#define new (VA-GB)
#define len (2*GB)

int main(int argc, char **argv)
{
void *addr;

addr = mmap(old, len,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED,
-1, 0);
memset(addr, 0xff, len);
mremap(old, len, len,
MREMAP_MAYMOVE | MREMAP_FIXED, new);
return 0;
}

and I claim that that mremap() right now ends up doing the whole 2GB
page table move one page at a time, but it *should* be doable as just
two PMD entry moves.

See?

Linus

2023-03-26 02:27:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Sat, Mar 25, 2023 at 10:06:59AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2023 at 9:33 AM Joel Fernandes <[email protected]> wrote:
> >
> > I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> > provide some example address numbers to explain?
>
> Sure, let me make this more clear with a couple of concrete examples.
>
> Let's say that we have a range '[old, old+len]' and we want to remap
> it to '[new, new+len]'.
>
> Furthermore, we'll say that overlapping is fine, but because we're
> always moving pages from "low address to high", we only allow
> overlapping when we're moving things down (ie 'new < old').
>
> And yes, I know that the overlapping case cannot actually happen with
> mremap() itself. So in practice the overlapping case only happens for
> the special "move the stack pages" around at execve() startup, but
> let's ignore that for now.
>
> So we'll talk about the generic "move pages around" case, not the more
> limited 'mremap()' case.
>
> I'll also simplify the thing to just assume that we have that
> CONFIG_HAVE_MOVE_PMD enabled, so I'll ignore some of the full grotty
> details.
>
> Ok?
[...]
>
> we could easily decode "let's just move the whole PMD", and expand the
> move to be
>
> old = 0x1e00000
> new = 0x1c00000
> len = 0x400000
> instead. And then instead of moving PTE's around at first, we'd move
> PMD's around *all* the time, and turn this into that "simple case
> (a)".

Right, I totally get what you mean. You want to move more than the 4k pages
in the beginning of the mapping. In fact the whole PMD, which extends further
below the destination to capture the full PMD that the first 4k pages are
located in. With that you get to just move PMDs purely all the way.

I think that is a great idea.

> NOTE! For this to work, there must be no mapping right below 'old' or
> 'new', of course. But during the execve() startup, that should be
> trivially true.

Exactly it wont work if there is something below old or new.

So for that very reason, we still have to handle the bad case where the
source PMD was not deleted right? Because if there is something below new,
you'll need to copy 1 PTE at a time till you hit the 2MB boundary, because
you can't mess with that source PMD, it is in use to satisfy mappings below
new. Then you'll eventually hit the warning we are discussing. I guess even
if one can assure that there is no mapping below new for the execve() case,
it still cannot be guaranteed for the mremap() case I think.

But I agree, if there is no mapping below old/new, then we can just do this
as an optimization. I think all that is needed to do is to check whether
there are any VMAs at those locations, but correct me if I'm wrong as I'm not
an mm expert.

> See what I'm saying?

Yep.

And as you pointed out in the mremap example, this issue can also show up
with non-overlapping ranges if I'm not mistaken.

I get your idea. Allow me to digest all this a bit more, and since it is not
urgent and this stuff is going to take some careful work with proper test
cases etc, let me take this up and work on it. But your idea is loud and
clear. I am also working on sending you that RCU PR and working hard to not
screw that up so it is a bit busy :-P. And thank you again for the great
idea and discussion! Looking forward to working on this.

thanks,

- Joel

2023-03-26 23:01:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARN_ON in move_normal_pmd

On Sat, Mar 25, 2023 at 7:27 PM Joel Fernandes <[email protected]> wrote:
>
> So for that very reason, we still have to handle the bad case where the
> source PMD was not deleted right?

Well, so our rules are that if nothing is mapped in a particular page
table directory (any level), then it must be empty.

And that "must" is actually a hard requirement, because our exit path
won't even spend time tearing down page tables that don't have any
mappings in them.

So if you were to have non-empty pmd entries that don't have a vma
associated with them, there would be a memory leak, and we really
would want to warn about that case.

End result: it should be sufficient to do something like "if you don't
have a mapping below you within this PMD, you can expand the movement
down to a full PMD".

And same with the above case.

Of course, the more I think about this, the more I wonder "is this
even worth it". Because we have

(a) mremap() that can't trigger the problematic case currently
(because not overlapping), and *probably* almost never would trigger
the optimization of widening the move in practice.

(b) setup_arg_pages() will probably almost never triggers the
problematic case in practice, since you'd have to shift the pages by
*just* the right amount

so in the end, maybe the "real fix" is to just say "none of this
matters, let's just remove the warning".

An alternative "real fix" might even be to just say "just don't shift
the stack by exactly a PMD". It's unlikely to happen anyway, it's not
worth optimizing for, so just make sure it doesn't happen.

IOW, another alternative could be something like this:

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -783,7 +783,14 @@ int setup_arg_pages(struct linux_binprm *bprm,
unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
return -ENOMEM;

+ /*
+ * Shift the stack up, but avoid shifting by
+ * exactly a PMD size, which causes issues
+ * when mixing page-sized and pmd-sized moves.
+ */
stack_shift = vma->vm_end - stack_top;
+ if (stack_shift && !(stack_shift & ~PMD_MASK))
+ stack_shift -= PAGE_SIZE;

bprm->p -= stack_shift;
mm->arg_start = bprm->p;

which is *really* quite ugly, and only handles the stack-grows-down
case, and I'm not proud of it, and the above is also entirely
untested.

I will delete that patch from my system after sending out this email,
and disavow any knowledge of that horrendously ugly hack. But if
somebody else takes ownership of it and I won't be blamed for it, I
would probably accept it as a solution.

Shudder. That's nasty.

Linus