2023-01-25 22:54:08

by Zach O'Keefe

[permalink] [raw]
Subject: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups

In commit 34488399fa08 ("mm/madvise: add file and shmem support to
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():

- if (!pmd_present(pmde))
- return SCAN_PMD_NULL;
+ if (pmd_none(pmde))
+ return SCAN_PMD_NONE;

This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
might identify a pte-mapped hugepage, only to have khugepaged race-in, free
the pte table, and clear the pmd. Such codepaths include:

A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
already in the pagecache.
B) In retract_page_tables(), if we fail to grab mmap_lock for the target
mm/address.

In these cases, collapse_pte_mapped_thp() really does expect a none (not
just !present) pmd, and we want to suitably identify that case separate
from the case where no pmd is found, or it's a bad-pmd (of course, many
things could happen once we drop mmap_lock, and the pmd could plausibly
undergo multiple transitions due to intervening fault, split, etc).
Regardless, the code is prepared install a huge-pmd only when the existing
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.

However, the commit introduces a logical hole; namely, that we've allowed
!none- && !huge- && !bad-pmds to be classified as genuine
pte-table-mapping-pmds. One such example that could leak through are swap
entries. The pmd values aren't checked again before use in
pte_offset_map_lock(), which is expecting nothing less than a genuine
pte-table-mapping-pmd.

We want to put back the !pmd_present() check (below the pmd_none() check),
but need to be careful to deal with subtleties in pmd transitions and
treatments by various arch.

The issue is that __split_huge_pmd_locked() temporarily clears the present
bit (or otherwise marks the entry as invalid), but pmd_present()
and pmd_trans_huge() still need to return true while the pmd is in this
transitory state. For example, x86's pmd_present() also checks the
_PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
checks a PMD_PRESENT_INVALID bit.

Covering all 4 cases for x86 (all checks done on the same pmd value):

1) pmd_present() && pmd_trans_huge()
All we actually know here is that the PSE bit is set. Either:
a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
is set.
=> huge-pmd
b) We are currently racing with __split_huge_page(). The danger here
is that we proceed as-if we have a huge-pmd, but really we are
looking at a pte-mapping-pmd. So, what is the risk of this
danger?

The only relevant path is:

madvise_collapse() -> collapse_pte_mapped_thp()

Where we might just incorrectly report back "success", when really
the memory isn't pmd-backed. This is fine, since split could
happen immediately after (actually) successful madvise_collapse().
So, it should be safe to just assume huge-pmd here.

2) pmd_present() && !pmd_trans_huge()
Either:
a) PSE not set and either PRESENT or PROTNONE is.
=> pte-table-mapping pmd (or PROT_NONE)
b) devmap. This routine can be called immediately after
unlocking/locking mmap_lock -- or called with no locks held (see
khugepaged_scan_mm_slot()), so previous VMA checks have since been
invalidated.

3) !pmd_present() && pmd_trans_huge()
Not possible.

4) !pmd_present() && !pmd_trans_huge()
Neither PRESENT nor PROTNONE set
=> not present

I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
powerpc, longarch, x86, mips, s390) and this logic roughly translates
(though devmap treatment is unique to x86 and powerpc, and (3) doesn't
necessarily hold in general -- but that doesn't matter since !pmd_present()
always takes failure path).

Also, add a comment above find_pmd_or_thp_or_none() to help future
travelers reason about the validity of the code; namely, the possible
mutations that might happen out from under us, depending on how
mmap_lock is held (if at all).

Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: [email protected]

---
Request that this be pulled into stable since it's theoretically
possible (though I have no reproducer) that while mmap_lock is dropped,
racing thp migration installs a pmd migration entry which then has a path to
be consumed, unchecked, by pte_offset_map().

v1 -> v2: Fix typo
---
mm/khugepaged.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9548644bdb56..1face2ae5877 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
return SCAN_SUCCEED;
}

+/*
+ * See pmd_trans_unstable() for how the result may change out from
+ * underneath us, even if we hold mmap_lock in read.
+ */
static int find_pmd_or_thp_or_none(struct mm_struct *mm,
unsigned long address,
pmd_t **pmd)
@@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
#endif
if (pmd_none(pmde))
return SCAN_PMD_NONE;
+ if (!pmd_present(pmde))
+ return SCAN_PMD_NULL;
if (pmd_trans_huge(pmde))
return SCAN_PMD_MAPPED;
+ if (pmd_devmap(pmde))
+ return SCAN_PMD_NULL;
if (pmd_bad(pmde))
return SCAN_PMD_NULL;
return SCAN_SUCCEED;
--
2.39.1.456.gfc5497dd1b-goog



2023-01-26 00:24:29

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups

On Wed, Jan 25, 2023 at 2:54 PM Zach O'Keefe <[email protected]> wrote:
>
> In commit 34488399fa08 ("mm/madvise: add file and shmem support to
> MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
>
> - if (!pmd_present(pmde))
> - return SCAN_PMD_NULL;
> + if (pmd_none(pmde))
> + return SCAN_PMD_NONE;
>
> This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
> might identify a pte-mapped hugepage, only to have khugepaged race-in, free
> the pte table, and clear the pmd. Such codepaths include:
>
> A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
> already in the pagecache.
> B) In retract_page_tables(), if we fail to grab mmap_lock for the target
> mm/address.
>
> In these cases, collapse_pte_mapped_thp() really does expect a none (not
> just !present) pmd, and we want to suitably identify that case separate
> from the case where no pmd is found, or it's a bad-pmd (of course, many
> things could happen once we drop mmap_lock, and the pmd could plausibly
> undergo multiple transitions due to intervening fault, split, etc).
> Regardless, the code is prepared install a huge-pmd only when the existing
> pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
>
> However, the commit introduces a logical hole; namely, that we've allowed
> !none- && !huge- && !bad-pmds to be classified as genuine
> pte-table-mapping-pmds. One such example that could leak through are swap
> entries. The pmd values aren't checked again before use in
> pte_offset_map_lock(), which is expecting nothing less than a genuine
> pte-table-mapping-pmd.
>
> We want to put back the !pmd_present() check (below the pmd_none() check),
> but need to be careful to deal with subtleties in pmd transitions and
> treatments by various arch.
>
> The issue is that __split_huge_pmd_locked() temporarily clears the present
> bit (or otherwise marks the entry as invalid), but pmd_present()
> and pmd_trans_huge() still need to return true while the pmd is in this
> transitory state. For example, x86's pmd_present() also checks the
> _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
> checks a PMD_PRESENT_INVALID bit.
>
> Covering all 4 cases for x86 (all checks done on the same pmd value):
>
> 1) pmd_present() && pmd_trans_huge()
> All we actually know here is that the PSE bit is set. Either:
> a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
> is set.
> => huge-pmd
> b) We are currently racing with __split_huge_page(). The danger here
> is that we proceed as-if we have a huge-pmd, but really we are
> looking at a pte-mapping-pmd. So, what is the risk of this
> danger?
>
> The only relevant path is:
>
> madvise_collapse() -> collapse_pte_mapped_thp()
>
> Where we might just incorrectly report back "success", when really
> the memory isn't pmd-backed. This is fine, since split could
> happen immediately after (actually) successful madvise_collapse().
> So, it should be safe to just assume huge-pmd here.
>
> 2) pmd_present() && !pmd_trans_huge()
> Either:
> a) PSE not set and either PRESENT or PROTNONE is.
> => pte-table-mapping pmd (or PROT_NONE)
> b) devmap. This routine can be called immediately after
> unlocking/locking mmap_lock -- or called with no locks held (see
> khugepaged_scan_mm_slot()), so previous VMA checks have since been
> invalidated.
>
> 3) !pmd_present() && pmd_trans_huge()
> Not possible.
>
> 4) !pmd_present() && !pmd_trans_huge()
> Neither PRESENT nor PROTNONE set
> => not present
>
> I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
> powerpc, longarch, x86, mips, s390) and this logic roughly translates
> (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
> necessarily hold in general -- but that doesn't matter since !pmd_present()
> always takes failure path).
>
> Also, add a comment above find_pmd_or_thp_or_none() to help future
> travelers reason about the validity of the code; namely, the possible
> mutations that might happen out from under us, depending on how
> mmap_lock is held (if at all).
>
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Reported-by: Hugh Dickins <[email protected]>
> Signed-off-by: Zach O'Keefe <[email protected]>
> Cc: [email protected]

Reviewed-by: Yang Shi <[email protected]>

>
> ---
> Request that this be pulled into stable since it's theoretically
> possible (though I have no reproducer) that while mmap_lock is dropped,
> racing thp migration installs a pmd migration entry which then has a path to
> be consumed, unchecked, by pte_offset_map().
>
> v1 -> v2: Fix typo
> ---
> mm/khugepaged.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9548644bdb56..1face2ae5877 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> return SCAN_SUCCEED;
> }
>
> +/*
> + * See pmd_trans_unstable() for how the result may change out from
> + * underneath us, even if we hold mmap_lock in read.
> + */
> static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> unsigned long address,
> pmd_t **pmd)
> @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> #endif
> if (pmd_none(pmde))
> return SCAN_PMD_NONE;
> + if (!pmd_present(pmde))
> + return SCAN_PMD_NULL;
> if (pmd_trans_huge(pmde))
> return SCAN_PMD_MAPPED;
> + if (pmd_devmap(pmde))
> + return SCAN_PMD_NULL;
> if (pmd_bad(pmde))
> return SCAN_PMD_NULL;
> return SCAN_SUCCEED;
> --
> 2.39.1.456.gfc5497dd1b-goog
>

2023-01-26 00:38:23

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups

On Wed, Jan 25, 2023 at 4:24 PM Yang Shi <[email protected]> wrote:
>
> On Wed, Jan 25, 2023 at 2:54 PM Zach O'Keefe <[email protected]> wrote:
> >
> > In commit 34488399fa08 ("mm/madvise: add file and shmem support to
> > MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
> >
> > - if (!pmd_present(pmde))
> > - return SCAN_PMD_NULL;
> > + if (pmd_none(pmde))
> > + return SCAN_PMD_NONE;
> >
> > This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
> > might identify a pte-mapped hugepage, only to have khugepaged race-in, free
> > the pte table, and clear the pmd. Such codepaths include:
> >
> > A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
> > already in the pagecache.
> > B) In retract_page_tables(), if we fail to grab mmap_lock for the target
> > mm/address.
> >
> > In these cases, collapse_pte_mapped_thp() really does expect a none (not
> > just !present) pmd, and we want to suitably identify that case separate
> > from the case where no pmd is found, or it's a bad-pmd (of course, many
> > things could happen once we drop mmap_lock, and the pmd could plausibly
> > undergo multiple transitions due to intervening fault, split, etc).
> > Regardless, the code is prepared install a huge-pmd only when the existing
> > pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
> >
> > However, the commit introduces a logical hole; namely, that we've allowed
> > !none- && !huge- && !bad-pmds to be classified as genuine
> > pte-table-mapping-pmds. One such example that could leak through are swap
> > entries. The pmd values aren't checked again before use in
> > pte_offset_map_lock(), which is expecting nothing less than a genuine
> > pte-table-mapping-pmd.
> >
> > We want to put back the !pmd_present() check (below the pmd_none() check),
> > but need to be careful to deal with subtleties in pmd transitions and
> > treatments by various arch.
> >
> > The issue is that __split_huge_pmd_locked() temporarily clears the present
> > bit (or otherwise marks the entry as invalid), but pmd_present()
> > and pmd_trans_huge() still need to return true while the pmd is in this
> > transitory state. For example, x86's pmd_present() also checks the
> > _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
> > checks a PMD_PRESENT_INVALID bit.
> >
> > Covering all 4 cases for x86 (all checks done on the same pmd value):
> >
> > 1) pmd_present() && pmd_trans_huge()
> > All we actually know here is that the PSE bit is set. Either:
> > a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
> > is set.
> > => huge-pmd
> > b) We are currently racing with __split_huge_page(). The danger here
> > is that we proceed as-if we have a huge-pmd, but really we are
> > looking at a pte-mapping-pmd. So, what is the risk of this
> > danger?
> >
> > The only relevant path is:
> >
> > madvise_collapse() -> collapse_pte_mapped_thp()
> >
> > Where we might just incorrectly report back "success", when really
> > the memory isn't pmd-backed. This is fine, since split could
> > happen immediately after (actually) successful madvise_collapse().
> > So, it should be safe to just assume huge-pmd here.
> >
> > 2) pmd_present() && !pmd_trans_huge()
> > Either:
> > a) PSE not set and either PRESENT or PROTNONE is.
> > => pte-table-mapping pmd (or PROT_NONE)
> > b) devmap. This routine can be called immediately after
> > unlocking/locking mmap_lock -- or called with no locks held (see
> > khugepaged_scan_mm_slot()), so previous VMA checks have since been
> > invalidated.
> >
> > 3) !pmd_present() && pmd_trans_huge()
> > Not possible.
> >
> > 4) !pmd_present() && !pmd_trans_huge()
> > Neither PRESENT nor PROTNONE set
> > => not present
> >
> > I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
> > powerpc, longarch, x86, mips, s390) and this logic roughly translates
> > (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
> > necessarily hold in general -- but that doesn't matter since !pmd_present()
> > always takes failure path).
> >
> > Also, add a comment above find_pmd_or_thp_or_none() to help future
> > travelers reason about the validity of the code; namely, the possible
> > mutations that might happen out from under us, depending on how
> > mmap_lock is held (if at all).
> >
> > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> > Reported-by: Hugh Dickins <[email protected]>
> > Signed-off-by: Zach O'Keefe <[email protected]>
> > Cc: [email protected]
>
> Reviewed-by: Yang Shi <[email protected]>

Thanks for your time as always, Yang!

Best,
Zach

> >
> > ---
> > Request that this be pulled into stable since it's theoretically
> > possible (though I have no reproducer) that while mmap_lock is dropped,
> > racing thp migration installs a pmd migration entry which then has a path to
> > be consumed, unchecked, by pte_offset_map().
> >
> > v1 -> v2: Fix typo
> > ---
> > mm/khugepaged.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9548644bdb56..1face2ae5877 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > return SCAN_SUCCEED;
> > }
> >
> > +/*
> > + * See pmd_trans_unstable() for how the result may change out from
> > + * underneath us, even if we hold mmap_lock in read.
> > + */
> > static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > unsigned long address,
> > pmd_t **pmd)
> > @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > #endif
> > if (pmd_none(pmde))
> > return SCAN_PMD_NONE;
> > + if (!pmd_present(pmde))
> > + return SCAN_PMD_NULL;
> > if (pmd_trans_huge(pmde))
> > return SCAN_PMD_MAPPED;
> > + if (pmd_devmap(pmde))
> > + return SCAN_PMD_NULL;
> > if (pmd_bad(pmde))
> > return SCAN_PMD_NULL;
> > return SCAN_SUCCEED;
> > --
> > 2.39.1.456.gfc5497dd1b-goog
> >