It is currently possible for a userspace application to enter a page
fault loop when using HugeTLB pages implemented with contiguous PTEs
when HAFDBS is not available. This happens because:
1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
(PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
the memory access on a system without HAFDBS, we will get a page
fault.
3. HugeTLB will check if it needs to update the dirty bits on the PTE.
For contiguous PTEs, it will check to see if the pgprot bits need
updating. In this case, HugeTLB wants to write a sequence of
sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
so it thinks no update is necessary.
Please see this[1] reproducer.
I think (though I may be wrong) that both step (1) and step (3) are
buggy.
The first patch in this series fixes step (3); instead of checking if
pte_dirty is matching in __cont_access_flags_changed, check pte_hw_dirty
and pte_sw_dirty separately.
The second patch in this series makes step (1) less likely to occur.
Without this patch, we can get the kernel to write a sw-dirty, hw-clean
PTE with the following steps (showing the relevant VMA flags and pgprot
bits):
i. Create a valid, writable contiguous PTE.
VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
VMA pgprot bits: PTE_RDONLY | PTE_WRITE
PTE pgprot bits: PTE_DIRTY | PTE_WRITE
ii. mprotect the VMA to PROT_NONE.
VMA vmflags: VM_SHARED
VMA pgprot bits: PTE_RDONLY
PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
VMA pgprot bits: PTE_RDONLY | PTE_WRITE
PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
Applying either one of the two patches in this patchset will fix the
particular issue with HugeTLB pages implemented with contiguous PTEs.
It's possible that only one of these patches should be taken, or that
the right fix is something else entirely.
[1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
James Houghton (2):
arm64: hugetlb: Distinguish between hw and sw dirtiness in
__cont_access_flags_changed
arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
arch/arm64/include/asm/pgtable.h | 6 ++++++
arch/arm64/mm/hugetlbpage.c | 5 ++++-
2 files changed, 10 insertions(+), 1 deletion(-)
base-commit: 645a9a454fdb7e698a63a275edca6a17ef97afc4
--
2.43.0.rc2.451.g8631bc7472-goog
__cont_access_flags_changed was originally introduced to avoid making
unnecessary changes to the PTEs. Consider the following case: all the
PTEs in the contiguous group have PTE_DIRTY | PTE_RDONLY | PTE_WRITE,
and we are running on a system without HAFDBS. When writing via these
PTEs, we will get a page fault, and hugetlb_fault will (rightly)
attempt to update the PTEs with PTE_DIRTY | PTE_WRITE, but, as both the
original PTEs and the new PTEs are pte_dirty(),
__cont_access_flags_changed prevents the pgprot update from occurring.
To avoid the page fault loop that we get ourselves into, distinguish
between hardware-dirty and software-dirty for this check. Non-contiguous
PTEs aren't broken in the same way, as we will always write a new PTE
unless the new PTE is exactly equal to the old one.
Fixes: 031e6e6b4e12 ("arm64: hugetlb: Avoid unnecessary clearing in huge_ptep_set_access_flags")
Signed-off-by: James Houghton <[email protected]>
Cc: <[email protected]>
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f5aae342632c..87a9564976fa 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -437,7 +437,10 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
for (i = 0; i < ncontig; i++) {
pte_t orig_pte = ptep_get(ptep + i);
- if (pte_dirty(pte) != pte_dirty(orig_pte))
+ if (pte_sw_dirty(pte) != pte_sw_dirty(orig_pte))
+ return 1;
+
+ if (pte_hw_dirty(pte) != pte_hw_dirty(orig_pte))
return 1;
if (pte_young(pte) != pte_young(orig_pte))
--
2.43.0.rc2.451.g8631bc7472-goog
Make it impossible to create a sw-dirty, hw-clean PTE with pte_modify.
Such a PTE should be impossible to create, and there may be places that
assume that pte_dirty() implies pte_hw_dirty().
Signed-off-by: James Houghton <[email protected]>
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..79ce70fbb751 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -834,6 +834,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
pte_val(pte) = (pte_val(pte) & ~mask) | (pgprot_val(newprot) & mask);
+ /*
+ * If we end up clearing hw dirtiness for a sw-dirty PTE, set hardware
+ * dirtiness again.
+ */
+ if (pte_sw_dirty(pte))
+ pte = pte_mkdirty(pte);
return pte;
}
--
2.43.0.rc2.451.g8631bc7472-goog
On 04/12/2023 17:26, James Houghton wrote:
> It is currently possible for a userspace application to enter a page
> fault loop when using HugeTLB pages implemented with contiguous PTEs
> when HAFDBS is not available. This happens because:
> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
> (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
Hi James,
Do you know how this happens?
AFAIK, this is the set of valid bit combinations, and
PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
to understand how this is happening and prevent it?
/*
* PTE bits configuration in the presence of hardware Dirty Bit Management
* (PTE_WRITE == PTE_DBM):
*
* Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw)
* 0 0 | 1 0 0
* 0 1 | 1 1 0
* 1 0 | 1 0 1
* 1 1 | 0 1 x
*
* When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
* the page fault mechanism. Checking the dirty status of a pte becomes:
*
* PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
*/
Thanks,
Ryan
> 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
> the memory access on a system without HAFDBS, we will get a page
> fault.
> 3. HugeTLB will check if it needs to update the dirty bits on the PTE.
> For contiguous PTEs, it will check to see if the pgprot bits need
> updating. In this case, HugeTLB wants to write a sequence of
> sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
> to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
> so it thinks no update is necessary.
>
> Please see this[1] reproducer.
>
> I think (though I may be wrong) that both step (1) and step (3) are
> buggy.
>
> The first patch in this series fixes step (3); instead of checking if
> pte_dirty is matching in __cont_access_flags_changed, check pte_hw_dirty
> and pte_sw_dirty separately.
>
> The second patch in this series makes step (1) less likely to occur.
> Without this patch, we can get the kernel to write a sw-dirty, hw-clean
> PTE with the following steps (showing the relevant VMA flags and pgprot
> bits):
> i. Create a valid, writable contiguous PTE.
> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> PTE pgprot bits: PTE_DIRTY | PTE_WRITE
> ii. mprotect the VMA to PROT_NONE.
> VMA vmflags: VM_SHARED
> VMA pgprot bits: PTE_RDONLY
> PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
> iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
>
> Applying either one of the two patches in this patchset will fix the
> particular issue with HugeTLB pages implemented with contiguous PTEs.
> It's possible that only one of these patches should be taken, or that
> the right fix is something else entirely.
>
> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
>
> James Houghton (2):
> arm64: hugetlb: Distinguish between hw and sw dirtiness in
> __cont_access_flags_changed
> arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
>
> arch/arm64/include/asm/pgtable.h | 6 ++++++
> arch/arm64/mm/hugetlbpage.c | 5 ++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
>
> base-commit: 645a9a454fdb7e698a63a275edca6a17ef97afc4
On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <[email protected]> wrote:
>
> On 04/12/2023 17:26, James Houghton wrote:
> > It is currently possible for a userspace application to enter a page
> > fault loop when using HugeTLB pages implemented with contiguous PTEs
> > when HAFDBS is not available. This happens because:
> > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
> > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
>
> Hi James,
>
> Do you know how this happens?
Hi Ryan,
Thanks for taking a look! I do understand why this is happening. There
is an explanation in the reproducer[1] and also in this cover letter
(though I realize I could have been a little clearer). See below.
> AFAIK, this is the set of valid bit combinations, and
> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
> to understand how this is happening and prevent it?
>
> /*
> * PTE bits configuration in the presence of hardware Dirty Bit Management
> * (PTE_WRITE == PTE_DBM):
> *
> * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw)
> * 0 0 | 1 0 0
> * 0 1 | 1 1 0
> * 1 0 | 1 0 1
> * 1 1 | 0 1 x
> *
> * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
> * the page fault mechanism. Checking the dirty status of a pte becomes:
> *
> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
> */
Thanks for pointing this out. So (1) is definitely a bug. The second
patch in this series makes it impossible to create such a PTE via
pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).
> > The second patch in this series makes step (1) less likely to occur.
It makes it impossible to create this invalid set of bits via
pte_modify(). Assuming all PTE pgprot updates are done via the proper
interfaces, patch #2 might actually make this invalid bit combination
impossible to produce (that's certainly the goal). So perhaps language
stronger than "less likely" is appropriate.
Here's the sequence of events to trigger this bug, via mprotect():
> > Without this patch, we can get the kernel to write a sw-dirty, hw-clean
> > PTE with the following steps (showing the relevant VMA flags and pgprot
> > bits):
> > i. Create a valid, writable contiguous PTE.
> > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> > VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> > PTE pgprot bits: PTE_DIRTY | PTE_WRITE
> > ii. mprotect the VMA to PROT_NONE.
> > VMA vmflags: VM_SHARED
> > VMA pgprot bits: PTE_RDONLY
> > PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
> > iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
> > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> > VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> > PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY |
PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty).
Thanks!
> > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
On 05/12/2023 17:54, James Houghton wrote:
> On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 04/12/2023 17:26, James Houghton wrote:
>>> It is currently possible for a userspace application to enter a page
>>> fault loop when using HugeTLB pages implemented with contiguous PTEs
>>> when HAFDBS is not available. This happens because:
>>> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
>>> (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
>>
>> Hi James,
>>
>> Do you know how this happens?
>
> Hi Ryan,
>
> Thanks for taking a look! I do understand why this is happening. There
> is an explanation in the reproducer[1] and also in this cover letter
> (though I realize I could have been a little clearer). See below.
Sigh... sorry! I totally missed your (excellent) explanation.
>
>> AFAIK, this is the set of valid bit combinations, and
>> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
>> to understand how this is happening and prevent it?
>>
>> /*
>> * PTE bits configuration in the presence of hardware Dirty Bit Management
>> * (PTE_WRITE == PTE_DBM):
>> *
>> * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw)
>> * 0 0 | 1 0 0
>> * 0 1 | 1 1 0
>> * 1 0 | 1 0 1
>> * 1 1 | 0 1 x
>> *
>> * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
>> * the page fault mechanism. Checking the dirty status of a pte becomes:
>> *
>> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>> */
>
> Thanks for pointing this out. So (1) is definitely a bug. The second
> patch in this series makes it impossible to create such a PTE via
> pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).
Yes; I think the second patch should be sufficient; I took a quick look at the
other helpers and I don't see anything else that could get the PTE to the
invalid state.
I have a series that starts using the contpte bit for (multi-size) THP
opportunistically. This bug will affect that too I think. Your patch #2 will fix
for both hugetlb and my series. I'd rather not apply an equivalent to your patch
#1 because its not quite as straightforward in my code path. But I'm pretty
confident that patch # is all that's needed here.
Thanks,
Ryan
>
>>> The second patch in this series makes step (1) less likely to occur.
>
> It makes it impossible to create this invalid set of bits via
> pte_modify(). Assuming all PTE pgprot updates are done via the proper
> interfaces, patch #2 might actually make this invalid bit combination
> impossible to produce (that's certainly the goal). So perhaps language
> stronger than "less likely" is appropriate.
>
> Here's the sequence of events to trigger this bug, via mprotect():
>
>>> Without this patch, we can get the kernel to write a sw-dirty, hw-clean
>>> PTE with the following steps (showing the relevant VMA flags and pgprot
>>> bits):
>>> i. Create a valid, writable contiguous PTE.
>>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
>>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE
>>> ii. mprotect the VMA to PROT_NONE.
>>> VMA vmflags: VM_SHARED
>>> VMA pgprot bits: PTE_RDONLY
>>> PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
>>> iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
>>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
>>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
>
> With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY |
> PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty).
>
> Thanks!
>
>>> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
On 04/12/2023 17:26, James Houghton wrote:
> Make it impossible to create a sw-dirty, hw-clean PTE with pte_modify.
> Such a PTE should be impossible to create, and there may be places that
> assume that pte_dirty() implies pte_hw_dirty().
>
> Signed-off-by: James Houghton <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..79ce70fbb751 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -834,6 +834,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
>
> pte_val(pte) = (pte_val(pte) & ~mask) | (pgprot_val(newprot) & mask);
> + /*
> + * If we end up clearing hw dirtiness for a sw-dirty PTE, set hardware
> + * dirtiness again.
> + */
> + if (pte_sw_dirty(pte))
> + pte = pte_mkdirty(pte);
> return pte;
> }
>
On Wed, Dec 6, 2023 at 2:24 AM Ryan Roberts <[email protected]> wrote:
>
> On 05/12/2023 17:54, James Houghton wrote:
> > On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <[email protected]> wrote:
> > Thanks for pointing this out. So (1) is definitely a bug. The second
> > patch in this series makes it impossible to create such a PTE via
> > pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).
>
> Yes; I think the second patch should be sufficient; I took a quick look at the
> other helpers and I don't see anything else that could get the PTE to the
> invalid state.
>
> I have a series that starts using the contpte bit for (multi-size) THP
> opportunistically. This bug will affect that too I think. Your patch #2 will fix
> for both hugetlb and my series. I'd rather not apply an equivalent to your patch
> #1 because its not quite as straightforward in my code path. But I'm pretty
> confident that patch # is all that's needed here.
There is no need to apply a patch #1-equivalent for multi-size THPs.
:) If multi-size THP has the same problem as HugeTLB, patch #2 will
fix it too. I don't think multi-size THP has the equivalent problem --
in fact, I'm not sure how multi-size THP keeps the PTE_DIRTY,
PTE_WRITE (DBM), and the PTE_RDONLY bits in sync (they do need to be
in-sync with each other when the contiguous bit is being used,
right?).
I included patch #1 (with cc:stable) because it's a more direct fix
for HugeTLB that might be slightly easier to backport. If you think
that patch #1 should be dropped and patch #2 should be backported,
please let me know.
Thanks for the review!
On Mon, Dec 04, 2023 at 05:26:46PM +0000, James Houghton wrote:
> Make it impossible to create a sw-dirty, hw-clean PTE with pte_modify.
> Such a PTE should be impossible to create, and there may be places that
> assume that pte_dirty() implies pte_hw_dirty().
>
> Signed-off-by: James Houghton <[email protected]>
I'm not sure how, but you seem to be missing the '---' separator and the
diffstat here, so I suspect this might confuse tools such as b4 which try
to apply the patch directly.
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..79ce70fbb751 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -834,6 +834,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
>
> pte_val(pte) = (pte_val(pte) & ~mask) | (pgprot_val(newprot) & mask);
> + /*
> + * If we end up clearing hw dirtiness for a sw-dirty PTE, set hardware
> + * dirtiness again.
> + */
> + if (pte_sw_dirty(pte))
> + pte = pte_mkdirty(pte);
> return pte;
Looks like this is a fix for Catalin to pick up (patch #1 isn't necessary
afaict).
Will
On Mon, Dec 11, 2023 at 10:42 AM Will Deacon <[email protected]> wrote:
>
> On Mon, Dec 04, 2023 at 05:26:46PM +0000, James Houghton wrote:
> > Make it impossible to create a sw-dirty, hw-clean PTE with pte_modify.
> > Such a PTE should be impossible to create, and there may be places that
> > assume that pte_dirty() implies pte_hw_dirty().
> >
> > Signed-off-by: James Houghton <[email protected]>
>
> I'm not sure how, but you seem to be missing the '---' separator and the
> diffstat here, so I suspect this might confuse tools such as b4 which try
> to apply the patch directly.
Thanks for pointing that out. Looks like it came from using
`--summary` in git format-patch. :/
>
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index b19a8aee684c..79ce70fbb751 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -834,6 +834,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> > pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> >
> > pte_val(pte) = (pte_val(pte) & ~mask) | (pgprot_val(newprot) & mask);
> > + /*
> > + * If we end up clearing hw dirtiness for a sw-dirty PTE, set hardware
> > + * dirtiness again.
> > + */
> > + if (pte_sw_dirty(pte))
> > + pte = pte_mkdirty(pte);
> > return pte;
>
> Looks like this is a fix for Catalin to pick up (patch #1 isn't necessary
> afaict).
If only this patch is taken, make sure to add cc:stable and the fixes
tag from patch #1. Thank you!
On Mon, 04 Dec 2023 17:26:44 +0000, James Houghton wrote:
> It is currently possible for a userspace application to enter a page
> fault loop when using HugeTLB pages implemented with contiguous PTEs
> when HAFDBS is not available. This happens because:
> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
> (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
> 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
> the memory access on a system without HAFDBS, we will get a page
> fault.
> 3. HugeTLB will check if it needs to update the dirty bits on the PTE.
> For contiguous PTEs, it will check to see if the pgprot bits need
> updating. In this case, HugeTLB wants to write a sequence of
> sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
> to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
> so it thinks no update is necessary.
>
> [...]
Applied to arm64 (for-next/fixes), thanks!
[2/2] arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
https://git.kernel.org/arm64/c/3c0696076aad
I only picked up the second patch and added the description from the
cover letter into the commit log.
--
Catalin