Hi All,
This series adds uffd write-protect support for arm64. The patches are
extracted, without change, from a wider RFC I posted at [1].
Previous attempts to add uffd-wp (and soft-dirty) have failed because of a
perceived lack of available PTE SW bits. However it actually turns out that
there are 2 available but they are hidden. PTE_PROT_NONE was previously
occupying a SW bit, but it only applies when PTE_VALID is clear, so this is
moved to overlay PTE_UXN in patch 1, freeing up the SW bit. Bit 63 is marked as
"IGNORED" in the Arm ARM, but it does not currently indicate "reserved for SW
use" like it does for the other SW bits. I've confirmed with the spec owner that
this is an oversight; the bit is intended to be reserved for SW use and the spec
will clarify this in a future update.
So now we have two spare bits; patch 2 enables uffd-wp on arm64, using the SW
bit freed up by moving PTE_PROT_NONE. This leaves bit 63 spare for future use
(e.g. soft-dirty - see RFC at [1] - or some other usage).
This applies on top of v6.9-rc3. The all mm selftests involving uffd-wp now run
and pass and no other selftest regressions are observed.
[1] https://lore.kernel.org/all/[email protected]/
Thanks,
Ryan
Ryan Roberts (2):
arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID
arm64/mm: Add uffd write-protect support
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable-prot.h | 12 ++++-
arch/arm64/include/asm/pgtable.h | 71 ++++++++++++++++++++++++---
3 files changed, 75 insertions(+), 9 deletions(-)
--
2.25.1
Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
for SW use when the PTE is valid. This is a waste of those precious SW
bits since PTE_PROT_NONE can only ever be set when valid is clear.
Instead let's overlay it on what would be a HW bit if valid was set.
We need to be careful about which HW bit to choose since some of them
must be preserved; when pte_present() is true (as it is for a
PTE_PROT_NONE pte), it is legitimate for the core to call various
accessors, e.g. pte_dirty(), pte_write() etc. There are also some
accessors that are private to the arch which must continue to be
honoured, e.g. pte_user(), pte_user_exec() etc.
So we choose to overlay PTE_UXN; This effectively means that whenever a
pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
false, which is obviously always correct.
As a result of this change, we must shuffle the layout of the
arch-specific swap pte so that PTE_PROT_NONE is always zero and not
overlapping with any other field. As a result of this, there is no way
to keep the `type` field contiguous without conflicting with
PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
let's move PMD_PRESENT_INVALID to bit 60.
In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
soft-dirty or uffd-wp).
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable-prot.h | 4 ++--
arch/arm64/include/asm/pgtable.h | 16 +++++++++-------
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index dd9ee67d1d87..ef952d69fd04 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -18,14 +18,14 @@
#define PTE_DIRTY (_AT(pteval_t, 1) << 55)
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
-#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
/*
* This bit indicates that the entry is present i.e. pmd_page()
* still points to a valid huge page in memory even if the pmd
* has been invalidated.
*/
-#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
+#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 60) /* only when !PMD_SECT_VALID */
#define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
#define _PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..23aabff4fa6f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* Encode and decode a swap entry:
* bits 0-1: present (must be zero)
* bits 2: remember PG_anon_exclusive
- * bits 3-7: swap type
- * bits 8-57: swap offset
- * bit 58: PTE_PROT_NONE (must be zero)
+ * bits 4-53: swap offset
+ * bit 54: PTE_PROT_NONE (overlays PTE_UXN) (must be zero)
+ * bits 55-59: swap type
+ * bit 60: PMD_PRESENT_INVALID (must be zero)
*/
-#define __SWP_TYPE_SHIFT 3
+#define __SWP_TYPE_SHIFT 55
#define __SWP_TYPE_BITS 5
-#define __SWP_OFFSET_BITS 50
#define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1)
-#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
+#define __SWP_OFFSET_SHIFT 4
+#define __SWP_OFFSET_BITS 50
#define __SWP_OFFSET_MASK ((1UL << __SWP_OFFSET_BITS) - 1)
#define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
#define __swp_offset(x) (((x).val >> __SWP_OFFSET_SHIFT) & __SWP_OFFSET_MASK)
-#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) })
+#define __swp_entry(type, offset) ((swp_entry_t) { ((unsigned long)(type) << __SWP_TYPE_SHIFT) | \
+ ((unsigned long)(offset) << __SWP_OFFSET_SHIFT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(swp) ((pte_t) { (swp).val })
--
2.25.1
Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
The standard handlers are implemented for set/test/clear for both pte
and pmd. Additionally we must also track the uffd-wp state as a pte swp
bit, so use a free swap entry pte bit (3).
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable-prot.h | 8 ++++
arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
3 files changed, 64 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..763e221f2169 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -255,6 +255,7 @@ config ARM64
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
+ select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
select TRACE_IRQFLAGS_SUPPORT
select TRACE_IRQFLAGS_NMI_SUPPORT
select HAVE_SOFTIRQ_ON_OWN_STACK
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index ef952d69fd04..f1e1f6306e03 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -20,6 +20,14 @@
#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
#define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
+#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
+#else
+#define PTE_UFFD_WP (_AT(pteval_t, 0))
+#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
/*
* This bit indicates that the entry is present i.e. pmd_page()
* still points to a valid huge page in memory even if the pmd
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 23aabff4fa6f..3f4748741fdb 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,6 +271,34 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pte_uffd_wp(pte_t pte)
+{
+ bool wp = !!(pte_val(pte) & PTE_UFFD_WP);
+
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * Having write bit for wr-protect-marked present ptes is fatal, because
+ * it means the uffd-wp bit will be ignored and write will just go
+ * through. See comment in x86 implementation.
+ */
+ WARN_ON_ONCE(wp && pte_write(pte));
+#endif
+
+ return wp;
+}
+
+static inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+ return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD_WP)));
+}
+
+static inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+ return clear_pte_bit(pte, __pgprot(PTE_UFFD_WP));
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
static inline void __set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
@@ -463,6 +491,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
}
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+ return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+}
+
+static inline int pte_swp_uffd_wp(pte_t pte)
+{
+ return !!(pte_val(pte) & PTE_SWP_UFFD_WP);
+}
+
+static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+ return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP));
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
#ifdef CONFIG_NUMA_BALANCING
/*
* See the comment in include/linux/pgtable.h
@@ -508,6 +553,15 @@ static inline int pmd_trans_huge(pmd_t pmd)
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd))
+#define pmd_mkuffd_wp(pmd) pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)))
+#define pmd_clear_uffd_wp(pmd) pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)))
+#define pmd_swp_uffd_wp(pmd) pte_swp_uffd_wp(pmd_pte(pmd))
+#define pmd_swp_mkuffd_wp(pmd) pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)))
+#define pmd_swp_clear_uffd_wp(pmd) \
+ pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
static inline pmd_t pmd_mkinvalid(pmd_t pmd)
{
@@ -1248,6 +1302,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* Encode and decode a swap entry:
* bits 0-1: present (must be zero)
* bits 2: remember PG_anon_exclusive
+ * bit 3: remember uffd-wp state
* bits 4-53: swap offset
* bit 54: PTE_PROT_NONE (overlays PTE_UXN) (must be zero)
* bits 55-59: swap type
--
2.25.1
Hi, Ryan,
On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
> Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
>
> The standard handlers are implemented for set/test/clear for both pte
> and pmd. Additionally we must also track the uffd-wp state as a pte swp
> bit, so use a free swap entry pte bit (3).
>
> Signed-off-by: Ryan Roberts <[email protected]>
Looks all sane here from userfault perspective, just one comment below.
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable-prot.h | 8 ++++
> arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
> 3 files changed, 64 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b11c98b3e84..763e221f2169 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -255,6 +255,7 @@ config ARM64
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
> select TRACE_IRQFLAGS_SUPPORT
> select TRACE_IRQFLAGS_NMI_SUPPORT
> select HAVE_SOFTIRQ_ON_OWN_STACK
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index ef952d69fd04..f1e1f6306e03 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -20,6 +20,14 @@
> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
> #define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
>
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
> +#else
> +#define PTE_UFFD_WP (_AT(pteval_t, 0))
> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
> /*
> * This bit indicates that the entry is present i.e. pmd_page()
> * still points to a valid huge page in memory even if the pmd
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 23aabff4fa6f..3f4748741fdb 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -271,6 +271,34 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> }
>
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pte_uffd_wp(pte_t pte)
> +{
> + bool wp = !!(pte_val(pte) & PTE_UFFD_WP);
> +
> +#ifdef CONFIG_DEBUG_VM
> + /*
> + * Having write bit for wr-protect-marked present ptes is fatal, because
> + * it means the uffd-wp bit will be ignored and write will just go
> + * through. See comment in x86 implementation.
> + */
> + WARN_ON_ONCE(wp && pte_write(pte));
> +#endif
Feel free to drop this line, see:
https://lore.kernel.org/r/[email protected]
It's still in mm-unstable only.
AFAICT ARM64 also is supported by check_page_table, I also checked ARM's
ptep_modify_prot_commit() which uses set_pte_at(), so it should cover
everything in a superior way already.
With that dropped, feel free to add:
Acked-by: Peter Xu <[email protected]>
Thanks,
--
Peter Xu
On 24/04/2024 12:57, Peter Xu wrote:
> Hi, Ryan,
>
> On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
>> Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
>>
>> The standard handlers are implemented for set/test/clear for both pte
>> and pmd. Additionally we must also track the uffd-wp state as a pte swp
>> bit, so use a free swap entry pte bit (3).
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> Looks all sane here from userfault perspective, just one comment below.
>
>> ---
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/include/asm/pgtable-prot.h | 8 ++++
>> arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
>> 3 files changed, 64 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 7b11c98b3e84..763e221f2169 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -255,6 +255,7 @@ config ARM64
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
>> + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
>> select TRACE_IRQFLAGS_SUPPORT
>> select TRACE_IRQFLAGS_NMI_SUPPORT
>> select HAVE_SOFTIRQ_ON_OWN_STACK
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
>> index ef952d69fd04..f1e1f6306e03 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -20,6 +20,14 @@
>> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>> #define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
>>
>> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>> +#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
>> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
>> +#else
>> +#define PTE_UFFD_WP (_AT(pteval_t, 0))
>> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
>> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
>> +
>> /*
>> * This bit indicates that the entry is present i.e. pmd_page()
>> * still points to a valid huge page in memory even if the pmd
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 23aabff4fa6f..3f4748741fdb 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,6 +271,34 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> }
>>
>> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>> +static inline int pte_uffd_wp(pte_t pte)
>> +{
>> + bool wp = !!(pte_val(pte) & PTE_UFFD_WP);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> + /*
>> + * Having write bit for wr-protect-marked present ptes is fatal, because
>> + * it means the uffd-wp bit will be ignored and write will just go
>> + * through. See comment in x86 implementation.
>> + */
>> + WARN_ON_ONCE(wp && pte_write(pte));
>> +#endif
>
> Feel free to drop this line, see:
>
> https://lore.kernel.org/r/[email protected]
Ahh nice! In that case, I'm going to convert this to a macro, which is the arm64
style for these getters (for some reason...):
#define pte_uffd_wp(pte_t pte) (!!(pte_val(pte) & PTE_UFFD_WP))
Will send out a v2 once others have had time to comment.
>
> It's still in mm-unstable only.
>
> AFAICT ARM64 also is supported by check_page_table, I also checked ARM's
> ptep_modify_prot_commit() which uses set_pte_at(), so it should cover
> everything in a superior way already.
>
> With that dropped, feel free to add:
>
> Acked-by: Peter Xu <[email protected]>
Thanks!
>
> Thanks,
>
On Wed, Apr 24, 2024 at 12:10:16PM +0100, Ryan Roberts wrote:
> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
> for SW use when the PTE is valid. This is a waste of those precious SW
> bits since PTE_PROT_NONE can only ever be set when valid is clear.
> Instead let's overlay it on what would be a HW bit if valid was set.
>
> We need to be careful about which HW bit to choose since some of them
> must be preserved; when pte_present() is true (as it is for a
> PTE_PROT_NONE pte), it is legitimate for the core to call various
> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
> accessors that are private to the arch which must continue to be
> honoured, e.g. pte_user(), pte_user_exec() etc.
>
> So we choose to overlay PTE_UXN; This effectively means that whenever a
> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
> false, which is obviously always correct.
>
> As a result of this change, we must shuffle the layout of the
> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
> overlapping with any other field. As a result of this, there is no way
> to keep the `type` field contiguous without conflicting with
> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
> let's move PMD_PRESENT_INVALID to bit 60.
I think we discussed but forgot the details. What was the reason for not
using, say, bit 60 for PTE_PROT_NONE to avoid all the swap bits
reshuffling? Clearing or setting of the PTE_PROT_NONE bit is done via
pte_modify() and this gets all the new permission bits anyway. With POE
support (on the list for now), PTE_PROT_NONE would overlap with
POIndex[0] but I don't think we ever plan to read this field (other than
maybe ptdump). The POIndex field is set from the vma->vm_page_prot (Joey
may need to adjust vm_get_page_prot() in his patches to avoid setting a
pkey on a PROT_NONE mapping).
--
Catalin
On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
> @@ -1248,6 +1302,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> * Encode and decode a swap entry:
> * bits 0-1: present (must be zero)
> * bits 2: remember PG_anon_exclusive
> + * bit 3: remember uffd-wp state
> * bits 4-53: swap offset
> * bit 54: PTE_PROT_NONE (overlays PTE_UXN) (must be zero)
> * bits 55-59: swap type
Ah, I did not realise we need to free up bit 3 from the swap pte as
well. Though maybe patch 1 is fine as is but for the record, it would be
good to justify the decision to go with PTE_UXN. For this patch:
Reviewed-by: Catalin Marinas <[email protected]>
On 24/04/2024 17:43, Catalin Marinas wrote:
> On Wed, Apr 24, 2024 at 12:10:16PM +0100, Ryan Roberts wrote:
>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>> for SW use when the PTE is valid. This is a waste of those precious SW
>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>> Instead let's overlay it on what would be a HW bit if valid was set.
>>
>> We need to be careful about which HW bit to choose since some of them
>> must be preserved; when pte_present() is true (as it is for a
>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>> accessors that are private to the arch which must continue to be
>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>
>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>> false, which is obviously always correct.
>>
>> As a result of this change, we must shuffle the layout of the
>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>> overlapping with any other field. As a result of this, there is no way
>> to keep the `type` field contiguous without conflicting with
>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>> let's move PMD_PRESENT_INVALID to bit 60.
>
> I think we discussed but forgot the details. What was the reason for not
> using, say, bit 60 for PTE_PROT_NONE to avoid all the swap bits
> reshuffling? Clearing or setting of the PTE_PROT_NONE bit is done via
> pte_modify() and this gets all the new permission bits anyway. With POE
> support (on the list for now), PTE_PROT_NONE would overlap with
> POIndex[0] but I don't think we ever plan to read this field (other than
> maybe ptdump). The POIndex field is set from the vma->vm_page_prot (Joey
> may need to adjust vm_get_page_prot() in his patches to avoid setting a
> pkey on a PROT_NONE mapping).
>
Copy/pasting your comment from the other patch into this one since its easier to
discuss it all together:
Ah, I did not realise we need to free up bit 3 from the swap pte as
well. Though maybe patch 1 is fine as is but for the record, it would be
good to justify the decision to go with PTE_UXN.
While we need a new bit in the swap pte for uffd-wp, its just a SW bit - it
could go anywhere. I chose bit 3 because it was free after all the other shuffling.
As for choosing PTE_UXN for PTE_PROT_NONE, I wanted to choose a bit that would
definitely never lead to confusion if ever interpretted as its HW meaning, since
as far as the core-mm is concerned, the pte is either present or its not, and if
it is present, then it is completely valid to call all the pte_*() helpers. By
definition, if PTE_PROT_NONE is set, then the PTE is not executable in user
space, so any helpers that continue to interpret the bit position as UXN will
still give sensible answers.
Yes, I could have just put PTE_PROT_NONE in bit position 60 and avoided all the
shuffling. But in the past you have pushed back on using the PBHA bits due to
out of tree patches using them. I thought it was better to just sidestep having
to think about it by not using them. Additionally, as you point out, I didn't
want to risk overlapping with the POIndex and that causing subtle bugs.
But then... PMD_PRESENT_INVALID. Which already turns out to be violating my
above considerations. Ugh. I considered moving that to NS, but it would have
required splitting the offset field into 2 discontiguous parts in the swap pte.
In the end, I decided its ok in any position because its transient; its just a
temp marker and the pte will soon get set again from scratch so it doesn't
matter is adding the marker is destructive.
Personally I think there is less risk of a future/subtle bug by putting
PTE_PROT_NONE over PTE_UXN. But if you prefer to reduce churn by putting it at
bit 60 then I'm ok with that.
On 24.04.24 13:10, Ryan Roberts wrote:
> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
> for SW use when the PTE is valid. This is a waste of those precious SW
> bits since PTE_PROT_NONE can only ever be set when valid is clear.
> Instead let's overlay it on what would be a HW bit if valid was set.
>
> We need to be careful about which HW bit to choose since some of them
> must be preserved; when pte_present() is true (as it is for a
> PTE_PROT_NONE pte), it is legitimate for the core to call various
> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
> accessors that are private to the arch which must continue to be
> honoured, e.g. pte_user(), pte_user_exec() etc.
>
> So we choose to overlay PTE_UXN; This effectively means that whenever a
> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
> false, which is obviously always correct.
>
> As a result of this change, we must shuffle the layout of the
> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
> overlapping with any other field. As a result of this, there is no way
> to keep the `type` field contiguous without conflicting with
> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
> let's move PMD_PRESENT_INVALID to bit 60.
A note that some archs split/re-combine type and/or offset, to make use
of every bit possible :) But that's mostly relevant for 32bit.
(and as long as PFNs can still fit into the swp offset for migration
entries etc.)
>
> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
> soft-dirty or uffd-wp).
I was briefly confused about how you would use these bits as SW bits for
swap PTEs (which you can't as they overlay the type). See below
regarding bit 3.
I would have said here "proper SW bit for present PTEs".
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/include/asm/pgtable-prot.h | 4 ++--
> arch/arm64/include/asm/pgtable.h | 16 +++++++++-------
> 2 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index dd9ee67d1d87..ef952d69fd04 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -18,14 +18,14 @@
> #define PTE_DIRTY (_AT(pteval_t, 1) << 55)
> #define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
> -#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
>
> /*
> * This bit indicates that the entry is present i.e. pmd_page()
> * still points to a valid huge page in memory even if the pmd
> * has been invalidated.
> */
> -#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
> +#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 60) /* only when !PMD_SECT_VALID */
>
> #define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
> #define _PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index afdd56d26ad7..23aabff4fa6f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> * Encode and decode a swap entry:
> * bits 0-1: present (must be zero)
> * bits 2: remember PG_anon_exclusive
> - * bits 3-7: swap type
> - * bits 8-57: swap offset
> - * bit 58: PTE_PROT_NONE (must be zero)
Reading this patch alone: what happened to bit 3? Please mention that
that it will be used as a swap pte metadata bit (uffd-wp).
> + * bits 4-53: swap offset
So we'll still have 50bit for the offset, good. We could even use 61-63
if ever required to store bigger PFNs.
LGTM
Reviewed-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb
On 25/04/2024 10:16, David Hildenbrand wrote:
> On 24.04.24 13:10, Ryan Roberts wrote:
>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>> for SW use when the PTE is valid. This is a waste of those precious SW
>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>> Instead let's overlay it on what would be a HW bit if valid was set.
>>
>> We need to be careful about which HW bit to choose since some of them
>> must be preserved; when pte_present() is true (as it is for a
>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>> accessors that are private to the arch which must continue to be
>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>
>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>> false, which is obviously always correct.
>>
>> As a result of this change, we must shuffle the layout of the
>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>> overlapping with any other field. As a result of this, there is no way
>> to keep the `type` field contiguous without conflicting with
>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>> let's move PMD_PRESENT_INVALID to bit 60.
>
> A note that some archs split/re-combine type and/or offset, to make use of every
> bit possible :) But that's mostly relevant for 32bit.
>
> (and as long as PFNs can still fit into the swp offset for migration entries etc.)
Yeah, I considered splitting the type or offset field to avoid moving
PMD_PRESENT_INVALID, but thought it was better to avoid the extra mask and shift.
>
>>
>> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
>> soft-dirty or uffd-wp).
>
> I was briefly confused about how you would use these bits as SW bits for swap
> PTEs (which you can't as they overlay the type). See below regarding bit 3.
>
> I would have said here "proper SW bit for present PTEs".
Yes; I'll clarify in the next version.
>
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> arch/arm64/include/asm/pgtable-prot.h | 4 ++--
>> arch/arm64/include/asm/pgtable.h | 16 +++++++++-------
>> 2 files changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h
>> b/arch/arm64/include/asm/pgtable-prot.h
>> index dd9ee67d1d87..ef952d69fd04 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -18,14 +18,14 @@
>> #define PTE_DIRTY (_AT(pteval_t, 1) << 55)
>> #define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
>> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>> -#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
>> +#define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when
>> !PTE_VALID */
>> /*
>> * This bit indicates that the entry is present i.e. pmd_page()
>> * still points to a valid huge page in memory even if the pmd
>> * has been invalidated.
>> */
>> -#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when
>> !PMD_SECT_VALID */
>> +#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 60) /* only when
>> !PMD_SECT_VALID */
>> #define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>> #define _PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index afdd56d26ad7..23aabff4fa6f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct
>> vm_area_struct *vma,
>> * Encode and decode a swap entry:
>> * bits 0-1: present (must be zero)
>> * bits 2: remember PG_anon_exclusive
>> - * bits 3-7: swap type
>> - * bits 8-57: swap offset
>> - * bit 58: PTE_PROT_NONE (must be zero)
>
> Reading this patch alone: what happened to bit 3? Please mention that that it
> will be used as a swap pte metadata bit (uffd-wp).
Will do. It's all a bit arbitrary though. I could have put offset in 3-52, and
then 53 would have been spare for uffd-wp. I'm not sure there is any advantage
to either option.
>
>> + * bits 4-53: swap offset
>
> So we'll still have 50bit for the offset, good. We could even use 61-63 if ever
> required to store bigger PFNs.
yep, or more sw bits.
>
> LGTM
>
> Reviewed-by: David Hildenbrand <[email protected]>
Thanks!
On 25/04/2024 11:29, Ryan Roberts wrote:
> On 25/04/2024 10:16, David Hildenbrand wrote:
>> On 24.04.24 13:10, Ryan Roberts wrote:
>>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>>> for SW use when the PTE is valid. This is a waste of those precious SW
>>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>>> Instead let's overlay it on what would be a HW bit if valid was set.
>>>
>>> We need to be careful about which HW bit to choose since some of them
>>> must be preserved; when pte_present() is true (as it is for a
>>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>>> accessors that are private to the arch which must continue to be
>>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>>
>>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>>> false, which is obviously always correct.
>>>
>>> As a result of this change, we must shuffle the layout of the
>>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>>> overlapping with any other field. As a result of this, there is no way
>>> to keep the `type` field contiguous without conflicting with
>>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>>> let's move PMD_PRESENT_INVALID to bit 60.
>>
>> A note that some archs split/re-combine type and/or offset, to make use of every
>> bit possible :) But that's mostly relevant for 32bit.
>>
>> (and as long as PFNs can still fit into the swp offset for migration entries etc.)
>
> Yeah, I considered splitting the type or offset field to avoid moving
> PMD_PRESENT_INVALID, but thought it was better to avoid the extra mask and shift.
Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
ptes; it would be cleaner to have one bit that defines "present" when valid is
clear (similar to PTE_PROT_NONE today) then another bit which is only defined
when "present && !valid" which tells us if this is PTE_PROT_NONE or
PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
But there is a problem with this: __split_huge_pmd_locked() calls
pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
but was trying to avoid the whole thing unravelling so didn't persue.
>
>>
>>>
>>> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
>>> soft-dirty or uffd-wp).
>>
>> I was briefly confused about how you would use these bits as SW bits for swap
>> PTEs (which you can't as they overlay the type). See below regarding bit 3.
>>
>> I would have said here "proper SW bit for present PTEs".
>
> Yes; I'll clarify in the next version.
>
>>
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> arch/arm64/include/asm/pgtable-prot.h | 4 ++--
>>> arch/arm64/include/asm/pgtable.h | 16 +++++++++-------
>>> 2 files changed, 11 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable-prot.h
>>> b/arch/arm64/include/asm/pgtable-prot.h
>>> index dd9ee67d1d87..ef952d69fd04 100644
>>> --- a/arch/arm64/include/asm/pgtable-prot.h
>>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>>> @@ -18,14 +18,14 @@
>>> #define PTE_DIRTY (_AT(pteval_t, 1) << 55)
>>> #define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
>>> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>>> -#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
>>> +#define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when
>>> !PTE_VALID */
>>> /*
>>> * This bit indicates that the entry is present i.e. pmd_page()
>>> * still points to a valid huge page in memory even if the pmd
>>> * has been invalidated.
>>> */
>>> -#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when
>>> !PMD_SECT_VALID */
>>> +#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 60) /* only when
>>> !PMD_SECT_VALID */
>>> #define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>>> #define _PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index afdd56d26ad7..23aabff4fa6f 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct
>>> vm_area_struct *vma,
>>> * Encode and decode a swap entry:
>>> * bits 0-1: present (must be zero)
>>> * bits 2: remember PG_anon_exclusive
>>> - * bits 3-7: swap type
>>> - * bits 8-57: swap offset
>>> - * bit 58: PTE_PROT_NONE (must be zero)
>>
>> Reading this patch alone: what happened to bit 3? Please mention that that it
>> will be used as a swap pte metadata bit (uffd-wp).
>
> Will do. It's all a bit arbitrary though. I could have put offset in 3-52, and
> then 53 would have been spare for uffd-wp. I'm not sure there is any advantage
> to either option.
>
>>
>>> + * bits 4-53: swap offset
>>
>> So we'll still have 50bit for the offset, good. We could even use 61-63 if ever
>> required to store bigger PFNs.
>
> yep, or more sw bits.
>
>>
>> LGTM
>>
>> Reviewed-by: David Hildenbrand <[email protected]>
>
> Thanks!
>
>
+ Muhammad Usama Anjum <[email protected]>
Hi Peter, Muhammad,
On 24/04/2024 12:57, Peter Xu wrote:
> Hi, Ryan,
>
> On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
>> Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
>>
>> The standard handlers are implemented for set/test/clear for both pte
>> and pmd. Additionally we must also track the uffd-wp state as a pte swp
>> bit, so use a free swap entry pte bit (3).
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> Looks all sane here from userfault perspective, just one comment below.
>
>> ---
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/include/asm/pgtable-prot.h | 8 ++++
>> arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
>> 3 files changed, 64 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 7b11c98b3e84..763e221f2169 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -255,6 +255,7 @@ config ARM64
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
>> + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
>> select TRACE_IRQFLAGS_SUPPORT
>> select TRACE_IRQFLAGS_NMI_SUPPORT
>> select HAVE_SOFTIRQ_ON_OWN_STACK
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
>> index ef952d69fd04..f1e1f6306e03 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -20,6 +20,14 @@
>> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>> #define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
>>
>> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>> +#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
>> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
I've just noticed code in task_mmu.c:
static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
{
...
if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
p->arg.category_mask == PAGE_IS_WRITTEN &&
p->arg.return_mask == PAGE_IS_WRITTEN) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
unsigned long next = addr + PAGE_SIZE;
if (pte_uffd_wp(ptep_get(pte))) <<<<<<
continue;
...
}
}
}
As far as I can see, you don't know that the pte is present when you do this. So
does this imply that the UFFD-WP bit is expected to be in the same position for
both present ptes and swap ptes? I had assumed pte_uffd_wp() was for present
ptes and pte_swp_uffd_wp() was for swap ptes.
As you can see, the way I've implemented this for arm64 the bit is in a
different position for these 2 cases. I've just done a slightly different
implementation that changes the first patch in this series quite a bit and a
bunch of pagemap_ioctl mm kselftests are now failing. I think this is the root
cause, but haven't proven it definitively yet.
I'm inclined towords thinking the above is a bug and should be fixed so that I
can store the bit in different places. What do you think?
Thanks,
Ryan
>> +#else
>> +#define PTE_UFFD_WP (_AT(pteval_t, 0))
>> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0))
>> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
>> +
>> /*
>> * This bit indicates that the entry is present i.e. pmd_page()
>> * still points to a valid huge page in memory even if the pmd
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 23aabff4fa6f..3f4748741fdb 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,6 +271,34 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> }
>>
>> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>> +static inline int pte_uffd_wp(pte_t pte)
>> +{
>> + bool wp = !!(pte_val(pte) & PTE_UFFD_WP);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> + /*
>> + * Having write bit for wr-protect-marked present ptes is fatal, because
>> + * it means the uffd-wp bit will be ignored and write will just go
>> + * through. See comment in x86 implementation.
>> + */
>> + WARN_ON_ONCE(wp && pte_write(pte));
>> +#endif
>
> Feel free to drop this line, see:
>
> https://lore.kernel.org/r/[email protected]
>
> It's still in mm-unstable only.
>
> AFAICT ARM64 also is supported by check_page_table, I also checked ARM's
> ptep_modify_prot_commit() which uses set_pte_at(), so it should cover
> everything in a superior way already.
>
> With that dropped, feel free to add:
>
> Acked-by: Peter Xu <[email protected]>
>
> Thanks,
>
On Fri, Apr 26, 2024 at 02:17:41PM +0100, Ryan Roberts wrote:
> + Muhammad Usama Anjum <[email protected]>
>
> Hi Peter, Muhammad,
>
>
> On 24/04/2024 12:57, Peter Xu wrote:
> > Hi, Ryan,
> >
> > On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
> >> Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
> >>
> >> The standard handlers are implemented for set/test/clear for both pte
> >> and pmd. Additionally we must also track the uffd-wp state as a pte swp
> >> bit, so use a free swap entry pte bit (3).
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >
> > Looks all sane here from userfault perspective, just one comment below.
> >
> >> ---
> >> arch/arm64/Kconfig | 1 +
> >> arch/arm64/include/asm/pgtable-prot.h | 8 ++++
> >> arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
> >> 3 files changed, 64 insertions(+)
> >>
> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> index 7b11c98b3e84..763e221f2169 100644
> >> --- a/arch/arm64/Kconfig
> >> +++ b/arch/arm64/Kconfig
> >> @@ -255,6 +255,7 @@ config ARM64
> >> select SYSCTL_EXCEPTION_TRACE
> >> select THREAD_INFO_IN_TASK
> >> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> >> + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
> >> select TRACE_IRQFLAGS_SUPPORT
> >> select TRACE_IRQFLAGS_NMI_SUPPORT
> >> select HAVE_SOFTIRQ_ON_OWN_STACK
> >> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> >> index ef952d69fd04..f1e1f6306e03 100644
> >> --- a/arch/arm64/include/asm/pgtable-prot.h
> >> +++ b/arch/arm64/include/asm/pgtable-prot.h
> >> @@ -20,6 +20,14 @@
> >> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
> >> #define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> >> +#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
> >> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
>
> I've just noticed code in task_mmu.c:
>
> static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> unsigned long end, struct mm_walk *walk)
> {
> ...
>
> if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
> p->arg.category_mask == PAGE_IS_WRITTEN &&
> p->arg.return_mask == PAGE_IS_WRITTEN) {
> for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> unsigned long next = addr + PAGE_SIZE;
>
> if (pte_uffd_wp(ptep_get(pte))) <<<<<<
> continue;
>
> ...
> }
> }
> }
>
> As far as I can see, you don't know that the pte is present when you do this. So
> does this imply that the UFFD-WP bit is expected to be in the same position for
> both present ptes and swap ptes? I had assumed pte_uffd_wp() was for present
> ptes and pte_swp_uffd_wp() was for swap ptes.
>
> As you can see, the way I've implemented this for arm64 the bit is in a
> different position for these 2 cases. I've just done a slightly different
> implementation that changes the first patch in this series quite a bit and a
> bunch of pagemap_ioctl mm kselftests are now failing. I think this is the root
> cause, but haven't proven it definitively yet.
>
> I'm inclined towords thinking the above is a bug and should be fixed so that I
> can store the bit in different places. What do you think?
Yep I agree.
Even on x86_64 they should be defined differently. It looks like some
sheer luck the test constantly pass on x86 even if it checked the wrong one.
Worth checking all the relevant paths in the pagemap code to make sure it's
checked, e.g. I also see one fast path above this chunk of code which looks
like to have the same issue.
Thanks,
--
Peter Xu
On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> ptes; it would be cleaner to have one bit that defines "present" when valid is
> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
and use it for both ptes and pmds.
> But there is a problem with this: __split_huge_pmd_locked() calls
> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
> but was trying to avoid the whole thing unravelling so didn't persue.
Maybe what's wrong is the arm64 implementation setting this bit on a
swap/migration pmd (though we could handle this in the core code as
well, it depends what the other architectures do). The only check for
the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
into the pmd_present() check. I think it is currently broken as
pmd_present() can return true for a swap pmd after pmd_mkinvalid().
So I don't think we lose anything if pmd_mkinvalid() skips any bit
setting when !PTE_VALID. Maybe it even fixes some corner case we never
hit yet (like pmd_present() on a swap/migration+invalid pmd).
--
Catalin
On 26/04/2024 14:54, Peter Xu wrote:
> On Fri, Apr 26, 2024 at 02:17:41PM +0100, Ryan Roberts wrote:
>> + Muhammad Usama Anjum <[email protected]>
>>
>> Hi Peter, Muhammad,
>>
>>
>> On 24/04/2024 12:57, Peter Xu wrote:
>>> Hi, Ryan,
>>>
>>> On Wed, Apr 24, 2024 at 12:10:17PM +0100, Ryan Roberts wrote:
>>>> Let's use the newly-free PTE SW bit (58) to add support for uffd-wp.
>>>>
>>>> The standard handlers are implemented for set/test/clear for both pte
>>>> and pmd. Additionally we must also track the uffd-wp state as a pte swp
>>>> bit, so use a free swap entry pte bit (3).
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>
>>> Looks all sane here from userfault perspective, just one comment below.
>>>
>>>> ---
>>>> arch/arm64/Kconfig | 1 +
>>>> arch/arm64/include/asm/pgtable-prot.h | 8 ++++
>>>> arch/arm64/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++
>>>> 3 files changed, 64 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index 7b11c98b3e84..763e221f2169 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -255,6 +255,7 @@ config ARM64
>>>> select SYSCTL_EXCEPTION_TRACE
>>>> select THREAD_INFO_IN_TASK
>>>> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
>>>> + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
>>>> select TRACE_IRQFLAGS_SUPPORT
>>>> select TRACE_IRQFLAGS_NMI_SUPPORT
>>>> select HAVE_SOFTIRQ_ON_OWN_STACK
>>>> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
>>>> index ef952d69fd04..f1e1f6306e03 100644
>>>> --- a/arch/arm64/include/asm/pgtable-prot.h
>>>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>>>> @@ -20,6 +20,14 @@
>>>> #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>>>> #define PTE_PROT_NONE (PTE_UXN) /* Reuse PTE_UXN; only when !PTE_VALID */
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>>>> +#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
>>>> +#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
>>
>> I've just noticed code in task_mmu.c:
>>
>> static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> unsigned long end, struct mm_walk *walk)
>> {
>> ...
>>
>> if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
>> p->arg.category_mask == PAGE_IS_WRITTEN &&
>> p->arg.return_mask == PAGE_IS_WRITTEN) {
>> for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> unsigned long next = addr + PAGE_SIZE;
>>
>> if (pte_uffd_wp(ptep_get(pte))) <<<<<<
>> continue;
>>
>> ...
>> }
>> }
>> }
>>
>> As far as I can see, you don't know that the pte is present when you do this. So
>> does this imply that the UFFD-WP bit is expected to be in the same position for
>> both present ptes and swap ptes? I had assumed pte_uffd_wp() was for present
>> ptes and pte_swp_uffd_wp() was for swap ptes.
>>
>> As you can see, the way I've implemented this for arm64 the bit is in a
>> different position for these 2 cases. I've just done a slightly different
>> implementation that changes the first patch in this series quite a bit and a
>> bunch of pagemap_ioctl mm kselftests are now failing. I think this is the root
>> cause, but haven't proven it definitively yet.
>>
>> I'm inclined towords thinking the above is a bug and should be fixed so that I
>> can store the bit in different places. What do you think?
>
> Yep I agree.
OK great - I'll spin a patch to fix this.
>
> Even on x86_64 they should be defined differently. It looks like some
> sheer luck the test constantly pass on x86 even if it checked the wrong one.
>
> Worth checking all the relevant paths in the pagemap code to make sure it's
> checked, e.g. I also see one fast path above this chunk of code which looks
> like to have the same issue.
Yes, spotted that one. I'll audit other sites too.
Thanks!
>
> Thanks,
>
On 26/04/2024 15:48, Catalin Marinas wrote:
> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>
> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> and use it for both ptes and pmds.
Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
core-mm so will now fix that before I can validate my change. see
https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
with both PTE_WRITE=0 and PTE_RDONLY=0.
While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
modification", this is not a problem as the pte is invalid, so the HW doesn't
interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
that we now repurpose for PROT_NONE.
This will subtly change behaviour in an edge case though. Imagine:
pte_t pte;
pte = pte_modify(pte, PAGE_NONE);
pte = pte_mkwrite_novma(pte);
WARN_ON(pte_protnone(pte));
Should that warning fire or not? Previously, because we had a dedicated bit for
PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
it's more intuitive if it doesn't fire. Regardless there is no core code that
ever does this. Once you have a protnone pte, its terminal - nothing ever
modifies it with these helpers AFAICS.
Personally I think this is a nice tidy up that saves a SW bit in both present
and swap ptes. What do you think? (I'll just post the series if its easier to
provide feedback in that context).
>
>> But there is a problem with this: __split_huge_pmd_locked() calls
>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>> but was trying to avoid the whole thing unravelling so didn't persue.
>
> Maybe what's wrong is the arm64 implementation setting this bit on a
> swap/migration pmd (though we could handle this in the core code as
> well, it depends what the other architectures do). The only check for
> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
> into the pmd_present() check. I think it is currently broken as
> pmd_present() can return true for a swap pmd after pmd_mkinvalid().
I've posted a fix here:
https://lore.kernel.org/linux-mm/[email protected]/
My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
>
> So I don't think we lose anything if pmd_mkinvalid() skips any bit
> setting when !PTE_VALID. Maybe it even fixes some corner case we never
> hit yet (like pmd_present() on a swap/migration+invalid pmd).
>
On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
> On 26/04/2024 15:48, Catalin Marinas wrote:
> > On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> >> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> >> ptes; it would be cleaner to have one bit that defines "present" when valid is
> >> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> >> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> >> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
> >
> > I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> > and use it for both ptes and pmds.
>
> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
> core-mm so will now fix that before I can validate my change. see
> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>
> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
> with both PTE_WRITE=0 and PTE_RDONLY=0.
>
> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
> modification", this is not a problem as the pte is invalid, so the HW doesn't
> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
> that we now repurpose for PROT_NONE.
Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
for PAGE_NONE (bar the kernel mappings).
For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
means pte_protnone(). For pmds, however, we can end up with
pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
permissions encoded into a valid pmd. That's where a dedicated
PTE_PROT_NONE bit helped.
Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
the pmd is present, it goes and checks pmd_protnone() which returns
true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
but it looks fragile to rely on them.
So I think for protnone we need to check some other bits (like USER and
UXN) in addition to PTE_PRESENT_INVALID.
> This will subtly change behaviour in an edge case though. Imagine:
>
> pte_t pte;
>
> pte = pte_modify(pte, PAGE_NONE);
> pte = pte_mkwrite_novma(pte);
> WARN_ON(pte_protnone(pte));
>
> Should that warning fire or not? Previously, because we had a dedicated bit for
> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
> it's more intuitive if it doesn't fire. Regardless there is no core code that
> ever does this. Once you have a protnone pte, its terminal - nothing ever
> modifies it with these helpers AFAICS.
I don't think any core code should try to make page a PAGE_NONE pte
writeable.
> Personally I think this is a nice tidy up that saves a SW bit in both present
> and swap ptes. What do you think? (I'll just post the series if its easier to
> provide feedback in that context).
It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
it doesn't affect the pmd case I mentioned above.
> >> But there is a problem with this: __split_huge_pmd_locked() calls
> >> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
> >> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
> >> but was trying to avoid the whole thing unravelling so didn't persue.
> >
> > Maybe what's wrong is the arm64 implementation setting this bit on a
> > swap/migration pmd (though we could handle this in the core code as
> > well, it depends what the other architectures do). The only check for
> > the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
> > into the pmd_present() check. I think it is currently broken as
> > pmd_present() can return true for a swap pmd after pmd_mkinvalid().
>
> I've posted a fix here:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
I agree, thanks.
--
Catalin
On 29/04/2024 13:38, Catalin Marinas wrote:
> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>
>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>> and use it for both ptes and pmds.
>>
>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>> core-mm so will now fix that before I can validate my change. see
>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>
>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>
>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>> that we now repurpose for PROT_NONE.
>
> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
> for PAGE_NONE (bar the kernel mappings).
Yes I guess that works. I personally prefer my proposal because it is more
intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
preferable, then I'll go with that.
>
> For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
> means pte_protnone(). For pmds, however, we can end up with
> pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
> permissions encoded into a valid pmd. That's where a dedicated
> PTE_PROT_NONE bit helped.
Yes agreed.
>
> Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
> first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
> the pmd is present, it goes and checks pmd_protnone() which returns
> true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
> but it looks fragile to rely on them.
>
> So I think for protnone we need to check some other bits (like USER and
> UXN) in addition to PTE_PRESENT_INVALID.
Yes 100% agree. But using PTE_WRITE|PTE_RDONLY==0b00 is just as valid for that
purpose, I think?
>
>> This will subtly change behaviour in an edge case though. Imagine:
>>
>> pte_t pte;
>>
>> pte = pte_modify(pte, PAGE_NONE);
>> pte = pte_mkwrite_novma(pte);
>> WARN_ON(pte_protnone(pte));
>>
>> Should that warning fire or not? Previously, because we had a dedicated bit for
>> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
>> it's more intuitive if it doesn't fire. Regardless there is no core code that
>> ever does this. Once you have a protnone pte, its terminal - nothing ever
>> modifies it with these helpers AFAICS.
>
> I don't think any core code should try to make page a PAGE_NONE pte
> writeable.
I looked at some other arches; some (at least alpha and hexagon) will not fire
this warning because they have R and W bits and 0b00 means NONE. Others (x86)
will fire it because they have an explicit NONE bit and don't remove it on
permission change. So I conclude its UB and fine to do either.
>
>> Personally I think this is a nice tidy up that saves a SW bit in both present
>> and swap ptes. What do you think? (I'll just post the series if its easier to
>> provide feedback in that context).
>
> It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
> it doesn't affect the pmd case I mentioned above.
>
>>>> But there is a problem with this: __split_huge_pmd_locked() calls
>>>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>>>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>>>> but was trying to avoid the whole thing unravelling so didn't persue.
>>>
>>> Maybe what's wrong is the arm64 implementation setting this bit on a
>>> swap/migration pmd (though we could handle this in the core code as
>>> well, it depends what the other architectures do). The only check for
>>> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
>>> into the pmd_present() check. I think it is currently broken as
>>> pmd_present() can return true for a swap pmd after pmd_mkinvalid().
>>
>> I've posted a fix here:
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
>
> I agree, thanks.
>
On 29/04/2024 14:01, Ryan Roberts wrote:
> On 29/04/2024 13:38, Catalin Marinas wrote:
>> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>>
>>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>>> and use it for both ptes and pmds.
>>>
>>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>>> core-mm so will now fix that before I can validate my change. see
>>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>>
>>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>>
>>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>>> that we now repurpose for PROT_NONE.
>>
>> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
>> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
>> for PAGE_NONE (bar the kernel mappings).
>
> Yes I guess that works. I personally prefer my proposal because it is more
> intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
> if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
> preferable, then I'll go with that.
Ignore this - I looked at your proposed approach and agree it's better. I'll use
`PTE_USER|PTE_UXN==0b01`. Posting shortly...
>
>>
>> For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
>> means pte_protnone(). For pmds, however, we can end up with
>> pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
>> permissions encoded into a valid pmd. That's where a dedicated
>> PTE_PROT_NONE bit helped.
>
> Yes agreed.
>
>>
>> Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
>> first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
>> the pmd is present, it goes and checks pmd_protnone() which returns
>> true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
>> but it looks fragile to rely on them.
>>
>> So I think for protnone we need to check some other bits (like USER and
>> UXN) in addition to PTE_PRESENT_INVALID.
>
> Yes 100% agree. But using PTE_WRITE|PTE_RDONLY==0b00 is just as valid for that
> purpose, I think?
>
>>
>>> This will subtly change behaviour in an edge case though. Imagine:
>>>
>>> pte_t pte;
>>>
>>> pte = pte_modify(pte, PAGE_NONE);
>>> pte = pte_mkwrite_novma(pte);
>>> WARN_ON(pte_protnone(pte));
>>>
>>> Should that warning fire or not? Previously, because we had a dedicated bit for
>>> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
>>> it's more intuitive if it doesn't fire. Regardless there is no core code that
>>> ever does this. Once you have a protnone pte, its terminal - nothing ever
>>> modifies it with these helpers AFAICS.
>>
>> I don't think any core code should try to make page a PAGE_NONE pte
>> writeable.
>
> I looked at some other arches; some (at least alpha and hexagon) will not fire
> this warning because they have R and W bits and 0b00 means NONE. Others (x86)
> will fire it because they have an explicit NONE bit and don't remove it on
> permission change. So I conclude its UB and fine to do either.
>
>>
>>> Personally I think this is a nice tidy up that saves a SW bit in both present
>>> and swap ptes. What do you think? (I'll just post the series if its easier to
>>> provide feedback in that context).
>>
>> It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
>> it doesn't affect the pmd case I mentioned above.
>>
>>>>> But there is a problem with this: __split_huge_pmd_locked() calls
>>>>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>>>>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>>>>> but was trying to avoid the whole thing unravelling so didn't persue.
>>>>
>>>> Maybe what's wrong is the arm64 implementation setting this bit on a
>>>> swap/migration pmd (though we could handle this in the core code as
>>>> well, it depends what the other architectures do). The only check for
>>>> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
>>>> into the pmd_present() check. I think it is currently broken as
>>>> pmd_present() can return true for a swap pmd after pmd_mkinvalid().
>>>
>>> I've posted a fix here:
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
>>
>> I agree, thanks.
>>
>
On Mon, Apr 29, 2024 at 02:23:35PM +0100, Ryan Roberts wrote:
> On 29/04/2024 14:01, Ryan Roberts wrote:
> > On 29/04/2024 13:38, Catalin Marinas wrote:
> >> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
> >>> On 26/04/2024 15:48, Catalin Marinas wrote:
> >>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> >>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> >>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
> >>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> >>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> >>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
> >>>>
> >>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> >>>> and use it for both ptes and pmds.
> >>>
> >>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
> >>> core-mm so will now fix that before I can validate my change. see
> >>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
> >>>
> >>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
> >>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
> >>> with both PTE_WRITE=0 and PTE_RDONLY=0.
> >>>
> >>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
> >>> modification", this is not a problem as the pte is invalid, so the HW doesn't
> >>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
> >>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
> >>> that we now repurpose for PROT_NONE.
> >>
> >> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
> >> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
> >> for PAGE_NONE (bar the kernel mappings).
> >
> > Yes I guess that works. I personally prefer my proposal because it is more
> > intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
> > if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
> > preferable, then I'll go with that.
>
> Ignore this - I looked at your proposed approach and agree it's better. I'll use
> `PTE_USER|PTE_UXN==0b01`. Posting shortly...
You nearly convinced me until I read your second reply ;). The
PTE_WRITE|PTE_RDONLY == 0b00 still has the mkwrite problem if we care
about (I don't think it can happen though).
--
Catalin
On 29/04/2024 15:18, Catalin Marinas wrote:
> On Mon, Apr 29, 2024 at 02:23:35PM +0100, Ryan Roberts wrote:
>> On 29/04/2024 14:01, Ryan Roberts wrote:
>>> On 29/04/2024 13:38, Catalin Marinas wrote:
>>>> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>>>>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>>>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>>>>
>>>>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>>>>> and use it for both ptes and pmds.
>>>>>
>>>>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>>>>> core-mm so will now fix that before I can validate my change. see
>>>>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>>>>
>>>>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>>>>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>>>>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>>>>
>>>>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>>>>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>>>>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>>>>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>>>>> that we now repurpose for PROT_NONE.
>>>>
>>>> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
>>>> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
>>>> for PAGE_NONE (bar the kernel mappings).
>>>
>>> Yes I guess that works. I personally prefer my proposal because it is more
>>> intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
>>> if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
>>> preferable, then I'll go with that.
>>
>> Ignore this - I looked at your proposed approach and agree it's better. I'll use
>> `PTE_USER|PTE_UXN==0b01`. Posting shortly...
>
> You nearly convinced me until I read your second reply ;). The
> PTE_WRITE|PTE_RDONLY == 0b00 still has the mkwrite problem if we care
> about (I don't think it can happen though).
Yes, just to clearly enumerate the reasons I prefer your approach:
- PTE_RDONLY is also used for HW dirty bit. I had to add a conditional to
pte_mkclean() for my scheme to prevent pte_mkclean() on a PROT_NONE pte
eroneously making it RO. No such problem with your scheme.
- With my scheme, we have the mkwrite problem, as you call it. Although, as I
said some arches already have this semantic, so I don't think its a problem.
But with your scheme we keep the existing arm64 semantics so it reduces risk
of a problem in a corner I overlooked.
Anyway, I've posted the v2. Take a look when you get time - perhaps we can get
it into v6.10?