From: Jeff Xu <[email protected]>
This patch adds an architecture-independent function,
arch_check_pkey_enforce_api(), that checks whether the calling thread
has write access to the PKRU for a given range of memory. If the
memory range is protected by PKEY_ENFORCE_API, then the thread must
have write access to the PKRU in order to make changes to the memory
mapping (such as mprotect/munmap).
This function is used by the kernel to enforce the
PKEY_ENFORCE_API flag.
Signed-off-by: Jeff Xu<[email protected]>
---
arch/powerpc/include/asm/pkeys.h | 8 +++++
arch/x86/include/asm/pkeys.h | 50 ++++++++++++++++++++++++++++++++
include/linux/pkeys.h | 9 ++++++
3 files changed, 67 insertions(+)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 943333ac0fee..24c481e5e95b 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -177,5 +177,13 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
return true;
}
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ /* Allow by default */
+ return 0;
+}
+
extern void pkey_mm_init(struct mm_struct *mm);
#endif /*_ASM_POWERPC_KEYS_H */
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index ecadf04a8251..8b94ffc4ca32 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -161,4 +161,54 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
return true;
}
+
+static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
+{
+ int pkey = vma_pkey(vma);
+
+ if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
+ if (!__pkru_allows_write(read_pkru(), pkey))
+ return -EACCES;
+ }
+
+ return 0;
+}
+
+/*
+ * arch_check_pkey_enforce_api is used by the kernel to enforce
+ * PKEY_ENFORCE_API flag.
+ * It checks whether the calling thread has write access to the PKRU
+ * for a given range of memory. If the memory range is protected by
+ * PKEY_ENFORCE_API, then the thread must have write access to the
+ * PKRU in order to make changes to the memory mapping, such as
+ * mprotect/munmap.
+ */
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ int error;
+ struct vm_area_struct *vma;
+
+ if (!arch_pkeys_enabled())
+ return 0;
+
+ while (true) {
+ vma = find_vma_intersection(mm, start, end);
+ if (!vma)
+ break;
+
+ error = __arch_check_vma_pkey_for_write(vma);
+ if (error)
+ return error;
+
+ if (vma->vm_end >= end)
+ break;
+
+ start = vma->vm_end;
+ }
+
+ return 0;
+}
+
#endif /*_ASM_X86_PKEYS_H */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 81a482c3e051..7b00689e1c24 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -53,6 +53,15 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
return false;
return true;
}
+
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ // Allow by default.
+ return 0;
+}
+
#endif /* ! CONFIG_ARCH_HAS_PKEYS */
#endif /* _LINUX_PKEYS_H */
--
2.40.1.606.ga4b1b128d6-goog
On 5/15/23 06:05, [email protected] wrote:
> +static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
> +{
> + int pkey = vma_pkey(vma);
> +
> + if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
> + if (!__pkru_allows_write(read_pkru(), pkey))
> + return -EACCES;
> + }
> +
> + return 0;
> +}
Please think very carefully about what I'm about to say:
What connects vma->vm_mm to read_pkru() here?
Now think about what happens when we have kthread_use_mm() or a ptrace()
doing get_task_mm() and working on another process's mm or VMA.
Look at arch_vma_access_permitted() and notice how it avoids read_pkru()
for 'foreign' aka. 'remote' accesses:
> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> bool write, bool execute, bool foreign)
> {
...
> if (foreign || vma_is_foreign(vma))
> return true;
> return // check read_pkru()
> }
In other words, it lets all remote accesses right through. That's
because there is *NOTHING* that fundamentally and tightly connects the
PKRU value in this context to the VMA or the context that initiated this
operation.
If your security model depends on PKRU protection, this 'remote'
disconnection is problematic. The PKRU enforcement inside the kernel is
best-effort. That usually doesn't map into the security space very well.
Do you have a solid handle on all call paths that will reach
__arch_check_vma_pkey_for_write() and can you ensure they are all
non-remote?
On Thu, May 18, 2023 at 2:43 PM Dave Hansen <[email protected]> wrote:
>
> On 5/15/23 06:05, [email protected] wrote:
> > +static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
> > +{
> > + int pkey = vma_pkey(vma);
> > +
> > + if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
> > + if (!__pkru_allows_write(read_pkru(), pkey))
> > + return -EACCES;
> > + }
> > +
> > + return 0;
> > +}
>
> Please think very carefully about what I'm about to say:
>
> What connects vma->vm_mm to read_pkru() here?
>
> Now think about what happens when we have kthread_use_mm() or a ptrace()
> doing get_task_mm() and working on another process's mm or VMA.
>
> Look at arch_vma_access_permitted() and notice how it avoids read_pkru()
> for 'foreign' aka. 'remote' accesses:
>
> > static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> > bool write, bool execute, bool foreign)
> > {
> ...
> > if (foreign || vma_is_foreign(vma))
> > return true;
> > return // check read_pkru()
> > }
>
> In other words, it lets all remote accesses right through. That's
> because there is *NOTHING* that fundamentally and tightly connects the
> PKRU value in this context to the VMA or the context that initiated this
> operation.
>
> If your security model depends on PKRU protection, this 'remote'
> disconnection is problematic. The PKRU enforcement inside the kernel is
> best-effort. That usually doesn't map into the security space very well.
>
> Do you have a solid handle on all call paths that will reach
> __arch_check_vma_pkey_for_write() and can you ensure they are all
> non-remote?
Is this about the attack scenario where the attacker uses ptrace()
into the chrome process ? if so it is not in our threat model, and
that is more related to sandboxing on the host.
Or is this about io_uring? Yes, io_uring kernel thread breaks our
expectations of PKRU & user space threads, however I thought the break
is not just for this - any syscall involved in memory operation will
break after into io_uring ?
Other than those, yes, I try to ensure the check is only used at the
beginning of syscall entry in all cases, which should be non-remote I
hope.
Thanks
-Jeff
On 5/18/23 15:51, Jeff Xu wrote:
>> Do you have a solid handle on all call paths that will reach
>> __arch_check_vma_pkey_for_write() and can you ensure they are all
>> non-remote?
> Is this about the attack scenario where the attacker uses ptrace()
> into the chrome process ? if so it is not in our threat model, and
> that is more related to sandboxing on the host.
The attacker would use *some* remote interface. ptrace() is just one of
those remote interfaces.
> Or is this about io_uring? Yes, io_uring kernel thread breaks our
> expectations of PKRU & user space threads, however I thought the break
> is not just for this - any syscall involved in memory operation will
> break after into io_uring ?
I'm not quite following.
Please just do me a favor: have the io_uring maintainers look at your
proposal. Make sure that the defenses you are building can work in a
process where io_uring is in use by the benign threads.
Those same folks are pretty familiar with the other, more traditional
I/O syscalls that have in-memory descriptors that control syscall
behavior like readv/writev. Those also need a close look.
> Other than those, yes, I try to ensure the check is only used at the
> beginning of syscall entry in all cases, which should be non-remote I
> hope.
You're right that synchronous, shallow syscall paths are usually
non-remote. But those aren't the problem. The problem is that there
*ARE* remote accesses and those are a potential hole for this whole
mechanism.
Can they be closed? I don't know. I honestly don't have a great grasp
on how widespread these things are. You'll need a much more complete
grasp on them than I have before this thing can go forward.
On Fri, May 19, 2023 at 2:00 AM Dave Hansen <[email protected]> wrote:
>
> On 5/18/23 15:51, Jeff Xu wrote:
> >> Do you have a solid handle on all call paths that will reach
> >> __arch_check_vma_pkey_for_write() and can you ensure they are all
> >> non-remote?
> > Is this about the attack scenario where the attacker uses ptrace()
> > into the chrome process ? if so it is not in our threat model, and
> > that is more related to sandboxing on the host.
>
> The attacker would use *some* remote interface. ptrace() is just one of
> those remote interfaces.
>
> > Or is this about io_uring? Yes, io_uring kernel thread breaks our
> > expectations of PKRU & user space threads, however I thought the break
> > is not just for this - any syscall involved in memory operation will
> > break after into io_uring ?
>
> I'm not quite following.
>
> Please just do me a favor: have the io_uring maintainers look at your
> proposal. Make sure that the defenses you are building can work in a
> process where io_uring is in use by the benign threads.
>
> Those same folks are pretty familiar with the other, more traditional
> I/O syscalls that have in-memory descriptors that control syscall
> behavior like readv/writev. Those also need a close look.
>
> > Other than those, yes, I try to ensure the check is only used at the
> > beginning of syscall entry in all cases, which should be non-remote I
> > hope.
>
> You're right that synchronous, shallow syscall paths are usually
> non-remote. But those aren't the problem. The problem is that there
> *ARE* remote accesses and those are a potential hole for this whole
> mechanism.
>
> Can they be closed? I don't know. I honestly don't have a great grasp
> on how widespread these things are. You'll need a much more complete
> grasp on them than I have before this thing can go forward.
I don't think the remote writes are a problem for us if they're initiated from
the same process. It's a case of syscalls where we need to add special
validation in userspace.