So here comes the second version after the first round of comments.
As suggested, I dropped the thread_info flag and placed it in the
mm_struct instead. There's now a per_cpu variable that can be checked
in the entry code to decide whether or not to switch CR3.
It's important to note that the new flag is lost upon execve(). I think
that this provides a better guarantee against any accidental use (eg: a
program calling some external helpers once in a while), but it also
means we can't use a wrapper anymore and have to modify the executable.
I continue to think that a mixed approach consisting in having a specific
flag that is only applied upon next execve() call and dropped could be
nice, but for now I'm not really sure how to do this cleanly.
Regarding the _PAGE_NX change, for now I didn't touch it. I like Andy's
approach consisting in changing it dynamically after the first page
fault caused by the return to userspace. I just don't know how to do
that for now.
I've split the entry code changes in two. The first part only updates the
kernel entry code to avoid updating CR3 if it already points to a kernel
PGD. The second one adds the flag check when going back to userspace.
This allowed me to check if the CR3-only changes brought any benefit, but
I failed to detect any improvement with that alone for now, including on
a preempt kernel.
With this patch, when haproxy starts with "arch_prctl(0x1022, 1)", the
performance drop compared to booting with "pti=off" is only ~1% and more
or less within measurement noise.
For now I've left the prctl to retrieve the current value as it helped
during debugging, though I think it should disappear before the final
version as it provides very little value.
Here are the numbers I'm seeing in the various situations for a few
tests on a hardware machine (core i7-4790K), numbers are in connections
per second, with the performance ratio compared to pti=off between
parenthesis :
TEST(*)
reject reject+acl forward
---------------+-------------+---------------+----------------
pti=off 444k (100%) 252k (100%) 83k (100%)
pti=on 382k (86%) 195k (77%) 71k (85%)
pti=on+prctl 439k (99%) 249k (99%) 83k (100%)
*: tests:
"reject" : reject rule, accept(), setsockopt() and close()
"reject+acl" : acl-based rule, does extra syscalls (getsockname(),
getsockopt, 2 setsockopt, recv, shutdown)
"forward" : connection forwarded to remote server, much heavier
It's interesting to node that the rule employing a few more syscalls
without adding much userspace work is obviously more impacted by PTI.
We have a total of 8 syscalls per connection on the middle one and
the difference is important.
Willy
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Willy Tarreau (6):
x86/mm: add a pti_disable entry in mm_context_t
x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to
enable/disable PTI
x86/pti: add a per-cpu variable pti_disable
x86/pti: don't mark the user PGD with _PAGE_NX.
x86/entry/pti: avoid setting CR3 when it's already correct
x86/entry/pti: don't switch PGD on when pti_disable is set
arch/x86/entry/calling.h | 25 +++++++++++++++++++++++++
arch/x86/include/asm/mmu.h | 4 ++++
arch/x86/include/uapi/asm/prctl.h | 3 +++
arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
arch/x86/mm/pti.c | 2 ++
5 files changed, 58 insertions(+)
--
1.7.12.1
When entering the kernel with CR3 pointing to the kernel's PGD, there's
no need to set it again. This will avoid a TLB flush on syscalls for tasks
running with the kernel's PGD (see next patch).
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
v2:
- updated comments according to Ingo's suggestions
- split the code to keep only the CR3 changes here
---
arch/x86/entry/calling.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 45a63e0..2c0d3b5 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -214,6 +214,11 @@
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg
+
+ /* if we're already on the kernel PGD, we don't switch */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+ jz .Lend_\@
+
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
.Lend_\@:
@@ -262,6 +267,14 @@
ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
movq %cr3, \scratch_reg
movq \scratch_reg, \save_reg
+
+ /*
+ * If we're already on the kernel PGD, we don't switch,
+ * we just save the current CR3.
+ */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+ jz .Ldone_\@
+
/*
* Is the "switch mask" all zero? That means that both of
* these are zero:
@@ -284,6 +297,13 @@
.macro RESTORE_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ /*
+ * If we saved a kernel context on entry, we didn't switch the CR3,
+ * so we don't need to restore it on the way out either:
+ */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
+ jz .Lend_\@
+
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
--
1.7.12.1
This will be set/cleared using arch_prctl() to allow the tasks using
this mm to disable the PTI protection.
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/mmu.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5ff3e8a..c7c2ca1 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -36,6 +36,10 @@
/* True if mm supports a task running in 32 bit compatibility mode. */
unsigned short ia32_compat;
#endif
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ /* True if mm is forced to run with page table isolation disabled */
+ char pti_disable;
+#endif
struct mutex lock;
void __user *vdso; /* vdso base address */
--
1.7.12.1
This one is updated upon each context switch to reflect the
crrent mm's pti_disable field.
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/process_64.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9516310..9bb5908 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -61,6 +61,10 @@
__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+__visible DEFINE_PER_CPU(unsigned char, pti_disable);
+#endif
+
/* Prints also some state that isn't saved in the pt_regs */
void __show_regs(struct pt_regs *regs, int all)
{
@@ -473,6 +477,11 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
__switch_to_xtra(prev_p, next_p, tss);
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ this_cpu_write(pti_disable,
+ next_p->mm && next_p->mm->context.pti_disable);
+#endif
+
#ifdef CONFIG_XEN_PV
/*
* On Xen PV, IOPL bits in pt_regs->flags have no effect, and
--
1.7.12.1
This allows to report the current state of the PTI protection and to
enable or disable it for the current process. The state change is only
allowed if the mm is not shared (no threads have been created yet).
Setting the flag to disable the protection is subject to CAP_SYS_RAWIO.
However it is possible to re-enable the protection without this privilege.
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
v2:
- use {set,clear}_thread_flag() as recommended by Peter
- use task->mm->context.pti_disable instead of task flag
- check for mm_users == 1
- check for CAP_SYS_RAWIO only when setting, not clearing
- make the code depend on CONFIG_PAGE_TABLE_ISOLATION
---
arch/x86/include/uapi/asm/prctl.h | 3 +++
arch/x86/kernel/process_64.c | 15 +++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9..1f1b5bc 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
#define ARCH_GET_CPUID 0x1011
#define ARCH_SET_CPUID 0x1012
+#define ARCH_GET_NOPTI 0x1021
+#define ARCH_SET_NOPTI 0x1022
+
#define ARCH_MAP_VDSO_X32 0x2001
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c754662..9516310 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -654,7 +654,22 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ case ARCH_GET_NOPTI: {
+ unsigned long flag = !!task->mm->context.pti_disable;
+ ret = put_user(flag, (unsigned long __user *)arg2);
+ break;
+ }
+ case ARCH_SET_NOPTI:
+ if (!task->mm || atomic_read(&task->mm->mm_users) > 1 ||
+ (arg2 && !capable(CAP_SYS_RAWIO)))
+ return -EPERM;
+
+ if (doit)
+ task->mm->context.pti_disable = !!arg2;
+ break;
+#endif
#ifdef CONFIG_CHECKPOINT_RESTORE
# ifdef CONFIG_X86_X32_ABI
case ARCH_MAP_VDSO_X32:
--
1.7.12.1
Since we're going to keep running on the same PGD when returning to
userspace for certain performance-critical tasks, we'll need the user
pages to be executable. So this code disables the extra protection
that was added consisting in marking user pages _PAGE_NX so that this
pgd remains usable for userspace.
Note: it isn't necessarily the best approach, but one way or another
if we want to be able to return to userspace from the kernel,
we'll have to have this executable anyway. Another approach
might consist in using another pgd for userland+kernel but
the current core really looks like an extra careful measure
to catch early bugs if any.
Note2: Andy's suggestion to instead dynamically disable NX upon
page fault seems the most appealing.
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/mm/pti.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 43d4a4a..9e2dca0 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -135,9 +135,11 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
* - we don't have NX support
* - we're clearing the PGD (i.e. the new pgd is not present).
*/
+#if 0
if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
(__supported_pte_mask & _PAGE_NX))
pgd.pgd |= _PAGE_NX;
+#endif
/* return the copy of the PGD we want the kernel to use: */
return pgd;
--
1.7.12.1
When a syscall returns to userspace with pti_disable set, it means the
current mm is configured to disable page table isolation (PTI). In this
case, returns from kernel to user will not switch the CR3, leaving it
to the kernel one which already maps both user and kernel pages. This
avoids a TLB flush, and saves another one on next entry.
Thanks to these changes, haproxy running under KVM went back from
12700 conn/s (without PCID) or 19700 (with PCID) to 23100 once loaded
after calling prctl(), indicating that PTI has no measurable impact on
this workload.
Signed-off-by: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Kees Cook <[email protected]>
v2:
- use pti_disable instead of task flag
---
arch/x86/entry/calling.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 2c0d3b5..5361a10 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -229,6 +229,11 @@
.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
+ cmpb $0, PER_CPU_VAR(pti_disable)
+ jne .Lend_\@
+
mov %cr3, \scratch_reg
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
--
1.7.12.1
On Tue, Jan 09, 2018 at 01:56:16PM +0100, Willy Tarreau wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current process. The state change is only
> allowed if the mm is not shared (no threads have been created yet).
>
> Setting the flag to disable the protection is subject to CAP_SYS_RAWIO.
> However it is possible to re-enable the protection without this privilege.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> v2:
> - use {set,clear}_thread_flag() as recommended by Peter
> - use task->mm->context.pti_disable instead of task flag
> - check for mm_users == 1
> - check for CAP_SYS_RAWIO only when setting, not clearing
> - make the code depend on CONFIG_PAGE_TABLE_ISOLATION
> ---
> arch/x86/include/uapi/asm/prctl.h | 3 +++
> arch/x86/kernel/process_64.c | 15 +++++++++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
> #define ARCH_GET_CPUID 0x1011
> #define ARCH_SET_CPUID 0x1012
>
> +#define ARCH_GET_NOPTI 0x1021
> +#define ARCH_SET_NOPTI 0x1022
> +
> #define ARCH_MAP_VDSO_X32 0x2001
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..9516310 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,7 +654,22 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> ret = put_user(base, (unsigned long __user *)arg2);
> break;
> }
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
Actually, I meant to add a new CONFIG item only for this feature which
depends on CONFIG_PAGE_TABLE_ISOLATION. So that people can disable the
per-mm thing when they don't want it.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Hi Boris!
On Tue, Jan 09, 2018 at 03:17:13PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 01:56:16PM +0100, Willy Tarreau wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current process. The state change is only
> > allowed if the mm is not shared (no threads have been created yet).
> >
> > Setting the flag to disable the protection is subject to CAP_SYS_RAWIO.
> > However it is possible to re-enable the protection without this privilege.
> >
> > Signed-off-by: Willy Tarreau <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Brian Gerst <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Josh Poimboeuf <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Kees Cook <[email protected]>
> >
> > v2:
> > - use {set,clear}_thread_flag() as recommended by Peter
> > - use task->mm->context.pti_disable instead of task flag
> > - check for mm_users == 1
> > - check for CAP_SYS_RAWIO only when setting, not clearing
> > - make the code depend on CONFIG_PAGE_TABLE_ISOLATION
> > ---
> > arch/x86/include/uapi/asm/prctl.h | 3 +++
> > arch/x86/kernel/process_64.c | 15 +++++++++++++++
> > 2 files changed, 18 insertions(+)
> >
> > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9..1f1b5bc 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -10,6 +10,9 @@
> > #define ARCH_GET_CPUID 0x1011
> > #define ARCH_SET_CPUID 0x1012
> >
> > +#define ARCH_GET_NOPTI 0x1021
> > +#define ARCH_SET_NOPTI 0x1022
> > +
> > #define ARCH_MAP_VDSO_X32 0x2001
> > #define ARCH_MAP_VDSO_32 0x2002
> > #define ARCH_MAP_VDSO_64 0x2003
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index c754662..9516310 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -654,7 +654,22 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> > ret = put_user(base, (unsigned long __user *)arg2);
> > break;
> > }
> > +#ifdef CONFIG_PAGE_TABLE_ISOLATION
>
> Actually, I meant to add a new CONFIG item only for this feature which
> depends on CONFIG_PAGE_TABLE_ISOLATION. So that people can disable the
> per-mm thing when they don't want it.
I see and am not particularly against this, but what use case do you
have in mind precisely ? I doubt it's just saving a few tens of bytes,
so probably you're more concerned about the potential risks this opens ?
But given we only allow this for CAP_SYS_RAWIO and these ones already
have access to /dev/mem and many other things, don't you think there
are much easier ways to dump kernel memory in this case than trying to
inject some meltdown code into the victim process ? Or maybe you have
other cases in mind that I'm not seeing.
Thanks,
willy
On Tue, Jan 09, 2018 at 03:36:53PM +0100, Willy Tarreau wrote:
> I see and am not particularly against this, but what use case do you
> have in mind precisely ? I doubt it's just saving a few tens of bytes,
> so probably you're more concerned about the potential risks this opens ?
> But given we only allow this for CAP_SYS_RAWIO and these ones already
> have access to /dev/mem and many other things, don't you think there
> are much easier ways to dump kernel memory in this case than trying to
> inject some meltdown code into the victim process ? Or maybe you have
> other cases in mind that I'm not seeing.
I'd like this to be config-controllable so that distros can make the
decision whether/if they want to support the whole per-mm thing.
Also, if CAP_SYS_RAWIO is going to protect, please make the
ARCH_GET_NOPTI variant check it too.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jan 09, 2018 at 03:51:57PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 03:36:53PM +0100, Willy Tarreau wrote:
> > I see and am not particularly against this, but what use case do you
> > have in mind precisely ? I doubt it's just saving a few tens of bytes,
> > so probably you're more concerned about the potential risks this opens ?
> > But given we only allow this for CAP_SYS_RAWIO and these ones already
> > have access to /dev/mem and many other things, don't you think there
> > are much easier ways to dump kernel memory in this case than trying to
> > inject some meltdown code into the victim process ? Or maybe you have
> > other cases in mind that I'm not seeing.
>
> I'd like this to be config-controllable so that distros can make the
> decision whether/if they want to support the whole per-mm thing.
OK.
> Also, if CAP_SYS_RAWIO is going to protect, please make the
> ARCH_GET_NOPTI variant check it too.
Interestingly I removed the check consecutive to the discussions. But
I think I'll simply remove the whole ARCH_GET_NOPTI as it has no real
value beyond initial development.
Willy
On Tue, Jan 9, 2018 at 6:54 AM, Willy Tarreau <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 03:51:57PM +0100, Borislav Petkov wrote:
>> On Tue, Jan 09, 2018 at 03:36:53PM +0100, Willy Tarreau wrote:
>> > I see and am not particularly against this, but what use case do you
>> > have in mind precisely ? I doubt it's just saving a few tens of bytes,
>> > so probably you're more concerned about the potential risks this opens ?
>> > But given we only allow this for CAP_SYS_RAWIO and these ones already
>> > have access to /dev/mem and many other things, don't you think there
>> > are much easier ways to dump kernel memory in this case than trying to
>> > inject some meltdown code into the victim process ? Or maybe you have
>> > other cases in mind that I'm not seeing.
>>
>> I'd like this to be config-controllable so that distros can make the
>> decision whether/if they want to support the whole per-mm thing.
>
> OK.
>
>> Also, if CAP_SYS_RAWIO is going to protect, please make the
>> ARCH_GET_NOPTI variant check it too.
>
> Interestingly I removed the check consecutive to the discussions. But
> I think I'll simply remove the whole ARCH_GET_NOPTI as it has no real
> value beyond initial development.
>
I've thought about this a bit more. Here are my thoughts:
1. I don't like it being per-mm. I think it should be a per-thread
control so that a program can have a thread with PTI that runs
less-trusted JavaScript and other network threads with PTI off.
Obviously we lose NX protection mm-wide if any threads have PTI off.
I think the way to implement this is:
Have this in struct mm_context:
bool has_non_pti_thread;
To turn PTI off on a thread:
Take pagetable_lock.
if (!has_non_pti_thread) {
context.has_non_pti_thread = true;
clear the NX bits;
}
drop pagetable_lock;
set the TI flag;
Fork clears the per-mm flag in the new mm. Exec clears it, too. I
think that's all that's needed. Newly created threads always have PTI
on.
To turn PTI back on, just clear the TI flag.
2.Turning off PTI is, in general, a terrible idea. It totally breaks
any semblance of a security model on a Meltdown-affected CPU. So I
think we should require CAP_SYS_RAWIO *and* that the system is booted
with pti=allow_optout or something like that.
--Andy
On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
> 2.Turning off PTI is, in general, a terrible idea. It totally breaks
> any semblance of a security model on a Meltdown-affected CPU. So I
> think we should require CAP_SYS_RAWIO *and* that the system is booted
> with pti=allow_optout or something like that.
Uhh, I like that.
Maybe also taint the kernel ...
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jan 9, 2018 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> 2.Turning off PTI is, in general, a terrible idea. It totally breaks
> any semblance of a security model on a Meltdown-affected CPU. So I
> think we should require CAP_SYS_RAWIO *and* that the system is booted
> with pti=allow_optout or something like that.
Agreed, this shouldn't be default-available. Besides, your most
trusted processes are the ones most likely to be targeted for attack.
:(
-Kees
--
Kees Cook
Pixel Security
On Tue, Jan 09, 2018 at 10:29:40PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
> > 2.Turning off PTI is, in general, a terrible idea. It totally breaks
> > any semblance of a security model on a Meltdown-affected CPU. So I
> > think we should require CAP_SYS_RAWIO *and* that the system is booted
> > with pti=allow_optout or something like that.
>
> Uhh, I like that.
>
> Maybe also taint the kernel ...
Requiring a reboot just to fix a performance problem you've discovered
the hard way is not the most friendly way to help users I'm afraid.
However, definitely +1 on tainting!
Willy
On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 9, 2018 at 6:54 AM, Willy Tarreau <[email protected]> wrote:
> > On Tue, Jan 09, 2018 at 03:51:57PM +0100, Borislav Petkov wrote:
> >> On Tue, Jan 09, 2018 at 03:36:53PM +0100, Willy Tarreau wrote:
> >> > I see and am not particularly against this, but what use case do you
> >> > have in mind precisely ? I doubt it's just saving a few tens of bytes,
> >> > so probably you're more concerned about the potential risks this opens ?
> >> > But given we only allow this for CAP_SYS_RAWIO and these ones already
> >> > have access to /dev/mem and many other things, don't you think there
> >> > are much easier ways to dump kernel memory in this case than trying to
> >> > inject some meltdown code into the victim process ? Or maybe you have
> >> > other cases in mind that I'm not seeing.
> >>
> >> I'd like this to be config-controllable so that distros can make the
> >> decision whether/if they want to support the whole per-mm thing.
> >
> > OK.
> >
> >> Also, if CAP_SYS_RAWIO is going to protect, please make the
> >> ARCH_GET_NOPTI variant check it too.
> >
> > Interestingly I removed the check consecutive to the discussions. But
> > I think I'll simply remove the whole ARCH_GET_NOPTI as it has no real
> > value beyond initial development.
> >
>
> I've thought about this a bit more. Here are my thoughts:
>
> 1. I don't like it being per-mm. I think it should be a per-thread
> control so that a program can have a thread with PTI that runs
> less-trusted JavaScript and other network threads with PTI off.
Ingo suggested such use case as well. While I'm quite inclined to agree
with it, I'm just thinking, do we really have some processes both I/O
bound and executing Javascript or similar in a thread ? Well, thinking
about it, we have Lua in haproxy, we could imagine having Javascript
later when admins don't want to learn Lua. So that could make sense
(/me takes a sickness bag to throw up).
> Obviously we lose NX protection mm-wide if any threads have PTI off.
> I think the way to implement this is:
>
> Have this in struct mm_context:
>
> bool has_non_pti_thread;
>
> To turn PTI off on a thread:
>
> Take pagetable_lock.
> if (!has_non_pti_thread) {
> context.has_non_pti_thread = true;
> clear the NX bits;
> }
> drop pagetable_lock;
> set the TI flag;
Linus suggested that we refuse to turn off PTI if any thread was already
created and I really agree with this, and it's not incompatible with
what you have above. We could just turn it on again for certain threads.
> Fork clears the per-mm flag in the new mm. Exec clears it, too. I
> think that's all that's needed. Newly created threads always have PTI
> on.
Fork doesn't clear (exec indeed does). Fork clearing it would be
problematic as it would mean you can't do it on a deamon during startup.
> To turn PTI back on, just clear the TI flag.
>
> 2.Turning off PTI is, in general, a terrible idea. It totally breaks
> any semblance of a security model on a Meltdown-affected CPU.
Absolutely, but it recovers what matters more in *certain* workloads,
which is performance.
> So I
> think we should require CAP_SYS_RAWIO *and* that the system is booted
> with pti=allow_optout or something like that.
I'm really not fan of this. 1) it would require to reboot during the
peak hour to try to fix the problem. 2) the flag will end up being
deployed everywhere by default in environments flirting with performance
"just in case" so it will be rendered useless.
I'm fine with Boris' requirement that the kernel should be build with
the appropriate option to support this. If you're doing your own builds,
you can well take care of having the appropriate options (PTI+the right
to turn it off) and deploy such kernels where relevant.
Willy
On Tue, Jan 09, 2018 at 10:32:27PM +0100, Willy Tarreau wrote:
> Requiring a reboot just to fix a performance problem you've discovered
> the hard way is not the most friendly way to help users I'm afraid.
That's a very strange argument: if you know you'd need max perf, you
boot with pti=allow_optout.
Color me confused.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jan 9, 2018 at 1:41 PM, Willy Tarreau <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
>> So I
>> think we should require CAP_SYS_RAWIO *and* that the system is booted
>> with pti=allow_optout or something like that.
>
> I'm really not fan of this. 1) it would require to reboot during the
> peak hour to try to fix the problem. 2) the flag will end up being
> deployed everywhere by default in environments flirting with performance
> "just in case" so it will be rendered useless.
>
> I'm fine with Boris' requirement that the kernel should be build with
> the appropriate option to support this. If you're doing your own builds,
> you can well take care of having the appropriate options (PTI+the right
> to turn it off) and deploy such kernels where relevant.
IMO, run-time selection is always better than build-time selection.
e.g. a distro would build it in just in case anyone needs it, but the
vast majority of system this would be dangerous on. Therefore, make it
part of the kernel, but require it be enabled at boot.
-Kees
--
Kees Cook
Pixel Security
On Tue, Jan 09, 2018 at 01:50:10PM -0800, Kees Cook wrote:
> On Tue, Jan 9, 2018 at 1:41 PM, Willy Tarreau <[email protected]> wrote:
> > On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
> >> So I
> >> think we should require CAP_SYS_RAWIO *and* that the system is booted
> >> with pti=allow_optout or something like that.
> >
> > I'm really not fan of this. 1) it would require to reboot during the
> > peak hour to try to fix the problem. 2) the flag will end up being
> > deployed everywhere by default in environments flirting with performance
> > "just in case" so it will be rendered useless.
> >
> > I'm fine with Boris' requirement that the kernel should be build with
> > the appropriate option to support this. If you're doing your own builds,
> > you can well take care of having the appropriate options (PTI+the right
> > to turn it off) and deploy such kernels where relevant.
>
> IMO, run-time selection is always better than build-time selection.
> e.g. a distro would build it in just in case anyone needs it, but the
> vast majority of system this would be dangerous on. Therefore, make it
> part of the kernel, but require it be enabled at boot.
For all the rest we use sysctls then. suid_dumpable is a sysctl,
mmap_min_addr is a sysctl. That would be quite better. Having to reboot
all your LBs at the traffic peak just to pass an option you had never
heard of and you don't even know if it will work nor what the impact is
is really what will make our users loudly call us names about our design
choices :-/
Another benefit of the sysctl is that if it doesn't work you can turn it
off. The user who already had to reboot to set the option will definitely
not boot again to disable it if it didn't solve his problem!
Willy
On Tue, Jan 09, 2018 at 10:46:02PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 10:32:27PM +0100, Willy Tarreau wrote:
> > Requiring a reboot just to fix a performance problem you've discovered
> > the hard way is not the most friendly way to help users I'm afraid.
>
> That's a very strange argument: if you know you'd need max perf, you
> boot with pti=allow_optout.
>
> Color me confused.
That's very simple : you first know you need more perf when you see the
name of your boss on your phone asking what's happening with the site
suddenly crawling at the worst possible moment, when everyone is there
to see it dead. Performance is something that's tuned at runtime, always,
not via random reboots. When you have 10 servers running at 100% CPU,
the last thing you're thinking about is to remove one of them so that
the 9 remaining ones are at 110% while you reboot :-/
Willy
On Tue, Jan 09, 2018 at 11:06:05PM +0100, Willy Tarreau wrote:
> That's very simple : you first know you need more perf when you see the
> name of your boss on your phone asking what's happening with the site
Ok, lemme try to understand that:
so you boot with PTI *enabled*, you don't know about the whole PTI
fiasco and the performance impact it has because you obviously live on
another planet, your boss calls you and *then* you start looking to
increase performance and *then* you google and realize, shikes, you need
to turn off PTI for a specific process. And rebooting is not an option.
Oh, and you've built the kernel with the option to be able to disable
PTI so it's not like you haven't seen it already.
Sorry but I really really don't believe that is in any way close to
reality.
If it is, then this is the story of the sysadmin who, like an ostrich,
kept his head in the sand.
:-)
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Has anyone else noticed that CR3 in oopses is now basically a lie with
PTI? It shows CR3 at the time of the printing of the oops, not at the
time of the fault like the normal registers. That greatly limits its
usefulness in the dumps with PTI.
Should we carve out some space somewhere to stash it at entry so we can
dump it in oopses? Any preferences on where? We need somewhere to
write that's before the SWITCH_TO_KERNEL_CR3, most likely so
cpu_entry_area is a candidate. The trampoline stack should also have space.
On Tue, Jan 09, 2018 at 11:20:36PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 11:06:05PM +0100, Willy Tarreau wrote:
> > That's very simple : you first know you need more perf when you see the
> > name of your boss on your phone asking what's happening with the site
>
> Ok, lemme try to understand that:
>
> so you boot with PTI *enabled*, you don't know about the whole PTI
> fiasco and the performance impact it has because you obviously live on
> another planet, your boss calls you and *then* you start looking to
> increase performance and *then* you google and realize, shikes, you need
> to turn off PTI for a specific process. And rebooting is not an option.
Boris, please don't try to make me look like a fool when I'm trying to
explain a common process. The reality is that at many places the system's
load can easily vary 10-fold along a day, week or month. I even know a
place where just two days a year you see on graphs what looks like a wall
because previous values completely disappear. And how do sysadmins size
their infra in these environments ? They use the same sizing as what they
used last time, correcting the error margin. When the graphs are small,
they'll say "OK now that I'm running the patched kernel, the load
increased from 7 to 11%, that's not much, only 4 more percent for the fix"
(because sadly many people add percents). But once this load multiples by
10, it will be too late to figure that the old 70% did not become 74% but
110%. And yes, saying that it's the best moment to reboot so that you have
even less processing power, and you reboot each of your 10 servers only
one at a time waiting for 2 minutes while the RAID controller enumerates
the disks, it will definitely take more than 20 minutes to fix the problem.
I'm not artificially making this up, that's sadly something more common
than you probably expect in environments experiencing huge but predictible
variations.
> Oh, and you've built the kernel with the option to be able to disable
> PTI so it's not like you haven't seen it already.
No, your distro did. Please keep in mind that you were the one asking me
to have this option so that distros can enable it to please their users,
or possibly in fact to remove it to please the competitors.
I'm sorry Boris, but I really really disagree with the principle of having
to reboot to fix a performance issue which does not require such a reboot
by design and which is a pure software issue. Worse, the software decided
on purpose to kill the performance to keep you safe against risks that
some people will simply consider totally irrelevant to their use cases.
The other alternative is to continue to run 100% of the time with pti=off,
making it impossible to protect anything on the system. I regret, but quite
frankly if I have the opportunity to protect all processes but one or none
at all, I still consider the first option better overall as it significantly
reduces the surface of attack.
Thanks,
Willy
> On Jan 9, 2018, at 2:06 PM, Willy Tarreau <[email protected]> wrote:
>
>> On Tue, Jan 09, 2018 at 10:46:02PM +0100, Borislav Petkov wrote:
>>> On Tue, Jan 09, 2018 at 10:32:27PM +0100, Willy Tarreau wrote:
>>> Requiring a reboot just to fix a performance problem you've discovered
>>> the hard way is not the most friendly way to help users I'm afraid.
>>
>> That's a very strange argument: if you know you'd need max perf, you
>> boot with pti=allow_optout.
>>
>> Color me confused.
>
> That's very simple : you first know you need more perf when you see the
> name of your boss on your phone asking what's happening with the site
> suddenly crawling at the worst possible moment, when everyone is there
> to see it dead. Performance is something that's tuned at runtime, always,
> not via random reboots. When you have 10 servers running at 100% CPU,
> the last thing you're thinking about is to remove one of them so that
> the 9 remaining ones are at 110% while you reboot :-/
Here's another idea: make it a module
To enable it, you do modprobe pti_control allow_privileged_prctl=1.
On Tue, Jan 09, 2018 at 03:53:54PM -0800, Andy Lutomirski wrote:
> Here's another idea: make it a module
> To enable it, you do modprobe pti_control allow_privileged_prctl=1.
This could be an idea. I know that some people insist on disabling
modules because they find this more secure so they won't be able to
use this. But after all they have to stand by their choice : either
they want maximum security or they want maximum performance.
Willy
* Andy Lutomirski <[email protected]> wrote:
> On Tue, Jan 9, 2018 at 6:54 AM, Willy Tarreau <[email protected]> wrote:
> > On Tue, Jan 09, 2018 at 03:51:57PM +0100, Borislav Petkov wrote:
> >> On Tue, Jan 09, 2018 at 03:36:53PM +0100, Willy Tarreau wrote:
> >> > I see and am not particularly against this, but what use case do you
> >> > have in mind precisely ? I doubt it's just saving a few tens of bytes,
> >> > so probably you're more concerned about the potential risks this opens ?
> >> > But given we only allow this for CAP_SYS_RAWIO and these ones already
> >> > have access to /dev/mem and many other things, don't you think there
> >> > are much easier ways to dump kernel memory in this case than trying to
> >> > inject some meltdown code into the victim process ? Or maybe you have
> >> > other cases in mind that I'm not seeing.
> >>
> >> I'd like this to be config-controllable so that distros can make the
> >> decision whether/if they want to support the whole per-mm thing.
> >
> > OK.
> >
> >> Also, if CAP_SYS_RAWIO is going to protect, please make the
> >> ARCH_GET_NOPTI variant check it too.
> >
> > Interestingly I removed the check consecutive to the discussions. But
> > I think I'll simply remove the whole ARCH_GET_NOPTI as it has no real
> > value beyond initial development.
> >
>
> I've thought about this a bit more. Here are my thoughts:
>
> 1. I don't like it being per-mm. I think it should be a per-thread
> control so that a program can have a thread with PTI that runs
> less-trusted JavaScript and other network threads with PTI off.
> Obviously we lose NX protection mm-wide if any threads have PTI off.
> I think the way to implement this is:
Btw., the "NX protection", the NX bit set in the PTI kernel pagetables for the
user range really just matters for non-SMEP hardware, right? On SMEP a CPU in
kernel privilege mode cannot execute user pages, i.e. the fact that it's user
pages is already NX, guaranteed by the CPU.
And note how there's a happy circumstance for users, regarding SMEP and PTI NX:
- All Intel desktop/server CPUs currently sold and those built in the last ~3
years have SMEP enabled already, so are not affected.
- AMD CPUs don't have PTI enabled, so they already don't have NX for their user
pages - no change in behavior.
I.e.: non-issue and not a real constraint on the flexibility of this ABI, AFAICS -
it's "only" an implementational matter.
Thanks,
Ingo
* Willy Tarreau <[email protected]> wrote:
> When a syscall returns to userspace with pti_disable set, it means the
> current mm is configured to disable page table isolation (PTI). In this
> case, returns from kernel to user will not switch the CR3, leaving it
> to the kernel one which already maps both user and kernel pages. This
> avoids a TLB flush, and saves another one on next entry.
>
> Thanks to these changes, haproxy running under KVM went back from
> 12700 conn/s (without PCID) or 19700 (with PCID) to 23100 once loaded
> after calling prctl(), indicating that PTI has no measurable impact on
> this workload.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> v2:
> - use pti_disable instead of task flag
> ---
> arch/x86/entry/calling.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 2c0d3b5..5361a10 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -229,6 +229,11 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> + cmpb $0, PER_CPU_VAR(pti_disable)
> + jne .Lend_\@
Could you please do this small change for future iterations:
s/per-cpu
/per-CPU
... to make the spelling more consistent with the rest of the code base?
Thanks,
Ingo
* Willy Tarreau <[email protected]> wrote:
> + /* if we're already on the kernel PGD, we don't switch */
> + * If we're already on the kernel PGD, we don't switch,
> + * If we saved a kernel context on entry, we didn't switch the CR3,
It's hard enough to read assembly code, please use consistent capitalization:
s/if
/If
Thanks,
Ingo
On Wed, Jan 10, 2018 at 08:16:24AM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > + /* if we're already on the kernel PGD, we don't switch */
> > + * If we're already on the kernel PGD, we don't switch,
> > + * If we saved a kernel context on entry, we didn't switch the CR3,
>
> It's hard enough to read assembly code, please use consistent capitalization:
>
> s/if
> /If
Will do, thanks for the review ;-)
Willy
* Willy Tarreau <[email protected]> wrote:
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> + this_cpu_write(pti_disable,
> + next_p->mm && next_p->mm->context.pti_disable);
> +#endif
Another pet peeve, please write:
> + this_cpu_write(pti_disable, next_p->mm && next_p->mm->context.pti_disable);
or consider introducing an 'mm_next' local variable, set to next_p->mm, and use
that to shorten the sequence.
More importantly, any strong reasons why the flag is logic-inverted? I.e. why not
::pti_enabled?
Thanks,
Ingo
On Wed, Jan 10, 2018 at 08:15:10AM +0100, Ingo Molnar wrote:
> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> > + cmpb $0, PER_CPU_VAR(pti_disable)
> > + jne .Lend_\@
>
> Could you please do this small change for future iterations:
>
> s/per-cpu
> /per-CPU
>
> ... to make the spelling more consistent with the rest of the code base?
Now done as well, thank you. Do not hesitate to send future cosmetic
changes directly to me so that we keep a max of participants mostly
focused on the design choices. (and don't worry, I'll pick them, I
really appreciate these as well).
Thanks,
Willy
* Borislav Petkov <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 01:26:57PM -0800, Andy Lutomirski wrote:
> > 2.Turning off PTI is, in general, a terrible idea. It totally breaks
> > any semblance of a security model on a Meltdown-affected CPU. So I
> > think we should require CAP_SYS_RAWIO *and* that the system is booted
> > with pti=allow_optout or something like that.
>
> Uhh, I like that.
>
> Maybe also taint the kernel ...
Tainting the kernel is really a bad idea: a CAP_SYS_RAWIO user _already_ has the
privilege to directly write to the hardware. The binary makes use of that
privilege to allow execution of ring 3 code that has the theoretical ability to
_read_ kernel memory.
Unless we plan on tainting all uses of /dev/mem as well this is totally over the
top.
We could taint the kernel and warn prominently in the syslog when PTI is disabled
globally on the boot line though, if running on affected CPUs.
Something like:
"x86/intel: Page Table Isolation (PTI) is disabled globally. This allows unprivileged, untrusted code to exploit the Meltdown CPU bug to read kernel data."
Thanks,
Ingo
On Wed, Jan 10, 2018 at 08:19:51AM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> > + this_cpu_write(pti_disable,
> > + next_p->mm && next_p->mm->context.pti_disable);
> > +#endif
>
> Another pet peeve, please write:
>
> > + this_cpu_write(pti_disable, next_p->mm && next_p->mm->context.pti_disable);
>
> or consider introducing an 'mm_next' local variable, set to next_p->mm, and use
> that to shorten the sequence.
OK.
> More importantly, any strong reasons why the flag is logic-inverted? I.e. why not
> ::pti_enabled?
For me it's a matter of default case. Having a "pti_enabled" flag makes
one think the default is disabled and an action is required to turn it on.
With "pti_disabled", it becomes clearer that the default is enabled and an
action is required to turn it off. While it causes a double inversion for
the user due to the temporary choice of prctl name (we could have
ARCH_SET_PTI for example), I think it results on more readable code in
the sensitive parts like the asm one where these tests could possibly
end up inside #ifdefs. If we had "pit_enabled", something like this could
be confusing because it's not obvious whether this pti_enabled *enforces*
PTI or if its absence disables it :
#ifdef CONFIG_ALLOW_DISABLE_PTI
cmpb $0, PER_CPU_VAR(pti_enabled)
jz .Lend\@
#endif
But this is open to discussion of course.
Willy
* Borislav Petkov <[email protected]> wrote:
> Oh, and you've built the kernel with the option to be able to disable
> PTI so it's not like you haven't seen it already.
In general in many corporate environments requiring kernel reboots or kernel
rebuilds limits the real-world usability of any kernel feature we offer down to
"non-existent". Saying "build your own kernel or reboot" is excluding a large
subset of our real-world users.
Build and boot options are fine for developers and testing. Otherwise _everything_
not readily accessible when your distro kernel has booted up is essentially behind
a usability (and corporate policy) wall so steep that it's essentially
non-existent to many users.
So either we make this properly sysctl (and/or prctl) controllable, or just don't
do it at all.
Thanks,
Ingo
On Wed, Jan 10, 2018 at 08:31:28AM +0100, Ingo Molnar wrote:
>
> * Borislav Petkov <[email protected]> wrote:
>
> > Oh, and you've built the kernel with the option to be able to disable
> > PTI so it's not like you haven't seen it already.
>
> In general in many corporate environments requiring kernel reboots or kernel
> rebuilds limits the real-world usability of any kernel feature we offer down to
> "non-existent". Saying "build your own kernel or reboot" is excluding a large
> subset of our real-world users.
>
> Build and boot options are fine for developers and testing. Otherwise _everything_
> not readily accessible when your distro kernel has booted up is essentially behind
> a usability (and corporate policy) wall so steep that it's essentially
> non-existent to many users.
>
> So either we make this properly sysctl (and/or prctl) controllable, or just don't
> do it at all.
After having slept over it, I really prefer the sysctl+prctl approach.
It's much more consistent with the rest of the tunables which act
similarly. We have mmap_min_addr, mmap_rnd_bits, randomize_va_space, etc
All of them are here to trade some protections for something else (mostly
compatibility).
What I'd like to have would be a sysctl with 3 values :
- 0 : default disabled : arch_prctl() fails, this is the default
- 1 : forced enabled : arch_prctl() succeeds for CAP_SYS_RAWIO
- -1 : permanently disabled : fails and cannot be switched back to enabled.
This way the admin always has the last choice, it's possible to globally
disable it without having to fear that someone would enable it again if
desired, and it's possible to try if it helps without rebooting in
emergency situations.
Willy
* Willy Tarreau <[email protected]> wrote:
> On Wed, Jan 10, 2018 at 08:31:28AM +0100, Ingo Molnar wrote:
> >
> > * Borislav Petkov <[email protected]> wrote:
> >
> > > Oh, and you've built the kernel with the option to be able to disable
> > > PTI so it's not like you haven't seen it already.
> >
> > In general in many corporate environments requiring kernel reboots or kernel
> > rebuilds limits the real-world usability of any kernel feature we offer down to
> > "non-existent". Saying "build your own kernel or reboot" is excluding a large
> > subset of our real-world users.
> >
> > Build and boot options are fine for developers and testing. Otherwise _everything_
> > not readily accessible when your distro kernel has booted up is essentially behind
> > a usability (and corporate policy) wall so steep that it's essentially
> > non-existent to many users.
> >
> > So either we make this properly sysctl (and/or prctl) controllable, or just don't
> > do it at all.
>
> After having slept over it, I really prefer the sysctl+prctl approach.
> It's much more consistent with the rest of the tunables which act
> similarly. We have mmap_min_addr, mmap_rnd_bits, randomize_va_space, etc
> All of them are here to trade some protections for something else (mostly
> compatibility).
>
> What I'd like to have would be a sysctl with 3 values :
> - 0 : default disabled : arch_prctl() fails, this is the default
> - 1 : forced enabled : arch_prctl() succeeds for CAP_SYS_RAWIO
> - -1 : permanently disabled : fails and cannot be switched back to enabled.
Btw., I wouldn't call the value of 1 'forced enabled', it's simply enabled.
BTW., we might eventually also want to introduce a 'super flag' for all the
permanent disabling features, so that sysadmins/distros who want to default to
restrictive policies can set that and don't have to be aware of new tunables, and
this also protects against renames, etc.
Thanks,
Ingo
* Willy Tarreau <[email protected]> wrote:
> [...] If we had "pit_enabled", something like this could be confusing because
> it's not obvious whether this pti_enabled *enforces* PTI or if its absence
> disables it :
>
> cmpb $0, PER_CPU_VAR(pti_enabled)
> jz .Lend\@
The natural sequence would be:
cmpb $1, PER_CPU_VAR(pti_enabled)
jne .Lend\@
which is not confusing to me at all.
Thanks,
Ingo
On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
> - use pti_disable instead of task flag
> ---
> arch/x86/entry/calling.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 2c0d3b5..5361a10 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -229,6 +229,11 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> + cmpb $0, PER_CPU_VAR(pti_disable)
> + jne .Lend_\@
> +
> mov %cr3, \scratch_reg
So could you switch back to a task flag for this? That word is already
cache-hot on the exit path while your new variable is not.
On Wed, Jan 10, 2018 at 09:01:02AM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > [...] If we had "pit_enabled", something like this could be confusing because
> > it's not obvious whether this pti_enabled *enforces* PTI or if its absence
> > disables it :
> >
> > cmpb $0, PER_CPU_VAR(pti_enabled)
> > jz .Lend\@
>
> The natural sequence would be:
>
> cmpb $1, PER_CPU_VAR(pti_enabled)
> jne .Lend\@
>
> which is not confusing to me at all.
In fact I think I know now why it still poses me a problem : this
pti_enabled flag alone is not sufficient to enable PTI, it's just part
of the condition, as another part comes from the X86_FEATURE_PTI flag.
However, pti_disabled is sufficient to disable PTI so actually its
effect matches its name (note BTW that I called it "pti_disable" as a
verb indicating an action -- "I want to disable pti", and not as a past
form "pti is disabled").
Willy
* Willy Tarreau <[email protected]> wrote:
> On Wed, Jan 10, 2018 at 09:01:02AM +0100, Ingo Molnar wrote:
> >
> > * Willy Tarreau <[email protected]> wrote:
> >
> > > [...] If we had "pit_enabled", something like this could be confusing because
> > > it's not obvious whether this pti_enabled *enforces* PTI or if its absence
> > > disables it :
> > >
> > > cmpb $0, PER_CPU_VAR(pti_enabled)
> > > jz .Lend\@
> >
> > The natural sequence would be:
> >
> > cmpb $1, PER_CPU_VAR(pti_enabled)
> > jne .Lend\@
> >
> > which is not confusing to me at all.
>
> In fact I think I know now why it still poses me a problem : this
> pti_enabled flag alone is not sufficient to enable PTI, it's just part
> of the condition, as another part comes from the X86_FEATURE_PTI flag.
> However, pti_disabled is sufficient to disable PTI so actually its
> effect matches its name (note BTW that I called it "pti_disable" as a
> verb indicating an action -- "I want to disable pti", and not as a past
> form "pti is disabled").
If it's a verb then please name it in the proper order, i.e. 'disable_pti'.
I'm fine with that approach.
Thanks,
Ingo
On Wed, Jan 10, 2018 at 09:59:01AM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > On Wed, Jan 10, 2018 at 09:01:02AM +0100, Ingo Molnar wrote:
> > >
> > > * Willy Tarreau <[email protected]> wrote:
> > >
> > > > [...] If we had "pit_enabled", something like this could be confusing because
> > > > it's not obvious whether this pti_enabled *enforces* PTI or if its absence
> > > > disables it :
> > > >
> > > > cmpb $0, PER_CPU_VAR(pti_enabled)
> > > > jz .Lend\@
> > >
> > > The natural sequence would be:
> > >
> > > cmpb $1, PER_CPU_VAR(pti_enabled)
> > > jne .Lend\@
> > >
> > > which is not confusing to me at all.
> >
> > In fact I think I know now why it still poses me a problem : this
> > pti_enabled flag alone is not sufficient to enable PTI, it's just part
> > of the condition, as another part comes from the X86_FEATURE_PTI flag.
> > However, pti_disabled is sufficient to disable PTI so actually its
> > effect matches its name (note BTW that I called it "pti_disable" as a
> > verb indicating an action -- "I want to disable pti", and not as a past
> > form "pti is disabled").
>
> If it's a verb then please name it in the proper order, i.e. 'disable_pti'.
>
> I'm fine with that approach.
OK thanks.
Willy
On Wed, Jan 10, 2018 at 09:22:07AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
> > - use pti_disable instead of task flag
> > ---
> > arch/x86/entry/calling.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index 2c0d3b5..5361a10 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -229,6 +229,11 @@
> >
> > .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> > +
> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> > + cmpb $0, PER_CPU_VAR(pti_disable)
> > + jne .Lend_\@
> > +
> > mov %cr3, \scratch_reg
>
> So could you switch back to a task flag for this? That word is already
> cache-hot on the exit path while your new variable is not.
That's a good point. There's already been some demands for a per-thread
setting.
What I can propose then is to partially revert the changes to have this :
- arch_prctl() adjusts the task flag and not a per-mm variable anymore
(Linus, are you OK for this ?)
- arch_prctl() only accepts to perform the action if mm->mm_users == 1
so that we don't change the setting after having created threads ;
this way the task flag is replicated to all future threads ;
- later we may decide to permit re-enabling PTI per thread if it was
disabled.
If we agree on this, I'd like to propose to have two flags :
- TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
- TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
execve() would then simply do :
TIF_DISABLE_PTI_NOW = TIF_DISABLE_PTI_NEXT;
TIF_DISABLE_PTI_NEXT = 0;
The former would be used by applications using their own configuration.
The latter would be used by wrappers. This way we seem to cover the various
use cases. And we make this depend on a sysctl that allows the admin to
globally and permanently disable the feature and which is disabled by
default.
Any objection ?
Regards,
Willy
On Tue, Jan 09, 2018 at 11:40:09PM +0100, Willy Tarreau wrote:
> Boris, please don't try to make me look like a fool when I'm trying to
> explain a common process.
I haven't even intended to do that, sorry, maybe you're misunderstanding
me.
All I'm trying to say is booting with pti=allow_optout should be part of
the proper *setup* of the box. In the sense, the thing is kinda expected
to go to 100% and if performance is still not enough, to allow customers
to disable PTI per process for the price of diminished security.
But...
> No, your distro did. Please keep in mind that you were the one asking me
> to have this option so that distros can enable it to please their users,
> or possibly in fact to remove it to please the competitors.
... I was asking for this so that I can completely keep the code out of the
built kernel but from reading this thread, it sounds to me like we'd
need the full spectrum of options:
1. prohibit disabling of PTI
2. per-process PTI disabling
3. disable PTI on the system
and then show people how to do that and do that at runtime. Apparently,
it is important to people to be able to control that.
And also explain what each option means so that they can evaluate
themselves what they'd prefer.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Wed, Jan 10, 2018 at 08:25:08AM +0100, Ingo Molnar wrote:
> We could taint the kernel and warn prominently in the syslog when PTI is disabled
> globally on the boot line though, if running on affected CPUs.
>
> Something like:
>
> "x86/intel: Page Table Isolation (PTI) is disabled globally. This allows unprivileged, untrusted code to exploit the Meltdown CPU bug to read kernel data."
>
I think we should warn in the per-mm disabling case too. Not the same
text but a similar blurb about the trusted process becoming a high-value
target.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Hi Boris!
On Wed, Jan 10, 2018 at 03:42:39PM +0100, Borislav Petkov wrote:
> On Tue, Jan 09, 2018 at 11:40:09PM +0100, Willy Tarreau wrote:
> > Boris, please don't try to make me look like a fool when I'm trying to
> > explain a common process.
>
> I haven't even intended to do that, sorry, maybe you're misunderstanding
> me.
OK no worries :-)
> > No, your distro did. Please keep in mind that you were the one asking me
> > to have this option so that distros can enable it to please their users,
> > or possibly in fact to remove it to please the competitors.
>
> ... I was asking for this so that I can completely keep the code out of the
> built kernel but from reading this thread, it sounds to me like we'd
> need the full spectrum of options:
>
> 1. prohibit disabling of PTI
> 2. per-process PTI disabling
> 3. disable PTI on the system
>
> and then show people how to do that and do that at runtime. Apparently,
> it is important to people to be able to control that.
It suddenly comes to my mind that this sysctl I suggested this morning
could in fact serve both purposes and be easier to understand, with 4
values. But in fact I hardly find a valid reason for disabling it for
the whole system via a sysctl (I could be proven wrong though). For
example, when building a load generator for my lab I can easily build
my kernels without PTI and/or have pti=off on the cmdline, that's fine.
For production, disabling PTI system-wide when I'm supposed to know what
performance critical processes should be exempted doesn't appeal me very
much, especially if it can happen by accident. So in the end I think that
"pti=off" on the cmdline should be the only way to disable it system-wide
as it doesn't represent a reasonable production case. Disabling it per
process should be allowed via a sysctl, which would also be locked disabled
for safety purposes.
I'll go work on something like this to illustrate better, as I'm well
aware that I never explain well what I'm thinking about.
> And also explain what each option means so that they can evaluate
> themselves what they'd prefer.
Clear explanations are an important part of the solution, as we want
it to be safe by default and only let knowledgeable people turn it off
on purpose.
Thanks,
Willy
On Wed, Jan 10, 2018 at 03:45:06PM +0100, Borislav Petkov wrote:
> On Wed, Jan 10, 2018 at 08:25:08AM +0100, Ingo Molnar wrote:
> > We could taint the kernel and warn prominently in the syslog when PTI is disabled
> > globally on the boot line though, if running on affected CPUs.
> >
> > Something like:
> >
> > "x86/intel: Page Table Isolation (PTI) is disabled globally. This allows unprivileged, untrusted code to exploit the Meltdown CPU bug to read kernel data."
> >
>
> I think we should warn in the per-mm disabling case too. Not the same
> text but a similar blurb about the trusted process becoming a high-value
> target.
Well, we don't warn when /dev/mem is opened read-only, even not when
it's opened R/W, and it exposes the contents much better. Tainting is
first a support help so that developers don't waste time debugging
something that might have been altered. In this case nothing got
altered. At best(worst?) things might have been disclosed. That said
I'm all for at least tainting when running with pti=off at least to
educate users.
Willy
* Borislav Petkov <[email protected]> wrote:
> On Wed, Jan 10, 2018 at 08:25:08AM +0100, Ingo Molnar wrote:
> > We could taint the kernel and warn prominently in the syslog when PTI is disabled
> > globally on the boot line though, if running on affected CPUs.
> >
> > Something like:
> >
> > "x86/intel: Page Table Isolation (PTI) is disabled globally. This allows unprivileged, untrusted code to exploit the Meltdown CPU bug to read kernel data."
> >
>
> I think we should warn in the per-mm disabling case too. Not the same
> text but a similar blurb about the trusted process becoming a high-value
> target.
Ok - that's fine by me too, as long as it's a one time warning only.
Thanks,
Ingo
On Wed, Jan 10, 2018 at 04:39:46PM +0100, Willy Tarreau wrote:
> For production, disabling PTI system-wide when I'm supposed to know what
> performance critical processes should be exempted doesn't appeal me very
> much, especially if it can happen by accident. So in the end I think that
> "pti=off" on the cmdline should be the only way to disable it system-wide
> as it doesn't represent a reasonable production case. Disabling it per
> process should be allowed via a sysctl, which would also be locked disabled
> for safety purposes.
It still might make sense to be able to disable it system-wide without
having to reboot. Imagine a bunch of processes showing performance
regressions and you want to disable PTI completely to rule it out
causing that regression. Then you toggle the master switch.
With the appropriate loud warnings, of course.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Wed, Jan 10, 2018 at 05:09:22PM +0100, Borislav Petkov wrote:
> On Wed, Jan 10, 2018 at 04:39:46PM +0100, Willy Tarreau wrote:
> > For production, disabling PTI system-wide when I'm supposed to know what
> > performance critical processes should be exempted doesn't appeal me very
> > much, especially if it can happen by accident. So in the end I think that
> > "pti=off" on the cmdline should be the only way to disable it system-wide
> > as it doesn't represent a reasonable production case. Disabling it per
> > process should be allowed via a sysctl, which would also be locked disabled
> > for safety purposes.
>
> It still might make sense to be able to disable it system-wide without
> having to reboot. Imagine a bunch of processes showing performance
> regressions and you want to disable PTI completely to rule it out
> causing that regression. Then you toggle the master switch.
Well, indeed. It will never be 100% equivalent to pti=off however since
the alternative code will remain in place, but why not. Or maybe we have
a way to change the alternatives at run time by changing a sysctl, but
that doesn't please me a lot. I'll check this after the rest however,
as I'm not sure about the code implications in the entry code (i.e. we'd
rather not check a system wide variable, or we might need another per-CPU
one). We could also just mention that the setting only applies to future
processes, which will be much easier and probably sufficient.
Willy
On Wed, Jan 10, 2018 at 05:19:29PM +0100, Willy Tarreau wrote:
> Well, indeed. It will never be 100% equivalent to pti=off however since
> the alternative code will remain in place, but why not. Or maybe we have
> a way to change the alternatives at run time by changing a sysctl,
Not yet. We need to think about it.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Wed, Jan 10, 2018 at 1:11 AM, Willy Tarreau <[email protected]> wrote:
> On Wed, Jan 10, 2018 at 09:22:07AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
>> > - use pti_disable instead of task flag
>> > ---
>> > arch/x86/entry/calling.h | 5 +++++
>> > 1 file changed, 5 insertions(+)
>> >
>> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> > index 2c0d3b5..5361a10 100644
>> > --- a/arch/x86/entry/calling.h
>> > +++ b/arch/x86/entry/calling.h
>> > @@ -229,6 +229,11 @@
>> >
>> > .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>> > +
>> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
>> > + cmpb $0, PER_CPU_VAR(pti_disable)
>> > + jne .Lend_\@
>> > +
>> > mov %cr3, \scratch_reg
>>
>> So could you switch back to a task flag for this? That word is already
>> cache-hot on the exit path while your new variable is not.
>
> That's a good point. There's already been some demands for a per-thread
> setting.
>
> What I can propose then is to partially revert the changes to have this :
>
> - arch_prctl() adjusts the task flag and not a per-mm variable anymore
> (Linus, are you OK for this ?)
>
> - arch_prctl() only accepts to perform the action if mm->mm_users == 1
> so that we don't change the setting after having created threads ;
> this way the task flag is replicated to all future threads ;
>
> - later we may decide to permit re-enabling PTI per thread if it was
> disabled.
>
> If we agree on this, I'd like to propose to have two flags :
>
> - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
> - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
I really dislike state that isn't cleared on execve(). I'm assuming
that this is so you can run time pwn_me_without_pti whatever? Surely
LD_PRELOAD can do this, too?
Hi Andy,
On Wed, Jan 10, 2018 at 11:21:15AM -0800, Andy Lutomirski wrote:
> > If we agree on this, I'd like to propose to have two flags :
> >
> > - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
> > - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
>
> I really dislike state that isn't cleared on execve(). I'm assuming
> that this is so you can run time pwn_me_without_pti whatever?
Yes exactly. I've just sent a 3rd series with an example code for this.
In fact it's not that the state is not cleared by execve(), it's that
it's set for the next execve() which then resets it.
> Surely LD_PRELOAD can do this, too?
That was one of my other proposals. I really don't know if LD_PRELOAD
fits anyone's usage for such things (static/setuid binaries, complication
to pass variables maybe).
Please take a look and tell me if you still dislike it or not.
thanks!
Willy
On Wed, Jan 10, 2018 at 11:39 AM, Willy Tarreau <[email protected]> wrote:
> Hi Andy,
>
> On Wed, Jan 10, 2018 at 11:21:15AM -0800, Andy Lutomirski wrote:
>> > If we agree on this, I'd like to propose to have two flags :
>> >
>> > - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
>> > - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
>>
>> I really dislike state that isn't cleared on execve(). I'm assuming
>> that this is so you can run time pwn_me_without_pti whatever?
>
> Yes exactly. I've just sent a 3rd series with an example code for this.
> In fact it's not that the state is not cleared by execve(), it's that
> it's set for the next execve() which then resets it.
>
>> Surely LD_PRELOAD can do this, too?
>
> That was one of my other proposals. I really don't know if LD_PRELOAD
> fits anyone's usage for such things (static/setuid binaries, complication
> to pass variables maybe).
>
> Please take a look and tell me if you still dislike it or not.
>
Adding flags that persist across execve() of setuid stuff is IMO even
worse. It just makes reasoning about the system's security model
really nasty.
On Wed, Jan 10, 2018 at 11:21 AM, Andy Lutomirski <[email protected]> wrote:
>
> I really dislike state that isn't cleared on execve(). I'm assuming
> that this is so you can run time pwn_me_without_pti whatever? Surely
> LD_PRELOAD can do this, too?
Andy, what the hell is wrong with you?
You are actively trying to screw this whole interface up, aren't you?
LD_PRELOAD cannot work for a wrapper, for the simple reason that it
runs in the same context as the process. So if you want to say "I want
to run this process without PTI", but you don't want to run the
process with elevated privileges, LD_PRELOAD doesn't work.
It's like saying "why do we need 'sudo' - let's juat make a LD_PRELOAD
that sets uid to zero instead".
The "let's do it per thread" made no sense either, since that's
fundamentally not how page tables work, and it's complete broken shit.
And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
no-PTI interface is that it
(a) inherits on fork/exec, so that you don't have to worry about how
something is implemented (think "I want to run this kernel build
without the PTI overhead", but also "I want to run this system daemon
without PTI").
(b) actual domain changes clear it (ie suid, whatever).
that make it useful for random uses of "I trust service XYZ".
So I'm NAK'ing this whole series on the grounds that it has several
completely insane semantics and really need to be clarified, and where
actual usage needs to be thought about a lot more.
Linus
On Wed, Jan 10, 2018 at 11:50 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jan 10, 2018 at 11:21 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> I really dislike state that isn't cleared on execve(). I'm assuming
>> that this is so you can run time pwn_me_without_pti whatever? Surely
>> LD_PRELOAD can do this, too?
>
> Andy, what the hell is wrong with you?
>
> You are actively trying to screw this whole interface up, aren't you?
>
> LD_PRELOAD cannot work for a wrapper, for the simple reason that it
> runs in the same context as the process. So if you want to say "I want
> to run this process without PTI", but you don't want to run the
> process with elevated privileges, LD_PRELOAD doesn't work.
Oh, right, duh. Brain was off.
> The "let's do it per thread" made no sense either, since that's
> fundamentally not how page tables work, and it's complete broken shit.
I still disagree with you here. The whole concept of per-thread or
per-mm or per-whatever PTI disablement is if the admin for some reason
trusts some piece of code not to try to exploit Meltdown. But just
imagine a program like a web browser. The browser will do some
performance critical stuff (networking) and some
absolutely-no-fucking-way-would-I-turn-off-PTI stuff (running
scripts). So per-thread seems totally sensible to me. No one sane
would ever do this for a web browser, but I can easily imagine it for
something like a web *server* or even a database server.
Just logically, too, per-thread is the obvious semantics. Whether we
rewrite CR3 when we go to usermode is a thing affecting that thread.
The only reason the mm has anything to do with it is the NX trick.
On 01/09/2018 04:56 AM, Willy Tarreau wrote:
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -214,6 +214,11 @@
> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> mov %cr3, \scratch_reg
> +
> + /* if we're already on the kernel PGD, we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Lend_\@
> +
> ADJUST_KERNEL_CR3 \scratch_reg
Willy, this is not specific to your patch, but it is one of the first
*changes* to this code since Spectre, thus I'm bringing it up in this
thread.
The code already has some, but new conditional branches give me the
willies. None of them take the form that can be used to exploit
Spectre/Variant1 that I can see, but I do think we need to start talking
about why this is not vulnerable and probably documenting it in the
entry code.
Spectre/V1 (conditional branches)
* Data reads in entry/exit when you have the user CR3 must be in
cpu_entry_area and readable to a Meltdown exploit, anyway. That
implies that there is no data to be leaked in the address space
at all at this point.
* Conditional branches in the entry/exit code with a kernel CR3 value
are the only concern. It is safe, however, if the data being checked
is not user-controllable.
Spectre/V2 (indirect branches)
* Indirect Branches in the entry code are forbidden because of
Spectre/V2. Retpolines or other mitigations (IBRS) must be used
instead.
Anybody disagree?
In this case, the data being checked (CR3) is not user-controllable and
there are no indirect branches. So this code is OK.
On Wed, Jan 10, 2018 at 11:50:46AM -0800, Linus Torvalds wrote:
> And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
> no-PTI interface is that it
>
> (a) inherits on fork/exec, so that you don't have to worry about how
> something is implemented (think "I want to run this kernel build
> without the PTI overhead", but also "I want to run this system daemon
> without PTI").
>
> (b) actual domain changes clear it (ie suid, whatever).
>
> that make it useful for random uses of "I trust service XYZ".
OK. Do you want to see something *only* based on a wrapper (i.e. works
only after execve) or can we let the application apply the change to
itself ? I would also like to let applications re-enable the protection
for processes they're going to exec and not necessarily trust.
> So I'm NAK'ing this whole series on the grounds that it has several
> completely insane semantics and really need to be clarified, and where
> actual usage needs to be thought about a lot more.
In fact we were trying to limit the risk of propagating the protection
removal too far, and leave it only on the sensitive process which really
requires it. But your example of "running the kernel build without PTI"
makes sense from a user's perspective, and it completely contradicts
our initial assumptions.
After all I don't think the NOW vs NEXT is so fundamentally broken. We
could think about having one option for the current process only (which
is cleared by execve) so that applications can apply it to themselves
only without having to wonder about clearing it, and another one which
is only for wrappers and which passes execve(). For now I considered
that we could stop at the first execve, but if I just remove the
clearing of the NEXT flag, it matches your requirement for the kernel
build. After this it's just a matter of naming and placing them on the
mm rather than thread.
Willy
On Wed, Jan 10, 2018 at 12:29:17PM -0800, Dave Hansen wrote:
> On 01/09/2018 04:56 AM, Willy Tarreau wrote:
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -214,6 +214,11 @@
> > .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> > mov %cr3, \scratch_reg
> > +
> > + /* if we're already on the kernel PGD, we don't switch */
> > + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> > + jz .Lend_\@
> > +
> > ADJUST_KERNEL_CR3 \scratch_reg
>
> Willy, this is not specific to your patch, but it is one of the first
> *changes* to this code since Spectre, thus I'm bringing it up in this
> thread.
>
> The code already has some, but new conditional branches give me the
> willies. None of them take the form that can be used to exploit
> Spectre/Variant1 that I can see, but I do think we need to start talking
> about why this is not vulnerable and probably documenting it in the
> entry code.
>
> Spectre/V1 (conditional branches)
> * Data reads in entry/exit when you have the user CR3 must be in
> cpu_entry_area and readable to a Meltdown exploit, anyway. That
> implies that there is no data to be leaked in the address space
> at all at this point.
> * Conditional branches in the entry/exit code with a kernel CR3 value
> are the only concern. It is safe, however, if the data being checked
> is not user-controllable.
>
> Spectre/V2 (indirect branches)
> * Indirect Branches in the entry code are forbidden because of
> Spectre/V2. Retpolines or other mitigations (IBRS) must be used
> instead.
>
> Anybody disagree?
>
> In this case, the data being checked (CR3) is not user-controllable and
> there are no indirect branches. So this code is OK.
Just let me know what comment you'd like me to add there. I suppose
something like "This code has code has been checked against Spectre/v1/v2
attacks ; it's safe as CR3 is not user-controllable and there's no
indirect branches" ?
Thanks,
Willy
On 01/10/2018 10:42 PM, Willy Tarreau wrote:
> On Wed, Jan 10, 2018 at 11:50:46AM -0800, Linus Torvalds wrote:
>> And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
>> no-PTI interface is that it
>>
>> (a) inherits on fork/exec, so that you don't have to worry about how
>> something is implemented (think "I want to run this kernel build
>> without the PTI overhead", but also "I want to run this system daemon
>> without PTI").
>>
>> (b) actual domain changes clear it (ie suid, whatever).
>>
>> that make it useful for random uses of "I trust service XYZ".
> OK. Do you want to see something *only* based on a wrapper (i.e. works
> only after execve) or can we let the application apply the change to
> itself ? I would also like to let applications re-enable the protection
> for processes they're going to exec and not necessarily trust.
I don't think we need a "NOW" and "NEXT" mode, at least initially. The
"NEXT" semantics are going to be tricky and I think "NOW" is good enough
Whatever we do, we'll need this PTI-disable flag to be able cross
exeve() so that a wrapper a la nice(1) work. Initially, I think the
default should be that it survives fork(). There are just too many
things out there that "start up" by doing a shell script that calls a
python script, that calls a...
Without the wrapper support, we're _basically_ stuck using this only in
newly-compiled binaries. That's going to make it much less likely to
get used.
The inheritance also gives an app a way to re-enable protections for
children, just from a _second_ wrapper. That's nice because it means we
don't initially need a "NEXT" ABI.
So, I'd do this:
1. Do the arch_prctl() (but ask the ARM guys what they want too)
2. Enabled for an entire process (not thread)
3. Inherited across fork/exec
4. Cleared on setuid() and friends
5. I'm sure the security folks have/want a way to force it on forever
Next, if we decide that we have things that both don't want PTI's
protections and are forking things not covered by #4, we can add some
"child opt out" in the prctl(), plus maybe marking binaries somehow.
Please don't forget to add ways to tell if this feature is on/off in
/proc or whatever. I think we also need to be able to dump the actual
CR3 value that we entered the kernel with before we start doing too much
other funky stuff with the entry code.
Hi Dave,
On Thu, Jan 11, 2018 at 07:29:30AM -0800, Dave Hansen wrote:
> I don't think we need a "NOW" and "NEXT" mode, at least initially. The
> "NEXT" semantics are going to be tricky and I think "NOW" is good enough
In fact I thought the NEXT one would bring us a nice benefit which is that
we start the new process knowing the flag's value so we can decide whether
or not to apply _PAGE_NX on the pgd from the start, and never touch it
anymore.
> Whatever we do, we'll need this PTI-disable flag to be able cross
> exeve() so that a wrapper a la nice(1) work.
Absolutely!
> Initially, I think the
> default should be that it survives fork(). There are just too many
> things out there that "start up" by doing a shell script that calls a
> python script, that calls a...
Not only that, simply daemons, like most services are!
> Without the wrapper support, we're _basically_ stuck using this only in
> newly-compiled binaries. That's going to make it much less likely to
> get used.
I know, that's why I kept considering that option despite not really
needing it for my own use case.
> The inheritance also gives an app a way to re-enable protections for
> children, just from a _second_ wrapper. That's nice because it means we
> don't initially need a "NEXT" ABI.
>
> So, I'd do this:
> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
> 2. Enabled for an entire process (not thread)
> 3. Inherited across fork/exec
> 4. Cleared on setuid() and friends
This one causes me a problem : some daemons already take care of dropping
privileges after the initial fork() for the sake of security. Haproxy
typically does this at boot :
- parse config
- chroot to /var/empty
- setuid(dedicated_uid)
- fork()
This ensures the process is properly isolated and hard enough to break out
of. So I'd really like this setuid() not to anihilate all we've done.
Probably that we want to drop it on suid binaries however, though I'm
having doubts about the benefits, because if the binary already allows
an intruder to inject its own meltdown code, you're quite screwed anyway.
> 5. I'm sure the security folks have/want a way to force it on forever
Sure! That's what I implemented using the sysctl.
> Next, if we decide that we have things that both don't want PTI's
> protections and are forking things not covered by #4, we can add some
> "child opt out" in the prctl(), plus maybe marking binaries somehow.
I was really thinking about using the "NOW" for this compard to the NEXT.
But I don't know what it could imply for the pgd not having the _PAGE_NX.
> Please don't forget to add ways to tell if this feature is on/off in
> /proc or whatever.
Very good idea, and it will be much more convenient than using the GET
prctl that I didn't like.
> I think we also need to be able to dump the actual
> CR3 value that we entered the kernel with before we start doing too much
> other funky stuff with the entry code.
When you say dump, you mean save it somewhere in a per_cpu variable ?
Willy
On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>> I think we also need to be able to dump the actual
>> CR3 value that we entered the kernel with before we start doing too much
>> other funky stuff with the entry code.
> When you say dump, you mean save it somewhere in a per_cpu variable ?
Yeah, I think a per-cpu variable is fine for now. But, that only gives
you a dump from a single entry to the kernel. Ideally, it would be nice
to have a stack of them so you could do things like debug
syscall->fault->oops. Was it the syscall entry or the fault entry that
lead to the oops?
But, the stack gets really fun because of NMIs.
I'm sure Andy Lutomirski has some ideas too.
On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
<[email protected]> wrote:
> On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>>> I think we also need to be able to dump the actual
>>> CR3 value that we entered the kernel with before we start doing too much
>>> other funky stuff with the entry code.
>> When you say dump, you mean save it somewhere in a per_cpu variable ?
>
> Yeah, I think a per-cpu variable is fine for now. But, that only gives
> you a dump from a single entry to the kernel. Ideally, it would be nice
> to have a stack of them so you could do things like debug
> syscall->fault->oops. Was it the syscall entry or the fault entry that
> lead to the oops?
>
> But, the stack gets really fun because of NMIs.
>
> I'm sure Andy Lutomirski has some ideas too.
I was thinking that maybe we should add a new field or two to pt_regs.
They could store CR2 and maybe CR3 as well. I'd also like to expose
the error code of exceptions in stack traces. We should get this
integrated right into the unwinder.
On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <[email protected]> wrote:
> Hi Dave,
>
> On Thu, Jan 11, 2018 at 07:29:30AM -0800, Dave Hansen wrote:
>> I don't think we need a "NOW" and "NEXT" mode, at least initially. The
>> "NEXT" semantics are going to be tricky and I think "NOW" is good enough
>
> In fact I thought the NEXT one would bring us a nice benefit which is that
> we start the new process knowing the flag's value so we can decide whether
> or not to apply _PAGE_NX on the pgd from the start, and never touch it
> anymore.
>
>> Whatever we do, we'll need this PTI-disable flag to be able cross
>> exeve() so that a wrapper a la nice(1) work.
>
> Absolutely!
>
>> Initially, I think the
>> default should be that it survives fork(). There are just too many
>> things out there that "start up" by doing a shell script that calls a
>> python script, that calls a...
>
> Not only that, simply daemons, like most services are!
>
>> Without the wrapper support, we're _basically_ stuck using this only in
>> newly-compiled binaries. That's going to make it much less likely to
>> get used.
>
> I know, that's why I kept considering that option despite not really
> needing it for my own use case.
>
>> The inheritance also gives an app a way to re-enable protections for
>> children, just from a _second_ wrapper. That's nice because it means we
>> don't initially need a "NEXT" ABI.
>>
>> So, I'd do this:
>> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
>> 2. Enabled for an entire process (not thread)
>> 3. Inherited across fork/exec
>> 4. Cleared on setuid() and friends
>
> This one causes me a problem : some daemons already take care of dropping
> privileges after the initial fork() for the sake of security. Haproxy
> typically does this at boot :
>
> - parse config
> - chroot to /var/empty
> - setuid(dedicated_uid)
> - fork()
>
> This ensures the process is properly isolated and hard enough to break out
> of. So I'd really like this setuid() not to anihilate all we've done.
> Probably that we want to drop it on suid binaries however, though I'm
> having doubts about the benefits, because if the binary already allows
> an intruder to inject its own meltdown code, you're quite screwed anyway.
>
>> 5. I'm sure the security folks have/want a way to force it on forever
>
> Sure! That's what I implemented using the sysctl.
>
All of these proposals have serious issues. For example, suppose I
have a setuid program called nopti that works like this:
$ nopti some_program
nopti verifies that some_program is trustworthy and runs it (as the
real uid of nopti's user) with PTI off. Now we have all the usual
problems: you can easily break out using ptrace(), for example. And
LD_PRELOAD gets this wrong. Et.
So I think that no-pti mode is a privilege as opposed to a mode per
se. If you can turn off PTI, then you have the ability to read all of
kernel memory So maybe we should treat it as such. Add a capability
CAP_DISABLE_PTI. If you have that capability (globally), then you can
use the arch_prctl() or regular prctl() or whatever to turn PTI on.
If you lose the cap, you lose no-pti mode as well. If an LSM wants to
block it, it can use existing mechanisms.
As for per-mm vs per-thread, let's make it only switchable in
single-threaded processes for now and inherited when threads are
created. We can change that if and when demand for the ability to
change it shows up.
(Another reason for per-thread instead of per-mm: as a per-mm thing,
you can't set it up for your descendents using vfork(); prctl();
exec(), and the latter is how your average language runtime that
spawns subprocesses would want to do it.
Hi Andy!
On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <[email protected]> wrote:
> >> So, I'd do this:
> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
> >> 2. Enabled for an entire process (not thread)
> >> 3. Inherited across fork/exec
> >> 4. Cleared on setuid() and friends
> >
> > This one causes me a problem : some daemons already take care of dropping
> > privileges after the initial fork() for the sake of security. Haproxy
> > typically does this at boot :
> >
> > - parse config
> > - chroot to /var/empty
> > - setuid(dedicated_uid)
> > - fork()
> >
> > This ensures the process is properly isolated and hard enough to break out
> > of. So I'd really like this setuid() not to anihilate all we've done.
> > Probably that we want to drop it on suid binaries however, though I'm
> > having doubts about the benefits, because if the binary already allows
> > an intruder to inject its own meltdown code, you're quite screwed anyway.
> >
> >> 5. I'm sure the security folks have/want a way to force it on forever
> >
> > Sure! That's what I implemented using the sysctl.
> >
>
> All of these proposals have serious issues. For example, suppose I
> have a setuid program called nopti that works like this:
>
> $ nopti some_program
>
> nopti verifies that some_program is trustworthy and runs it (as the
> real uid of nopti's user) with PTI off. Now we have all the usual
> problems: you can easily break out using ptrace(), for example. And
> LD_PRELOAD gets this wrong. Et.
Fine, we can drop it on suid binaries as I mentionned.
> So I think that no-pti mode is a privilege as opposed to a mode per
> se. If you can turn off PTI, then you have the ability to read all of
> kernel memory So maybe we should treat it as such. Add a capability
> CAP_DISABLE_PTI. If you have that capability (globally), then you can
> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> If you lose the cap, you lose no-pti mode as well.
I disagree on this, because the only alternative I have is to decide
to keep my process running as root, which is even worse, as root can
much more easily escape from a chroot jail for example, or access
/dev/mem and read all the memory as well. Also tell Linus that he'll
have to build his kernels as root ;-)
The arch_prctl() calls I proposed only allow to turn PTI off for
privileged users but any user can turn it back on. For me it's
important. A process might trust itself for its own use, but such
processes will rarely trust external processes in case they need to
perform an occasional fork+exec. Imagine for example a static web
server requiring to achieve the highest possible performance and
having to occasionally call logrotate to rotate+compress the logs.
It's better if the process knows how to turn PTI back on before
calling this.
> If an LSM wants to block it, it can use existing mechanisms.
>
> As for per-mm vs per-thread, let's make it only switchable in
> single-threaded processes for now and inherited when threads are
> created.
That's exactly what it does for now, but Linus doesn't like it at all.
So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
point regarding the pgd and _PAGE_NX setting. point Now at least we know
the change is minimal if we have a good reason for doing differently
later.
> We can change that if and when demand for the ability to
> change it shows up.
>
> (Another reason for per-thread instead of per-mm: as a per-mm thing,
> you can't set it up for your descendents using vfork(); prctl();
> exec(), and the latter is how your average language runtime that
> spawns subprocesses would want to do it.
That's indeed the benefit it provides for now since I actually had
to *add* code to execve() to disable it then.
Cheers,
Willy
On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <[email protected]> wrote:
> Hi Andy!
>
> On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <[email protected]> wrote:
>> >> So, I'd do this:
>> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
>> >> 2. Enabled for an entire process (not thread)
>> >> 3. Inherited across fork/exec
>> >> 4. Cleared on setuid() and friends
>> >
>> > This one causes me a problem : some daemons already take care of dropping
>> > privileges after the initial fork() for the sake of security. Haproxy
>> > typically does this at boot :
>> >
>> > - parse config
>> > - chroot to /var/empty
>> > - setuid(dedicated_uid)
>> > - fork()
>> >
>> > This ensures the process is properly isolated and hard enough to break out
>> > of. So I'd really like this setuid() not to anihilate all we've done.
>> > Probably that we want to drop it on suid binaries however, though I'm
>> > having doubts about the benefits, because if the binary already allows
>> > an intruder to inject its own meltdown code, you're quite screwed anyway.
>> >
>> >> 5. I'm sure the security folks have/want a way to force it on forever
>> >
>> > Sure! That's what I implemented using the sysctl.
>> >
>>
>> All of these proposals have serious issues. For example, suppose I
>> have a setuid program called nopti that works like this:
>>
>> $ nopti some_program
>>
>> nopti verifies that some_program is trustworthy and runs it (as the
>> real uid of nopti's user) with PTI off. Now we have all the usual
>> problems: you can easily break out using ptrace(), for example. And
>> LD_PRELOAD gets this wrong. Et.
>
> Fine, we can drop it on suid binaries as I mentionned.
>
>> So I think that no-pti mode is a privilege as opposed to a mode per
>> se. If you can turn off PTI, then you have the ability to read all of
>> kernel memory So maybe we should treat it as such. Add a capability
>> CAP_DISABLE_PTI. If you have that capability (globally), then you can
>> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
>> If you lose the cap, you lose no-pti mode as well.
>
> I disagree on this, because the only alternative I have is to decide
> to keep my process running as root, which is even worse, as root can
> much more easily escape from a chroot jail for example, or access
> /dev/mem and read all the memory as well. Also tell Linus that he'll
> have to build his kernels as root ;-)
Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient"
capability and let it be inherited by all descendents even if
unprivileged. This was all very carefully designed so that a program
that inherited an ambient capability but tries to run a
less-privileged child using pre-4.3 techniques will correctly drop the
ambient capability, which is *exactly* what we want for PTI.
So I stand by my suggestion. Linus could still do:
$ nopti make -j512
and have it work, but trying to ptrace() the make process from outside
the nopti process tree (and without doing nopti ptrace) will fail, as
expected. (Unless root does the ptrace, again as expected.)
>
> The arch_prctl() calls I proposed only allow to turn PTI off for
> privileged users but any user can turn it back on. For me it's
> important. A process might trust itself for its own use, but such
> processes will rarely trust external processes in case they need to
> perform an occasional fork+exec. Imagine for example a static web
> server requiring to achieve the highest possible performance and
> having to occasionally call logrotate to rotate+compress the logs.
> It's better if the process knows how to turn PTI back on before
> calling this.
In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
the ability to have PTI off. If you have PTI off, you can turn it
back in using prctl() or whatever. So you call prctl() (to turn PTI
back on) or capset() (to turn it on and drop the ability to turn it
off).
How exactly do you plan to efficiently call logrotate from your
webserver if the flag is per-mm, though? You don't want to fork() in
your fancy server because fork() write-protects the whole address
space. So you use clone() or vfork() (like posix_spawn() does
internally on a sane libc), but now you can't turn PTI back on if it's
per-mm because you haven't changed your mm.
I really really think it should be per thread.
>
>> If an LSM wants to block it, it can use existing mechanisms.
>>
>> As for per-mm vs per-thread, let's make it only switchable in
>> single-threaded processes for now and inherited when threads are
>> created.
>
> That's exactly what it does for now, but Linus doesn't like it at all.
> So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> point regarding the pgd and _PAGE_NX setting. point Now at least we know
> the change is minimal if we have a good reason for doing differently
> later.
Yuck, I hate this. Are you really going to wire it up complete with
all the IPIs needed to get the thing synced up right? it's also going
to run slower no matter what, I think, because you'll have to sync the
per-mm state back to the TI flags on context switches.
Linus, can you explain again why you think PTI should be a per-mm
thing? That is, why do you think it's useful and why do you think it
makes logical sense from a user's POV? I think the implementation is
easier and saner for per-thread. Also, if I we use a capability bit
for it, making it per-mm gets really weird.
--Andy
On Thu, Jan 11, 2018 at 09:53:26AM -0800, Andy Lutomirski wrote:
> >> So I think that no-pti mode is a privilege as opposed to a mode per
> >> se. If you can turn off PTI, then you have the ability to read all of
> >> kernel memory So maybe we should treat it as such. Add a capability
> >> CAP_DISABLE_PTI. If you have that capability (globally), then you can
> >> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> >> If you lose the cap, you lose no-pti mode as well.
> >
> > I disagree on this, because the only alternative I have is to decide
> > to keep my process running as root, which is even worse, as root can
> > much more easily escape from a chroot jail for example, or access
> > /dev/mem and read all the memory as well. Also tell Linus that he'll
> > have to build his kernels as root ;-)
>
> Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient"
> capability and let it be inherited by all descendents even if
> unprivileged. This was all very carefully designed so that a program
> that inherited an ambient capability but tries to run a
> less-privileged child using pre-4.3 techniques will correctly drop the
> ambient capability, which is *exactly* what we want for PTI.
Ah thanks for explaining what these "ambient" capabilities are, I saw
the term a few times but never looked closer.
> So I stand by my suggestion. Linus could still do:
>
> $ nopti make -j512
>
> and have it work, but trying to ptrace() the make process from outside
> the nopti process tree (and without doing nopti ptrace) will fail, as
> expected. (Unless root does the ptrace, again as expected.)
This may be reasonable.
> > The arch_prctl() calls I proposed only allow to turn PTI off for
> > privileged users but any user can turn it back on. For me it's
> > important. A process might trust itself for its own use, but such
> > processes will rarely trust external processes in case they need to
> > perform an occasional fork+exec. Imagine for example a static web
> > server requiring to achieve the highest possible performance and
> > having to occasionally call logrotate to rotate+compress the logs.
> > It's better if the process knows how to turn PTI back on before
> > calling this.
>
> In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
> the ability to have PTI off. If you have PTI off, you can turn it
> back in using prctl() or whatever. So you call prctl() (to turn PTI
> back on) or capset() (to turn it on and drop the ability to turn it
> off).
Hmmm OK. I still don't like much to conflate the "turn back on" between
the two distinct calls. If the capability grants you the right to act
on prctl(), it should not perform the action in your back when disabled.
It may even lead people to care less about it which is not a good practise.
> How exactly do you plan to efficiently call logrotate from your
> webserver if the flag is per-mm, though? You don't want to fork() in
> your fancy server because fork() write-protects the whole address
> space.
I'm not following you, what's the problem here ? I mean most programs
do fork() then close() a few FDs that were not marked CLOEXEC, then
execve(). The purpose is to avoid changing existing programs too much.
> So you use clone() or vfork() (like posix_spawn() does
> internally on a sane libc), but now you can't turn PTI back on if it's
> per-mm because you haven't changed your mm.
But as soon as you write anything it's cloned, right ? Ie you write in
the stack. I could say bullshit here, but surely we have a way to split
them.
> I really really think it should be per thread.
It could be a good argument here.
> >> As for per-mm vs per-thread, let's make it only switchable in
> >> single-threaded processes for now and inherited when threads are
> >> created.
> >
> > That's exactly what it does for now, but Linus doesn't like it at all.
> > So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> > point regarding the pgd and _PAGE_NX setting. point Now at least we know
> > the change is minimal if we have a good reason for doing differently
> > later.
>
> Yuck, I hate this. Are you really going to wire it up complete with
> all the IPIs needed to get the thing synced up right? it's also going
> to run slower no matter what, I think, because you'll have to sync the
> per-mm state back to the TI flags on context switches.
At this point I'm lost, all I can do is trust you guys once you agree
on a solution :-)
> Linus, can you explain again why you think PTI should be a per-mm
> thing? That is, why do you think it's useful and why do you think it
> makes logical sense from a user's POV? I think the implementation is
> easier and saner for per-thread. Also, if I we use a capability bit
> for it, making it per-mm gets really weird.
thanks,
Willy
On 01/11/2018 10:05 AM, Willy Tarreau wrote:
>> That's exactly what it does for now, but Linus doesn't like it at all.
>> So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
>> point regarding the pgd and _PAGE_NX setting. point Now at least we know
>> the change is minimal if we have a good reason for doing differently
>> later.
> Yuck, I hate this. Are you really going to wire it up complete with
> all the IPIs needed to get the thing synced up right?
Well, on the bright side, we don't need IPIs when _removing_ NX. We can
just handle those like a spurious fault.
But, when re-enabling it, we need all the TLB flushing for all the CPUs
that have run with the un-NX'd page tables. I guess that's the one
benefit if you come up with an API that only allows "disable PTI" while
a task is running but leaves execve()/fork() as places that implicitly
reenable PTI.
On Thu, Jan 11, 2018 at 09:02:55AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
> <[email protected]> wrote:
> > On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> >>> I think we also need to be able to dump the actual
> >>> CR3 value that we entered the kernel with before we start doing too much
> >>> other funky stuff with the entry code.
> >> When you say dump, you mean save it somewhere in a per_cpu variable ?
> >
> > Yeah, I think a per-cpu variable is fine for now. But, that only gives
> > you a dump from a single entry to the kernel. Ideally, it would be nice
> > to have a stack of them so you could do things like debug
> > syscall->fault->oops. Was it the syscall entry or the fault entry that
> > lead to the oops?
> >
> > But, the stack gets really fun because of NMIs.
> >
> > I'm sure Andy Lutomirski has some ideas too.
>
> I was thinking that maybe we should add a new field or two to pt_regs.
> They could store CR2 and maybe CR3 as well. I'd also like to expose
> the error code of exceptions in stack traces. We should get this
> integrated right into the unwinder.
hmm. Exposing cr3 to user space will make it trivial for user process
to know whether kpti is active. Not sure how exploitable such
information leak.
On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <[email protected]> wrote:
>> As for per-mm vs per-thread, let's make it only switchable in
>> single-threaded processes for now and inherited when threads are
>> created.
>
> That's exactly what it does for now, but Linus doesn't like it at all.
Just to clarify: I definitely want the part where it is only
switchable in single-threaded mode, and I actually do want it
"inherited" by threads when they do get created.
It's just that my mental model for threads is not that they "inherit"
the PTI flag, it's that they simply share it. But that's more of a
difference in "views" than in actual behavior.
>> (Another reason for per-thread instead of per-mm: as a per-mm thing,
>> you can't set it up for your descendents using vfork(); prctl();
>> exec(), and the latter is how your average language runtime that
>> spawns subprocesses would want to do it.
>
> That's indeed the benefit it provides for now since I actually had
> to *add* code to execve() to disable it then.
So the "vfork()" case is indeed interesting, but I don't think it's
all that relevant.
Why?
If you do the PTI on/off operation *before* the vfork(), nothing is
different. The vfork() by definition ends up having the same PTI
state, since it has the same VM. But that's actually 100% expected,
and it matches the fork() behavior too: the PTI state should be just
copied by a fork(), since fork isn't any protection domain.
And *after* you've done a vfork(), you can't do a PTI on/off, because
now the VM has multiple users, which is 100% equivalent to the thread
case that we already all agreed should be disallowed. So no, you can't
do "vfork -> setnopti -> exec', but that is in no way different from
any of the *other* things you cannot do in between vfork and execve.
And in a wrapper that sets nopti, you wouldn't want to use vfork
anyway. You wouldn't even want to use *fork*. You'd just do "set
nopti" and then execve(). That's the whole point of the wrapper.
So vfork() is worth _mentioning_, but I don't think there is any
actual issue there. Quite the reverse - it acts exactly as expected.
The main thing that should be special for PTI on/off is "execve()".
That's the one that may force PTI on again, because of a security
boundary.
The other case may be the CLONE_NEW* operations. I *think* they are
noops as far as PTI settings would be, but I think people should think
about them.
Linus
On Thu, Jan 11, 2018 at 10:25 AM, Linus Torvalds
<[email protected]> wrote:
>
> The other case may be the CLONE_NEW* operations. I *think* they are
> noops as far as PTI settings would be, but I think people should think
> about them.
Oh, and yes, I think the npti flag should also break ptrace(). I do
agree with Andy that it's a "capability", although I do not think it
should actually be implemented as one.
Linus
On 01/11/2018 10:21 AM, Alexei Starovoitov wrote:
>> I was thinking that maybe we should add a new field or two to pt_regs.
>> They could store CR2 and maybe CR3 as well. I'd also like to expose
>> the error code of exceptions in stack traces. We should get this
>> integrated right into the unwinder.
> hmm. Exposing cr3 to user space will make it trivial for user process
> to know whether kpti is active. Not sure how exploitable such
> information leak.
It also gives userspace a pretty valuable physical address to go after.
That, plus a KASLR defeat gives you a known-valuable virtual address to
target. That's no good.
I think CR3 is a non-starter.
On Thu, Jan 11, 2018 at 10:15 AM, Dave Hansen
<[email protected]> wrote:
>
> Well, on the bright side, we don't need IPIs when _removing_ NX. We can
> just handle those like a spurious fault.
I think I agree.
> But, when re-enabling it, we need all the TLB flushing for all the CPUs
> that have run with the un-NX'd page tables.
Actually, I really don't think we should even allow "re-enable PTI".
The only thing that re-enables PTI is a completely new page table,
notably "execve()".
And I think that is when the "NOW" vs "NEXT" *may* make sense. Not for
enabling PTI, but if we want to have a "disable PTI", I think it
should act on the next execve().
And one reason I think we want that behavior is that once you've
disabled PTI, I don't think the double page tables would necessarily
even exist, and I don't think we should try to re-populate them. A
noPTI process might simply *have* just the single page table.
That wouldn't be the first implementation, but I think the interface
should be designed for that kind of thing in mind, where nopti really
means "stop doing two page tables for this process". And that may make
it *impossible* to re-enable PTI for this process, simply because we
don't have the required double-page PGD allocation at all.
Linus
On Thu, Jan 11, 2018 at 10:21:49AM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 11, 2018 at 09:02:55AM -0800, Andy Lutomirski wrote:
> > On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
> > <[email protected]> wrote:
> > > On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> > >>> I think we also need to be able to dump the actual
> > >>> CR3 value that we entered the kernel with before we start doing too much
> > >>> other funky stuff with the entry code.
> > >> When you say dump, you mean save it somewhere in a per_cpu variable ?
> > >
> > > Yeah, I think a per-cpu variable is fine for now. But, that only gives
> > > you a dump from a single entry to the kernel. Ideally, it would be nice
> > > to have a stack of them so you could do things like debug
> > > syscall->fault->oops. Was it the syscall entry or the fault entry that
> > > lead to the oops?
> > >
> > > But, the stack gets really fun because of NMIs.
> > >
> > > I'm sure Andy Lutomirski has some ideas too.
> >
> > I was thinking that maybe we should add a new field or two to pt_regs.
> > They could store CR2 and maybe CR3 as well. I'd also like to expose
> > the error code of exceptions in stack traces. We should get this
> > integrated right into the unwinder.
>
> hmm. Exposing cr3 to user space will make it trivial for user process
> to know whether kpti is active. Not sure how exploitable such
> information leak.
It's already trivial to detect PTI from user space.
--
Josh
On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>> 4. Cleared on setuid() and friends
> This one causes me a problem : some daemons already take care of dropping
> privileges after the initial fork() for the sake of security. Haproxy
> typically does this at boot :
>
> - parse config
> - chroot to /var/empty
> - setuid(dedicated_uid)
> - fork()
This makes me a _bit_ nervous. I think Andy touched on this, but I'll
say it another way: you want PTI turned off because you trust an app to
be good, but you also drop permissions because it is exposed to an
environment where you do *not* fully trust it.
I'm not sure how you reconcile that.
If your proxy gets compromised, and pti is turned off, you are entirely
exposed to meltdown from that process. I don't know exactly what you
are doing, but isn't this proxy sitting there shuffling untrusted user
data around all day?
On Thu, Jan 11, 2018 at 10:32 AM, Josh Poimboeuf <[email protected]> wrote:
> On Thu, Jan 11, 2018 at 10:21:49AM -0800, Alexei Starovoitov wrote:
>>
>> hmm. Exposing cr3 to user space will make it trivial for user process
>> to know whether kpti is active. Not sure how exploitable such
>> information leak.
>
> It's already trivial to detect PTI from user space.
I agree. I don't think the whole "is PTI on" is all that much of a secret.
But exposing all of %cr3 to user space is completely unacceptable.
That's just about *the* best attack vector information you could give
to somebody, and would make KASLR almost totally uninteresting. If you
have access to the page directory pointer, and some other weakness
that allows you to access it, you're golden. You can do anything.
Linus
On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>> hmm. Exposing cr3 to user space will make it trivial for user process
>> to know whether kpti is active. Not sure how exploitable such
>> information leak.
> It's already trivial to detect PTI from user space.
Do tell.
On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
<[email protected]> wrote:
> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>>> hmm. Exposing cr3 to user space will make it trivial for user process
>>> to know whether kpti is active. Not sure how exploitable such
>>> information leak.
>> It's already trivial to detect PTI from user space.
>
> Do tell.
One way to do it is to just run the attack, and see if you get something.
So it's not really "is PTI enabled", but a "is meltdown there". Then
you just use that together with cpuinfo to decide if PTI is enabled.
So I think Josh is 100% right. Detecting PTI on/off is not hard.
But that does *not* mean that %cr3 isn't secret. %cr3 should
definitely never *ever* be accessible to user space.
Linus
On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> <[email protected]> wrote:
>> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>>>> hmm. Exposing cr3 to user space will make it trivial for user process
>>>> to know whether kpti is active. Not sure how exploitable such
>>>> information leak.
>>> It's already trivial to detect PTI from user space.
>> Do tell.
> One way to do it is to just run the attack, and see if you get something.
Not judging how trivial (or not) the attack is, I was hoping for
something that was *not* the attack itself. :)
I'd love to have a tool that tells you for sure "KPTI enabled or not",
but I'd also love to have it be something I can easily distribute
without it being handled like a WMD.
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> > <[email protected]> wrote:
> >> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >>>> hmm. Exposing cr3 to user space will make it trivial for user process
> >>>> to know whether kpti is active. Not sure how exploitable such
> >>>> information leak.
> >>> It's already trivial to detect PTI from user space.
> >> Do tell.
> > One way to do it is to just run the attack, and see if you get something.
>
> Not judging how trivial (or not) the attack is, I was hoping for
> something that was *not* the attack itself. :)
>
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
Instead of the meltdown attack, why not just run the KASLR attack, with
prefetches + cache timing?
Something like this (I haven't tested it though):
https://github.com/IAIK/prefetch/blob/master/addrspace/addrspace.c
Andrea also created such a tool, but IIRC, it requires knowing a kernel
address, so it wouldn't work with KASLR. It could probably be extended
to scan kernel space until it finds something.
We could maybe even add such a tool to the kernel tree.
--
Josh
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
You mean this:
https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
?
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Thu, Jan 11, 2018 at 10:38:07AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >> hmm. Exposing cr3 to user space will make it trivial for user process
> >> to know whether kpti is active. Not sure how exploitable such
> >> information leak.
> > It's already trivial to detect PTI from user space.
>
> Do tell.
Probably because meltdown doesn't work anymore ? :-)
Willy
On Thu, Jan 11, 2018 at 10:57 AM, Dave Hansen
<[email protected]> wrote:
>
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
As Josh points out, the whole meltdown excitement actually already
generated a number of projects that aren't the attack, but show some
of the effects.
But Thomas does have a whole sysfs series to show "mitigation status
for various hardware bugs".
Linus
On 01/11/2018 11:07 AM, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
>> I'd love to have a tool that tells you for sure "KPTI enabled or not",
>> but I'd also love to have it be something I can easily distribute
>> without it being handled like a WMD.
> You mean this:
>
> https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
I meant that works across all the kpti implementations. Those with gunk
in /sys, /proc/cpuinfo, or with nothing at all like the original KAISER
patches I posted.
And, yes, I want a pony. I just though someone had such a pony handy
and it was trivial.
Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
OG.
On Thu, Jan 11, 2018 at 8:17 PM, Dave Hansen
<[email protected]> wrote:
> On 01/11/2018 11:07 AM, Borislav Petkov wrote:
>> On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
>>> I'd love to have a tool that tells you for sure "KPTI enabled or not",
>>> but I'd also love to have it be something I can easily distribute
>>> without it being handled like a WMD.
>> You mean this:
>>
>> https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
>
> I meant that works across all the kpti implementations. Those with gunk
> in /sys, /proc/cpuinfo, or with nothing at all like the original KAISER
> patches I posted.
>
> And, yes, I want a pony. I just though someone had such a pony handy
> and it was trivial.
On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
I think only if you had a baseline measurement to compare against.
--
Josh
> On Jan 11, 2018, at 10:26 AM, Linus Torvalds <[email protected]> wrote:
>
> On Thu, Jan 11, 2018 at 10:25 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> The other case may be the CLONE_NEW* operations. I *think* they are
>> noops as far as PTI settings would be, but I think people should think
>> about them.
>
> Oh, and yes, I think the npti flag should also break ptrace(). I do
> agree with Andy that it's a "capability", although I do not think it
> should actually be implemented as one.
For all that Linux capabilities are crap, nopti walks like one and quacks like one. It needs to affect ptrace() permissions, it needs a way to disable it systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all of this without tons of churn, auditing, and a whole new configuration thingy for each LSM. And I avoids permanently polluting ptrace checks, the LSM interface, etc for what is, essentially, a performance hack to work around a blatant error in the design of some CPUs.
Plus, with ambient caps, we already did the nasty part of the with and finished all the relevant bikeshedding.
So I'd rather just hold my nose and add the new capability bit.
On Thu, 11 Jan 2018 13:26:01 -0600
Josh Poimboeuf <[email protected]> wrote:
> On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> > Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
>
> I think only if you had a baseline measurement to compare against.
getuid can also be cached by a smart libc. getppid() is what micro
benchmarks seem to favour.
Alan
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> > <[email protected]> wrote:
> >> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >>>> hmm. Exposing cr3 to user space will make it trivial for user process
> >>>> to know whether kpti is active. Not sure how exploitable such
> >>>> information leak.
> >>> It's already trivial to detect PTI from user space.
> >> Do tell.
> > One way to do it is to just run the attack, and see if you get something.
>
> Not judging how trivial (or not) the attack is, I was hoping for
> something that was *not* the attack itself. :)
the attack can also be inconclusive.
I'm playing with an idea to conditionally switch into user cr3
after returning from syscall.
Like per task or per cpu counter or randomly tell syscall return to
keep kernel cr3 while next interrupt will bring task back to user cr3.
Public exploits use syscalls to keep kernel memory in L1 and
with such hack they see partial kernel reads and cannot really tell
which are the kernel bytes in the mix.
If there is a fast way for an attacker to know that after
syscall kpti is off for this task then such conditional kpti on/off
will be completely pointless.
It's not 100% secure obviously. Sort-of best effort to bring back
most of syscall performance without thinking which task should or
should not be allowed to toggle kpti flag.
On 01/11/2018 09:02 AM, Andy Lutomirski wrote:
>> But, the stack gets really fun because of NMIs.
>>
>> I'm sure Andy Lutomirski has some ideas too.
> I was thinking that maybe we should add a new field or two to pt_regs.
> They could store CR2 and maybe CR3 as well. I'd also like to expose
> the error code of exceptions in stack traces. We should get this
> integrated right into the unwinder.
The trampoline and (normal) interrupt stacks should be pretty doable.
It's the NMI mess that I'm worried about. I tried to change the stack
layout in there once and ran away screaming.
On Thu, Jan 11, 2018 at 07:34:31PM +0000, Alan Cox wrote:
> On Thu, 11 Jan 2018 13:26:01 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
> > On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> > > Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
> >
> > I think only if you had a baseline measurement to compare against.
>
> getuid can also be cached by a smart libc. getppid() is what micro
> benchmarks seem to favour.
I initially tried getpid() which I found to be cached by glibc, but I
switched to write(-1, "a", 1) in the example in the cover letter and
the test for 3 million runs roughly climbs from 200 ms to 900 ms under
kvm (with PCID this time, I didn't retest without).
Willy
On Thu, Jan 11, 2018 at 10:35:57AM -0800, Dave Hansen wrote:
> On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> >> 4. Cleared on setuid() and friends
> > This one causes me a problem : some daemons already take care of dropping
> > privileges after the initial fork() for the sake of security. Haproxy
> > typically does this at boot :
> >
> > - parse config
> > - chroot to /var/empty
> > - setuid(dedicated_uid)
> > - fork()
>
> This makes me a _bit_ nervous. I think Andy touched on this, but I'll
> say it another way: you want PTI turned off because you trust an app to
> be good, but you also drop permissions because it is exposed to an
> environment where you do *not* fully trust it.
>
> I'm not sure how you reconcile that.
>
> If your proxy gets compromised, and pti is turned off, you are entirely
> exposed to meltdown from that process.
Yes but security is not either black or white, it's about adding protections
that make sense wherever possible. Meltdown would "only" allow you to dump
the system's memory. Running as root would additionally allow you to modify
this memory, ptrace processes outside of the chroot to inject backdoors etc.
For me, and for a lot of users I guess, the choice is easily made.
> I don't know exactly what you
> are doing, but isn't this proxy sitting there shuffling untrusted user
> data around all day?
Yes definitely for some users, while it can be very trusted data for others.
For example those dealing with wire transfers have much more to care about
than someone reading their kernel memory when they can also read and modify
the amount of money in the transfers if they want!
And it has to do all this *fast*, that's one of its most touted qualities.
And that's even why people who care the most about performance always
install it on Linux because today it's the best OS for this job.
Again Dave, it's a tradeoff. People using such components generally know
what they are doing and are the ones to decide. Some of them prefer to run
as root because it allows them to bind to foreign addresses, others cannot
start as root because the local policy is strict about this, some do a lot
of things some of us would consider ugly but which are critical to their
acitvity. It's not up to me as the software's author nor to us as system
developers to tell our users how they will be much slower to be more
secure if they decide that they don't care about this *specific* risk.
Our job is to properly protect those who don't know they're at risk and
to help the other ones reduce the risk to the minimum needed for a given
task.
In this case the minimum can be pretty well summarized :
- PTI on by default on the system
- no PTI for the performance-sensitive process if the admin decides it
- switch to a harmless uid/gid couple once not needed to be root anymore
- if the process may execute external processes, give it an option to
disable PTI before or during execve() to avoid exposing the system to
what they could be doing.
I think that's pretty reasonable for a normal production environment.
Willy
On Thu, Jan 11, 2018 at 10:25:29AM -0800, Linus Torvalds wrote:
> Just to clarify: I definitely want the part where it is only
> switchable in single-threaded mode, and I actually do want it
> "inherited" by threads when they do get created.
OK this is what is currently done in series v3 because the TIF_* flags
are copied as-is to threads (unless I missed something). Even for
re-enabling it currently refuses it if mm_users > 1.
> It's just that my mental model for threads is not that they "inherit"
> the PTI flag, it's that they simply share it. But that's more of a
> difference in "views" than in actual behavior.
I see, thanks for explaining this point, I understand better your
concern now. Well, if we document that the current process' flag is
replicated as-is to any threads so that it is consistent across all
threads and that it may only be modified on all threads atomically,
which currently we can only achieve by doing it when there's a single
thread on an mm, I suspect it could match your mental model.
> If you do the PTI on/off operation *before* the vfork(), nothing is
> different. The vfork() by definition ends up having the same PTI
> state, since it has the same VM. But that's actually 100% expected,
> and it matches the fork() behavior too: the PTI state should be just
> copied by a fork(), since fork isn't any protection domain.
>
> And *after* you've done a vfork(), you can't do a PTI on/off, because
> now the VM has multiple users, which is 100% equivalent to the thread
> case that we already all agreed should be disallowed. So no, you can't
> do "vfork -> setnopti -> exec', but that is in no way different from
> any of the *other* things you cannot do in between vfork and execve.
That's where I like the principle of the NEXT ctl which can be per-
thread. The thread about to do an execve() cannot change its own flag
because it's entangled to the other ones sharing the same mm, but it
can change its own NEXT flag so that execve() starts with the specified
mode (typically PTI on in the example of log rotation for a server).
Quite honnestly for the NOW vs NEXT, I find the NOW convenient to avoid
a wrapper, but a program could also self-exec after setting the flag
(I've already done this to change thread stack sizes on certain
processes a long time ago and that's no real hassle). And given that
NOW cannot really re-adjust the PGD that was already assigned, maybe
in the end we should stick to this NEXT thing and wait for the next
execve() to apply the operation.
Willy
On Thu, Jan 11, 2018 at 01:28:18PM -0800, Linus Torvalds wrote:
> The traditional fast system call to test is getppid().
>
> write() goes through a lot more code.
Just tried getppid() now, it's relatively similar (slightly slower than
write(-1) though, maybe that one aborts very early) :
PTI=on : 920ms for 3 million calls
PTI=off (prctl) : 230ms for 3 million calls
PTI=off (boot) : 215ms for 3 million calls
The small difference between the last two very likely comes from the few
instructions avoided thanks to the alternatives when pti=off is used at
boot.
So yes here it's trivial to tell if it's on or off :-)
Willy
> I initially tried getpid() which I found to be cached by glibc, but I
> switched to write(-1, "a", 1) in the example in the cover letter and
> the test for 3 million runs roughly climbs from 200 ms to 900 ms under
> kvm (with PCID this time, I didn't retest without).
umask() is the fastest system call I think.
Also who cares about glibc caching?
static inline int sys_umask(int mask)
{
int rv;
asm volatile (
"syscall"
: "=a" (rv)
: "0" (95), "D" (mask)
: "rcx", "r11", "cc", "memory"
);
return rv;
}
int main(void)
{
unsigned int i;
for (i = 0; i < (1U << 24); i++) {
sys_umask(0);
}
return 0;
}
From: Willy Tarreau
> Sent: 09 January 2018 12:56
>
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current process. The state change is only
> allowed if the mm is not shared (no threads have been created yet).
>
> Setting the flag to disable the protection is subject to CAP_SYS_RAWIO.
> However it is possible to re-enable the protection without this privilege.
While a process with CAP_SYS_RAWIO can easily access kernel memory so
get little extra protection from PTI, you probably want to be able
to disable PTI without giving a process CAP_SYS_RAWIO.
No point leaving the door wide open.
If you are trying to find the size of the performance hit you also
need to be able to disable (and re-enable) PTI for all the current and
future threads of a given process.
David
On Fri, Jan 12, 2018 at 03:03:23PM +0000, David Laight wrote:
> From: Willy Tarreau
> > Sent: 09 January 2018 12:56
> >
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current process. The state change is only
> > allowed if the mm is not shared (no threads have been created yet).
> >
> > Setting the flag to disable the protection is subject to CAP_SYS_RAWIO.
> > However it is possible to re-enable the protection without this privilege.
>
> While a process with CAP_SYS_RAWIO can easily access kernel memory so
> get little extra protection from PTI, you probably want to be able
> to disable PTI without giving a process CAP_SYS_RAWIO.
> No point leaving the door wide open.
Sure but if it's only a wrapper you don't mind. Start with the capability,
disable PTI, drop the capability and that's done.
Willy
From: Linus Torvalds
> Sent: 11 January 2018 18:25
> On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <[email protected]> wrote:
> >> As for per-mm vs per-thread, let's make it only switchable in
> >> single-threaded processes for now and inherited when threads are
> >> created.
> >
> > That's exactly what it does for now, but Linus doesn't like it at all.
>
> Just to clarify: I definitely want the part where it is only
> switchable in single-threaded mode, and I actually do want it
> "inherited" by threads when they do get created.
...
You need to allow for libraries that create threads before main()
is called.
David
From: Willy Tarreau
> Sent: 11 January 2018 22:07
> On Thu, Jan 11, 2018 at 01:28:18PM -0800, Linus Torvalds wrote:
> > The traditional fast system call to test is getppid().
> >
> > write() goes through a lot more code.
>
> Just tried getppid() now, it's relatively similar (slightly slower than
> write(-1) though, maybe that one aborts very early) :
>
> PTI=on : 920ms for 3 million calls
> PTI=off (prctl) : 230ms for 3 million calls
> PTI=off (boot) : 215ms for 3 million calls
>
> The small difference between the last two very likely comes from the few
> instructions avoided thanks to the alternatives when pti=off is used at
> boot.
>
> So yes here it's trivial to tell if it's on or off :-)
A system call with a larger kernel memory footprint, and user
code that touches more pages, might show an even bigger difference
between PTI=on and PTI=off.
David
On Fri, Jan 12, 2018 at 8:27 AM, David Laight <[email protected]> wrote:
>
> You need to allow for libraries that create threads before main()
> is called.
I really don't think we do. I think the normal case is the wrapper.
Processes should never say "I'm so important that I'm disabling PTI".
That's crazy talk, and wrong.
It's wrong for all the usual reasons - everybody always thinks that
_their_ own work is so important and bug-free, and that things like
PTI are about protecting all those other imcompetent people.
No. Bullshit. Nobody should ever disable PTI for themselves, because
nobody is inherently trustworthy.
Instead, we have the case of something _external_ saying "this
process is so important that it should be started without PTI".
Linus
On Fri, Jan 12, 2018 at 09:55:45AM -0800, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 8:27 AM, David Laight <[email protected]> wrote:
> >
> > You need to allow for libraries that create threads before main()
> > is called.
>
> I really don't think we do. I think the normal case is the wrapper.
>
> Processes should never say "I'm so important that I'm disabling PTI".
> That's crazy talk, and wrong.
>
> It's wrong for all the usual reasons - everybody always thinks that
> _their_ own work is so important and bug-free, and that things like
> PTI are about protecting all those other imcompetent people.
>
> No. Bullshit. Nobody should ever disable PTI for themselves, because
> nobody is inherently trustworthy.
>
> Instead, we have the case of something _external_ saying "this
> process is so important that it should be started without PTI".
I totally agree, and what I initially envisionned (for my use case)
was a config option with a scary enough name if we couldn't have the
wrapper. But the wrapper brings the long term benefit of allowing us
to do what we want with the pgd, which is a nice add-on.
Willy
* Andy Lutomirski <[email protected]> wrote:
> > Oh, and yes, I think the npti flag should also break ptrace(). I do agree with
> > Andy that it's a "capability", although I do not think it should actually be
> > implemented as one.
>
> For all that Linux capabilities are crap, nopti walks like one and quacks like
> one. It needs to affect ptrace() permissions, it needs a way to disable it
> systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all
> of this without tons of churn, auditing, and a whole new configuration thingy
> for each LSM. And I avoids permanently polluting ptrace checks, the LSM
> interface, etc for what is, essentially, a performance hack to work around a
> blatant error in the design of some CPUs.
>
> Plus, with ambient caps, we already did the nasty part of the with and finished
> all the relevant bikeshedding.
>
> So I'd rather just hold my nose and add the new capability bit.
Those all seem pretty valid arguments to me.
Thanks,
Ingo
> On Jan 12, 2018, at 12:22 PM, Ingo Molnar <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
>>> Oh, and yes, I think the npti flag should also break ptrace(). I do agree with
>>> Andy that it's a "capability", although I do not think it should actually be
>>> implemented as one.
>>
>> For all that Linux capabilities are crap, nopti walks like one and quacks like
>> one. It needs to affect ptrace() permissions, it needs a way to disable it
>> systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all
>> of this without tons of churn, auditing, and a whole new configuration thingy
>> for each LSM. And I avoids permanently polluting ptrace checks, the LSM
>> interface, etc for what is, essentially, a performance hack to work around a
>> blatant error in the design of some CPUs.
>>
>> Plus, with ambient caps, we already did the nasty part of the with and finished
>> all the relevant bikeshedding.
>>
>> So I'd rather just hold my nose and add the new capability bit.
>
> Those all seem pretty valid arguments to me.
>
>
FWIW, if we take this approach, then either dropping the capability should turn PTI back on or we need to deal with the corner case of PTI off and capability not present. The latter is a bit awkward but not necessarily a show stopper. I think that all we need to do is to update the ptrace rules and maybe make PTI turn back on when we execve. At least there's no need to muck around with LSM hooks.
On Fri, Jan 12, 2018 at 01:18:06PM -0800, Andy Lutomirski wrote:
> FWIW, if we take this approach, then either dropping the capability should
> turn PTI back on or we need to deal with the corner case of PTI off and
> capability not present. The latter is a bit awkward but not necessarily a
> show stopper. I think that all we need to do is to update the ptrace rules
> and maybe make PTI turn back on when we execve. At least there's no need to
> muck around with LSM hooks.
That's my point as well, just the same principle as the "NEXT" prctl : only
perform changes on execve(). At least we're sure to deal with something
consistent and it's the right moment for deciding on _PAGE_NX.
Willy