2018-01-08 16:12:48

by Willy Tarreau

[permalink] [raw]
Subject: [PATCH RFC 0/4] Per-task PTI activation

Hi!

I could experiment a bit with the possibility to enable/disable PTI per
task. Please keep in mind that it's not my area of experitise at all, but
doing so I could recover the initial performance without disabling PTI on
the whole system.

So what I did in this series consists in the following :
- addition of a new per-task TIF_NOPTI flag. Please note that I'm not
proud of the way I did it, as 32 flags were already taken. The flags
are declared as "long" so there are 32 more flags available on x86_64
but C and asm disagree on the type of 1<<32 so I had to declare the
hex value by hand... By the way I even suspect that _TIF_FSCHECK is
wrong once cast to a long, I think it causes sign extension into the
32 upper bits since it's supposed to be signed.

- addition of a set of arch_prctl() calls (ARCH_GET_NOPTI and
ARCH_SET_NOPTI), to check and change the activation of the
protection. The change requires CAP_SYS_RAWIO and can be done in
a wrapper (that's how I tested)

- the user PGD was marked with _PAGE_NX to prevent an accidental leak
of CR3 from not being detected. I obviously had to disable this since
in this case we do want such a user task to run without switching the
PGD. I think this could be performed per-task maybe. Another approach
might consist in dealing with 3 PGDs and using a different one for
unprotected tasks but that really starts to sound overkill.

- upon return to userspace, I check if the task's flags contain the
new TIF_NOPTI or not. If it does contain it, then we don't switch
the CR3.

- upon entry into the kernel from userspace, we can't access the task's
flags but we can already check if CR3 points to the kernel or user PGD,
and we refrain from switching if it's already the system one.

By doing so I could recover the initial performance of haproxy in a VM,
going from 12400 connections per second to 21000 once started with this
trivial wrapper :

#include <asm/prctl.h>
#include <sys/prctl.h>

#ifndef ARCH_SET_NOPTI
#define ARCH_SET_NOPTI 0x1022
#endif

int main(int argc, char **argv)
{
arch_prctl(ARCH_SET_NOPTI, 1);
argv++;
return execvp(argv[0], argv);
}

I have not yet run it on real hardware. Before trying to go a bit further
I'd like to know if such an approach is acceptable or if I'm doing anything
stupid and looking in the wrong direction.

Thanks!
Willy


Willy Tarreau (4):
x86/thread_info: add TIF_NOPTI to disable PTI per task
x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to
enable/disable PTI
x86/pti: don't mark the user PGD with _PAGE_NX.
x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/thread_info.h | 8 ++++++++
arch/x86/include/uapi/asm/prctl.h | 3 +++
arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
arch/x86/mm/pti.c | 2 ++
5 files changed, 60 insertions(+)

--
1.7.12.1


2018-01-08 16:12:55

by Willy Tarreau

[permalink] [raw]
Subject: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task

This flag indicates that the task will not use isolated page tables.

Signed-off-by: Willy Tarreau <[email protected]>
---
arch/x86/include/asm/thread_info.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 0022333..2f92cf1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -126,6 +126,14 @@ struct thread_info {
#define _TIF_X32 (1 << TIF_X32)
#define _TIF_FSCHECK (1 << TIF_FSCHECK)

+/* The following flags only exist on x86-64. We can't use the shift anymore
+ * due to C using signed ints by default and asm using unsigned longs.
+ */
+#ifdef CONFIG_X86_64
+# define TIF_NOPTI 32 /* disable PTI for this task */
+# define _TIF_NOPTI 0x0000000100000000
+#endif
+
/*
* work to do in syscall_trace_enter(). Also includes TIF_NOHZ for
* enter_from_user_mode()
--
1.7.12.1

2018-01-08 16:13:03

by Willy Tarreau

[permalink] [raw]
Subject: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

This allows to report the current state of the PTI protection and to
enable or disable it for the current task.

Signed-off-by: Willy Tarreau <[email protected]>
---
arch/x86/include/uapi/asm/prctl.h | 3 +++
arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9..1f1b5bc 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
#define ARCH_GET_CPUID 0x1011
#define ARCH_SET_CPUID 0x1012

+#define ARCH_GET_NOPTI 0x1021
+#define ARCH_SET_NOPTI 0x1022
+
#define ARCH_MAP_VDSO_X32 0x2001
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c754662..1686d3d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
+ case ARCH_GET_NOPTI: {
+ unsigned long flag;
+
+ printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
+ flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
+ ret = put_user(flag, (unsigned long __user *)arg2);
+ break;
+ }
+
+ case ARCH_SET_NOPTI:
+ if (!capable(CAP_SYS_RAWIO))
+ return -EPERM;
+
+ printk(KERN_DEBUG "set1: task=%p ti=%p fl=%16lx doit=%d arg2=%ld\n", task, task_thread_info(task), task_thread_info(task)->flags, doit, arg2);
+
+ if (doit) {
+ if (arg2)
+ task_thread_info(task)->flags |= _TIF_NOPTI;
+ else
+ task_thread_info(task)->flags &= ~_TIF_NOPTI;
+
+ printk(KERN_DEBUG "set2: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
+ }
+ break;

#ifdef CONFIG_CHECKPOINT_RESTORE
# ifdef CONFIG_X86_X32_ABI
--
1.7.12.1

2018-01-08 16:13:11

by Willy Tarreau

[permalink] [raw]
Subject: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

Since we're going to keep running on the same PGD when returning to
userspace for certain performance-critical tasks, we'll need the user
pages to be executable. So this code disables the extra protection
that was added consisting in marking user pages _PAGE_NX so that this
pgd remains usable for userspace.

Note: it isn't necessarily the best approach, but one way or another
if we want to be able to return to userspace from the kernel,
we'll have to have this executable anyway. Another approach
might consist in using another pgd for userland+kernel but
the current core really looks like an extra careful measure
to catch early bugs if any.

Signed-off-by: Willy Tarreau <[email protected]>
---
arch/x86/mm/pti.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 43d4a4a..9e2dca0 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -135,9 +135,11 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
* - we don't have NX support
* - we're clearing the PGD (i.e. the new pgd is not present).
*/
+#if 0
if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
(__supported_pte_mask & _PAGE_NX))
pgd.pgd |= _PAGE_NX;
+#endif

/* return the copy of the PGD we want the kernel to use: */
return pgd;
--
1.7.12.1

2018-01-08 16:13:26

by Willy Tarreau

[permalink] [raw]
Subject: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

If a task has the TIF_NOPTI flag set, it doesn't want to experience
page table isolation. In this case, returns from kernel to user will
not switch the CR3, leaving it to the kernel one which already maps
both user and kernel pages. Upon entry in the kernel, we can't check
this flag so we simply check if CR3 was pointing to the kernel's PGD,
indicating an earlier absence of switch, and in this case we don't
change it.

Thanks to these changes, haproxy running under KVM went back from
12400 conn/s to 21000 once loaded after calling prctl().

Signed-off-by: Willy Tarreau <[email protected]>
---
arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 45a63e0..054b8b7 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,5 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/jump_label.h>
+#include <asm/thread_info.h>
#include <asm/unwind_hints.h>
#include <asm/cpufeatures.h>
#include <asm/page_types.h>
@@ -214,6 +215,11 @@
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg
+
+ /* if we're already on the kernel PGD, we don't switch */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+ jz .Lend_\@
+
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
.Lend_\@:
@@ -224,6 +230,12 @@

.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ /* "NOPTI" taskflag avoids the switch */
+ movq PER_CPU_VAR(current_task), \scratch_reg
+ btq $TIF_NOPTI, TASK_TI_flags(\scratch_reg)
+ jc .Lend_\@
+
mov %cr3, \scratch_reg

ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -262,6 +274,13 @@
ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
movq %cr3, \scratch_reg
movq \scratch_reg, \save_reg
+
+ /* if we're already on the kernel PGD, we don't switch,
+ * we just save the current cr3.
+ */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+ jz .Ldone_\@
+
/*
* Is the "switch mask" all zero? That means that both of
* these are zero:
@@ -284,6 +303,10 @@
.macro RESTORE_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI

+ /* if we saved a kernel context, we didn't switch so we don't switch */
+ testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
+ jz .Lend_\@
+
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

/*
--
1.7.12.1

2018-01-08 16:49:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> + if (doit) {
> + if (arg2)
> + task_thread_info(task)->flags |= _TIF_NOPTI;
> + else
> + task_thread_info(task)->flags &= ~_TIF_NOPTI;

{set,clear}_thread_flag() please, the above is not SMP safe.

2018-01-08 16:57:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 05:49:00PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> > + if (doit) {
> > + if (arg2)
> > + task_thread_info(task)->flags |= _TIF_NOPTI;
> > + else
> > + task_thread_info(task)->flags &= ~_TIF_NOPTI;
>
> {set,clear}_thread_flag() please, the above is not SMP safe.

Oops, thank you. I initially thought about it initially but didn't
know the names of the functions to use, and have left it there
thinking I'd fix it later. Usual source of bugs :-/

Bah and I left my debugging printk() in the patch as well! I'll
rework this a bit.

Thanks,
Willy

2018-01-08 16:57:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task

On Mon, 8 Jan 2018, Willy Tarreau wrote:

> This flag indicates that the task will not use isolated page tables.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/include/asm/thread_info.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 0022333..2f92cf1 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -126,6 +126,14 @@ struct thread_info {
> #define _TIF_X32 (1 << TIF_X32)
> #define _TIF_FSCHECK (1 << TIF_FSCHECK)
>
> +/* The following flags only exist on x86-64. We can't use the shift anymore

Please do not use this horrible comment syle

And what's wrong with (1UL << 32)?

> + * due to C using signed ints by default and asm using unsigned longs.
> + */
> +#ifdef CONFIG_X86_64
> +# define TIF_NOPTI 32 /* disable PTI for this task */

No tail comments please.

> +# define _TIF_NOPTI 0x0000000100000000
> +#endif
> +

Thanks,

tglx

2018-01-08 16:59:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> I could experiment a bit with the possibility to enable/disable PTI per
> task. Please keep in mind that it's not my area of experitise at all, but
> doing so I could recover the initial performance without disabling PTI on
> the whole system.

This cc list is way too small. Please at *least* include me and Andy on
this kind of stuff.

2018-01-08 17:02:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/include/uapi/asm/prctl.h | 3 +++
> arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
> #define ARCH_GET_CPUID 0x1011
> #define ARCH_SET_CPUID 0x1012
>
> +#define ARCH_GET_NOPTI 0x1021
> +#define ARCH_SET_NOPTI 0x1022
> +
> #define ARCH_MAP_VDSO_X32 0x2001
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> ret = put_user(base, (unsigned long __user *)arg2);
> break;
> }
> + case ARCH_GET_NOPTI: {
> + unsigned long flag;
> +
> + printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> + flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> + ret = put_user(flag, (unsigned long __user *)arg2);
> + break;

Per task is really an odd choice. That should be per process I think, but
that of course needs synchronization of some form. Aside of that we need to
think about fork().

Thanks,

tglx


2018-01-08 17:03:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.
>
> Note: it isn't necessarily the best approach, but one way or another
> if we want to be able to return to userspace from the kernel,
> we'll have to have this executable anyway. Another approach
> might consist in using another pgd for userland+kernel but
> the current core really looks like an extra careful measure
> to catch early bugs if any.

I don't like this.

I think the prctl() should apply to an entire process, not to a thread.
If it applies to a process, you can unpoison the PGD. I even had code
to do this in an earlier version of the (whole system) runtime PTI
on/off stuff.

Why are you even posting half-baked hacks like this now? Is there
something super-pressing about this set that we need to lock in a new
ABI now?

2018-01-08 17:04:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task

On Mon, Jan 08, 2018 at 05:57:11PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
>
> > This flag indicates that the task will not use isolated page tables.
> >
> > Signed-off-by: Willy Tarreau <[email protected]>
> > ---
> > arch/x86/include/asm/thread_info.h | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> > index 0022333..2f92cf1 100644
> > --- a/arch/x86/include/asm/thread_info.h
> > +++ b/arch/x86/include/asm/thread_info.h
> > @@ -126,6 +126,14 @@ struct thread_info {
> > #define _TIF_X32 (1 << TIF_X32)
> > #define _TIF_FSCHECK (1 << TIF_FSCHECK)
> >
> > +/* The following flags only exist on x86-64. We can't use the shift anymore
>
> Please do not use this horrible comment syle

You mean, the fact that there is no '/*' alone on the first line or
anything else ?

> And what's wrong with (1UL << 32)?

It fails when inherited in assembly parts. Initially the test was based
on "testq $(_TIF_NOPTI), reg" and 1UL doesn't parse there, which is why
I had to abandon it. In the latest patch I dropped "testq" for "bt" so
I didn't need the mask anymore and it wouldn't be a problem... until
someone wants to use it again.


> > + * due to C using signed ints by default and asm using unsigned longs.
> > + */
> > +#ifdef CONFIG_X86_64
> > +# define TIF_NOPTI 32 /* disable PTI for this task */
>
> No tail comments please.

OK but I did exactly like is done for all other flags above :-/

Thanks,
Willy

2018-01-08 17:05:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI


(Expanded the Cc: list.)

* Willy Tarreau <[email protected]> wrote:

> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/include/uapi/asm/prctl.h | 3 +++
> arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
> #define ARCH_GET_CPUID 0x1011
> #define ARCH_SET_CPUID 0x1012
>
> +#define ARCH_GET_NOPTI 0x1021
> +#define ARCH_SET_NOPTI 0x1022
> +
> #define ARCH_MAP_VDSO_X32 0x2001
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> ret = put_user(base, (unsigned long __user *)arg2);
> break;
> }
> + case ARCH_GET_NOPTI: {
> + unsigned long flag;
> +
> + printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> + flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> + ret = put_user(flag, (unsigned long __user *)arg2);
> + break;
> + }
> +
> + case ARCH_SET_NOPTI:
> + if (!capable(CAP_SYS_RAWIO))
> + return -EPERM;
> +
> + printk(KERN_DEBUG "set1: task=%p ti=%p fl=%16lx doit=%d arg2=%ld\n", task, task_thread_info(task), task_thread_info(task)->flags, doit, arg2);
> +
> + if (doit) {
> + if (arg2)
> + task_thread_info(task)->flags |= _TIF_NOPTI;
> + else
> + task_thread_info(task)->flags &= ~_TIF_NOPTI;
> +
> + printk(KERN_DEBUG "set2: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> + }
> + break;

Btw., we could enforce the CAP_SYS_RAWIO permission check only if it's _clearing_
the PTI flag.

I.e. this would allow apps and runtime environments to opt into PTI, without
having to rely on external security frameworks getting it right.

Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked
as 'Meltdown safe': should an explicit request to turn on PTI be honored by the
kernel? Should that be some sort of separate 'force PTI on' attribute?

Thanks,

Ingo

2018-01-08 17:05:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.
>
> Note: it isn't necessarily the best approach, but one way or another
> if we want to be able to return to userspace from the kernel,
> we'll have to have this executable anyway. Another approach
> might consist in using another pgd for userland+kernel but
> the current core really looks like an extra careful measure
> to catch early bugs if any.

I surely want to keep that as a safety measure. The entry code is simple to
get wrong and running with the wrong pagetables by a silly mistake and
thereby undoing the protection is surely not what we want.

Need to find a free time slot to think about that.

Thanks,

tglx

2018-01-08 17:06:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

On Mon, Jan 08, 2018 at 08:59:54AM -0800, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > I could experiment a bit with the possibility to enable/disable PTI per
> > task. Please keep in mind that it's not my area of experitise at all, but
> > doing so I could recover the initial performance without disabling PTI on
> > the whole system.
>
> This cc list is way too small. Please at *least* include me and Andy on
> this kind of stuff.

You're welcome Dave. In fact you were the two I initially copied from
another thread and seeing the huge amount of CCs I preferred to stick
to recent discussions and cc x86 thinking all people involved were there.

Feel free to suggest a wider list. Maybe all the CCs of the latest PTI
patch ?

Thanks,
Willy

2018-01-08 17:10:15

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 06:02:35PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> >
> > Signed-off-by: Willy Tarreau <[email protected]>
> > ---
> > arch/x86/include/uapi/asm/prctl.h | 3 +++
> > arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
> > 2 files changed, 27 insertions(+)
> >
> > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9..1f1b5bc 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -10,6 +10,9 @@
> > #define ARCH_GET_CPUID 0x1011
> > #define ARCH_SET_CPUID 0x1012
> >
> > +#define ARCH_GET_NOPTI 0x1021
> > +#define ARCH_SET_NOPTI 0x1022
> > +
> > #define ARCH_MAP_VDSO_X32 0x2001
> > #define ARCH_MAP_VDSO_32 0x2002
> > #define ARCH_MAP_VDSO_64 0x2003
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index c754662..1686d3d 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> > ret = put_user(base, (unsigned long __user *)arg2);
> > break;
> > }
> > + case ARCH_GET_NOPTI: {
> > + unsigned long flag;
> > +
> > + printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> > + flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> > + ret = put_user(flag, (unsigned long __user *)arg2);
> > + break;
>
> Per task is really an odd choice. That should be per process I think, but
> that of course needs synchronization of some form. Aside of that we need to
> think about fork().

I also wondered how to do it and had no idea, but I wanted to start with
something, also keeping in mind that I didn't want to risk losing in
extra checks what we managed to previously save. I agree that it's not
the best we can have. That said, other features seem to work like this,
like TIF_NOTSC and TIF_NOCPUID, so it didn't look too odd at first glance.

Willy

2018-01-08 17:11:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI


* Willy Tarreau <[email protected]> wrote:

> If a task has the TIF_NOPTI flag set, it doesn't want to experience
> page table isolation. In this case, returns from kernel to user will
> not switch the CR3, leaving it to the kernel one which already maps
> both user and kernel pages. Upon entry in the kernel, we can't check
> this flag so we simply check if CR3 was pointing to the kernel's PGD,
> indicating an earlier absence of switch, and in this case we don't
> change it.
>
> Thanks to these changes, haproxy running under KVM went back from
> 12400 conn/s to 21000 once loaded after calling prctl().
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)

Just a few nits:

>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 45a63e0..054b8b7 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -1,5 +1,6 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <linux/jump_label.h>
> +#include <asm/thread_info.h>
> #include <asm/unwind_hints.h>
> #include <asm/cpufeatures.h>
> #include <asm/page_types.h>
> @@ -214,6 +215,11 @@
> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> mov %cr3, \scratch_reg
> +
> + /* if we're already on the kernel PGD, we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Lend_\@
> +
> ADJUST_KERNEL_CR3 \scratch_reg
> mov \scratch_reg, %cr3
> .Lend_\@:
> @@ -224,6 +230,12 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* "NOPTI" taskflag avoids the switch */
> + movq PER_CPU_VAR(current_task), \scratch_reg
> + btq $TIF_NOPTI, TASK_TI_flags(\scratch_reg)
> + jc .Lend_\@

s/"NOPTI" taskflag
/The "NOPTI" task flag

> +
> mov %cr3, \scratch_reg
>
> ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> @@ -262,6 +274,13 @@
> ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
> movq %cr3, \scratch_reg
> movq \scratch_reg, \save_reg
> +
> + /* if we're already on the kernel PGD, we don't switch,
> + * we just save the current cr3.
> + */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Ldone_\@

Proper kernel comment style please. Also:

s/cr3/CR3

> +
> /*
> * Is the "switch mask" all zero? That means that both of
> * these are zero:
> @@ -284,6 +303,10 @@
> .macro RESTORE_CR3 scratch_reg:req save_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> + /* if we saved a kernel context, we didn't switch so we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
> + jz .Lend_\@

Maybe clarify this a bit? How about:

/*
* If we saved a kernel context on entry, we didn't switch the CR3,
* so we don't need to restore it on the way out either:
*/

or so?

Thanks,

Ingo

2018-01-08 17:13:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation


* Dave Hansen <[email protected]> wrote:

> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > I could experiment a bit with the possibility to enable/disable PTI per
> > task. Please keep in mind that it's not my area of experitise at all, but
> > doing so I could recover the initial performance without disabling PTI on
> > the whole system.
>
> This cc list is way too small. Please at *least* include me and Andy on
> this kind of stuff.

Yeah. The full x86-PTI Cc list that I typically add to new PTI commits is:

Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>

Thanks,

Ingo

2018-01-08 17:14:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task


* Willy Tarreau <[email protected]> wrote:

> On Mon, Jan 08, 2018 at 05:57:11PM +0100, Thomas Gleixner wrote:
> > On Mon, 8 Jan 2018, Willy Tarreau wrote:
> >
> > > This flag indicates that the task will not use isolated page tables.
> > >
> > > Signed-off-by: Willy Tarreau <[email protected]>
> > > ---
> > > arch/x86/include/asm/thread_info.h | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> > > index 0022333..2f92cf1 100644
> > > --- a/arch/x86/include/asm/thread_info.h
> > > +++ b/arch/x86/include/asm/thread_info.h
> > > @@ -126,6 +126,14 @@ struct thread_info {
> > > #define _TIF_X32 (1 << TIF_X32)
> > > #define _TIF_FSCHECK (1 << TIF_FSCHECK)
> > >
> > > +/* The following flags only exist on x86-64. We can't use the shift anymore
> >
> > Please do not use this horrible comment syle
>
> You mean, the fact that there is no '/*' alone on the first line or
> anything else ?

please use the customary (multi-line) comment style:

/*
* Comment .....
* ...... goes here.
*/

specified in Documentation/CodingStyle.

Thanks,

Ingo

2018-01-08 17:17:29

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

[ expanded the Cc list a bit ]

On Mon, Jan 08, 2018 at 09:03:36AM -0800, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > Since we're going to keep running on the same PGD when returning to
> > userspace for certain performance-critical tasks, we'll need the user
> > pages to be executable. So this code disables the extra protection
> > that was added consisting in marking user pages _PAGE_NX so that this
> > pgd remains usable for userspace.
> >
> > Note: it isn't necessarily the best approach, but one way or another
> > if we want to be able to return to userspace from the kernel,
> > we'll have to have this executable anyway. Another approach
> > might consist in using another pgd for userland+kernel but
> > the current core really looks like an extra careful measure
> > to catch early bugs if any.
>
> I don't like this.

This is the purpose of the review.

> I think the prctl() should apply to an entire process, not to a thread.

As I mentionned in another mail, I didn't know how to do it, even less
how to do it fast enough so that we didn't add more cycles to the syscall
code.

> If it applies to a process, you can unpoison the PGD. I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
>
> Why are you even posting half-baked hacks like this now? Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?

No need to lock in or whatever. It's just that a number of us simply
cannot use the current protection due to the huge cost it comes with
for their specific workload, and having to choose between performance
or protection remains a problem. Having a bit more available time and
being directly concerned by this problem I tried to propose something
to 1) see if there was any hope and 2) possibly help things move forward
in this direction. The patches are marked RFC, they're for discussing,
not for merging.

Willy

2018-01-08 17:17:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

On 01/08/2018 09:06 AM, Willy Tarreau wrote:
> On Mon, Jan 08, 2018 at 08:59:54AM -0800, Dave Hansen wrote:
>> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>>> I could experiment a bit with the possibility to enable/disable PTI per
>>> task. Please keep in mind that it's not my area of experitise at all, but
>>> doing so I could recover the initial performance without disabling PTI on
>>> the whole system.
>> This cc list is way too small. Please at *least* include me and Andy on
>> this kind of stuff.
> You're welcome Dave. In fact you were the two I initially copied from
> another thread and seeing the huge amount of CCs I preferred to stick
> to recent discussions and cc x86 thinking all people involved were there.

x86@ is only the maintainers. It isn't a general mailing list.

> Feel free to suggest a wider list. Maybe all the CCs of the latest PTI
> patch ?

get_maintainer.pl is your friend. For patch 4/4 it gives this:

> Thomas Gleixner <[email protected]> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),commit_signer:6/8=75%,authored:1/8=12%)
> Ingo Molnar <[email protected]> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),commit_signer:7/8=88%)
> "H. Peter Anvin" <[email protected]> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> [email protected] (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Borislav Petkov <[email protected]> (commit_signer:2/8=25%)
> Andy Lutomirski <[email protected]> (commit_signer:2/8=25%,authored:2/8=25%,added_lines:21/187=11%,removed_lines:52/67=78%)
> Peter Zijlstra <[email protected]> (commit_signer:2/8=25%,authored:2/8=25%,added_lines:87/187=47%,removed_lines:15/67=22%)
> Dave Hansen <[email protected]> (authored:1/8=12%,added_lines:66/187=35%)
> Josh Poimboeuf <[email protected]> (authored:1/8=12%)
> [email protected] (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))

Which is shockingly close to what Ingo suggested.

2018-01-08 17:18:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI


* Thomas Gleixner <[email protected]> wrote:

> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> >
> > Signed-off-by: Willy Tarreau <[email protected]>
> > ---
> > arch/x86/include/uapi/asm/prctl.h | 3 +++
> > arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
> > 2 files changed, 27 insertions(+)
> >
> > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9..1f1b5bc 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -10,6 +10,9 @@
> > #define ARCH_GET_CPUID 0x1011
> > #define ARCH_SET_CPUID 0x1012
> >
> > +#define ARCH_GET_NOPTI 0x1021
> > +#define ARCH_SET_NOPTI 0x1022
> > +
> > #define ARCH_MAP_VDSO_X32 0x2001
> > #define ARCH_MAP_VDSO_32 0x2002
> > #define ARCH_MAP_VDSO_64 0x2003
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index c754662..1686d3d 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> > ret = put_user(base, (unsigned long __user *)arg2);
> > break;
> > }
> > + case ARCH_GET_NOPTI: {
> > + unsigned long flag;
> > +
> > + printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> > + flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> > + ret = put_user(flag, (unsigned long __user *)arg2);
> > + break;
>
> Per task is really an odd choice. That should be per process I think, but
> that of course needs synchronization of some form. Aside of that we need to
> think about fork().

So per task (thread) is the most natural approach to low level asm flaggery.

Making it per thread also makes some sense conceptually: in a complex
multi-threaded runtime implementation some threads might never execute
'untrusted' code, some might. No need to penalize the 'server' threads.

Not sure we want that complexity though, and while it _should_ work I think,
mostly, there might be some unexpected implications.

Thanks,

Ingo

2018-01-08 17:19:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 06:05:31PM +0100, Ingo Molnar wrote:
> Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked
> as 'Meltdown safe': should an explicit request to turn on PTI be honored by the
> kernel? Should that be some sort of separate 'force PTI on' attribute?

AMD should not have FEATURE_PTI enabled, and thus not end up in any code
that cares about TIF_NOPTI.

2018-01-08 17:20:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

Please cc Andy on this stuff. I can't imagine patching entry_64.S at
this point without cc'ing him. *Surely* you didn't even bother to run
get_maintainer.pl on this.

> @@ -214,6 +215,11 @@
> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> mov %cr3, \scratch_reg
> +
> + /* if we're already on the kernel PGD, we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Lend_\@
> +
> ADJUST_KERNEL_CR3 \scratch_reg
> mov \scratch_reg, %cr3
> .Lend_\@:

This is an optimization that we can do generally without your feature.
Actually, it would be a welcome bit of benchmarking if you could see if
just this hunk helps your workload.

You touched on it in the description, but this is a *very* clever way to
do what you need without needing to look at the task flag at
user->kernel entry which also happens to be a place you don't have
task_struct mapped. It *greatly* simplifies what this would have to do
otherwise.

That needs calling out specifically though.

2018-01-08 17:21:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.


* Dave Hansen <[email protected]> wrote:

> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > Since we're going to keep running on the same PGD when returning to
> > userspace for certain performance-critical tasks, we'll need the user
> > pages to be executable. So this code disables the extra protection
> > that was added consisting in marking user pages _PAGE_NX so that this
> > pgd remains usable for userspace.
> >
> > Note: it isn't necessarily the best approach, but one way or another
> > if we want to be able to return to userspace from the kernel,
> > we'll have to have this executable anyway. Another approach
> > might consist in using another pgd for userland+kernel but
> > the current core really looks like an extra careful measure
> > to catch early bugs if any.
>
> I don't like this.
>
> I think the prctl() should apply to an entire process, not to a thread.
> If it applies to a process, you can unpoison the PGD. I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
>
> Why are you even posting half-baked hacks like this now? Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?

Arguably it was posted as an RFC patch-set, to get feedback early on.

The motivation is clear enough from the announcement I think: to speed up the
haproxy performance almost two-fold, without sacrificing the overall security
given by PTI against the Meltdown attack. haproxy does not require PTI, as it
never executes untrusted code.

Thanks,

Ingo

2018-01-08 17:23:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On 01/08/2018 09:17 AM, Willy Tarreau wrote:
>> I think the prctl() should apply to an entire process, not to a thread.
>
> As I mentionned in another mail, I didn't know how to do it, even less
> how to do it fast enough so that we didn't add more cycles to the syscall
> code.

You can _implement_ it with a task thread if you want. Just spray it
across all threads at the prctl()-time instead of a single thread.
It'll take a wee bit of locking.

I just don't think the API should apply to a single thread.

2018-01-08 17:26:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, 8 Jan 2018, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
> > Per task is really an odd choice. That should be per process I think, but
> > that of course needs synchronization of some form. Aside of that we need to
> > think about fork().
>
> So per task (thread) is the most natural approach to low level asm flaggery.

Well, yes and no. PTI is a property of the mm/pgdir and that's process
wide.

> Making it per thread also makes some sense conceptually: in a complex
> multi-threaded runtime implementation some threads might never execute
> 'untrusted' code, some might. No need to penalize the 'server' threads.

If one thread runs untrusted code then your 'trusted' thread is not longer
trusted either as they share everything.

Thanks,

tglx

2018-01-08 17:26:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI


* Peter Zijlstra <[email protected]> wrote:

> On Mon, Jan 08, 2018 at 06:05:31PM +0100, Ingo Molnar wrote:
> > Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked
> > as 'Meltdown safe': should an explicit request to turn on PTI be honored by the
> > kernel? Should that be some sort of separate 'force PTI on' attribute?
>
> AMD should not have FEATURE_PTI enabled, and thus not end up in any code
> that cares about TIF_NOPTI.

I know, this is the status quo.

Nevertheless:

- if someone disbelieves AMD's claims and wants to force-enable it, should it be
possible without patching the kernel?

- or if someone wants to test it on AMD to increase test coverage. pti=on will
already be force-enable it on AMD CPUs.

Likewise, there's the counter part on the app level PTI disabling/enabling
ABI functionality as well:

- should there be a way for sysadmins to force PTI enabled, even on apps that
want to turn it off?

- should there be a way for sysadmins to force PTI disabled, even for apps that
want to turn it on?

If we decide that we want to allow fine-grained, per app control of PTI, then all
of these look valid scenarios to me.

Thanks,

Ingo

2018-01-08 17:28:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On 01/08/2018 09:05 AM, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
>> Since we're going to keep running on the same PGD when returning to
>> userspace for certain performance-critical tasks, we'll need the user
>> pages to be executable. So this code disables the extra protection
>> that was added consisting in marking user pages _PAGE_NX so that this
>> pgd remains usable for userspace.
>>
>> Note: it isn't necessarily the best approach, but one way or another
>> if we want to be able to return to userspace from the kernel,
>> we'll have to have this executable anyway. Another approach
>> might consist in using another pgd for userland+kernel but
>> the current core really looks like an extra careful measure
>> to catch early bugs if any.
>
> I surely want to keep that as a safety measure. The entry code is simple to
> get wrong and running with the wrong pagetables by a silly mistake and
> thereby undoing the protection is surely not what we want.
>
> Need to find a free time slot to think about that.

This does get immensely easier if we choose a mode at exec() (or fork()
even) and never change it. The prctl() _could_ just be a flag to tell
what your children should do.

2018-01-08 17:30:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On Mon, Jan 08, 2018 at 09:23:49AM -0800, Dave Hansen wrote:
> On 01/08/2018 09:17 AM, Willy Tarreau wrote:
> >> I think the prctl() should apply to an entire process, not to a thread.
> >
> > As I mentionned in another mail, I didn't know how to do it, even less
> > how to do it fast enough so that we didn't add more cycles to the syscall
> > code.
>
> You can _implement_ it with a task thread if you want. Just spray it
> across all threads at the prctl()-time instead of a single thread.
> It'll take a wee bit of locking.
>
> I just don't think the API should apply to a single thread.

It is surprisingly hard to find all tasks that share an mm. Finding all
threads in a threadgroup is easy, but we have CLONE_THREAD and CLONE_VM
as separate bits.

In any case, aside from that, setting this remotely is indeed
'intersting'.

2018-01-08 17:46:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI


* Thomas Gleixner <[email protected]> wrote:

> On Mon, 8 Jan 2018, Ingo Molnar wrote:
> > * Thomas Gleixner <[email protected]> wrote:
> > > Per task is really an odd choice. That should be per process I think, but
> > > that of course needs synchronization of some form. Aside of that we need to
> > > think about fork().
> >
> > So per task (thread) is the most natural approach to low level asm flaggery.
>
> Well, yes and no. PTI is a property of the mm/pgdir and that's process
> wide.

Yes and no: the contents of CR3 is a property of the _thread_ (CPU), so we do have
the implementational degree of freedom of executing user-space with either the
user+kernel PGD or the user-only PGD.

The question I'm asking is whether it makes sense to offer this - the
_possibility_ is clearly there unless I'm missing something.

> > Making it per thread also makes some sense conceptually: in a complex
> > multi-threaded runtime implementation some threads might never execute
> > 'untrusted' code, some might. No need to penalize the 'server' threads.
>
> If one thread runs untrusted code then your 'trusted' thread is not longer
> trusted either as they share everything.

I think you are missing the wider context here: a number of runtime environments
(browsers, JIT compilers, etc.) are able to run "untrusted code" while still
applying heavy sandboxing constraints.

In some of those runtime models it's very much possible to have untrusted code in
the same address space as trusted code. BPF is an example for such a runtime
model: we can run user-defined BPF JIT-ed code in kernel context.

Thanks,

Ingo

2018-01-08 17:49:32

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On Mon, Jan 08, 2018 at 06:30:32PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 09:23:49AM -0800, Dave Hansen wrote:
> > On 01/08/2018 09:17 AM, Willy Tarreau wrote:
> > >> I think the prctl() should apply to an entire process, not to a thread.
> > >
> > > As I mentionned in another mail, I didn't know how to do it, even less
> > > how to do it fast enough so that we didn't add more cycles to the syscall
> > > code.
> >
> > You can _implement_ it with a task thread if you want. Just spray it
> > across all threads at the prctl()-time instead of a single thread.
> > It'll take a wee bit of locking.
> >
> > I just don't think the API should apply to a single thread.
>
> It is surprisingly hard to find all tasks that share an mm. Finding all
> threads in a threadgroup is easy, but we have CLONE_THREAD and CLONE_VM
> as separate bits.
>
> In any case, aside from that, setting this remotely is indeed
> 'intersting'.

Then couldn't we instead detect that there's more than one thread in
the process and refuse to apply prctl() to prevent the behaviour from
becoming inconsistent ? This would seem reasonable after all, we want
to do this very early upon startup, it probably doesn't make sense to
change one's mind after threads have been created (or maybe only to
re-enable protection on some of them ?).

Willy

2018-01-08 17:50:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.


* Dave Hansen <[email protected]> wrote:

> On 01/08/2018 09:05 AM, Thomas Gleixner wrote:
> > On Mon, 8 Jan 2018, Willy Tarreau wrote:
> >> Since we're going to keep running on the same PGD when returning to
> >> userspace for certain performance-critical tasks, we'll need the user
> >> pages to be executable. So this code disables the extra protection
> >> that was added consisting in marking user pages _PAGE_NX so that this
> >> pgd remains usable for userspace.
> >>
> >> Note: it isn't necessarily the best approach, but one way or another
> >> if we want to be able to return to userspace from the kernel,
> >> we'll have to have this executable anyway. Another approach
> >> might consist in using another pgd for userland+kernel but
> >> the current core really looks like an extra careful measure
> >> to catch early bugs if any.
> >
> > I surely want to keep that as a safety measure. The entry code is simple to
> > get wrong and running with the wrong pagetables by a silly mistake and
> > thereby undoing the protection is surely not what we want.
> >
> > Need to find a free time slot to think about that.
>
> This does get immensely easier if we choose a mode at exec() (or fork()
> even) and never change it. The prctl() _could_ just be a flag to tell
> what your children should do.

Switching PTI on/off for a whole process would be nightmarish.

The simplest model is indeed child inheritance tree propagation - plus perhaps the
ability for a thread to change its *own* PTI status, which obviously doesn't
create any deep "process lookup" or cross-CPU complications.

( Note that here I only mean "simple to implement" - we might decide to not offer
the ABI. )

Thanks,

Ingo

2018-01-08 17:51:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/include/uapi/asm/prctl.h | 3 +++
> arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
> #define ARCH_GET_CPUID 0x1011
> #define ARCH_SET_CPUID 0x1012
>
> +#define ARCH_GET_NOPTI 0x1021
> +#define ARCH_SET_NOPTI 0x1022
> +
> #define ARCH_MAP_VDSO_X32 0x2001
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> ret = put_user(base, (unsigned long __user *)arg2);
> break;
> }
> + case ARCH_GET_NOPTI: {

Please add a CONFIG_ item for this so that people can disable those.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-01-08 17:54:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <[email protected]> wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.

So I really think that this needs to be done up-front to avoid a lot
of complexity. And per mm.

If the process is already threaded (so the mm has multiple users),
it's too late to start playing games with PTI.

In fact, maybe the whole thing needs to be controlled before "exec"
happens, so that we have the knowledge as we build up the mm, rather
than being "runtime" dynamic at all.

But in no case should you even try to handle the multi-threaded case -
just error out for trying to change the PTI setting.

So make the thing per-mm, and then at task switch time as you switch
mms, you set the bit in a percpu variable for testing at kernel entry.

Linus

2018-01-08 18:12:51

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

On Mon, Jan 08, 2018 at 09:20:08AM -0800, Dave Hansen wrote:
> > @@ -214,6 +215,11 @@
> > .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> > mov %cr3, \scratch_reg
> > +
> > + /* if we're already on the kernel PGD, we don't switch */
> > + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> > + jz .Lend_\@
> > +
> > ADJUST_KERNEL_CR3 \scratch_reg
> > mov \scratch_reg, %cr3
> > .Lend_\@:
>
> This is an optimization that we can do generally without your feature.
> Actually, it would be a welcome bit of benchmarking if you could see if
> just this hunk helps your workload.

Well, this alone provides no gains at all, as I can see when running
haproxy without the nopti wrapper. And it makes sense, as there is no
reason without this patch series for coming from userspace with CR3
pointing to the kernel's PGD.

> You touched on it in the description, but this is a *very* clever way to
> do what you need without needing to look at the task flag at
> user->kernel entry which also happens to be a place you don't have
> task_struct mapped. It *greatly* simplifies what this would have to do
> otherwise.

In fact when I saw my kernel fail to boot when trying to read the task
flag, I quickly realized I needed something else to check before
entering the kernel ;-)

Willy

2018-01-08 18:22:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <[email protected]> wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
>
> So I really think that this needs to be done up-front to avoid a lot
> of complexity. And per mm.
>
> If the process is already threaded (so the mm has multiple users),
> it's too late to start playing games with PTI.
>
> In fact, maybe the whole thing needs to be controlled before "exec"
> happens, so that we have the knowledge as we build up the mm, rather
> than being "runtime" dynamic at all.

In fact I initially wanted to start with a prctl() flag that acts upon
exec because it looked simpler. But then I realized that this would
always require a wrapper and that it's not necessarily more convenient.
It even makes permission checks more complicated from an administration
perspective, and I'd hate to see such a wrapper ending up setuid...

> But in no case should you even try to handle the multi-threaded case -
> just error out for trying to change the PTI setting.

I totally agree here. Like Ingo says, there *may* be useful cases for
this but I'm not sure we're prepared to address a new class of bugs
caused by this yet. And now I can get rid of GET_NOPTI, it was just for
debugging.

> So make the thing per-mm, and then at task switch time as you switch
> mms, you set the bit in a percpu variable for testing at kernel entry.

I'll see how to do that, this is not yet 100% clear to me, I'm still
discovering (this code has immensely changed since last time I *really*
dug into it). So I suspect I'll have to set this variable in
__switch_to() based on this other MM flag.

I'll study this.

Thanks,
Willy

2018-01-08 18:25:45

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

> The simplest model is indeed child inheritance tree propagation - plus perhaps the
> ability for a thread to change its *own* PTI status, which obviously doesn't
> create any deep "process lookup" or cross-CPU complications.
>
> ( Note that here I only mean "simple to implement" - we might decide to not offer
> the ABI. )

I still think cgroups are the best model for this. In particular it
naturally fits things like containers, or network facing apps that fork
helpers.

Secondly when you are looking at barrier semantics between client/client
a cgroup is much more natural as a way to group processes together who
don't need to be protected from each other as they are trusting each
other. (Or we could just harcode this based upon ptraceability ?)

Alan

2018-01-08 18:35:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.


* Alan Cox <[email protected]> wrote:

> > The simplest model is indeed child inheritance tree propagation - plus perhaps the
> > ability for a thread to change its *own* PTI status, which obviously doesn't
> > create any deep "process lookup" or cross-CPU complications.
> >
> > ( Note that here I only mean "simple to implement" - we might decide to not offer
> > the ABI. )
>
> I still think cgroups are the best model for this. In particular it
> naturally fits things like containers, or network facing apps that fork
> helpers.

I think the suggested exec() time inheritance model would naturally also cover
cgroups (without tying the ABI to cgroups) - as containers typically get inherited
from a single binary. A bit like how various personality bits get propagated.

Thanks,

Ingo

2018-01-08 18:35:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On Mon, Jan 8, 2018 at 10:25 AM, Alan Cox <[email protected]> wrote:
>
> I still think cgroups are the best model for this. In particular it
> naturally fits things like containers, or network facing apps that fork
> helpers.
>
> Secondly when you are looking at barrier semantics between client/client
> a cgroup is much more natural as a way to group processes together who
> don't need to be protected from each other as they are trusting each
> other. (Or we could just harcode this based upon ptraceability ?)

I agree that cgroups would be fairly natural, but I do think we could
look at things like simply trusted users too ("running as root? Yeah,
we're not going to try to protect the kernel from you") and/or trusted
binaries.

But all of those things are likely things that can easily be
determined at execve() time.

Linus

2018-01-08 18:44:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On 01/08/2018 09:50 AM, Ingo Molnar wrote:
>> This does get immensely easier if we choose a mode at exec() (or fork()
>> even) and never change it. The prctl() _could_ just be a flag to tell
>> what your children should do.
> Switching PTI on/off for a whole process would be nightmarish.

Yeah, totally. I meant "future children" aka. things *after* fork().

2018-01-08 20:36:28

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

[ increased the CC list this time ]

On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <[email protected]> wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
>
> So I really think that this needs to be done up-front to avoid a lot
> of complexity. And per mm.
>
> If the process is already threaded (so the mm has multiple users),
> it's too late to start playing games with PTI.
>
> In fact, maybe the whole thing needs to be controlled before "exec"
> happens, so that we have the knowledge as we build up the mm, rather
> than being "runtime" dynamic at all.
>
> But in no case should you even try to handle the multi-threaded case -
> just error out for trying to change the PTI setting.
>
> So make the thing per-mm, and then at task switch time as you switch
> mms, you set the bit in a percpu variable for testing at kernel entry.

So I did something like this (will have to remerge the awful patches
and remove the printks before resending). In short, now here's what it
does :

- added a new x86 flag : "mm->context.pti_disable", depends on
CONFIG_PAGE_TABLE_ISOLATION.
- the new prctl() also depends on this config setting.
- prctl() refuses any change if mm->mm_users > 1
- prctl() refuses to set nopti if !CAP_SYS_RAWIO, but clearing it is
fine without (Ingo's idea)
- __switch_to() sets a new "pti_disable" per-cpu variable to the copy
of mm->context.pti_disable
- entry code in SWITCH_TO_USER_CR3_NOSTACK now checks
PER_CPU_VAR(pti_disable)

First tests show that it still works. One main difference I immediately
observed is that it stops at execve(). This means that it will not be
possible to implement a wrapper to enable the bypass, but on the other
hand it guarantees that any execve() even from a so called "trusted"
process doesn't accidently expose a victim program. So there are pros
and cons here.

I'm personally fine with both the wrapper and the code changes. But I'm
in the easiest situation, working with opensource code that I can easily
update to accommodate the changes. Other users might have a different
opinion here.

Another option could be to have a per-task (and really task here) flag
is only passed to execve() to mention that the per-mm pti_disable has
to be set in the new mm (and which would clear the task flag). But this
mechanism would always require a wrapper. Or we could have both.

I'll clean up my patches tomorrow morning and will post an update. Ideas
and objections welcome in the mean time ;-)

Cheers,
Willy

2018-01-08 20:49:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
>
> > So make the thing per-mm, and then at task switch time as you switch
> > mms, you set the bit in a percpu variable for testing at kernel entry.
>
> I'll see how to do that, this is not yet 100% clear to me, I'm still
> discovering (this code has immensely changed since last time I *really*
> dug into it). So I suspect I'll have to set this variable in
> __switch_to() based on this other MM flag.

The right thing is probably the cpu entry area because that's what you can
access before switching CR3 if PTI is enabled for the task

Thanks,

tglx

2018-01-08 21:04:04

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI

On Mon, Jan 08, 2018 at 09:49:23PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> >
> > > So make the thing per-mm, and then at task switch time as you switch
> > > mms, you set the bit in a percpu variable for testing at kernel entry.
> >
> > I'll see how to do that, this is not yet 100% clear to me, I'm still
> > discovering (this code has immensely changed since last time I *really*
> > dug into it). So I suspect I'll have to set this variable in
> > __switch_to() based on this other MM flag.
>
> The right thing is probably the cpu entry area because that's what you can
> access before switching CR3 if PTI is enabled for the task

In fact I didn't need any such thing, because when coming from user
space I can simply check CR3.12 : if 0, we had PTI disabled for the
task, otherwise it's enabled.

I'll repost the cleaned up series ASAP, unfortunately I need to go now.

Thanks,
Willy

2018-01-08 23:01:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> If a task has the TIF_NOPTI flag set, it doesn't want to experience
> page table isolation. In this case, returns from kernel to user will
> not switch the CR3, leaving it to the kernel one which already maps
> both user and kernel pages. Upon entry in the kernel, we can't check
> this flag so we simply check if CR3 was pointing to the kernel's PGD,
> indicating an earlier absence of switch, and in this case we don't
> change it.
>
> Thanks to these changes, haproxy running under KVM went back from
> 12400 conn/s to 21000 once loaded after calling prctl().
>
> Signed-off-by: Willy Tarreau <[email protected]>
> ---
> arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 45a63e0..054b8b7 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -1,5 +1,6 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <linux/jump_label.h>
> +#include <asm/thread_info.h>
> #include <asm/unwind_hints.h>
> #include <asm/cpufeatures.h>
> #include <asm/page_types.h>
> @@ -214,6 +215,11 @@
> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> mov %cr3, \scratch_reg
> +
> + /* if we're already on the kernel PGD, we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Lend_\@
> +
> ADJUST_KERNEL_CR3 \scratch_reg
> mov \scratch_reg, %cr3
> .Lend_\@:
> @@ -224,6 +230,12 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* "NOPTI" taskflag avoids the switch */
> + movq PER_CPU_VAR(current_task), \scratch_reg
> + btq $TIF_NOPTI, TASK_TI_flags(\scratch_reg)
> + jc .Lend_\@
> +
> mov %cr3, \scratch_reg
>
> ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> @@ -262,6 +274,13 @@
> ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
> movq %cr3, \scratch_reg
> movq \scratch_reg, \save_reg
> +
> + /* if we're already on the kernel PGD, we don't switch,
> + * we just save the current cr3.
> + */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> + jz .Ldone_\@
> +
> /*
> * Is the "switch mask" all zero? That means that both of
> * these are zero:
> @@ -284,6 +303,10 @@
> .macro RESTORE_CR3 scratch_reg:req save_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> + /* if we saved a kernel context, we didn't switch so we don't switch */
> + testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
> + jz .Lend_\@
> +

I haven't looked that carefully, but this seems generally sane.

2018-01-08 23:05:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On 01/08/2018 09:03 AM, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>> Since we're going to keep running on the same PGD when returning to
>> userspace for certain performance-critical tasks, we'll need the user
>> pages to be executable. So this code disables the extra protection
>> that was added consisting in marking user pages _PAGE_NX so that this
>> pgd remains usable for userspace.
>>
>> Note: it isn't necessarily the best approach, but one way or another
>> if we want to be able to return to userspace from the kernel,
>> we'll have to have this executable anyway. Another approach
>> might consist in using another pgd for userland+kernel but
>> the current core really looks like an extra careful measure
>> to catch early bugs if any.
>
> I don't like this.
>
> I think the prctl() should apply to an entire process, not to a thread.
> If it applies to a process, you can unpoison the PGD. I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
>
> Why are you even posting half-baked hacks like this now? Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?
>

I vote per-thread.

Anyway, we can easily sync the NX-clearing: just catch the spurious page
fault and clear the bit. Avoiding infinite loops will need a bit of
thought, but it's surely doable.

Or we set a per-mm flag saying "no NX", then do synchronize_sched() or
similar if we were the first to set it (or take the pagetable lock),
then clear all the NX bits. Again, needs some care, but doable.

FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware,
which is fantastic for Spectre resistance and general hardening.
Turning it off totally defeats that, which hurts a bit.

Also, Kees should be CC'd here.

2018-01-08 23:09:03

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

On Mon, Jan 8, 2018 at 3:05 PM, Andy Lutomirski <[email protected]> wrote:
> On 01/08/2018 09:03 AM, Dave Hansen wrote:
>>
>> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>>>
>>> Since we're going to keep running on the same PGD when returning to
>>> userspace for certain performance-critical tasks, we'll need the user
>>> pages to be executable. So this code disables the extra protection
>>> that was added consisting in marking user pages _PAGE_NX so that this
>>> pgd remains usable for userspace.
>>>
>>> Note: it isn't necessarily the best approach, but one way or another
>>> if we want to be able to return to userspace from the kernel,
>>> we'll have to have this executable anyway. Another approach
>>> might consist in using another pgd for userland+kernel but
>>> the current core really looks like an extra careful measure
>>> to catch early bugs if any.
>>
>>
>> I don't like this.
>>
>> I think the prctl() should apply to an entire process, not to a thread.
>> If it applies to a process, you can unpoison the PGD. I even had code
>> to do this in an earlier version of the (whole system) runtime PTI
>> on/off stuff.
>>
>> Why are you even posting half-baked hacks like this now? Is there
>> something super-pressing about this set that we need to lock in a new
>> ABI now?
>>
>
> I vote per-thread.
>
> Anyway, we can easily sync the NX-clearing: just catch the spurious page
> fault and clear the bit. Avoiding infinite loops will need a bit of
> thought, but it's surely doable.
>
> Or we set a per-mm flag saying "no NX", then do synchronize_sched() or
> similar if we were the first to set it (or take the pagetable lock), then
> clear all the NX bits. Again, needs some care, but doable.
>
> FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware, which is
> fantastic for Spectre resistance and general hardening. Turning it off
> totally defeats that, which hurts a bit.
>
> Also, Kees should be CC'd here.

Please please keep the NX. As mentioned, this gets us emulated SMEP
for "free" with PTI, which Linux has needed for years.

-Kees

--
Kees Cook
Pixel Security

2018-01-09 04:22:17

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.

Hi Andy,

On Mon, Jan 08, 2018 at 03:05:48PM -0800, Andy Lutomirski wrote:
> On 01/08/2018 09:03 AM, Dave Hansen wrote:
> > On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> I vote per-thread.

The per-mm approach that Linus suggested doesn't look bad either and
makes quite some sense.

> Anyway, we can easily sync the NX-clearing: just catch the spurious page
> fault and clear the bit. Avoiding infinite loops will need a bit of
> thought, but it's surely doable.

That's an excellent idea, eventhough I have no idea how to implement it :-)

> Or we set a per-mm flag saying "no NX", then do synchronize_sched() or
> similar if we were the first to set it (or take the pagetable lock), then
> clear all the NX bits. Again, needs some care, but doable.
>
> FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware, which is
> fantastic for Spectre resistance and general hardening.

Yes I figured exactly this when I faced this protection!

> Turning it off totally defeats that, which hurts a bit.

I agree, that's why I'd like it to be conditional. Probably that with
your idea of catching the page fault and the per-mm flag it would work
quite well, but before being able to do this I still have a lot to
explore :-/

> Also, Kees should be CC'd here.

Yes I've added him and you (and a few others) in CC of all forthcoming
patches. Sorry for not adding you initially, I simply wanted to share
a quick experiment and initiate a discussion.

Willy

2018-01-09 15:32:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

Willy Tarreau <[email protected]> writes:

> Hi!
>
> I could experiment a bit with the possibility to enable/disable PTI per
> task. Please keep in mind that it's not my area of experitise at all, but
> doing so I could recover the initial performance without disabling PTI on
> the whole system.
>
> So what I did in this series consists in the following :
> - addition of a new per-task TIF_NOPTI flag. Please note that I'm not
> proud of the way I did it, as 32 flags were already taken. The flags
> are declared as "long" so there are 32 more flags available on x86_64
> but C and asm disagree on the type of 1<<32 so I had to declare the
> hex value by hand... By the way I even suspect that _TIF_FSCHECK is
> wrong once cast to a long, I think it causes sign extension into the
> 32 upper bits since it's supposed to be signed.
>
> - addition of a set of arch_prctl() calls (ARCH_GET_NOPTI and
> ARCH_SET_NOPTI), to check and change the activation of the
> protection. The change requires CAP_SYS_RAWIO and can be done in
> a wrapper (that's how I tested)
>
> - the user PGD was marked with _PAGE_NX to prevent an accidental leak
> of CR3 from not being detected. I obviously had to disable this since
> in this case we do want such a user task to run without switching the
> PGD. I think this could be performed per-task maybe. Another approach
> might consist in dealing with 3 PGDs and using a different one for
> unprotected tasks but that really starts to sound overkill.
>
> - upon return to userspace, I check if the task's flags contain the
> new TIF_NOPTI or not. If it does contain it, then we don't switch
> the CR3.
>
> - upon entry into the kernel from userspace, we can't access the task's
> flags but we can already check if CR3 points to the kernel or user PGD,
> and we refrain from switching if it's already the system one.
>
> By doing so I could recover the initial performance of haproxy in a VM,
> going from 12400 connections per second to 21000 once started with this
> trivial wrapper :
>
> #include <asm/prctl.h>
> #include <sys/prctl.h>
>
> #ifndef ARCH_SET_NOPTI
> #define ARCH_SET_NOPTI 0x1022
> #endif
>
> int main(int argc, char **argv)
> {
> arch_prctl(ARCH_SET_NOPTI, 1);
> argv++;
> return execvp(argv[0], argv);
> }
>
> I have not yet run it on real hardware. Before trying to go a bit further
> I'd like to know if such an approach is acceptable or if I'm doing anything
> stupid and looking in the wrong direction.

Before this goes much farther I want to point something out.

When I have kpti protecting me it is the applications with that connect
to the network I worry about. Until I get to a system with users that
don't trust each other local I don't have a reason to worry about these
attacks from local applications.

The dangerous scenario is someone exploting a buffer overflow, or
otherwise getting a network facing application to misbehave, and then
using these new attacks to assist in gaining privilege escalation.


Googling seems to indicate that there is about one issue a year found in
haproxy. So this is not an unrealistic concern for the case you
mention.


So unless I am seeing things wrong this is a patchset designed to drop
your defensense on the most vulnerable applications.


Disably protection on the most vunerable applications is not behavior
I would encourage. It seems better than disabling protection system
wide but only slightly. I definitely don't think this is something we
want applications disabling themselves.

Certainly this is something that should look at no-new-privs and if
no-new-privs is set not allow disabling this protection.

Eric




2018-01-09 16:02:28

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

Hi Eric,

On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
> The dangerous scenario is someone exploting a buffer overflow, or
> otherwise getting a network facing application to misbehave, and then
> using these new attacks to assist in gaining privilege escalation.

For most use cases sure. But for *some* use cases, if they can control
of the application, you've already lost everything you had. Private keys,
clear text traffic, etc. We're precisely talking about such applications
where the userspace is as much important as the kernel, and where there's
hardly anything left to lose once the application is cracked. However, a
significant performance drop on the application definitely is a problem,
first making it weaker when facing attacks, or even failing to deal with
traffic peaks.

> Googling seems to indicate that there is about one issue a year found in
> haproxy. So this is not an unrealistic concern for the case you
> mention.

I agree. But in practice, we had two exploitable bugs, one in 2002
(overflow in the logs), and one in 2014 requiring a purposely written
config which makes no pratical sense at all. Most other vulnerabilities
involve freezes, occasionally crashes, though that's even more rare.
And even with the two above, you just have one chance to try to exploit
it, if you get your pointer wrong, it dies and you have to wait for the
admin to restart it. In practice, seeing the process die is the worst
nightmare of admins as the service simply stops. I'm not saying we don't
want to defend them, we even chroot to an empty directory and drop
privileges to mitigate such a risk. But when the intruder is in the
process it's really too late.

> So unless I am seeing things wrong this is a patchset designed to drop
> your defensense on the most vulnerable applications.

In fact it can be seen very differently. By making it possible for exposed
but critical applications to share some risks with the rest of the system,
we also ensure they remain strong for their initial purpose and against
the most common types of attacks. And quite frankly we're not weakening
much given the risks already involved by the process itself.

What I'm describing represents a small category of processes in only
certain environments. Some database servers will have the same issue.
Imagine a Redis server for example, which normally is very fast and
easily saturates whatever network around it. Some DNS providers may
have the same problem when dealing with hundreds of thousands to
millions of UDP packets each second (not counting attacks).

All such services are critical in themselves, but the fact that we accept
to let them share the risks with the system doesn't mean they should be
running without the protections from the occasional operations guy just
allowed to connect there to verify if logs are full and to retrive stats.

> Disably protection on the most vunerable applications is not behavior
> I would encourage.

I'm not encouraging this behaviour either but right now the only option
for performance critical applications (even if they are vulnerable) is
to make the whole system vulnerable.

> It seems better than disabling protection system
> wide but only slightly. I definitely don't think this is something we
> want applications disabling themselves.

In fact that's what I liked with the wrapper approach, except that it
had the downside of being harder to manage in terms of administration
and we'd risk to see it used everywhere by default. The arch_prctl()
approach ensures that only applications where this is relevant can do
it. In the case of haproxy, I can trivially add a config option like
"disable-page-isolation" to let the admin enable it on purpose.

But I suspect there might be some performance critical applications that
cannot be patched, and that's where the wrapper could still provide some
value.

> Certainly this is something that should look at no-new-privs and if
> no-new-privs is set not allow disabling this protection.

I don't know what is "no-new-privs" and couldn't find info on it
unfortunately. Do you have a link please ?

Thanks!
Willy

2018-01-09 18:11:41

by Zhi Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

Is is possible to put per-task PTI control interface into cgroup or
other interfaces?  Enabling/disabling per-task PTI should be a decision
from the system administrator not the application itself.

On 2018/1/9 18:02, Willy Tarreau wrote:
> Hi Eric,
>
> On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
>> The dangerous scenario is someone exploting a buffer overflow, or
>> otherwise getting a network facing application to misbehave, and then
>> using these new attacks to assist in gaining privilege escalation.
> For most use cases sure. But for *some* use cases, if they can control
> of the application, you've already lost everything you had. Private keys,
> clear text traffic, etc. We're precisely talking about such applications
> where the userspace is as much important as the kernel, and where there's
> hardly anything left to lose once the application is cracked. However, a
> significant performance drop on the application definitely is a problem,
> first making it weaker when facing attacks, or even failing to deal with
> traffic peaks.
>
>> Googling seems to indicate that there is about one issue a year found in
>> haproxy. So this is not an unrealistic concern for the case you
>> mention.
> I agree. But in practice, we had two exploitable bugs, one in 2002
> (overflow in the logs), and one in 2014 requiring a purposely written
> config which makes no pratical sense at all. Most other vulnerabilities
> involve freezes, occasionally crashes, though that's even more rare.
> And even with the two above, you just have one chance to try to exploit
> it, if you get your pointer wrong, it dies and you have to wait for the
> admin to restart it. In practice, seeing the process die is the worst
> nightmare of admins as the service simply stops. I'm not saying we don't
> want to defend them, we even chroot to an empty directory and drop
> privileges to mitigate such a risk. But when the intruder is in the
> process it's really too late.
>
>> So unless I am seeing things wrong this is a patchset designed to drop
>> your defensense on the most vulnerable applications.
> In fact it can be seen very differently. By making it possible for exposed
> but critical applications to share some risks with the rest of the system,
> we also ensure they remain strong for their initial purpose and against
> the most common types of attacks. And quite frankly we're not weakening
> much given the risks already involved by the process itself.
>
> What I'm describing represents a small category of processes in only
> certain environments. Some database servers will have the same issue.
> Imagine a Redis server for example, which normally is very fast and
> easily saturates whatever network around it. Some DNS providers may
> have the same problem when dealing with hundreds of thousands to
> millions of UDP packets each second (not counting attacks).
>
> All such services are critical in themselves, but the fact that we accept
> to let them share the risks with the system doesn't mean they should be
> running without the protections from the occasional operations guy just
> allowed to connect there to verify if logs are full and to retrive stats.
>
>> Disably protection on the most vunerable applications is not behavior
>> I would encourage.
> I'm not encouraging this behaviour either but right now the only option
> for performance critical applications (even if they are vulnerable) is
> to make the whole system vulnerable.
>
>> It seems better than disabling protection system
>> wide but only slightly. I definitely don't think this is something we
>> want applications disabling themselves.
> In fact that's what I liked with the wrapper approach, except that it
> had the downside of being harder to manage in terms of administration
> and we'd risk to see it used everywhere by default. The arch_prctl()
> approach ensures that only applications where this is relevant can do
> it. In the case of haproxy, I can trivially add a config option like
> "disable-page-isolation" to let the admin enable it on purpose.
>
> But I suspect there might be some performance critical applications that
> cannot be patched, and that's where the wrapper could still provide some
> value.
>
>> Certainly this is something that should look at no-new-privs and if
>> no-new-privs is set not allow disabling this protection.
> I don't know what is "no-new-privs" and couldn't find info on it
> unfortunately. Do you have a link please ?
>
> Thanks!
> Willy

2018-01-09 21:08:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

Willy Tarreau <[email protected]> writes:

> Hi Eric,
>
> On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
>> The dangerous scenario is someone exploting a buffer overflow, or
>> otherwise getting a network facing application to misbehave, and then
>> using these new attacks to assist in gaining privilege escalation.
>
> For most use cases sure. But for *some* use cases, if they can control
> of the application, you've already lost everything you had. Private keys,
> clear text traffic, etc. We're precisely talking about such applications
> where the userspace is as much important as the kernel, and where there's
> hardly anything left to lose once the application is cracked. However, a
> significant performance drop on the application definitely is a problem,
> first making it weaker when facing attacks, or even failing to deal with
> traffic peaks.

>From reading the earlier emails it was not clear that all was lost if
they were compromomised. In that case this makes plenty of sense.


>> Googling seems to indicate that there is about one issue a year found in
>> haproxy. So this is not an unrealistic concern for the case you
>> mention.
>
> I agree. But in practice, we had two exploitable bugs, one in 2002
> (overflow in the logs), and one in 2014 requiring a purposely written
> config which makes no pratical sense at all. Most other vulnerabilities
> involve freezes, occasionally crashes, though that's even more rare.
> And even with the two above, you just have one chance to try to exploit
> it, if you get your pointer wrong, it dies and you have to wait for the
> admin to restart it. In practice, seeing the process die is the worst
> nightmare of admins as the service simply stops. I'm not saying we don't
> want to defend them, we even chroot to an empty directory and drop
> privileges to mitigate such a risk. But when the intruder is in the
> process it's really too late.
>
>> So unless I am seeing things wrong this is a patchset designed to drop
>> your defensense on the most vulnerable applications.
>
> In fact it can be seen very differently. By making it possible for exposed
> but critical applications to share some risks with the rest of the system,
> we also ensure they remain strong for their initial purpose and against
> the most common types of attacks. And quite frankly we're not weakening
> much given the risks already involved by the process itself.
>
> What I'm describing represents a small category of processes in only
> certain environments. Some database servers will have the same issue.
> Imagine a Redis server for example, which normally is very fast and
> easily saturates whatever network around it. Some DNS providers may
> have the same problem when dealing with hundreds of thousands to
> millions of UDP packets each second (not counting attacks).
>
> All such services are critical in themselves, but the fact that we accept
> to let them share the risks with the system doesn't mean they should be
> running without the protections from the occasional operations guy just
> allowed to connect there to verify if logs are full and to retrive
> stats.

Reasonable.

>> Disably protection on the most vunerable applications is not behavior
>> I would encourage.
>
> I'm not encouraging this behaviour either but right now the only option
> for performance critical applications (even if they are vulnerable) is
> to make the whole system vulnerable.
>
>> It seems better than disabling protection system
>> wide but only slightly. I definitely don't think this is something we
>> want applications disabling themselves.
>
> In fact that's what I liked with the wrapper approach, except that it
> had the downside of being harder to manage in terms of administration
> and we'd risk to see it used everywhere by default. The arch_prctl()
> approach ensures that only applications where this is relevant can do
> it. In the case of haproxy, I can trivially add a config option like
> "disable-page-isolation" to let the admin enable it on purpose.

How is that different from the option?

> But I suspect there might be some performance critical applications that
> cannot be patched, and that's where the wrapper could still provide some
> value.

I just don't want to encourage changning this option by default. As a
lot of applications get installed in home servers or other places where
they are not performance critical. At which point disabling the kpti
protection by default would be reducing the level of protection of
everything.

But ultimately I only brought this up so that people are thinking about
the other side of this. About how it will affect not the high
performance servers single function but how it will affect the millions
of little servers that do many things all from a single machine.

Certainly I would not want this enabled in a container or a virtual
private server. The capable(CAP_RAWIO) seems to handle that beautifully.

>> Certainly this is something that should look at no-new-privs and if
>> no-new-privs is set not allow disabling this protection.
>
> I don't know what is "no-new-privs" and couldn't find info on it
> unfortunately. Do you have a link please ?

Probably because I used dashes. The no new privs flag is documented
in:
Documentation/userspace-api/no_new_privs.rst

It is a sandboxing flag that guarantees a process can not gain
privileges after it has been set. You can search for PFA_NO_NEW_PRIVS
in sched.h if you want to see where it is defined.

Eric

2018-01-09 21:57:38

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] Per-task PTI activation

On Tue, Jan 09, 2018 at 03:07:07PM -0600, Eric W. Biederman wrote:
> > In fact that's what I liked with the wrapper approach, except that it
> > had the downside of being harder to manage in terms of administration
> > and we'd risk to see it used everywhere by default. The arch_prctl()
> > approach ensures that only applications where this is relevant can do
> > it. In the case of haproxy, I can trivially add a config option like
> > "disable-page-isolation" to let the admin enable it on purpose.
>
> How is that different from the option?

Not sure what is different from the option, so let me just enumerate the
two approaches I'm seeing :
- either a wrapper which work for the next execve() and not further.
The benefit is that applications do not need to be recompiled. The
problem is that it requires some changes to various places ranging
from init scripts to API wrappers and whatever to place this wrapper.
Also it can start to be perceived as the "wrapper that makes things
fast" and some admins might start to routinely use it like we use
taskset, and I'm not fond of this.

- the option I mentionned would be a configuration setting enabling
arch_prctl(). The option name should explicitly indicate that the
admin wants to screw up his system's security. For example,
"destroy-security-for-performance" is not the type of option you
leave enabled in your config templates. It *will* require to
recompile applications. I don't think it's a problem for the most
performance sensitive ones, but I may be biased by my own
experience.

- or a combination of the two if some admins need to support stuff
they can't rebuild (well LD_PRELOAD might be an option after all...)

> > But I suspect there might be some performance critical applications that
> > cannot be patched, and that's where the wrapper could still provide some
> > value.
>
> I just don't want to encourage changning this option by default.

That's the same for me. Even the prctl name should be scary enough. Boris
suggested tainting the kernel and that's a very good idea, it will even
ensure that the setting is not used by default in applications because
admins won't want to see their systems tainted for no reason.

> As a
> lot of applications get installed in home servers or other places where
> they are not performance critical. At which point disabling the kpti
> protection by default would be reducing the level of protection of
> everything.

Definitely. That's why I don't want to see the hard-coded prctl() either.

> But ultimately I only brought this up so that people are thinking about
> the other side of this. About how it will affect not the high
> performance servers single function but how it will affect the millions
> of little servers that do many things all from a single machine.

This is my concern as well, which is why I'm seeking the most balanced
approach I can think of.

> Certainly I would not want this enabled in a container or a virtual
> private server. The capable(CAP_RAWIO) seems to handle that beautifully.
>
> >> Certainly this is something that should look at no-new-privs and if
> >> no-new-privs is set not allow disabling this protection.
> >
> > I don't know what is "no-new-privs" and couldn't find info on it
> > unfortunately. Do you have a link please ?
>
> Probably because I used dashes. The no new privs flag is documented
> in:
> Documentation/userspace-api/no_new_privs.rst

Ah stupid me, I should have tried underscores as well. Thank you.

> It is a sandboxing flag that guarantees a process can not gain
> privileges after it has been set. You can search for PFA_NO_NEW_PRIVS
> in sched.h if you want to see where it is defined.

Oh that's very interesting, I wasn't aware of this! Definitely something
I need to set after dropping privileges in haproxy I guess ;-)

Thanks,
Willy