2018-07-20 16:23:39

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 0/3] PTI for x86-32 Fixes and Updates

Hi,

here are 3 patches which update the PTI-x86-32 patches recently merged
into the tip-tree. The patches are ordered by importance:

Patch 1: Very important, it fixes a vmalloc-fault in NMI context
when PTI is enabled. This is pretty unlikely to hit
when starting perf on an idle machine, which is why I
didn't find it earlier in my testing. I always started
'perf top' first :/ But when I start 'perf top' last
when the kernel-compile already runs, it hits almost
immediatly.

Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
to also take VM86 into account. This is not strictly
necessary because the slow-path also works for VM86
mode but it is not how the code was intended to work.
And it breaks when Patch 3 is applied on-top.

Patch 3: Implement the reduced copying in the paranoid
entry/exit path as suggested by Andy Lutomirski while
reviewing version 7 of the original patches.

I have the x86/tip branch with these patches on-top running my test for
6h now, with no issues so far. So for now it looks like there are no
scheduling points or irq-enabled sections reached from the paranoid
entry/exit paths and we always return to the entry-stack we came from.

I keep the test running over the weekend at least.

Please review.

[ If Patch 1 looks good to the maintainers I suggest applying it soon,
before too many linux-next testers run into this issue. It is actually
the reason why I send out the patches _now_ and didn't wait until next
week when the other two patches got more testing from my side. ]

Thanks,

Joerg

Joerg Roedel (3):
perf/core: Make sure the ring-buffer is mapped in all page-tables
x86/entry/32: Check for VM86 mode in slow-path check
x86/entry/32: Copy only ptregs on paranoid entry/exit path

arch/x86/entry/entry_32.S | 82 ++++++++++++++++++++++++++-------------------
kernel/events/ring_buffer.c | 10 ++++++
2 files changed, 58 insertions(+), 34 deletions(-)

--
2.7.4



2018-07-20 16:23:31

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path

From: Joerg Roedel <[email protected]>

The code that switches from entry- to task-stack when we
enter from kernel-mode copies the full entry-stack contents
to the task-stack.

That is because we don't trust that the entry-stack
contents. But actually we can trust its contents if we are
not scheduled between entry and exit.

So do less copying and move only the ptregs over to the
task-stack in this code-path.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 70 +++++++++++++++++++++++++----------------------
1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2767c62..90166b2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -469,33 +469,48 @@
* segment registers on the way back to user-space or when the
* sysenter handler runs with eflags.tf set.
*
- * When we switch to the task-stack here, we can't trust the
- * contents of the entry-stack anymore, as the exception handler
- * might be scheduled out or moved to another CPU. Therefore we
- * copy the complete entry-stack to the task-stack and set a
- * marker in the iret-frame (bit 31 of the CS dword) to detect
- * what we've done on the iret path.
+ * When we switch to the task-stack here, we extend the
+ * stack-frame we copy to include the entry-stack %esp and a
+ * pseudo %ss value so that we have a full ptregs struct on the
+ * stack. We set a marker in the frame (bit 31 of the CS dword).
*
- * On the iret path we copy everything back and switch to the
- * entry-stack, so that the interrupted kernel code-path
- * continues on the same stack it was interrupted with.
+ * On the iret path we read %esp from the PT_OLDESP slot on the
+ * stack and copy ptregs (except oldesp and oldss) to it, when
+ * we find the marker set. Then we switch to the %esp we read,
+ * so that the interrupted kernel code-path continues on the
+ * same stack it was interrupted with.
*
* Be aware that an NMI can happen anytime in this code.
*
+ * Register values here are:
+ *
* %esi: Entry-Stack pointer (same as %esp)
* %edi: Top of the task stack
* %eax: CR3 on kernel entry
*/

- /* Calculate number of bytes on the entry stack in %ecx */
- movl %esi, %ecx
+ /* Allocate full pt_regs on task-stack */
+ subl $PTREGS_SIZE, %edi
+
+ /* Switch to task-stack */
+ movl %edi, %esp

- /* %ecx to the top of entry-stack */
- andl $(MASK_entry_stack), %ecx
- addl $(SIZEOF_entry_stack), %ecx
+ /* Populate pt_regs on task-stack */
+ movl $__KERNEL_DS, PT_OLDSS(%esp) /* Check: Is this needed? */

- /* Number of bytes on the entry stack to %ecx */
- sub %esi, %ecx
+ /*
+ * Save entry-stack pointer on task-stack so that we can switch back to
+ * it on the the iret path.
+ */
+ movl %esi, PT_OLDESP(%esp)
+
+ /* sizeof(pt_regs) minus space for %esp and %ss to %ecx */
+ movl $(PTREGS_SIZE - 8), %ecx
+
+ /* Copy rest */
+ shrl $2, %ecx
+ cld
+ rep movsl

/* Mark stackframe as coming from entry stack */
orl $CS_FROM_ENTRY_STACK, PT_CS(%esp)
@@ -505,16 +520,9 @@
* so that we can switch back to it before iret.
*/
testl $PTI_SWITCH_MASK, %eax
- jz .Lcopy_pt_regs_\@
+ jz .Lend_\@
orl $CS_FROM_USER_CR3, PT_CS(%esp)

- /*
- * %esi and %edi are unchanged, %ecx contains the number of
- * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
- * the stack-frame on task-stack and copy everything over
- */
- jmp .Lcopy_pt_regs_\@
-
.Lend_\@:
.endm

@@ -594,16 +602,14 @@
/* Clear marker from stack-frame */
andl $(~CS_FROM_ENTRY_STACK), PT_CS(%esp)

- /* Copy the remaining task-stack contents to entry-stack */
+ /*
+ * Copy the remaining 'struct ptregs' to entry-stack. Leave out
+ * OLDESP and OLDSS as we didn't copy that over on entry.
+ */
movl %esp, %esi
- movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+ movl PT_OLDESP(%esp), %edi

- /* Bytes on the task-stack to ecx */
- movl PER_CPU_VAR(cpu_tss_rw + TSS_sp1), %ecx
- subl %esi, %ecx
-
- /* Allocate stack-frame on entry-stack */
- subl %ecx, %edi
+ movl $(PTREGS_SIZE - 8), %ecx

/*
* Save future stack-pointer, we must not switch until the
--
2.7.4


2018-07-20 16:24:18

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

From: Joerg Roedel <[email protected]>

The ring-buffer is accessed in the NMI handler, so we better
avoid faulting on it. Sync the vmalloc range with all
page-tables in system to make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI
enabled on x86-32:

WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the
PMDs are no longer shared between the page-tables, so the
vmalloc changes do not propagate automatically.

Signed-off-by: Joerg Roedel <[email protected]>
---
kernel/events/ring_buffer.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf40..7b0e9aa 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,9 @@ static void rb_free_work(struct work_struct *work)

vfree(base);
kfree(rb);
+
+ /* Make sure buffer is unmapped in all page-tables */
+ vmalloc_sync_all();
}

void rb_free(struct ring_buffer *rb)
@@ -840,6 +843,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
if (!all_buf)
goto fail_all_buf;

+ /*
+ * The buffer is accessed in NMI handlers, make sure it is
+ * mapped in all page-tables in the system so that we don't
+ * fault on the range in an NMI handler.
+ */
+ vmalloc_sync_all();
+
rb->user_page = all_buf;
rb->data_pages[0] = all_buf + PAGE_SIZE;
if (nr_pages) {
--
2.7.4


2018-07-20 16:24:39

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check

From: Joerg Roedel <[email protected]>

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to
go down the slow and paranoid entry path. The problem is
that this check also returns true when coming from VM86
mode. This is not a problem by itself, as the paranoid path
handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and
faster).

Extend the check to include VM86 mode. This also makes an
optimization of the paranoid path possible.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb4..2767c62 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
andl $(0x0000ffff), PT_CS(%esp)

/* Special case - entry from kernel mode via entry stack */
- testl $SEGMENT_RPL_MASK, PT_CS(%esp)
- jz .Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+ movl PT_EFLAGS(%esp), %ecx # mix EFLAGS and CS
+ movb PT_CS(%esp), %cl
+ andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+ movl PT_CS(%esp), %ecx
+ andl $SEGMENT_RPL_MASK, %ecx
+#endif
+ cmpl $USER_RPL, %ecx
+ jb .Lentry_from_kernel_\@

/* Bytes to copy */
movl $PTREGS_SIZE, %ecx
--
2.7.4


2018-07-20 17:08:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

> On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
>
> From: Joerg Roedel <[email protected]>
>
> The ring-buffer is accessed in the NMI handler, so we better
> avoid faulting on it. Sync the vmalloc range with all
> page-tables in system to make sure everyone has it mapped.
>
> This fixes a WARN_ON_ONCE() that can be triggered with PTI
> enabled on x86-32:
>
> WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>
> This triggers because with PTI enabled on an PAE kernel the
> PMDs are no longer shared between the page-tables, so the
> vmalloc changes do not propagate automatically.

It seems like it would be much more robust to fix the vmalloc_fault()
code instead.

2018-07-20 17:11:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path

On Fri, Jul 20, 2018 at 9:22 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> The code that switches from entry- to task-stack when we
> enter from kernel-mode copies the full entry-stack contents
> to the task-stack.
>
> That is because we don't trust that the entry-stack
> contents. But actually we can trust its contents if we are
> not scheduled between entry and exit.
>
> So do less copying and move only the ptregs over to the
> task-stack in this code-path.
>
> Suggested-by: Andy Lutomirski <[email protected]>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 70 +++++++++++++++++++++++++----------------------
> 1 file changed, 38 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 2767c62..90166b2 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -469,33 +469,48 @@
> * segment registers on the way back to user-space or when the
> * sysenter handler runs with eflags.tf set.
> *
> - * When we switch to the task-stack here, we can't trust the
> - * contents of the entry-stack anymore, as the exception handler
> - * might be scheduled out or moved to another CPU. Therefore we
> - * copy the complete entry-stack to the task-stack and set a
> - * marker in the iret-frame (bit 31 of the CS dword) to detect
> - * what we've done on the iret path.
> + * When we switch to the task-stack here, we extend the
> + * stack-frame we copy to include the entry-stack %esp and a
> + * pseudo %ss value so that we have a full ptregs struct on the
> + * stack. We set a marker in the frame (bit 31 of the CS dword).
> *
> - * On the iret path we copy everything back and switch to the
> - * entry-stack, so that the interrupted kernel code-path
> - * continues on the same stack it was interrupted with.
> + * On the iret path we read %esp from the PT_OLDESP slot on the
> + * stack and copy ptregs (except oldesp and oldss) to it, when
> + * we find the marker set. Then we switch to the %esp we read,
> + * so that the interrupted kernel code-path continues on the
> + * same stack it was interrupted with.


Can you give an example of the exact scenario in which any of this
copying happens and why it's needed? IMO you should just be able to
*run* on the entry stack without copying anything at all.

2018-07-20 17:50:08

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
> >
> > From: Joerg Roedel <[email protected]>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
>
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

The question is whether the NMI path is nesting-safe, then we can remove
the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
nesting-safe on x86-32 because of the way the stack-switch happens
there. If its also nesting-safe on x86-64 the warning there can be
removed.

Or did you think of something else to fix there?


Thanks,

Joerg


2018-07-20 19:28:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
> >
> > From: Joerg Roedel <[email protected]>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
>
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

Right, but now the obvious fix for the issue at hand is this. We surely
should revisit this.

Thanks,

tglx


2018-07-20 19:33:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, Jul 20, 2018 at 10:48 AM, Joerg Roedel <[email protected]> wrote:
> On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
>> >
>> > From: Joerg Roedel <[email protected]>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> The question is whether the NMI path is nesting-safe, then we can remove
> the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
> nesting-safe on x86-32 because of the way the stack-switch happens
> there. If its also nesting-safe on x86-64 the warning there can be
> removed.
>
> Or did you think of something else to fix there?

I'm just reading your changelog, and you said the PMDs are no longer
shared between the page tables. So this presumably means that
vmalloc_fault() no longer actually works correctly on PTI systems. I
didn't read the code to figure out *why* it doesn't work, but throwing
random vmalloc_sync_all() calls around is wrong.

Or maybe the bug really just is the warning. The warning can probably go.

>
>
> Thanks,
>
> Joerg
>

2018-07-20 19:34:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 20 Jul 2018, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
>> >
>> > From: Joerg Roedel <[email protected]>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> Right, but now the obvious fix for the issue at hand is this. We surely
> should revisit this.

If you commit this under this reasoning, then please at least make it say:

/* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
is a workaround. */

Let's not have code in the kernel that pretends to make sense but is
actually voodoo magic that works around bugs elsewhere. It's no fun
to maintain down the road.

Subject: [tip:x86/pti] perf/core: Make sure the ring-buffer is mapped in all page-tables

Commit-ID: cdbaf0a372db2bc3c3127e8b63fd15bd6e6757ee
Gitweb: https://git.kernel.org/tip/cdbaf0a372db2bc3c3127e8b63fd15bd6e6757ee
Author: Joerg Roedel <[email protected]>
AuthorDate: Fri, 20 Jul 2018 18:22:22 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 21:32:08 +0200

perf/core: Make sure the ring-buffer is mapped in all page-tables

The ring-buffer is accessed in the NMI handler, so it's better to avoid
faulting on it. Sync the vmalloc range with all page-tables in system to
make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI enabled on
x86-32:

WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the PMDs are no
longer shared between the page-tables, so the vmalloc changes do not
propagate automatically.

Note: Andy said rightfully that we should try to fix the vmalloc code for
that case, but that's not a hot fix for the issue at hand.

Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Laight <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrea Arcangeli <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: "David H . Gutteridge" <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/events/ring_buffer.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf407e374..7b0e9aafafdf 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,9 @@ static void rb_free_work(struct work_struct *work)

vfree(base);
kfree(rb);
+
+ /* Make sure buffer is unmapped in all page-tables */
+ vmalloc_sync_all();
}

void rb_free(struct ring_buffer *rb)
@@ -840,6 +843,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
if (!all_buf)
goto fail_all_buf;

+ /*
+ * The buffer is accessed in NMI handlers, make sure it is
+ * mapped in all page-tables in the system so that we don't
+ * fault on the range in an NMI handler.
+ */
+ vmalloc_sync_all();
+
rb->user_page = all_buf;
rb->data_pages[0] = all_buf + PAGE_SIZE;
if (nr_pages) {

Subject: [tip:x86/pti] x86/entry/32: Check for VM86 mode in slow-path check

Commit-ID: 9cd342705877526f387cfcb5df8a964ab5873deb
Gitweb: https://git.kernel.org/tip/9cd342705877526f387cfcb5df8a964ab5873deb
Author: Joerg Roedel <[email protected]>
AuthorDate: Fri, 20 Jul 2018 18:22:23 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 21:32:08 +0200

x86/entry/32: Check for VM86 mode in slow-path check

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to go down the
slow and paranoid entry path. The problem is that this check also returns
true when coming from VM86 mode. This is not a problem by itself, as the
paranoid path handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and faster).

Extend the check to include VM86 mode. This also makes an optimization of
the paranoid path possible.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Laight <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrea Arcangeli <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: "David H . Gutteridge" <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/entry/entry_32.S | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb41e3c7..2767c625a52c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
andl $(0x0000ffff), PT_CS(%esp)

/* Special case - entry from kernel mode via entry stack */
- testl $SEGMENT_RPL_MASK, PT_CS(%esp)
- jz .Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+ movl PT_EFLAGS(%esp), %ecx # mix EFLAGS and CS
+ movb PT_CS(%esp), %cl
+ andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+ movl PT_CS(%esp), %ecx
+ andl $SEGMENT_RPL_MASK, %ecx
+#endif
+ cmpl $USER_RPL, %ecx
+ jb .Lentry_from_kernel_\@

/* Bytes to copy */
movl $PTREGS_SIZE, %ecx

2018-07-20 19:44:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <[email protected]> wrote:
> > On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
> >> >
> >> > From: Joerg Roedel <[email protected]>
> >> >
> >> > The ring-buffer is accessed in the NMI handler, so we better
> >> > avoid faulting on it. Sync the vmalloc range with all
> >> > page-tables in system to make sure everyone has it mapped.
> >> >
> >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> >> > enabled on x86-32:
> >> >
> >> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >> >
> >> > This triggers because with PTI enabled on an PAE kernel the
> >> > PMDs are no longer shared between the page-tables, so the
> >> > vmalloc changes do not propagate automatically.
> >>
> >> It seems like it would be much more robust to fix the vmalloc_fault()
> >> code instead.
> >
> > Right, but now the obvious fix for the issue at hand is this. We surely
> > should revisit this.
>
> If you commit this under this reasoning, then please at least make it say:
>
> /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
> is a workaround. */
>
> Let's not have code in the kernel that pretends to make sense but is
> actually voodoo magic that works around bugs elsewhere. It's no fun
> to maintain down the road.

Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to
get that stuff some exposure in next ASAP.

Thanks,

tglx


2018-07-20 19:54:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, 20 Jul 2018, Thomas Gleixner wrote:
> On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <[email protected]> wrote:
> > > On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <[email protected]> wrote:
> > >> >
> > >> > From: Joerg Roedel <[email protected]>
> > >> >
> > >> > The ring-buffer is accessed in the NMI handler, so we better
> > >> > avoid faulting on it. Sync the vmalloc range with all
> > >> > page-tables in system to make sure everyone has it mapped.
> > >> >
> > >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > >> > enabled on x86-32:
> > >> >
> > >> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> > >> >
> > >> > This triggers because with PTI enabled on an PAE kernel the
> > >> > PMDs are no longer shared between the page-tables, so the
> > >> > vmalloc changes do not propagate automatically.
> > >>
> > >> It seems like it would be much more robust to fix the vmalloc_fault()
> > >> code instead.
> > >
> > > Right, but now the obvious fix for the issue at hand is this. We surely
> > > should revisit this.
> >
> > If you commit this under this reasoning, then please at least make it say:
> >
> > /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
> > is a workaround. */
> >
> > Let's not have code in the kernel that pretends to make sense but is
> > actually voodoo magic that works around bugs elsewhere. It's no fun
> > to maintain down the road.
>
> Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to
> get that stuff some exposure in next ASAP.

Delta patch below.

Thanks.

tglx

8<-------------
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -815,8 +815,12 @@ static void rb_free_work(struct work_str
vfree(base);
kfree(rb);

- /* Make sure buffer is unmapped in all page-tables */
- vmalloc_sync_all();
+ /*
+ * FIXME: PAE workaround for vmalloc_fault(): Make sure buffer is
+ * unmapped in all page-tables.
+ */
+ if (IS_ENABLED(CONFIG_X86_PAE))
+ vmalloc_sync_all();
}

void rb_free(struct ring_buffer *rb)
@@ -844,11 +848,13 @@ struct ring_buffer *rb_alloc(int nr_page
goto fail_all_buf;

/*
- * The buffer is accessed in NMI handlers, make sure it is
- * mapped in all page-tables in the system so that we don't
- * fault on the range in an NMI handler.
+ * FIXME: PAE workaround for vmalloc_fault(): The buffer is
+ * accessed in NMI handlers, make sure it is mapped in all
+ * page-tables in the system so that we don't fault on the range in
+ * an NMI handler.
*/
- vmalloc_sync_all();
+ if (IS_ENABLED(CONFIG_X86_PAE))
+ vmalloc_sync_all();

rb->user_page = all_buf;
rb->data_pages[0] = all_buf + PAGE_SIZE;

Subject: [tip:x86/pti] x86/entry/32: Check for VM86 mode in slow-path check

Commit-ID: d5e84c21dbf5ea458897f88346dc979909eed913
Gitweb: https://git.kernel.org/tip/d5e84c21dbf5ea458897f88346dc979909eed913
Author: Joerg Roedel <[email protected]>
AuthorDate: Fri, 20 Jul 2018 18:22:23 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 22:33:41 +0200

x86/entry/32: Check for VM86 mode in slow-path check

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to go down the
slow and paranoid entry path. The problem is that this check also returns
true when coming from VM86 mode. This is not a problem by itself, as the
paranoid path handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and faster).

Extend the check to include VM86 mode. This also makes an optimization of
the paranoid path possible.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Laight <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrea Arcangeli <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: "David H . Gutteridge" <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/entry/entry_32.S | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb41e3c7..2767c625a52c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
andl $(0x0000ffff), PT_CS(%esp)

/* Special case - entry from kernel mode via entry stack */
- testl $SEGMENT_RPL_MASK, PT_CS(%esp)
- jz .Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+ movl PT_EFLAGS(%esp), %ecx # mix EFLAGS and CS
+ movb PT_CS(%esp), %cl
+ andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+ movl PT_CS(%esp), %ecx
+ andl $SEGMENT_RPL_MASK, %ecx
+#endif
+ cmpl $USER_RPL, %ecx
+ jb .Lentry_from_kernel_\@

/* Bytes to copy */
movl $PTREGS_SIZE, %ecx

Subject: [tip:x86/pti] perf/core: Make sure the ring-buffer is mapped in all page-tables

Commit-ID: 77754cfa09a6c528c38cbca9ee4cc4f7cf6ad6f2
Gitweb: https://git.kernel.org/tip/77754cfa09a6c528c38cbca9ee4cc4f7cf6ad6f2
Author: Joerg Roedel <[email protected]>
AuthorDate: Fri, 20 Jul 2018 18:22:22 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 22:33:41 +0200

perf/core: Make sure the ring-buffer is mapped in all page-tables

The ring-buffer is accessed in the NMI handler, so it's better to avoid
faulting on it. Sync the vmalloc range with all page-tables in system to
make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI enabled on
x86-32:

WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the PMDs are no
longer shared between the page-tables, so the vmalloc changes do not
propagate automatically.

Note: Andy said rightfully that we should try to fix the vmalloc code for
that case, but that's not a hot fix for the issue at hand.

Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Laight <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrea Arcangeli <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: "David H . Gutteridge" <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/events/ring_buffer.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf407e374..df2d8cf0072c 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,13 @@ static void rb_free_work(struct work_struct *work)

vfree(base);
kfree(rb);
+
+ /*
+ * FIXME: PAE workaround for vmalloc_fault(): Make sure buffer is
+ * unmapped in all page-tables.
+ */
+ if (IS_ENABLED(CONFIG_X86_PAE))
+ vmalloc_sync_all();
}

void rb_free(struct ring_buffer *rb)
@@ -840,6 +847,15 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
if (!all_buf)
goto fail_all_buf;

+ /*
+ * FIXME: PAE workaround for vmalloc_fault(): The buffer is
+ * accessed in NMI handlers, make sure it is mapped in all
+ * page-tables in the system so that we don't fault on the range in
+ * an NMI handler.
+ */
+ if (IS_ENABLED(CONFIG_X86_PAE))
+ vmalloc_sync_all();
+
rb->user_page = all_buf;
rb->data_pages[0] = all_buf + PAGE_SIZE;
if (nr_pages) {

2018-07-20 21:37:54

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
> I'm just reading your changelog, and you said the PMDs are no longer
> shared between the page tables. So this presumably means that
> vmalloc_fault() no longer actually works correctly on PTI systems. I
> didn't read the code to figure out *why* it doesn't work, but throwing
> random vmalloc_sync_all() calls around is wrong.

Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
swapper_pg_dir to process page-tables when the relevant parts of the
kernel page-table are not shared, no?

That is also the reason we don't see this on 64 bit, because there these
parts *are* shared.

So with that reasoning vmalloc_fault() works as designed, except that
a warning is issued when it's happens in the NMI path. That warning comes
from

ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI

which went into 2.6.37 and was added because the NMI handler were not
nesting-safe back then. Reason probably was that the handler on 64 bit
has to use an IST stack and a nested NMI would overwrite the stack of
the upper handler. We don't have this problem on 32 bit as a nested NMI
will not do another stack-switch there.

I am not sure about 64 bit, but there is a lot of assembly magic to make
NMIs nesting-safe, so I guess the problem should be gone there too.


Regards,

Joerg

2018-07-20 21:44:58

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path

[ Re-sending because I accidentially replied only to Andy ]

On Fri, Jul 20, 2018 at 10:09:26AM -0700, Andy Lutomirski wrote:
> Can you give an example of the exact scenario in which any of this
> copying happens and why it's needed? IMO you should just be able to
> *run* on the entry stack without copying anything at all.

So for example when we execute RESTORE_REGS on the path back to
user-space and get an exception while loading the user segment
registers.

When that happens we are already on the entry-stack and on user-cr3.
There is no question that when we return from the exception we need to
get back to entry-stack and user-cr3, despite we are returning to kernel
mode. Otherwise we enter user-space with kernel-cr3 or get a page-fault
and panic.

The exception runs through the common_exception path, and finally ends
up calling C code. And correct me if I am wrong, but calling into C code
from the entry-stack is a bad idea for multiple reasons.

First reason is the size of the stack. We can make it larger, but how
large does it need to be?

Next problem is that current_pt_regs doesn't work in the C code when
pt_regs are on the entry-stack.

These problems can all be solved, but it wouldn't be a robust solution
because when changes to the C code are made they are usually not tested
while on the entry-stack. That case is hard to trigger, so it can easily
break again.

For me, only the x86 selftests triggered all these corner-cases, but not
all developers run them on 32 bit when making changes to generic x86
code.

Regards,

Joerg

2018-07-20 22:21:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables


> On Jul 20, 2018, at 11:37 AM, Joerg Roedel <[email protected]> wrote:
>
>> On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
>> I'm just reading your changelog, and you said the PMDs are no longer
>> shared between the page tables. So this presumably means that
>> vmalloc_fault() no longer actually works correctly on PTI systems. I
>> didn't read the code to figure out *why* it doesn't work, but throwing
>> random vmalloc_sync_all() calls around is wrong.
>
> Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
> swapper_pg_dir to process page-tables when the relevant parts of the
> kernel page-table are not shared, no?
>
> That is also the reason we don't see this on 64 bit, because there these
> parts *are* shared.
>
> So with that reasoning vmalloc_fault() works as designed, except that
> a warning is issued when it's happens in the NMI path. That warning comes
> from
>
> ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI
>
> which went into 2.6.37 and was added because the NMI handler were not
> nesting-safe back then. Reason probably was that the handler on 64 bit
> has to use an IST stack and a nested NMI would overwrite the stack of
> the upper handler. We don't have this problem on 32 bit as a nested NMI
> will not do another stack-switch there.
>

Thanks for digging! The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out. But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.

I would remove the warning, re-test, and revert the other patch.

The one case we can’t handle in vmalloc_fault() is a fault on a stack access. I don’t expect this to be a problem for PTI. It was a problem for CONFIG_VMAP_STACK, though.

> I am not sure about 64 bit, but there is a lot of assembly magic to make
> NMIs nesting-safe, so I guess the problem should be gone there too.
>
>
> Regards,
>
> Joerg

2018-07-21 16:08:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check

Hi!

> The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to
> go down the slow and paranoid entry path. The problem is
> that this check also returns true when coming from VM86
> mode. This is not a problem by itself, as the paranoid path
> handles VM86 stack-frames just fine, but it is not necessary
> as the normal code path handles VM86 mode as well (and
> faster).
>
> Extend the check to include VM86 mode. This also makes an
> optimization of the paranoid path possible.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 010cdb4..2767c62 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -414,8 +414,16 @@
> andl $(0x0000ffff), PT_CS(%esp)
>
> /* Special case - entry from kernel mode via entry stack */
> - testl $SEGMENT_RPL_MASK, PT_CS(%esp)
> - jz .Lentry_from_kernel_\@
> +#ifdef CONFIG_VM86
> + movl PT_EFLAGS(%esp), %ecx # mix EFLAGS and CS
> + movb PT_CS(%esp), %cl
> + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
> +#else
> + movl PT_CS(%esp), %ecx
> + andl $SEGMENT_RPL_MASK, %ecx
> +#endif
> + cmpl $USER_RPL, %ecx
> + jb .Lentry_from_kernel_\@

Would it make sense to jump to the slow path as we did, and them jump
back if VM86 is detected?

Because VM86 is not really used often these days, and moving partial
registers results in short code but IIRC it will be rather slow.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.69 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-21 21:08:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

On Fri, Jul 20, 2018 at 3:20 PM Andy Lutomirski <[email protected]> wrote:
> Thanks for digging! The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out.
> But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.
>
> I would remove the warning, re-test, and revert the other patch.

Agreed. I don't think we have any issues with page faults during NMI
any more. Afaik the kprobe people depend on it.

That said, 64-bit mode has that scary PV-op case
(arch_flush_lazy_mmu_mode). Being PV mode, I can't find it in myself
to worry about it, I'm assuming it's ok.

Linus

2018-07-23 03:50:09

by David H. Gutteridge

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote:
> Hi,
>
> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:
>
> Patch 1: Very important, it fixes a vmalloc-fault in NMI context
> when PTI is enabled. This is pretty unlikely to hit
> when starting perf on an idle machine, which is why I
> didn't find it earlier in my testing. I always started
> 'perf top' first :/ But when I start 'perf top' last
> when the kernel-compile already runs, it hits almost
> immediatly.
>
> Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
> to also take VM86 into account. This is not strictly
> necessary because the slow-path also works for VM86
> mode but it is not how the code was intended to work.
> And it breaks when Patch 3 is applied on-top.
>
> Patch 3: Implement the reduced copying in the paranoid
> entry/exit path as suggested by Andy Lutomirski while
> reviewing version 7 of the original patches.
>
> I have the x86/tip branch with these patches on-top running my test
> for
> 6h now, with no issues so far. So for now it looks like there are no
> scheduling points or irq-enabled sections reached from the paranoid
> entry/exit paths and we always return to the entry-stack we came from.
>
> I keep the test running over the weekend at least.
>
> Please review.
>
> [ If Patch 1 looks good to the maintainers I suggest applying it soon,
> before too many linux-next testers run into this issue. It is
> actually
> the reason why I send out the patches _now_ and didn't wait until
> next
> week when the other two patches got more testing from my side. ]
>
> Thanks,
>
> Joerg
>
> Joerg Roedel (3):
> perf/core: Make sure the ring-buffer is mapped in all page-tables
> x86/entry/32: Check for VM86 mode in slow-path check
> x86/entry/32: Copy only ptregs on paranoid entry/exit path
>
> arch/x86/entry/entry_32.S | 82 ++++++++++++++++++++++++++--------
> -----------
> kernel/events/ring_buffer.c | 10 ++++++
> 2 files changed, 58 insertions(+), 34 deletions(-)

Hi Joerg,

I tested again using the tip "x86/pti" branch (with two of the three
patches in your change set already applied), and manually applied
your third patch on top of it (I see there was some debate about it,
but I thought I'd include it), plus I had to manually apply the patch
to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't
do zero_resv_unavail if memmap is not allocated"), since "x86/pti"
doesn't include it yet.

Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
driver. (This is the same VM definition I shared with you in a PM
back on Feb. 20th, except note that 4.18 kernels won't successfully
boot with QEMU's IDE device, so I'm using SATA instead. That's a
regression totally unrelated to your change sets, or to the general
booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

[drm] Found bochs VGA, ID 0xb0c0.
[drm] Framebuffer size 16384 kB @ 0xfd000000, mmio @ 0xfebd4000.
[TTM] Zone kernel: Available graphics memory: 390536 kiB
[TTM] Zone highmem: Available graphics memory: 4659530 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
------------[ cut here ]------------
kernel BUG at arch/x86/mm/fault.c:269!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 349 Comm: systemd-udevd Tainted: G W 4.18.0-
rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
04/01/2014
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3
EAX: 02788000 EBX: c85b6de8 ECX: 00000080 EDX: 00000000
ESI: 000fd000 EDI: fd0000f3 EBP: f4743994 ESP: f474397c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a00000 CR3: 347f6000 CR4: 000006f0
Call Trace:
__do_page_fault+0x340/0x4c0
do_page_fault+0x25/0xf0
? ttm_mem_reg_ioremap+0xe5/0x100 [ttm]
? kvm_async_pf_task_wait+0x1b0/0x1b0
do_async_page_fault+0x55/0x80
common_exception+0x13f/0x146
EIP: memset+0xb/0x20
Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b
5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89
d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66
EAX: 00000000 EBX: f7a00000 ECX: 00300000 EDX: 00000000
ESI: f4743b9c EDI: f7a00000 EBP: f4743a4c ESP: f4743a44
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
ttm_bo_move_memcpy+0x49a/0x4c0 [ttm]
? _cond_resched+0x17/0x40
ttm_bo_handle_move_mem+0x554/0x570 [ttm]
? ttm_bo_mem_space+0x211/0x440 [ttm]
ttm_bo_validate+0xf5/0x110 [ttm]
bochs_bo_pin+0xde/0x1c0 [bochs_drm]
bochsfb_create+0xce/0x310 [bochs_drm]
__drm_fb_helper_initial_config_and_unlock+0x1cc/0x470 [drm_kms_helper]
drm_fb_helper_initial_config+0x35/0x40 [drm_kms_helper]
bochs_fbdev_init+0x74/0x80 [bochs_drm]
bochs_load+0x7a/0x90 [bochs_drm]
drm_dev_register+0x103/0x180 [drm]
drm_get_pci_dev+0x80/0x160 [drm]
bochs_pci_probe+0xcb/0x100 [bochs_drm]
? bochs_load+0x90/0x90 [bochs_drm]
pci_device_probe+0xc7/0x160
driver_probe_device+0x2dc/0x470
__driver_attach+0xe1/0x110
? driver_probe_device+0x470/0x470
bus_for_each_dev+0x5a/0x90
driver_attach+0x19/0x20
? driver_probe_device+0x470/0x470
bus_add_driver+0x12f/0x230
? pci_bus_num_vf+0x20/0x20
driver_register+0x56/0xf0
? 0xf789c000
__pci_register_driver+0x3d/0x40
bochs_init+0x41/0x1000 [bochs_drm]
do_one_initcall+0x42/0x1a9
? free_unref_page_commit+0x6f/0xe0
? _cond_resched+0x17/0x40
? kmem_cache_alloc_trace+0x3b/0x1f0
? do_init_module+0x21/0x1dc
? do_init_module+0x21/0x1dc
do_init_module+0x50/0x1dc
load_module+0x22ae/0x2940
sys_finit_module+0x8a/0xe0
do_fast_syscall_32+0x7f/0x1b0
entry_SYSENTER_32+0x79/0xda
EIP: 0xb7eecd09
Code: 08 8b 80 64 cd ff ff 85 d2 74 02 89 02 5d c3 8b 04 24 c3 8b 0c 24
c3 8b 1c 24 c3 8b 3c 24 c3 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59
c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: ffffffda EBX: 00000010 ECX: b7afee75 EDX: 00000000
ESI: 019a91e0 EDI: 0199a6d0 EBP: 0199a760 ESP: bfba3d8c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: bochs_drm(+) drm_kms_helper ttm virtio_console drm
8139too virtio_pci serio_raw crc32c_intel virtio_ring 8139cp ata_generic
qemu_fw_cfg virtio mii floppy pata_acpi
---[ end trace 9fc4a94c280952eb ]---
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3
EAX: 02788000 EBX: c85b6de8 ECX: 00000080 EDX: 00000000
ESI: 000fd000 EDI: fd0000f3 EBP: f4743994 ESP: c85c1ddc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a00000 CR3: 347f6000 CR4: 000006f0

Regards,

Dave



2018-07-23 07:31:43

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

Hey David,

On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> driver. (This is the same VM definition I shared with you in a PM
> back on Feb. 20th, except note that 4.18 kernels won't successfully
> boot with QEMU's IDE device, so I'm using SATA instead. That's a
> regression totally unrelated to your change sets, or to the general
> booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that
branch in and test again, please?


Thanks,

Joerg


2018-07-23 14:10:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

Hi!

> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:

It seems PTI is now in -next. I'll test that soon.

Meanwhile... it looks like gcc is not slowed down significantly, but
other stuff sees 30% .. 40% slowdowns... which is rather
significant.

Would it be possible to have per-process control of kpti? I have
some processes where trading of speed for security would make sense.

Best regards,
Pavel

cd ~/g/tui/nowcast
time ./nowcast -x (30%)
KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU
133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU
140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU
133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium)
139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU

4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU
115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU
128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU
132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU
134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU
132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU
120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU

time cat /dev/urandom | head -c 10000000 | bzip2 -9 - | wc -c (40%)
v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU
4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU
5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU
KPTI: 6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU
6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU
6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU

v4l-utils: make clean, time make
v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU
221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU
198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU
204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU
203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU
KPTI: 156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU
183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (2.43 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-23 19:02:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
>
> Meanwhile... it looks like gcc is not slowed down significantly, but
> other stuff sees 30% .. 40% slowdowns... which is rather
> significant.

That is more or less expected.

Gcc spends about 90+% of its time in user space, and the system calls
it *does* do tend to be "real work" (open/read/etc). And modern gcc's
no longer have the pipe between cpp and cc1, so they don't have that
issue either (which would have sjhown the PTI slowdown a lot more)

Some other loads will do a lot more time traversing the user/kernel
boundary, and in 32-bit mode you won't be able to take advantage of
the address space ID's, so you really get the full effect.

> Would it be possible to have per-process control of kpti? I have
> some processes where trading of speed for security would make sense.

That was pretty extensively discussed, and no sane model for it was
ever agreed upon. Some people wanted it per-thread, others per-mm,
and it wasn't clear how to set it either and how it should inherit
across fork/exec, and what the namespace rules etc should be.

You absolutely need to inherit it (so that you can say "I trust this
session" or whatever), but at the same time you *don't* want to
inherit if you have a server you trust that then spawns user processes
(think "I want systemd to not have the overhead, but the user
processes it spawns obviously do need protection").

It was just a morass. Nothing came out of it. I guess people can
discuss it again, but it's not simple.

Linus

2018-07-23 21:40:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
>
> That is more or less expected.
>
> Gcc spends about 90+% of its time in user space, and the system calls
> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> no longer have the pipe between cpp and cc1, so they don't have that
> issue either (which would have sjhown the PTI slowdown a lot more)
>
> Some other loads will do a lot more time traversing the user/kernel
> boundary, and in 32-bit mode you won't be able to take advantage of
> the address space ID's, so you really get the full effect.

Understood. Just -- bzip2 should include quite a lot of time in
userspace, too.

> > Would it be possible to have per-process control of kpti? I have
> > some processes where trading of speed for security would make sense.
>
> That was pretty extensively discussed, and no sane model for it was
> ever agreed upon. Some people wanted it per-thread, others per-mm,
> and it wasn't clear how to set it either and how it should inherit
> across fork/exec, and what the namespace rules etc should be.
>
> You absolutely need to inherit it (so that you can say "I trust this
> session" or whatever), but at the same time you *don't* want to
> inherit if you have a server you trust that then spawns user processes
> (think "I want systemd to not have the overhead, but the user
> processes it spawns obviously do need protection").
>
> It was just a morass. Nothing came out of it. I guess people can
> discuss it again, but it's not simple.

I agree it is not easy. OTOH -- 30% of user-visible performance is a
_lot_. That is worth spending man-years on... Ok, problem is not as
severe on modern CPUs with address space ID's, but...

What I want is "if A can ptrace B, and B has pti disabled, A can have
pti disabled as well". Now.. I see someone may want to have it
per-thread, because for stuff like javascript JIT, thread may have
rights to call ptrace, but is unable to call ptrace because JIT
removed that ability... hmm...

But for now I'd like at least "global" option of turning pti on/off
during runtime for benchmarking. Let me see...

Something like this, or is it going to be way more complex? Does
anyone have patch by chance?

Pavel

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index dfb975b..719e39a 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -162,6 +162,9 @@
.macro SWITCH_TO_USER_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI

+ cmpl $1, PER_CPU_VAR(pti_enabled)
+ jne .Lend_\@
+
movl %cr3, \scratch_reg
orl $PTI_SWITCH_MASK, \scratch_reg
movl \scratch_reg, %cr3
@@ -176,6 +179,8 @@
testl $SEGMENT_RPL_MASK, PT_CS(%esp)
jz .Lend_\@
.endif
+ cmpl $1, PER_CPU_VAR(pti_enabled)
+ jne .Lend_\@
/* On user-cr3? */
movl %cr3, %eax
testl $PTI_SWITCH_MASK, %eax
@@ -192,6 +197,10 @@
*/
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ cmpl $1, PER_CPU_VAR(pti_enabled)
+ jne .Lend_\@
+
movl %cr3, \scratch_reg
/* Test if we are already on kernel CR3 */
testl $PTI_SWITCH_MASK, \scratch_reg
@@ -302,6 +311,9 @@
*/
ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI

+ cmpl $1, PER_CPU_VAR(pti_enabled)
+ jne .Lswitched_\@
+
testl $PTI_SWITCH_MASK, \cr3_reg
jz .Lswitched_\@

diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b..8c92ae2 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -59,6 +59,7 @@ struct cpu_entry_area {
#define CPU_ENTRY_AREA_TOT_SIZE (CPU_ENTRY_AREA_SIZE * NR_CPUS)

DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+DECLARE_PER_CPU(int, pti_enabled);

extern void setup_cpu_entry_areas(void);
extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f..da34a21 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,9 @@ void load_percpu_segment(int cpu)
DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
#endif

+DEFINE_PER_CPU(int, pti_enabled);
+
+
#ifdef CONFIG_X86_64
/*
* Special IST stacks which the CPU switches to when it calls

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (4.68 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-23 21:52:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates



> On Jul 23, 2018, at 2:38 PM, Pavel Machek <[email protected]> wrote:
>
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
>>>
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>>
>> That is more or less expected.
>>
>> Gcc spends about 90+% of its time in user space, and the system calls
>> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
>> no longer have the pipe between cpp and cc1, so they don't have that
>> issue either (which would have sjhown the PTI slowdown a lot more)
>>
>> Some other loads will do a lot more time traversing the user/kernel
>> boundary, and in 32-bit mode you won't be able to take advantage of
>> the address space ID's, so you really get the full effect.
>
> Understood. Just -- bzip2 should include quite a lot of time in
> userspace, too.
>
>>> Would it be possible to have per-process control of kpti? I have
>>> some processes where trading of speed for security would make sense.
>>
>> That was pretty extensively discussed, and no sane model for it was
>> ever agreed upon. Some people wanted it per-thread, others per-mm,
>> and it wasn't clear how to set it either and how it should inherit
>> across fork/exec, and what the namespace rules etc should be.
>>
>> You absolutely need to inherit it (so that you can say "I trust this
>> session" or whatever), but at the same time you *don't* want to
>> inherit if you have a server you trust that then spawns user processes
>> (think "I want systemd to not have the overhead, but the user
>> processes it spawns obviously do need protection").
>>
>> It was just a morass. Nothing came out of it. I guess people can
>> discuss it again, but it's not simple.
>
> I agree it is not easy. OTOH -- 30% of user-visible performance is a
> _lot_. That is worth spending man-years on... Ok, problem is not as
> severe on modern CPUs with address space ID's, but...
>
> What I want is "if A can ptrace B, and B has pti disabled, A can have
> pti disabled as well". Now.. I see someone may want to have it
> per-thread, because for stuff like javascript JIT, thread may have
> rights to call ptrace, but is unable to call ptrace because JIT
> removed that ability... hmm...

No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

2018-07-23 21:57:25

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

Hi!

> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
>
> No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

Yeah, the web browser threads that run javascript code should have PTI
on. But maybe I want the rest of web browser with PTI off.

So... yes, I see why someone may want it per-thread (and not
per-process).

I guess per-process would be good enough for me. Actually, maybe even
per-uid. I don't have any fancy security here, so anything running uid
0 and 1000 is close enough to trusted.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.09 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-23 22:01:15

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
> But for now I'd like at least "global" option of turning pti on/off
> during runtime for benchmarking. Let me see...
>
> Something like this, or is it going to be way more complex? Does
> anyone have patch by chance?

RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

--
Josh

2018-07-23 22:09:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On 07/23/2018 02:59 PM, Josh Poimboeuf wrote:
> On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
>> But for now I'd like at least "global" option of turning pti on/off
>> during runtime for benchmarking. Let me see...
>>
>> Something like this, or is it going to be way more complex? Does
>> anyone have patch by chance?
> RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

Let's not forget PTI's NX-for-userspace in the kernel page tables. That
provides Spectre V2 mitigation as well as Meltdown.

2018-07-24 13:40:47

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
>
> That is more or less expected.

Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
loop, I get (on v4.18) that, too.

That tells me that something is wrong with machine I'm using for
benchmarking. Whether KPTI is enabled can still be measured with the
bzip2 pipeline, but the effect is far more subtle.

Pavel

pavel@amd:~$ while true; do time cat /dev/urandom | head -c 10000000 |
bzip2 -9 - | wc -c ; done10044031
3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU
10044234
4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU
10043664
4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU
10045959
4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU
10043829
4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU
10044296
4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU
10045311
4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU
10043819
4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU
10045097
4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU
10044012
4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU
10044569
4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU
10044141
4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU
10043695
4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU
10045690
5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU
10045153
5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU
10044560
5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU
10044845
5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU
10043884
5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU
10044286
5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU
10045749
5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU
10044098
5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU
10045326
5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU
10042365
5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU
10043952
5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU
10044569
5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU
10043241
5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU
10044797
5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU
10043711
5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU
10043882
5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU
10043571
5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU
10044742
5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU
10044170
5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU
10043542
5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU
10042985
5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU
10044102
5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU
10044609
5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU
10045185
5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU
10044444
5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU
10044859
5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU
10045613

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (3.38 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-24 14:41:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates



> On Jul 24, 2018, at 6:39 AM, Pavel Machek <[email protected]> wrote:
>
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
>>>
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>>
>> That is more or less expected.
>
> Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
> loop, I get (on v4.18) that, too.
>
>

...

The obvious cause would be thermal issues, which are increasingly common in laptops. You could get cycle counts from perf stat, perhaps.

2018-07-24 21:20:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote:
>
>
> > On Jul 23, 2018, at 2:38 PM, Pavel Machek <[email protected]> wrote:
> >
> >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <[email protected]> wrote:
> >>>
> >>> Meanwhile... it looks like gcc is not slowed down significantly, but
> >>> other stuff sees 30% .. 40% slowdowns... which is rather
> >>> significant.
> >>
> >> That is more or less expected.
> >>
> >> Gcc spends about 90+% of its time in user space, and the system calls
> >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> >> no longer have the pipe between cpp and cc1, so they don't have that
> >> issue either (which would have sjhown the PTI slowdown a lot more)
> >>
> >> Some other loads will do a lot more time traversing the user/kernel
> >> boundary, and in 32-bit mode you won't be able to take advantage of
> >> the address space ID's, so you really get the full effect.
> >
> > Understood. Just -- bzip2 should include quite a lot of time in
> > userspace, too.
> >
> >>> Would it be possible to have per-process control of kpti? I have
> >>> some processes where trading of speed for security would make sense.
> >>
> >> That was pretty extensively discussed, and no sane model for it was
> >> ever agreed upon. Some people wanted it per-thread, others per-mm,
> >> and it wasn't clear how to set it either and how it should inherit
> >> across fork/exec, and what the namespace rules etc should be.
> >>
> >> You absolutely need to inherit it (so that you can say "I trust this
> >> session" or whatever), but at the same time you *don't* want to
> >> inherit if you have a server you trust that then spawns user processes
> >> (think "I want systemd to not have the overhead, but the user
> >> processes it spawns obviously do need protection").
> >>
> >> It was just a morass. Nothing came out of it. I guess people can
> >> discuss it again, but it's not simple.
> >
> > I agree it is not easy. OTOH -- 30% of user-visible performance is a
> > _lot_. That is worth spending man-years on... Ok, problem is not as
> > severe on modern CPUs with address space ID's, but...
> >
> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
>
> No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

Ok, you are right. It is more tricky then I thought.

Still, I probably don't need to run grep's and cat's with PTI
on. Chromium (etc) probably needs it. Python interpretter running my
own code probably does not.

Yes, my Thinkpad X60 is probably thermal-throttled. It is not really
new machine. I switched to T40p for now :-).

What is the "worst" case people are seeing?
time dd if=/dev/zero of=/dev/null bs=1 count=10000000
can reproduce 3x slowdown, but that's basically microbenchmark.

root-only per-process enable/disable of kpti should not be too hard to
do. Would there be interest in that?

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (3.50 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-26 03:48:48

by David H. Gutteridge

[permalink] [raw]
Subject: Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote:
> Hey David,
>
> On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> > driver. (This is the same VM definition I shared with you in a PM
> > back on Feb. 20th, except note that 4.18 kernels won't successfully
> > boot with QEMU's IDE device, so I'm using SATA instead. That's a
> > regression totally unrelated to your change sets, or to the general
> > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as
> > well.)
>
> Yes, this needs the fixes in the tip/x86/mm branch as well. Can you
> that branch in and test again, please?

Sorry, I didn't realize I needed those changes, too. I've re-tested
with those applied and haven't encountered any issues. I'm now
re-testing again with your newer patch set from the 25th. No issues
so far with those, either; I'll confirm in that email thread after
the laptop has seen some more use.

Dave