2018-01-04 20:56:43

by Dave Hansen

[permalink] [raw]
Subject: [PATCH] x86/doc: add PTI description


This got kicked out of the PTI set as the implementation diverged
from its contents. I've updated it so it can hopefully rejoin the
set.

---

From: Dave Hansen <[email protected]>

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'nopti'.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Moritz Lipp <[email protected]>
Cc: Daniel Gruss <[email protected]>
Cc: Michael Schwarz <[email protected]>
Cc: Richard Fellner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
---

b/Documentation/admin-guide/kernel-parameters.txt | 11 +
b/Documentation/x86/pti.txt | 174 ++++++++++++++++++++++
2 files changed, 182 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc 2018-01-03 17:04:23.255028797 -0800
+++ b/Documentation/admin-guide/kernel-parameters.txt 2018-01-03 17:07:06.058028391 -0800
@@ -2712,8 +2712,6 @@
steal time is computed, but won't influence scheduler
behaviour

- nopti [X86-64] Disable kernel page table isolation
-
nolapic [X86-32,APIC] Do not enable or use the local APIC.

nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
@@ -3288,12 +3286,19 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.

- pti= [X86_64]
+ pti= [X86_64] Disable Page Table Isolation of user and
+ kernel address spaces. Disabling this feature
+ removes hardening, but improves performance of
+ system calls and interrupts.
+
Control user/kernel address space isolation:
on - enable
off - disable
auto - default setting

+ nopti [X86_64]
+ Equivalent to pti=off
+
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
diff -puN /dev/null Documentation/x86/pti.txt
--- /dev/null 2017-12-15 13:48:30.454245127 -0800
+++ b/Documentation/x86/pti.txt 2018-01-04 12:54:05.667850771 -0800
@@ -0,0 +1,174 @@
+Overview
+========
+
+Page Table Isolation (pti, previously known as KAISER[1]) is a
+countermeasure against attacks on kernel address information such
+as the "Meltdown" approach[2].
+
+To avoid leaking address information, we create an new, independent
+copy of the page tables which are used only when running userspace
+applications. When the kernel is entered via syscalls, interrupts or
+exceptions, page tables are switched to the full "kernel" copy. When
+the system switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT). There are a few strictly unnecessary things that get mapped
+such as the first C function when entering an interrupt (see comments
+in pti.c).
+
+This approach helps to ensure that side-channel attacks that leverage
+the paging structures do not function when PTI is enabled. It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' or 'pti=' kernel parameter (see kernel-parameters.txt).
+
+Page Table Management
+=====================
+
+When PTI is enabled, the kernel manages two sets of page
+tables. The first copy is very similar to what would be present
+for a kernel without PTI. This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The userspace copy is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy. It maps a only
+the kernel data needed to enter and exit the kernel. This data
+is entirely contained in the 'struct cpu_entry_area' structure
+which is placed in the fixmap and thus each CPU's copy of the
+area has a compile-time-fixed virtual address.
+
+For new userspace mappings, the kernel makes the entries in its
+page tables like normal. The only difference is when the kernel
+makes entries in the top (PGD) level. In addition to setting the
+entry in the main kernel PGD, a copy of the entry is made in the
+userspace page tables' PGD.
+
+This sharing at the PGD level also inherently shares all the lower
+layers of the page tables. This leaves a single, shared set of
+userspace page tables to manage. One PTE to lock, one set set of
+accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important. But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+ a. Each process now needs an order-1 PGD instead of order-0.
+ (Consumes 4k per process).
+ b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
+ aligned so that it can be mapped by setting a single PMD
+ entry. This consumes nearly 2MB of RAM once the kernel
+ is decompressed, but no space in the kernel image itself.
+
+2. Runtime Cost
+ a. CR3 manipulation to switch between the page table copies
+ must be done at interrupt, syscall, and exception entry
+ and exit (it can be skipped when the kernel is interrupted,
+ though.) Moves to CR3 are on the order of a hundred
+ cycles, and are required every at entry and every at exit.
+ b. A "trampoline" must be used for SYSCALL entry. This
+ trampoline depends on a smaller set of resources than the
+ non-PTI SYSCALL entry code, so requires mapping fewer
+ things into the userspace page tables. The downside is
+ that stacks must be switched at entry time.
+ d. Global pages are disabled for all kernel structures not
+ mapped in both to kernel and userspace page tables. This
+ feature of the MMU allows different processes to share TLB
+ entries mapping the kernel. Losing the feature means more
+ TLB misses after a context switch. The actual loss of
+ performance is very small, however, never exceeding 1%.
+ d. Process Context IDentifiers (PCID) is a CPU feature that
+ allows us to skip flushing the entire TLB when switching page
+ tables. This makes switching the page tables (at context
+ switch, or kernel entry/exit) cheaper. But, on systems with
+ PCID support, the context switch code must flush both the user
+ and kernel entries out of the TLB. The user PCID TLB flush is
+ deferred until the exit to userspace, minimizing the cost.
+ e. The userspace page tables must be populated for each new
+ process. Even without PTI, the shared kernel mappings
+ are created by copying top-level (PGD) entries into each
+ new process. But, with PTI, there are now *two* kernel
+ mappings: one in the kernel page tables that maps everything
+ and one for the entry/exit structures. At fork(), we need to
+ copy both.
+ f. In addition to the fork()-time copying, there must also
+ be an update to the userspace PGD any time a set_pgd() is done
+ on a PGD used to map userspace. This ensures that the kernel
+ and userspace copies always map the same userspace
+ memory.
+ g. On systems without PCID support, each CR3 write flushes
+ the entire TLB. That means that each syscall, interrupt
+ or exception flushes the TLB.
+
+Possible Future Work
+====================
+1. We can be more careful about not actually writing to CR3
+ unless its value is actually changed.
+2. Allow PTI to enabled/disabled at runtime in addition to the
+ boot-time switching.
+
+Testing
+========
+
+To test stability of PTI, the following test procedure is recommended,
+ideally doing all of these in parallel:
+
+1. Set CONFIG_DEBUG_ENTRY=y
+2. Run several copies of all of the tools/testing/selftests/x86/ tests
+ (excluding MPX and protection_keys) in a loop on multiple CPUs for
+ several minutes. These tests frequently uncover corner cases in the
+ kernel entry code. In general, old kernels might cause these tests
+ themselves to crash, but they should never crash the kernel.
+3. Run the 'perf' tool in a mode (top or record) that generates many
+ frequent performance monitoring non-maskable interrupts (see "NMI"
+ in /proc/interrupts). This exercises the NMI entry/exit code which
+ is known to trigger bugs in code paths that did not expect to be
+ interrupted, including nested NMIs. Using "-c" boosts the rate of
+ NMIs, and using two -c with separate counters encourages nested NMIs
+ and less deterministic behavior.
+
+ while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
+
+4. Launch a KVM virtual machine.
+5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
+ This has been a lightly-tested code path and needs extra scrutiny.
+
+Debugging
+=========
+
+Bugs in PTI cause a few different signatures of crashes
+that are worth noting here.
+
+ * Failures of the selftests/x86 code. Usually a bug in one of the
+ more obscure corners of entry_64.S
+ * Crashes in early boot, especially around CPU bringup. Bugs
+ in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt. Caused by bugs in entry_64.S,
+ like screwing up a page table switch. Also caused by
+ incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI. The NMI code is separate from main
+ interrupt handlers and can have bugs that do not affect
+ normal interrupts. Also caused by incorrectly mapping NMI
+ code. NMIs that interrupt the entry code must be very
+ careful and can be the cause of crashes that show up when
+ running perf.
+ * Kernel crashes at the first exit to userspace. entry_64.S
+ bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+ in entry_64.S that return to userspace are sometimes separate
+ from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+ faults upon page faults. Caused by touching non-pti-mapped
+ data in the entry code, or forgetting to switch to kernel
+ CR3 before calling into C functions which are not pti-mapped.
+ * Userspace segfaults early in boot, sometimes manifesting
+ as mount(8) failing to mount the rootfs. These have
+ tended to be TLB invalidation issues. Usually invalidating
+ the wrong PCID, or otherwise missing an invalidation.
+
+1. https://gruss.cc/files/kaiser.pdf
+2. https://meltdownattack.com/meltdown.pdf
_


2018-01-04 21:46:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/doc: add PTI description

On Thu, 4 Jan 2018, Dave Hansen wrote:
>
> - pti= [X86_64]
> + pti= [X86_64] Disable Page Table Isolation of user and

That description is definitely wrong....

> + kernel address spaces. Disabling this feature
> + removes hardening, but improves performance of
> + system calls and interrupts.
> +
> Control user/kernel address space isolation:
> on - enable
> off - disable
> auto - default setting
>
> + nopti [X86_64]
> + Equivalent to pti=off
> +
> pty.legacy_count=
> [KNL] Number of legacy pty's. Overwrites compiled-in
> default number.

Thanks,

tglx

2018-01-05 00:06:34

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] x86/doc: add PTI description

On Thu, Jan 4, 2018 at 12:54 PM, Dave Hansen
<[email protected]> wrote:
> [...]
> +For new userspace mappings, the kernel makes the entries in its
> +page tables like normal. The only difference is when the kernel
> +makes entries in the top (PGD) level. In addition to setting the
> +entry in the main kernel PGD, a copy of the entry is made in the
> +userspace page tables' PGD.

It might be worth noting that NX is set in the kernel's view of the
userspace page tables.

> [...]
> +1. Increased Memory Use
> + a. Each process now needs an order-1 PGD instead of order-0.
> + (Consumes 4k per process).

"Consumes an additional 4k per process" ?

> [...]
> + d. Process Context IDentifiers (PCID) is a CPU feature that
> + allows us to skip flushing the entire TLB when switching page
> + tables. This makes switching the page tables (at context
> + switch, or kernel entry/exit) cheaper. But, on systems with
> + PCID support, the context switch code must flush both the user
> + and kernel entries out of the TLB. The user PCID TLB flush is
> + deferred until the exit to userspace, minimizing the cost.

Does this mean it's possible to bypass the NX on userspace pages?

> [...]
> + g. On systems without PCID support, each CR3 write flushes
> + the entire TLB. That means that each syscall, interrupt
> + or exception flushes the TLB.

Is it worth clarifying this for hardware support of PCID vs INVPCID?

Otherwise, looks good!

Reviewed-by: Kees Cook <[email protected]>

-Kees

--
Kees Cook
Pixel Security

2018-01-05 00:21:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/doc: add PTI description

On 01/04/2018 04:06 PM, Kees Cook wrote:
>> + d. Process Context IDentifiers (PCID) is a CPU feature that
>> + allows us to skip flushing the entire TLB when switching page
>> + tables. This makes switching the page tables (at context
>> + switch, or kernel entry/exit) cheaper. But, on systems with
>> + PCID support, the context switch code must flush both the user
>> + and kernel entries out of the TLB. The user PCID TLB flush is
>> + deferred until the exit to userspace, minimizing the cost.
>
> Does this mean it's possible to bypass the NX on userspace pages?

I'll clarify this. The write to CR3 happens, but bit 63 gets set to
tell the CPU not to flush the TLB on the CR3 write.

>> [...]
>> + g. On systems without PCID support, each CR3 write flushes
>> + the entire TLB. That means that each syscall, interrupt
>> + or exception flushes the TLB.
>
> Is it worth clarifying this for hardware support of PCID vs INVPCID?

I'll make changes based on the rest of your comments. Thanks for taking
a look!