This patch set adds support for the Memory Protection eXtensions
(MPX) feature found in future Intel processors. MPX is used in
conjunction with compiler changes to check memory references, and
can be used to catch buffer overflow or underflow.
For MPX to work, changes are required in the kernel, binutils and
compiler. No source changes are required for applications, just a
recompile.
There are a lot of moving parts of this to all work right:
===== Example Compiler / Application / Kernel Interaction =====
1. Application developer compiles with -fmpx. The compiler will add the
instrumentation as well as some setup code called early after the app
starts. New instruction prefixes are noops for old CPUs.
2. That setup code allocates (virtual) space for the "bounds directory",
points the "bndcfgu" register to the directory and notifies the
kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) that the app will
be using MPX.
3. The kernel detects that the CPU has MPX, allows the new prctl() to
succeed, and notes the location of the bounds directory. Userspace is
expected to keep the bounds directory at that location. We note it
instead of reading it each time because the 'xsave' operation needed
to access the bounds directory register is an expensive operation.
4. If the application needs to spill bounds out of the 4 registers, it
issues a bndstx instruction. Since the bounds directory is empty at
this point, a bounds fault (#BR) is raised, the kernel allocates a
bounds table (in the user address space) and makes the relevant
entry in the bounds directory point to the new table. [1]
5. If the application violates the bounds specified in the bounds
registers, a separate kind of #BR is raised which will deliver a
signal with information about the violation in the 'struct siginfo'.
6. Whenever memory is freed, we know that it can no longer contain
valid pointers, and we attempt to free the associated space in the
bounds tables. If an entire table becomes unused, we will attempt
to free the table and remove the entry in the directory.
To summarize, there are essentially three things interacting here:
GCC with -fmpx:
* enables annotation of code with MPX instructions and prefixes
* inserts code early in the application to call in to the "gcc runtime"
GCC MPX Runtime:
* Checks for hardware MPX support in cpuid leaf
* allocates virtual space for the bounds directory (malloc()
essentially)
* points the hardware BNDCFGU register at the directory
* calls a new prctl() to notify the kernel to start managing the
bounds directories
Kernel MPX Code:
* Checks for hardware MPX support in cpuid leaf
* Handles #BR exceptions and sends SIGSEGV to the app when it violates
bounds, like during a buffer overflow.
* When bounds are spilled in to an unallocated bounds table, the kernel
notices in the #BR exception, allocates the virtual space, then
updates the bounds directory to point to the new table. It keeps
special track of the memory with a specific ->vm_ops for MPX.
* Frees unused bounds tables at the time that the memory they described
is unmapped. (See "cleanup unused bound tables")
===== Testing =====
This patchset has been tested on real internal hardware platform at Intel.
We have some simple unit tests in user space, which directly call MPX
instructions to produce #BR to let kernel allocate bounds tables and cause
bounds violations. We also compiled several benchmarks with an MPX-enabled
compiler and ran them with this patch set. We found a number of bugs in this
code in these tests.
1. For more info on why the kernel does these allocations, see the patch
"on-demand kernel allocation of bounds tables"
Future TODO items:
1) support 32-bit binaries on 64-bit kernels.
Changes since v1:
* check to see if #BR occurred in userspace or kernel space.
* use generic structure and macro as much as possible when
decode mpx instructions.
Changes since v2:
* fix some compile warnings.
* update documentation.
Changes since v3:
* correct some syntax errors at documentation, and document
extended struct siginfo.
* for kill the process when the error code of BNDSTATUS is 3.
* add some comments.
* remove new prctl() commands.
* fix some compile warnings for 32-bit.
Changes since v4:
* raise SIGBUS if the allocations of the bound tables fail.
Changes since v5:
* hook unmap() path to cleanup unused bounds tables, and use
new prctl() command to register bounds directory address to
struct mm_struct to check whether one process is MPX enabled
during unmap().
* in order track precisely MPX memory usage, add MPX specific
mmap interface and one VM_MPX flag to check whether a VMA
is MPX bounds table.
* add macro cpu_has_mpx to do performance optimization.
* sync struct figinfo for mips with general version to avoid
build issue.
Changes since v6:
* because arch_vma_name is removed, this patchset have toset MPX
specific ->vm_ops to do the same thing.
* fix warnings for 32 bit arch.
* add more description into these patches.
Changes since v7:
* introduce VM_ARCH_2 flag.
* remove all of the pr_debug()s.
* fix prctl numbers in documentation.
* fix some bugs on bounds tables freeing.
Changes since v8:
* add new patch to rename cfg_reg_u and status_reg.
* add new patch to use disabled features from Dave's patches.
* add new patch to sync struct siginfo for IA64.
* rename two new prctl() commands to PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT, check whether the management of bounds
tables in kernel is enabled at #BR fault time, and add locking to
protect the access to 'bd_addr'.
* update the documentation file to add more content about on-demand
allocation of bounds tables, etc..
Qiaowei Ren (12):
mm: distinguish VMAs with different vm_ops
x86, mpx: rename cfg_reg_u and status_reg
x86, mpx: add MPX specific mmap interface
x86, mpx: add MPX to disaabled features
x86, mpx: on-demand kernel allocation of bounds tables
mpx: extend siginfo structure to include bound violation information
mips: sync struct siginfo with general version
ia64: sync struct siginfo with general version
x86, mpx: decode MPX instruction to get bound violation information
x86, mpx: add prctl commands PR_MPX_ENABLE_MANAGEMENT,
PR_MPX_DISABLE_MANAGEMENT
x86, mpx: cleanup unused bound tables
x86, mpx: add documentation on Intel MPX
Qiaowei Ren (12):
x86, mpx: introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: rename cfg_reg_u and status_reg
x86, mpx: add MPX specific mmap interface
x86, mpx: add MPX to disaabled features
x86, mpx: on-demand kernel allocation of bounds tables
mpx: extend siginfo structure to include bound violation information
mips: sync struct siginfo with general version
ia64: sync struct siginfo with general version
x86, mpx: decode MPX instruction to get bound violation information
x86, mpx: add prctl commands PR_MPX_ENABLE_MANAGEMENT,
PR_MPX_DISABLE_MANAGEMENT
x86, mpx: cleanup unused bound tables
x86, mpx: add documentation on Intel MPX
Documentation/x86/intel_mpx.txt | 245 +++++++++++++++
arch/ia64/include/uapi/asm/siginfo.h | 8 +-
arch/mips/include/uapi/asm/siginfo.h | 4 +
arch/x86/Kconfig | 4 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/mmu_context.h | 25 ++
arch/x86/include/asm/mpx.h | 101 ++++++
arch/x86/include/asm/processor.h | 22 ++-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/mpx.c | 488 ++++++++++++++++++++++++++++++
arch/x86/kernel/setup.c | 8 +
arch/x86/kernel/traps.c | 86 ++++++-
arch/x86/mm/Makefile | 2 +
arch/x86/mm/mpx.c | 385 +++++++++++++++++++++++
fs/exec.c | 2 +
fs/proc/task_mmu.c | 1 +
include/asm-generic/mmu_context.h | 11 +
include/linux/mm.h | 6 +
include/linux/mm_types.h | 3 +
include/uapi/asm-generic/siginfo.h | 9 +-
include/uapi/linux/prctl.h | 6 +
kernel/signal.c | 4 +
kernel/sys.c | 12 +
mm/mmap.c | 2 +
24 files changed, 1436 insertions(+), 7 deletions(-)
create mode 100644 Documentation/x86/intel_mpx.txt
create mode 100644 arch/x86/include/asm/mpx.h
create mode 100644 arch/x86/kernel/mpx.c
create mode 100644 arch/x86/mm/mpx.c
We have to do the allocation of bounds tables in kernel (See the patch
"on-demand kernel allocation of bounds tables"). Moreover, if we want
to track MPX VMAs we need to be able to stick new VM_MPX flag and a
specific vm_ops for MPX in the vma_area_struct.
But there are not suitable interfaces to do this in current kernel.
Existing interfaces, like do_mmap_pgoff(), could not stick specific
->vm_ops in the vma_area_struct when a VMA is created. So, this patch
adds MPX specific mmap interface to do the allocation of bounds tables.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/Kconfig | 4 ++
arch/x86/include/asm/mpx.h | 38 +++++++++++++++++++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/mpx.c | 79 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 123 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/mpx.h
create mode 100644 arch/x86/mm/mpx.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4b663e1..e5bcc70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -243,6 +243,10 @@ config HAVE_INTEL_TXT
def_bool y
depends on INTEL_IOMMU && ACPI
+config X86_INTEL_MPX
+ def_bool y
+ depends on CPU_SUP_INTEL
+
config X86_32_SMP
def_bool y
depends on X86_32 && SMP
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
new file mode 100644
index 0000000..5725ac4
--- /dev/null
+++ b/arch/x86/include/asm/mpx.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_X86_MPX_H
+#define _ASM_X86_MPX_H
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef CONFIG_X86_64
+
+/* upper 28 bits [47:20] of the virtual address in 64-bit used to
+ * index into bounds directory (BD).
+ */
+#define MPX_BD_ENTRY_OFFSET 28
+#define MPX_BD_ENTRY_SHIFT 3
+/* bits [19:3] of the virtual address in 64-bit used to index into
+ * bounds table (BT).
+ */
+#define MPX_BT_ENTRY_OFFSET 17
+#define MPX_BT_ENTRY_SHIFT 5
+#define MPX_IGN_BITS 3
+
+#else
+
+#define MPX_BD_ENTRY_OFFSET 20
+#define MPX_BD_ENTRY_SHIFT 2
+#define MPX_BT_ENTRY_OFFSET 10
+#define MPX_BT_ENTRY_SHIFT 4
+#define MPX_IGN_BITS 2
+
+#endif
+
+#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
+#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+
+#define MPX_BNDSTA_ERROR_CODE 0x3
+
+unsigned long mpx_mmap(unsigned long len);
+
+#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 6a19ad9..ecfdc46 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -30,3 +30,5 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o
obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
new file mode 100644
index 0000000..e1b28e6
--- /dev/null
+++ b/arch/x86/mm/mpx.c
@@ -0,0 +1,79 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <asm/mpx.h>
+#include <asm/mman.h>
+#include <linux/sched/sysctl.h>
+
+static const char *mpx_mapping_name(struct vm_area_struct *vma)
+{
+ return "[mpx]";
+}
+
+static struct vm_operations_struct mpx_vma_ops = {
+ .name = mpx_mapping_name,
+};
+
+/*
+ * this is really a simplified "vm_mmap". it only handles mpx
+ * related maps, including bounds table and bounds directory.
+ *
+ * here we can stick new vm_flag VM_MPX in the vma_area_struct
+ * when create a bounds table or bounds directory, in order to
+ * track MPX specific memory.
+ */
+unsigned long mpx_mmap(unsigned long len)
+{
+ unsigned long ret;
+ unsigned long addr, pgoff;
+ struct mm_struct *mm = current->mm;
+ vm_flags_t vm_flags;
+ struct vm_area_struct *vma;
+
+ /* Only bounds table and bounds directory can be allocated here */
+ if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
+ return -EINVAL;
+
+ down_write(&mm->mmap_sem);
+
+ /* Too many mappings? */
+ if (mm->map_count > sysctl_max_map_count) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Obtain the address to map to. we verify (or select) it and ensure
+ * that it represents a valid section of the address space.
+ */
+ addr = get_unmapped_area(NULL, 0, len, 0, MAP_ANONYMOUS | MAP_PRIVATE);
+ if (addr & ~PAGE_MASK) {
+ ret = addr;
+ goto out;
+ }
+
+ vm_flags = VM_READ | VM_WRITE | VM_MPX |
+ mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+
+ /* Set pgoff according to addr for anon_vma */
+ pgoff = addr >> PAGE_SHIFT;
+
+ ret = mmap_region(NULL, addr, len, vm_flags, pgoff);
+ if (IS_ERR_VALUE(ret))
+ goto out;
+
+ vma = find_vma(mm, ret);
+ if (!vma) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ vma->vm_ops = &mpx_vma_ops;
+
+ if (vm_flags & VM_LOCKED) {
+ up_write(&mm->mmap_sem);
+ mm_populate(ret, len);
+ return ret;
+ }
+
+out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
--
1.7.1
This patch adds the Documentation/x86/intel_mpx.txt file with some
information about Intel MPX.
Signed-off-by: Qiaowei Ren <[email protected]>
---
Documentation/x86/intel_mpx.txt | 245 +++++++++++++++++++++++++++++++++++++++
1 files changed, 245 insertions(+), 0 deletions(-)
create mode 100644 Documentation/x86/intel_mpx.txt
diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt
new file mode 100644
index 0000000..3c20a17
--- /dev/null
+++ b/Documentation/x86/intel_mpx.txt
@@ -0,0 +1,245 @@
+1. Intel(R) MPX Overview
+========================
+
+Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
+introduced into Intel Architecture. Intel MPX provides hardware features
+that can be used in conjunction with compiler changes to check memory
+references, for those references whose compile-time normal intentions are
+usurped at runtime due to buffer overflow or underflow.
+
+For more information, please refer to Intel(R) Architecture Instruction
+Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
+Extensions.
+
+Note: Currently no hardware with MPX ISA is available but it is always
+possible to use SDE (Intel(R) Software Development Emulator) instead, which
+can be downloaded from
+http://software.intel.com/en-us/articles/intel-software-development-emulator
+
+
+2. How to get the advantage of MPX
+==================================
+
+For MPX to work, changes are required in the kernel, binutils and compiler.
+No source changes are required for applications, just a recompile.
+
+There are a lot of moving parts of this to all work right. The following
+is how we expect the compiler, application and kernel to work together.
+
+1) Application developer compiles with -fmpx. The compiler will add the
+ instrumentation as well as some setup code called early after the app
+ starts. New instruction prefixes are noops for old CPUs.
+2) That setup code allocates (virtual) space for the "bounds directory",
+ points the "bndcfgu" register to the directory and notifies the kernel
+ (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) that the app will be using
+ MPX.
+3) The kernel detects that the CPU has MPX, allows the new prctl() to
+ succeed, and notes the location of the bounds directory. Userspace is
+ expected to keep the bounds directory at that locationWe note it
+ instead of reading it each time because the 'xsave' operation needed
+ to access the bounds directory register is an expensive operation.
+4) If the application needs to spill bounds out of the 4 registers, it
+ issues a bndstx instruction. Since the bounds directory is empty at
+ this point, a bounds fault (#BR) is raised, the kernel allocates a
+ bounds table (in the user address space) and makes the relevant entry
+ in the bounds directory point to the new table.
+5) If the application violates the bounds specified in the bounds registers,
+ a separate kind of #BR is raised which will deliver a signal with
+ information about the violation in the 'struct siginfo'.
+6) Whenever memory is freed, we know that it can no longer contain valid
+ pointers, and we attempt to free the associated space in the bounds
+ tables. If an entire table becomes unused, we will attempt to free
+ the table and remove the entry in the directory.
+
+To summarize, there are essentially three things interacting here:
+
+GCC with -fmpx:
+ * enables annotation of code with MPX instructions and prefixes
+ * inserts code early in the application to call in to the "gcc runtime"
+GCC MPX Runtime:
+ * Checks for hardware MPX support in cpuid leaf
+ * allocates virtual space for the bounds directory (malloc() essentially)
+ * points the hardware BNDCFGU register at the directory
+ * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
+ start managing the bounds directories
+Kernel MPX Code:
+ * Checks for hardware MPX support in cpuid leaf
+ * Handles #BR exceptions and sends SIGSEGV to the app when it violates
+ bounds, like during a buffer overflow.
+ * When bounds are spilled in to an unallocated bounds table, the kernel
+ notices in the #BR exception, allocates the virtual space, then
+ updates the bounds directory to point to the new table. It keeps
+ special track of the memory with a VM_MPX flag.
+ * Frees unused bounds tables at the time that the memory they described
+ is unmapped.
+
+
+3. How does MPX kernel code work
+================================
+
+Handling #BR faults caused by MPX
+---------------------------------
+
+When MPX is enabled, there are 2 new situations that can generate
+#BR faults.
+ * new bounds tables (BT) need to be allocated to save bounds.
+ * bounds violation caused by MPX instructions.
+
+We hook #BR handler to handle these two new situations.
+
+On-demand kernel allocation of bounds tables
+--------------------------------------------
+
+MPX only has 4 hardware registers for storing bounds information. If
+MPX-enabled code needs more than these 4 registers, it needs to spill
+them somewhere. It has two special instructions for this which allow
+the bounds to be moved between the bounds registers and some new "bounds
+tables".
+
+#BR exceptions are a new class of exceptions just for MPX. They are
+similar conceptually to a page fault and will be raised by the MPX
+hardware during both bounds violations or when the tables are not
+present. The kernel handles those #BR exceptions for not-present tables
+by carving the space out of the normal processes address space and then
+pointing the bounds-directory over to it.
+
+The tables need to be accessed and controlled by userspace because
+the instructions for moving bounds in and out of them are extremely
+frequent. They potentially happen every time a register points to
+memory. Any direct kernel involvement (like a syscall) to access the
+tables would obviously destroy performance.
+
+Why not do this in userspace? MPX does not strictly require anything in
+the kernel. It can theoretically be done completely from userspace. Here
+are a few ways this could be done. We don't think any of them are practical
+in the real-world, but here they are.
+
+Q: Can virtual space simply be reserved for the bounds tables so that we
+ never have to allocate them?
+A: MPX-enabled application will possibly create a lot of bounds tables in
+ process address space to save bounds information. These tables can take
+ up huge swaths of memory (as much as 80% of the memory on the system)
+ even if we clean them up aggressively. In the worst-case scenario, the
+ tables can be 4x the size of the data structure being tracked. IOW, a
+ 1-page structure can require 4 bounds-table pages. An X-GB virtual
+ area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
+ If we were to preallocate them for the 128TB of user virtual address
+ space, we would need to reserve 512TB+2GB, which is larger than the
+ entire virtual address space today. This means they can not be reserved
+ ahead of time. Also, a single process's pre-popualated bounds directory
+ consumes 2GB of virtual *AND* physical memory. IOW, it's completely
+ infeasible to prepopulate bounds directories.
+
+Q: Can we preallocate bounds table space at the same time memory is
+ allocated which might contain pointers that might eventually need
+ bounds tables?
+A: This would work if we could hook the site of each and every memory
+ allocation syscall. This can be done for small, constrained applications.
+ But, it isn't practical at a larger scale since a given app has no
+ way of controlling how all the parts of the app might allocate memory
+ (think libraries). The kernel is really the only place to intercept
+ these calls.
+
+Q: Could a bounds fault be handed to userspace and the tables allocated
+ there in a signal handler intead of in the kernel?
+A: mmap() is not on the list of safe async handler functions and even
+ if mmap() would work it still requires locking or nasty tricks to
+ keep track of the allocation state there.
+
+Having ruled out all of the userspace-only approaches for managing
+bounds tables that we could think of, we create them on demand in
+the kernel.
+
+Decoding MPX instructions
+-------------------------
+
+If a #BR is generated due to a bounds violation caused by MPX.
+We need to decode MPX instructions to get violation address and
+set this address into extended struct siginfo.
+
+The _sigfault feild of struct siginfo is extended as follow:
+
+87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
+88 struct {
+89 void __user *_addr; /* faulting insn/memory ref. */
+90 #ifdef __ARCH_SI_TRAPNO
+91 int _trapno; /* TRAP # which caused the signal */
+92 #endif
+93 short _addr_lsb; /* LSB of the reported address */
+94 struct {
+95 void __user *_lower;
+96 void __user *_upper;
+97 } _addr_bnd;
+98 } _sigfault;
+
+The '_addr' field refers to violation address, and new '_addr_and'
+field refers to the upper/lower bounds when a #BR is caused.
+
+Glibc will be also updated to support this new siginfo. So user
+can get violation address and bounds when bounds violations occur.
+
+Cleanup unused bounds tables
+----------------------------
+
+When a BNDSTX instruction attempts to save bounds to a bounds directory
+entry marked as invalid, a #BR is generated. This is an indication that
+no bounds table exists for this entry. In this case the fault handler
+will allocate a new bounds table on demand.
+
+Since the kernel allocated those tables on-demand without userspace
+knowledge, it is also responsible for freeing them when the associated
+mappings go away.
+
+Here, the solution for this issue is to hook do_munmap() to check
+whether one process is MPX enabled. If yes, those bounds tables covered
+in the virtual address region which is being unmapped will be freed also.
+
+Adding new prctl commands
+-------------------------
+
+Two new prctl commands are added to enable and disable MPX bounds tables
+management in kernel.
+
+155 #define PR_MPX_ENABLE_MANAGEMENT 43
+156 #define PR_MPX_DISABLE_MANAGEMENT 44
+
+Runtime library in userspace is responsible for allocation of bounds
+directory. So kernel have to use XSAVE instruction to get the base
+of bounds directory from BNDCFG register.
+
+But XSAVE is expected to be very expensive. In order to do performance
+optimization, we have to get the base of bounds directory and save it
+into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
+command execution.
+
+
+4. Special rules
+================
+
+To well use this facility and this kernel side of it, the following
+are some rules for what well-behaved userspace is expected to do.
+
+1) If userspace is requiring help from the kernel to do the management
+of bounds tables, it can not create bounds tables and point the bounds
+directory at them.
+
+When #BR fault is produced due to invalid entry, bounds table will be
+created in kernel on demand and kernel will not transfer this fault to
+userspace. So usersapce can't receive #BR fault for invalid entry, and
+it is not also necessary for users to create bounds tables by themselves.
+
+Certainly users can allocate bounds tables and forcibly point the bounds
+directory at them through XSAVE instruction, and then set valid bit
+of bounds entry to have this entry valid. But we have no way to track
+the memory usage of these user-created bounds tables.
+
+2) Userspace can not take multiple bounds directory entries and point
+them at the same bounds table.
+
+Users can be allowed to do this. See more information "Intel(R) Architecture
+Instruction Set Extensions Programming Reference" (9.3.4).
+
+But if users did this, it will be possible for kernel to unmap an in-use
+bounds table since it does not recognize sharing. So we will not support
+the case that multiple bounds directory entries are pointed at the same
+bounds table.
--
1.7.1
This patch adds two prctl() commands to provide one explicit interaction
mechanism to enable or disable the management of bounds tables in kernel,
including on-demand kernel allocation (See the patch "on-demand kernel
allocation of bounds tables") and cleanup (See the patch "cleanup unused
bound tables"). Applications do not strictly need the kernel to manage
bounds tables and we expect some applications to use MPX without taking
advantage of the kernel support. This means the kernel can not simply
infer whether an application needs bounds table management from the
MPX registers. prctl() is an explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables. And
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, kernel
won't allocate and free the bounds table, even if the CPU supports MPX
feature.
PR_MPX_ENABLE_MANAGEMENT will do an xsave and fetch the base address
of bounds directory from the xsave buffer and then cache it into new
filed "bd_addr" of struct mm_struct. PR_MPX_DISABLE_MANAGEMENT will
set "bd_addr" to one invalid address. Then we can check "bd_addr" to
judge whether the management of bounds tables in kernel is enabled.
xsaves are expensive, so "bd_addr" is kept for caching to reduce the
number of we have to do at munmap() time. But we still have to do
xsave to get the value of BNDSTATUS at #BR fault time. In addition,
with this caching, userspace can't just move the bounds directory
around willy-nilly. For sane applications, base address of the bounds
directory won't be changed, otherwise we would be in a world of hurt.
But we will still check whether it is changed by users at #BR fault
time.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 9 ++++
arch/x86/include/asm/mpx.h | 11 +++++
arch/x86/include/asm/processor.h | 18 +++++++
arch/x86/kernel/mpx.c | 88 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/setup.c | 8 +++
arch/x86/kernel/traps.c | 30 ++++++++++++-
arch/x86/mm/mpx.c | 25 +++-------
fs/exec.c | 2 +
include/asm-generic/mmu_context.h | 5 ++
include/linux/mm_types.h | 3 +
include/uapi/linux/prctl.h | 6 +++
kernel/sys.c | 12 +++++
12 files changed, 198 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 166af2a..e33ddb7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -10,6 +10,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
+#include <asm/mpx.h>
#ifndef CONFIG_PARAVIRT
#include <asm-generic/mm_hooks.h>
@@ -102,4 +103,12 @@ do { \
} while (0)
#endif
+static inline void arch_bprm_mm_init(struct mm_struct *mm,
+ struct vm_area_struct *vma)
+{
+#ifdef CONFIG_X86_INTEL_MPX
+ mm->bd_addr = MPX_INVALID_BOUNDS_DIR;
+#endif
+}
+
#endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 780af63..32f13f5 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -5,6 +5,12 @@
#include <asm/ptrace.h>
#include <asm/insn.h>
+/*
+ * NULL is theoretically a valid place to put the bounds
+ * directory, so point this at an invalid address.
+ */
+#define MPX_INVALID_BOUNDS_DIR ((void __user *)-1)
+
#ifdef CONFIG_X86_64
/* upper 28 bits [47:20] of the virtual address in 64-bit used to
@@ -43,6 +49,7 @@
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
#define MPX_BNDSTA_ERROR_CODE 0x3
+#define MPX_BNDCFG_ENABLE_FLAG 0x1
#define MPX_BD_ENTRY_VALID_FLAG 0x1
struct mpx_insn {
@@ -61,6 +68,10 @@ struct mpx_insn {
#define MAX_MPX_INSN_SIZE 15
+static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
+{
+ return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
+}
unsigned long mpx_mmap(unsigned long len);
#ifdef CONFIG_X86_INTEL_MPX
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 020142f..b35aefa 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -953,6 +953,24 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
extern int get_tsc_mode(unsigned long adr);
extern int set_tsc_mode(unsigned int val);
+/* Register/unregister a process' MPX related resource */
+#define MPX_ENABLE_MANAGEMENT(tsk) mpx_enable_management((tsk))
+#define MPX_DISABLE_MANAGEMENT(tsk) mpx_disable_management((tsk))
+
+#ifdef CONFIG_X86_INTEL_MPX
+extern int mpx_enable_management(struct task_struct *tsk);
+extern int mpx_disable_management(struct task_struct *tsk);
+#else
+static inline int mpx_enable_management(struct task_struct *tsk)
+{
+ return -EINVAL;
+}
+static inline int mpx_disable_management(struct task_struct *tsk)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_X86_INTEL_MPX */
+
extern u16 amd_get_nb_id(int cpu);
static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
index b7e4c0e..36df3a5 100644
--- a/arch/x86/kernel/mpx.c
+++ b/arch/x86/kernel/mpx.c
@@ -8,7 +8,78 @@
#include <linux/kernel.h>
#include <linux/syscalls.h>
+#include <linux/prctl.h>
#include <asm/mpx.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
+
+static __user void *task_get_bounds_dir(struct task_struct *tsk)
+{
+ struct xsave_struct *xsave_buf;
+
+ if (!cpu_feature_enabled(X86_FEATURE_MPX))
+ return MPX_INVALID_BOUNDS_DIR;
+
+ /*
+ * The bounds directory pointer is stored in a register
+ * only accessible if we first do an xsave.
+ */
+ fpu_xsave(&tsk->thread.fpu);
+ xsave_buf = &(tsk->thread.fpu.state->xsave);
+
+ /*
+ * Make sure the register looks valid by checking the
+ * enable bit.
+ */
+ if (!(xsave_buf->bndcsr.bndcfgu & MPX_BNDCFG_ENABLE_FLAG))
+ return MPX_INVALID_BOUNDS_DIR;
+
+ /*
+ * Lastly, mask off the low bits used for configuration
+ * flags, and return the address of the bounds table.
+ */
+ return (void __user *)(unsigned long)
+ (xsave_buf->bndcsr.bndcfgu & MPX_BNDCFG_ADDR_MASK);
+}
+
+int mpx_enable_management(struct task_struct *tsk)
+{
+ struct mm_struct *mm = tsk->mm;
+ void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
+ int ret = 0;
+
+ /*
+ * runtime in the userspace will be responsible for allocation of
+ * the bounds directory. Then, it will save the base of the bounds
+ * directory into XSAVE/XRSTOR Save Area and enable MPX through
+ * XRSTOR instruction.
+ *
+ * fpu_xsave() is expected to be very expensive. Storing the bounds
+ * directory here means that we do not have to do xsave in the unmap
+ * path; we can just use mm->bd_addr instead.
+ */
+ bd_base = task_get_bounds_dir(tsk);
+ down_write(&mm->mmap_sem);
+ mm->bd_addr = bd_base;
+ if (mm->bd_addr == MPX_INVALID_BOUNDS_DIR)
+ ret = -ENXIO;
+
+ up_write(&mm->mmap_sem);
+ return ret;
+}
+
+int mpx_disable_management(struct task_struct *tsk)
+{
+ struct mm_struct *mm = current->mm;
+
+ if (!cpu_feature_enabled(X86_FEATURE_MPX))
+ return -ENXIO;
+
+ down_write(&mm->mmap_sem);
+ mm->bd_addr = MPX_INVALID_BOUNDS_DIR;
+ up_write(&mm->mmap_sem);
+ return 0;
+}
enum reg_type {
REG_TYPE_RM = 0,
@@ -283,6 +354,9 @@ static unsigned long mpx_insn_decode(struct mpx_insn *insn,
* With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
* bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
* and the size of each bounds table is 4MB.
+ *
+ * This function will be called holding mmap_sem for write. And it
+ * will downgrade this write lock to read lock.
*/
static int allocate_bt(long __user *bd_entry)
{
@@ -304,6 +378,11 @@ static int allocate_bt(long __user *bd_entry)
bt_addr = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
/*
+ * Access to the bounds directory possibly fault, so we
+ * need to downgrade write lock to read lock.
+ */
+ downgrade_write(¤t->mm->mmap_sem);
+ /*
* Go poke the address of the new bounds table in to the
* bounds directory entry out in userspace memory. Note:
* we may race with another CPU instantiating the same table.
@@ -351,6 +430,15 @@ int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
unsigned long bd_entry, bd_base;
bd_base = xsave_buf->bndcsr.bndcfgu & MPX_BNDCFG_ADDR_MASK;
+
+ /*
+ * Make sure the bounds directory being pointed to by the
+ * configuration register agrees with the place userspace
+ * told us it was going to be. Otherwise, this -EINVAL return
+ * will cause a one SIGSEGV.
+ */
+ if (bd_base != (unsigned long)current->mm->bd_addr)
+ return -EINVAL;
status = xsave_buf->bndcsr.bndstatus;
/*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 41ead8d..8a58c98 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -110,6 +110,7 @@
#include <asm/mce.h>
#include <asm/alternative.h>
#include <asm/prom.h>
+#include <asm/mpx.h>
/*
* max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -950,6 +951,13 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
+#ifdef CONFIG_X86_INTEL_MPX
+ /*
+ * NULL is theoretically a valid place to put the bounds
+ * directory, so point this at an invalid address.
+ */
+ init_mm.bd_addr = MPX_INVALID_BOUNDS_DIR;
+#endif
code_resource.start = __pa_symbol(_text);
code_resource.end = __pa_symbol(_etext)-1;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b2a916b..5e5b299 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -285,6 +285,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
struct xsave_struct *xsave_buf;
struct task_struct *tsk = current;
siginfo_t info;
+ int ret = 0;
prev_state = exception_enter();
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
@@ -312,8 +313,35 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
*/
switch (status & MPX_BNDSTA_ERROR_CODE) {
case 2: /* Bound directory has invalid entry. */
- if (do_mpx_bt_fault(xsave_buf))
+ down_write(¤t->mm->mmap_sem);
+ /*
+ * Userspace never asked us to manage the bounds tables,
+ * so refuse to help.
+ */
+ if (!kernel_managing_mpx_tables(current->mm)) {
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
+ error_code, NULL);
+ up_write(¤t->mm->mmap_sem);
+ goto exit;
+ }
+
+ ret = do_mpx_bt_fault(xsave_buf);
+ if (!ret || ret == -EFAULT) {
+ /*
+ * Successfully handle bounds table fault, or the
+ * cmpxchg which updates bounds directory entry
+ * fails.
+ *
+ * For this case, write lock has been downgraded
+ * to read lock in allocate_bt() called by
+ * do_mpx_bt_fault().
+ */
+ up_read(¤t->mm->mmap_sem);
+ goto exit;
+ }
+ if (ret)
force_sig(SIGSEGV, tsk);
+ up_write(¤t->mm->mmap_sem);
break;
case 1: /* Bound violation. */
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index e1b28e6..376f2ee 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -33,22 +33,16 @@ unsigned long mpx_mmap(unsigned long len)
if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
return -EINVAL;
- down_write(&mm->mmap_sem);
-
/* Too many mappings? */
- if (mm->map_count > sysctl_max_map_count) {
- ret = -ENOMEM;
- goto out;
- }
+ if (mm->map_count > sysctl_max_map_count)
+ return -ENOMEM;
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
addr = get_unmapped_area(NULL, 0, len, 0, MAP_ANONYMOUS | MAP_PRIVATE);
- if (addr & ~PAGE_MASK) {
- ret = addr;
- goto out;
- }
+ if (addr & ~PAGE_MASK)
+ return addr;
vm_flags = VM_READ | VM_WRITE | VM_MPX |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
@@ -58,22 +52,17 @@ unsigned long mpx_mmap(unsigned long len)
ret = mmap_region(NULL, addr, len, vm_flags, pgoff);
if (IS_ERR_VALUE(ret))
- goto out;
+ return ret;
vma = find_vma(mm, ret);
- if (!vma) {
- ret = -ENOMEM;
- goto out;
- }
+ if (!vma)
+ return -ENOMEM;
vma->vm_ops = &mpx_vma_ops;
if (vm_flags & VM_LOCKED) {
up_write(&mm->mmap_sem);
mm_populate(ret, len);
- return ret;
}
-out:
- up_write(&mm->mmap_sem);
return ret;
}
diff --git a/fs/exec.c b/fs/exec.c
index a2b42a9..16d1606 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -60,6 +60,7 @@
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
#include <asm/tlb.h>
+#include <asm/mpx.h>
#include <trace/events/task.h>
#include "internal.h"
@@ -277,6 +278,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
goto err;
mm->stack_vm = mm->total_vm = 1;
+ arch_bprm_mm_init(mm, vma);
up_write(&mm->mmap_sem);
bprm->p = vma->vm_end - sizeof(void *);
return 0;
diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index a7eec91..1f2a8f9 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -42,4 +42,9 @@ static inline void activate_mm(struct mm_struct *prev_mm,
{
}
+static inline void arch_bprm_mm_init(struct mm_struct *mm,
+ struct vm_area_struct *vma)
+{
+}
+
#endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..760aee3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -454,6 +454,9 @@ struct mm_struct {
bool tlb_flush_pending;
#endif
struct uprobes_state uprobes_state;
+#ifdef CONFIG_X86_INTEL_MPX
+ void __user *bd_addr; /* address of the bounds directory */
+#endif
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04..b7a8cf4 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,10 @@
#define PR_SET_THP_DISABLE 41
#define PR_GET_THP_DISABLE 42
+/*
+ * Tell the kernel to start/stop helping userspace manage bounds tables.
+ */
+#define PR_MPX_ENABLE_MANAGEMENT 43
+#define PR_MPX_DISABLE_MANAGEMENT 44
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index b663664..4713585 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -91,6 +91,12 @@
#ifndef SET_TSC_CTL
# define SET_TSC_CTL(a) (-EINVAL)
#endif
+#ifndef MPX_ENABLE_MANAGEMENT
+# define MPX_ENABLE_MANAGEMENT(a) (-EINVAL)
+#endif
+#ifndef MPX_DISABLE_MANAGEMENT
+# define MPX_DISABLE_MANAGEMENT(a) (-EINVAL)
+#endif
/*
* this is where the system-wide overflow UID and GID are defined, for
@@ -2009,6 +2015,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
me->mm->def_flags &= ~VM_NOHUGEPAGE;
up_write(&me->mm->mmap_sem);
break;
+ case PR_MPX_ENABLE_MANAGEMENT:
+ error = MPX_ENABLE_MANAGEMENT(me);
+ break;
+ case PR_MPX_DISABLE_MANAGEMENT:
+ error = MPX_DISABLE_MANAGEMENT(me);
+ break;
default:
error = -EINVAL;
break;
--
1.7.1
There are two mappings in play: 1. The mapping with the actual data,
which userspace is munmap()ing or brk()ing away, etc... 2. The mapping
for the bounds table *backing* the data (is tagged with mpx_vma_ops,
see the patch "add MPX specific mmap interface").
If userspace use the prctl() indroduced earlier in this patchset to
enable the management of bounds tables in kernel, when it unmaps the
first kind of mapping with the actual data, kernel needs to free the
mapping for the bounds table backing the data. This patch calls
arch_unmap() at the very end of do_unmap() to do so. This will walk
the directory to look at the entries covered in the data vma and unmaps
the bounds table which is referenced from the directory and then clears
the directory entry.
Unmapping of bounds tables is called under vm_munmap() of the data VMA.
So we have to check ->vm_ops to prevent recursion. This recursion represents
having bounds tables for bounds tables, which should not occur normally.
Being strict about it here helps ensure that we do not have an exploitable
stack overflow.
Once we unmap the bounds table, we would have a bounds directory entry
pointing at empty address space. That address space could now be allocated
for some other (random) use, and the MPX hardware is now going to go
trying to walk it as if it were a bounds table. That would be bad. So
any unmapping of a bounds table has to be accompanied by a corresponding
write to the bounds directory entry to have it invalid. That write to
the bounds directory can fault.
Since we are doing the freeing from munmap() (and other paths like it),
we hold mmap_sem for write. If we fault, the page fault handler will
attempt to acquire mmap_sem for read and we will deadlock. For now, to
avoid deadlock, we disable page faults while touching the bounds directory
entry. This keeps us from being able to free the tables in this case.
This deficiency will be addressed in later patches.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 16 ++
arch/x86/include/asm/mpx.h | 9 +
arch/x86/mm/mpx.c | 317 ++++++++++++++++++++++++++++++++++++
include/asm-generic/mmu_context.h | 6 +
mm/mmap.c | 2 +
5 files changed, 350 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e33ddb7..2b52d1b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -111,4 +111,20 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
#endif
}
+static inline void arch_unmap(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+#ifdef CONFIG_X86_INTEL_MPX
+ /*
+ * Userspace never asked us to manage the bounds tables,
+ * so refuse to help.
+ */
+ if (!kernel_managing_mpx_tables(current->mm))
+ return;
+
+ mpx_notify_unmap(mm, vma, start, end);
+#endif
+}
+
#endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 32f13f5..a1a0155 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -48,6 +48,13 @@
#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+#define MPX_BD_ENTRY_MASK ((1<<MPX_BD_ENTRY_OFFSET)-1)
+#define MPX_BT_ENTRY_MASK ((1<<MPX_BT_ENTRY_OFFSET)-1)
+#define MPX_GET_BD_ENTRY_OFFSET(addr) ((((addr)>>(MPX_BT_ENTRY_OFFSET+ \
+ MPX_IGN_BITS)) & MPX_BD_ENTRY_MASK) << MPX_BD_ENTRY_SHIFT)
+#define MPX_GET_BT_ENTRY_OFFSET(addr) ((((addr)>>MPX_IGN_BITS) & \
+ MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
+
#define MPX_BNDSTA_ERROR_CODE 0x3
#define MPX_BNDCFG_ENABLE_FLAG 0x1
#define MPX_BD_ENTRY_VALID_FLAG 0x1
@@ -73,6 +80,8 @@ static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
}
unsigned long mpx_mmap(unsigned long len);
+void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
#ifdef CONFIG_X86_INTEL_MPX
int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 376f2ee..dcc6621 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -1,7 +1,16 @@
+/*
+ * mpx.c - Memory Protection eXtensions
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ * Qiaowei Ren <[email protected]>
+ * Dave Hansen <[email protected]>
+ */
+
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <asm/mpx.h>
#include <asm/mman.h>
+#include <asm/mmu_context.h>
#include <linux/sched/sysctl.h>
static const char *mpx_mapping_name(struct vm_area_struct *vma)
@@ -13,6 +22,11 @@ static struct vm_operations_struct mpx_vma_ops = {
.name = mpx_mapping_name,
};
+int is_mpx_vma(struct vm_area_struct *vma)
+{
+ return (vma->vm_ops == &mpx_vma_ops);
+}
+
/*
* this is really a simplified "vm_mmap". it only handles mpx
* related maps, including bounds table and bounds directory.
@@ -66,3 +80,306 @@ unsigned long mpx_mmap(unsigned long len)
return ret;
}
+
+/*
+ * Get the base of bounds tables pointed by specific bounds
+ * directory entry.
+ */
+static int get_bt_addr(long __user *bd_entry, unsigned long *bt_addr)
+{
+ int ret;
+ int valid;
+
+ if (!access_ok(VERIFY_READ, (bd_entry), sizeof(*bd_entry)))
+ return -EFAULT;
+
+ pagefault_disable();
+ ret = get_user(*bt_addr, bd_entry);
+ pagefault_enable();
+ if (ret)
+ return ret;
+
+ valid = *bt_addr & MPX_BD_ENTRY_VALID_FLAG;
+ *bt_addr &= MPX_BT_ADDR_MASK;
+
+ /*
+ * When the kernel is managing bounds tables, a bounds directory
+ * entry will either have a valid address (plus the valid bit)
+ * *OR* be completely empty. If we see a !valid entry *and* some
+ * data in the address field, we know something is wrong. This
+ * -EINVAL return will cause a SIGSEGV.
+ */
+ if (!valid && *bt_addr)
+ return -EINVAL;
+ /*
+ * Not present is OK. It just means there was no bounds table
+ * for this memory, which is completely OK. Make sure to distinguish
+ * this from -EINVAL, which will cause a SEGV.
+ */
+ if (!valid)
+ return -ENOENT;
+
+ return 0;
+}
+
+/*
+ * Free the backing physical pages of bounds table 'bt_addr'.
+ * Assume start...end is within that bounds table.
+ */
+static int zap_bt_entries(struct mm_struct *mm,
+ unsigned long bt_addr,
+ unsigned long start, unsigned long end)
+{
+ struct vm_area_struct *vma;
+ unsigned long addr, len;
+
+ /*
+ * Find the first overlapping vma. If vma->vm_start > start, there
+ * will be a hole in the bounds table. This -EINVAL return will
+ * cause a SIGSEGV.
+ */
+ vma = find_vma(mm, start);
+ if (!vma || vma->vm_start > start)
+ return -EINVAL;
+
+ /*
+ * A NUMA policy on a VM_MPX VMA could cause this bouds table to
+ * be split. So we need to look across the entire 'start -> end'
+ * range of this bounds table, find all of the VM_MPX VMAs, and
+ * zap only those.
+ */
+ addr = start;
+ while (vma && vma->vm_start < end) {
+ /*
+ * If one VMA which is not for MPX is found in the middle
+ * of this bounds table, this -EINVAL return will cause a
+ * SIGSEGV.
+ */
+ if (!(vma->vm_flags & VM_MPX))
+ return -EINVAL;
+
+ len = min(vma->vm_end, end) - addr;
+ zap_page_range(vma, addr, len, NULL);
+
+ vma = vma->vm_next;
+ addr = vma->vm_start;
+ }
+
+ return 0;
+}
+
+static int unmap_single_bt(struct mm_struct *mm,
+ long __user *bd_entry, unsigned long bt_addr)
+{
+ unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+ unsigned long actual_old_val = 0;
+ int ret;
+
+ /*
+ * The whole bounds tables freeing code will be called under
+ * mm->mmap_sem write-lock. So the value of bd_entry could not
+ * be changed during this period, and 'actual_old_val' will be
+ * equal to 'expected_old_val'.
+ */
+ pagefault_disable();
+ ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
+ expected_old_val, 0);
+ pagefault_enable();
+ if (ret)
+ return ret;
+
+ /*
+ * Note, we are likely being called under do_munmap() already. To
+ * avoid recursion, do_munmap() will check whether it comes
+ * from one bounds table through VM_MPX flag.
+ */
+ return do_munmap(mm, bt_addr, MPX_BT_SIZE_BYTES);
+}
+
+/*
+ * If the bounds table pointed by bounds directory 'bd_entry' is
+ * not shared, unmap this whole bounds table. Otherwise, only free
+ * those backing physical pages of bounds table entries covered
+ * in this virtual address region start...end.
+ */
+static int unmap_shared_bt(struct mm_struct *mm,
+ long __user *bd_entry, unsigned long start,
+ unsigned long end, bool prev_shared, bool next_shared)
+{
+ unsigned long bt_addr;
+ int ret;
+
+ ret = get_bt_addr(bd_entry, &bt_addr);
+ if (ret)
+ return ret;
+
+ if (prev_shared && next_shared)
+ ret = zap_bt_entries(mm, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+ else if (prev_shared)
+ ret = zap_bt_entries(mm, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
+ bt_addr+MPX_BT_SIZE_BYTES);
+ else if (next_shared)
+ ret = zap_bt_entries(mm, bt_addr, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+ else
+ ret = unmap_single_bt(mm, bd_entry, bt_addr);
+
+ return ret;
+}
+
+/*
+ * A virtual address region being munmap()ed might share bounds table
+ * with adjacent VMAs. We only need to free the backing physical
+ * memory of these shared bounds tables entries covered in this virtual
+ * address region.
+ */
+static int unmap_edge_bts(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+ long __user *bde_start, *bde_end;
+ struct vm_area_struct *prev, *next;
+ bool prev_shared = false, next_shared = false;
+
+ bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
+ bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+
+ /*
+ * Check whether bde_start and bde_end are shared with adjacent
+ * VMAs.
+ *
+ * We already unliked the VMAs from the mm's rbtree so 'start'
+ * is guaranteed to be in a hole. This gets us the first VMA
+ * before the hole in to 'prev' and the next VMA after the hole
+ * in to 'next'.
+ */
+ next = find_vma_prev(mm, start, &prev);
+ if (prev && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(prev->vm_end-1))
+ == bde_start)
+ prev_shared = true;
+ if (next && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(next->vm_start))
+ == bde_end)
+ next_shared = true;
+
+ /*
+ * This virtual address region being munmap()ed is only
+ * covered by one bounds table.
+ *
+ * In this case, if this table is also shared with adjacent
+ * VMAs, only part of the backing physical memory of the bounds
+ * table need be freeed. Otherwise the whole bounds table need
+ * be unmapped.
+ */
+ if (bde_start == bde_end) {
+ return unmap_shared_bt(mm, bde_start, start, end,
+ prev_shared, next_shared);
+ }
+
+ /*
+ * If more than one bounds tables are covered in this virtual
+ * address region being munmap()ed, we need to separately check
+ * whether bde_start and bde_end are shared with adjacent VMAs.
+ */
+ ret = unmap_shared_bt(mm, bde_start, start, end, prev_shared, false);
+ if (ret)
+ return ret;
+
+ ret = unmap_shared_bt(mm, bde_end, start, end, false, next_shared);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static int mpx_unmap_tables_for(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+ long __user *bd_entry, *bde_start, *bde_end;
+ unsigned long bt_addr;
+
+ /*
+ * "Edge" bounds tables are those which are being used by the region
+ * (start -> end), but that may be shared with adjacent areas. If they
+ * turn out to be completely unshared, they will be freed. If they are
+ * shared, we will free the backing store (like an MADV_DONTNEED) for
+ * areas used by this region.
+ */
+ ret = unmap_edge_bts(mm, start, end);
+ if (ret == -EFAULT)
+ return ret;
+
+ /*
+ * Only unmap the bounds table that are
+ * 1. fully covered
+ * 2. not at the edges of the mapping, even if full aligned
+ */
+ bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
+ bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+ for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
+ ret = get_bt_addr(bd_entry, &bt_addr);
+ /*
+ * A fault means we have to drop mmap_sem,
+ * perform the fault, and retry this somehow.
+ */
+ if (ret == -EFAULT)
+ return ret;
+ /*
+ * Any other issue (like a bad bounds-directory)
+ * we can try the next one.
+ */
+ if (ret)
+ continue;
+
+ ret = unmap_single_bt(mm, bd_entry, bt_addr);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/*
+ * Free unused bounds tables covered in a virtual address region being
+ * munmap()ed. Assume end > start.
+ *
+ * This function will be called by do_munmap(), and the VMAs covering
+ * the virtual address region start...end have already been split if
+ * necessary, and the 'vma' is the first vma in this range (start -> end).
+ */
+void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+
+ /*
+ * This will look across the entire 'start -> end' range,
+ * and find all of the non-VM_MPX VMAs.
+ *
+ * To avoid recursion, if a VM_MPX vma is found in the range
+ * (start->end), we will not continue follow-up work. This
+ * recursion represents having bounds tables for bounds tables,
+ * which should not occur normally. Being strict about it here
+ * helps ensure that we do not have an exploitable stack overflow.
+ */
+ do {
+ if (vma->vm_flags & VM_MPX)
+ return;
+ vma = vma->vm_next;
+ } while (vma && vma->vm_start < end);
+
+ ret = mpx_unmap_tables_for(mm, start, end);
+ if (ret == -EFAULT) {
+ /*
+ * FIXME: If access to the bounds directory fault, we
+ * need handle that outside of the mmap sem held region.
+ */
+ return;
+ }
+
+ if (ret == -EINVAL)
+ force_sig(SIGSEGV, current);
+}
diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index 1f2a8f9..aa2d8ba 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -47,4 +47,10 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
{
}
+static inline void arch_unmap(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+}
+
#endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index c0a3637..6246437 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2580,6 +2580,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
detach_vmas_to_be_unmapped(mm, vma, prev, end);
unmap_region(mm, vma, prev, start, end);
+ arch_unmap(mm, vma, start, end);
+
/* Fix up all other VM information */
remove_vma_list(mm, vma);
--
1.7.1
This patch sets bound violation fields of siginfo struct in #BR
exception handler by decoding the user instruction and constructing
the faulting pointer.
This patch does't use the generic decoder, and implements a limited
special-purpose decoder to decode MPX instructions, simply because the
generic decoder is very heavyweight not just in terms of performance
but in terms of interface -- because it has to.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mpx.h | 23 ++++
arch/x86/kernel/mpx.c | 299 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 6 +
3 files changed, 328 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index b7598ac..780af63 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <asm/ptrace.h>
+#include <asm/insn.h>
#ifdef CONFIG_X86_64
@@ -44,15 +45,37 @@
#define MPX_BNDSTA_ERROR_CODE 0x3
#define MPX_BD_ENTRY_VALID_FLAG 0x1
+struct mpx_insn {
+ struct insn_field rex_prefix; /* REX prefix */
+ struct insn_field modrm;
+ struct insn_field sib;
+ struct insn_field displacement;
+
+ unsigned char addr_bytes; /* effective address size */
+ unsigned char limit;
+ unsigned char x86_64;
+
+ const unsigned char *kaddr; /* kernel address of insn to analyze */
+ const unsigned char *next_byte;
+};
+
+#define MAX_MPX_INSN_SIZE 15
+
unsigned long mpx_mmap(unsigned long len);
#ifdef CONFIG_X86_INTEL_MPX
int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
+void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf);
#else
static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
{
return -EINVAL;
}
+static inline void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf)
+{
+}
#endif /* CONFIG_X86_INTEL_MPX */
#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
index 2103b5e..b7e4c0e 100644
--- a/arch/x86/kernel/mpx.c
+++ b/arch/x86/kernel/mpx.c
@@ -10,6 +10,275 @@
#include <linux/syscalls.h>
#include <asm/mpx.h>
+enum reg_type {
+ REG_TYPE_RM = 0,
+ REG_TYPE_INDEX,
+ REG_TYPE_BASE,
+};
+
+static unsigned long get_reg(struct mpx_insn *insn, struct pt_regs *regs,
+ enum reg_type type)
+{
+ int regno = 0;
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+ unsigned char sib = (unsigned char)insn->sib.value;
+
+ static const int regoff[] = {
+ offsetof(struct pt_regs, ax),
+ offsetof(struct pt_regs, cx),
+ offsetof(struct pt_regs, dx),
+ offsetof(struct pt_regs, bx),
+ offsetof(struct pt_regs, sp),
+ offsetof(struct pt_regs, bp),
+ offsetof(struct pt_regs, si),
+ offsetof(struct pt_regs, di),
+#ifdef CONFIG_X86_64
+ offsetof(struct pt_regs, r8),
+ offsetof(struct pt_regs, r9),
+ offsetof(struct pt_regs, r10),
+ offsetof(struct pt_regs, r11),
+ offsetof(struct pt_regs, r12),
+ offsetof(struct pt_regs, r13),
+ offsetof(struct pt_regs, r14),
+ offsetof(struct pt_regs, r15),
+#endif
+ };
+
+ switch (type) {
+ case REG_TYPE_RM:
+ regno = X86_MODRM_RM(modrm);
+ if (X86_REX_B(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ case REG_TYPE_INDEX:
+ regno = X86_SIB_INDEX(sib);
+ if (X86_REX_X(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ case REG_TYPE_BASE:
+ regno = X86_SIB_BASE(sib);
+ if (X86_REX_B(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ default:
+ break;
+ }
+
+ return regs_get_register(regs, regoff[regno]);
+}
+
+/*
+ * return the address being referenced be instruction
+ * for rm=3 returning the content of the rm reg
+ * for rm!=3 calculates the address using SIB and Disp
+ */
+static unsigned long get_addr_ref(struct mpx_insn *insn, struct pt_regs *regs)
+{
+ unsigned long addr;
+ unsigned long base;
+ unsigned long indx;
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+ unsigned char sib = (unsigned char)insn->sib.value;
+
+ if (X86_MODRM_MOD(modrm) == 3) {
+ addr = get_reg(insn, regs, REG_TYPE_RM);
+ } else {
+ if (insn->sib.nbytes) {
+ base = get_reg(insn, regs, REG_TYPE_BASE);
+ indx = get_reg(insn, regs, REG_TYPE_INDEX);
+ addr = base + indx * (1 << X86_SIB_SCALE(sib));
+ } else {
+ addr = get_reg(insn, regs, REG_TYPE_RM);
+ }
+ addr += insn->displacement.value;
+ }
+
+ return addr;
+}
+
+/* Verify next sizeof(t) bytes can be on the same instruction */
+#define validate_next(t, insn, n) \
+ ((insn)->next_byte + sizeof(t) + n - (insn)->kaddr <= (insn)->limit)
+
+#define __get_next(t, insn) \
+({ \
+ t r = *(t *)insn->next_byte; \
+ insn->next_byte += sizeof(t); \
+ r; \
+})
+
+#define __peek_next(t, insn) \
+({ \
+ t r = *(t *)insn->next_byte; \
+ r; \
+})
+
+#define get_next(t, insn) \
+({ \
+ if (unlikely(!validate_next(t, insn, 0))) \
+ goto err_out; \
+ __get_next(t, insn); \
+})
+
+#define peek_next(t, insn) \
+({ \
+ if (unlikely(!validate_next(t, insn, 0))) \
+ goto err_out; \
+ __peek_next(t, insn); \
+})
+
+static void mpx_insn_get_prefixes(struct mpx_insn *insn)
+{
+ unsigned char b;
+
+ /* Decode legacy prefix and REX prefix */
+ b = peek_next(unsigned char, insn);
+ while (b != 0x0f) {
+ /*
+ * look for a rex prefix
+ * a REX prefix cannot be followed by a legacy prefix.
+ */
+ if (insn->x86_64 && ((b&0xf0) == 0x40)) {
+ insn->rex_prefix.value = b;
+ insn->rex_prefix.nbytes = 1;
+ insn->next_byte++;
+ break;
+ }
+
+ /* check the other legacy prefixes */
+ switch (b) {
+ case 0xf2:
+ case 0xf3:
+ case 0xf0:
+ case 0x64:
+ case 0x65:
+ case 0x2e:
+ case 0x3e:
+ case 0x26:
+ case 0x36:
+ case 0x66:
+ case 0x67:
+ insn->next_byte++;
+ break;
+ default: /* everything else is garbage */
+ goto err_out;
+ }
+ b = peek_next(unsigned char, insn);
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_modrm(struct mpx_insn *insn)
+{
+ insn->modrm.value = get_next(unsigned char, insn);
+ insn->modrm.nbytes = 1;
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_sib(struct mpx_insn *insn)
+{
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+
+ if (X86_MODRM_MOD(modrm) != 3 && X86_MODRM_RM(modrm) == 4) {
+ insn->sib.value = get_next(unsigned char, insn);
+ insn->sib.nbytes = 1;
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_displacement(struct mpx_insn *insn)
+{
+ unsigned char mod, rm, base;
+
+ /*
+ * Interpreting the modrm byte:
+ * mod = 00 - no displacement fields (exceptions below)
+ * mod = 01 - 1-byte displacement field
+ * mod = 10 - displacement field is 4 bytes
+ * mod = 11 - no memory operand
+ *
+ * mod != 11, r/m = 100 - SIB byte exists
+ * mod = 00, SIB base = 101 - displacement field is 4 bytes
+ * mod = 00, r/m = 101 - rip-relative addressing, displacement
+ * field is 4 bytes
+ */
+ mod = X86_MODRM_MOD(insn->modrm.value);
+ rm = X86_MODRM_RM(insn->modrm.value);
+ base = X86_SIB_BASE(insn->sib.value);
+ if (mod == 3)
+ return;
+ if (mod == 1) {
+ insn->displacement.value = get_next(unsigned char, insn);
+ insn->displacement.nbytes = 1;
+ } else if ((mod == 0 && rm == 5) || mod == 2 ||
+ (mod == 0 && base == 5)) {
+ insn->displacement.value = get_next(int, insn);
+ insn->displacement.nbytes = 4;
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_init(struct mpx_insn *insn, struct pt_regs *regs)
+{
+ unsigned char buf[MAX_MPX_INSN_SIZE];
+ int bytes;
+
+ memset(insn, 0, sizeof(*insn));
+
+ bytes = copy_from_user(buf, (void __user *)regs->ip, MAX_MPX_INSN_SIZE);
+ insn->limit = MAX_MPX_INSN_SIZE - bytes;
+ insn->kaddr = buf;
+ insn->next_byte = buf;
+
+ /*
+ * In 64-bit Mode, all Intel MPX instructions use 64-bit
+ * operands for bounds and 64 bit addressing, i.e. REX.W &
+ * 67H have no effect on data or address size.
+ *
+ * In compatibility and legacy modes (including 16-bit code
+ * segments, real and virtual 8086 modes) all Intel MPX
+ * instructions use 32-bit operands for bounds and 32 bit
+ * addressing.
+ */
+#ifdef CONFIG_X86_64
+ insn->x86_64 = 1;
+ insn->addr_bytes = 8;
+#else
+ insn->x86_64 = 0;
+ insn->addr_bytes = 4;
+#endif
+}
+
+static unsigned long mpx_insn_decode(struct mpx_insn *insn,
+ struct pt_regs *regs)
+{
+ mpx_insn_init(insn, regs);
+
+ /*
+ * In this case, we only need decode bndcl/bndcn/bndcu,
+ * so we can use private diassembly interfaces to get
+ * prefixes, modrm, sib, displacement, etc..
+ */
+ mpx_insn_get_prefixes(insn);
+ insn->next_byte += 2; /* ignore opcode */
+ mpx_insn_get_modrm(insn);
+ mpx_insn_get_sib(insn);
+ mpx_insn_get_displacement(insn);
+
+ return get_addr_ref(insn, regs);
+}
+
/*
* With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
* bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
@@ -99,3 +368,33 @@ int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
return allocate_bt((long __user *)bd_entry);
}
+
+/*
+ * If a bounds overflow occurs then a #BR is generated. The fault
+ * handler will decode MPX instructions to get violation address
+ * and set this address into extended struct siginfo.
+ */
+void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf)
+{
+ struct mpx_insn insn;
+ uint8_t bndregno;
+ unsigned long addr_vio;
+
+ addr_vio = mpx_insn_decode(&insn, regs);
+
+ bndregno = X86_MODRM_REG(insn.modrm.value);
+ if (bndregno > 3)
+ return;
+
+ /* Note: the upper 32 bits are ignored in 32-bit mode. */
+ info->si_lower = (void __user *)(unsigned long)
+ (xsave_buf->bndregs.bndregs[2*bndregno]);
+ info->si_upper = (void __user *)(unsigned long)
+ (~xsave_buf->bndregs.bndregs[2*bndregno+1]);
+ info->si_addr_lsb = 0;
+ info->si_signo = SIGSEGV;
+ info->si_errno = 0;
+ info->si_code = SEGV_BNDERR;
+ info->si_addr = (void __user *)addr_vio;
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 611b6ec..b2a916b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -284,6 +284,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
unsigned long status;
struct xsave_struct *xsave_buf;
struct task_struct *tsk = current;
+ siginfo_t info;
prev_state = exception_enter();
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
@@ -316,6 +317,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
break;
case 1: /* Bound violation. */
+ do_mpx_bounds(regs, &info, xsave_buf);
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
+ error_code, &info);
+ break;
+
case 0: /* No exception caused by Intel MPX operations. */
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
break;
--
1.7.1
MPX-enabled applications using large swaths of memory can potentially
have large numbers of bounds tables in process address space to save
bounds information. These tables can take up huge swaths of memory
(as much as 80% of the memory on the system) even if we clean them
up aggressively. In the worst-case scenario, the tables can be 4x the
size of the data structure being tracked. IOW, a 1-page structure can
require 4 bounds-table pages.
Being this huge, our expectation is that folks using MPX are going to
be keen on figuring out how much memory is being dedicated to it. So
we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting merged
with normal VMAs. A new VM_ flag set only on MPX VMAs does both of
those things. With this flag, MPX bounds-table VMAs can be distinguished
from other VMAs, and userspace can also walk /proc/$pid/smaps to get
memory usage for MPX.
Except this flag, we also introduce a specific ->vm_ops for MPX VMAs
(see the patch "add MPX specific mmap interface"), but currently vmas
with different ->vm_ops could be not prevented from merging. We
understand that VM_ flags are scarce and are open to other options.
Signed-off-by: Qiaowei Ren <[email protected]>
---
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 6 ++++++
2 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..cc31520 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -549,6 +549,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_GROWSDOWN)] = "gd",
[ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_DENYWRITE)] = "dw",
+ [ilog2(VM_MPX)] = "mp",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..942be8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -127,6 +127,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
+#define VM_ARCH_2 0x02000000
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
#ifdef CONFIG_MEM_SOFT_DIRTY
@@ -154,6 +155,11 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */
#endif
+#if defined(CONFIG_X86)
+/* MPX specific bounds table or bounds directory */
+# define VM_MPX VM_ARCH_2
+#endif
+
#ifndef VM_GROWSUP
# define VM_GROWSUP VM_NONE
#endif
--
1.7.1
New fields about bound violation are added into general struct
siginfo. This will impact MIPS and IA64, which extend general
struct siginfo. This patch syncs this struct for IA64 with
general version.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/ia64/include/uapi/asm/siginfo.h | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/ia64/include/uapi/asm/siginfo.h b/arch/ia64/include/uapi/asm/siginfo.h
index 4ea6225..bce9bc1 100644
--- a/arch/ia64/include/uapi/asm/siginfo.h
+++ b/arch/ia64/include/uapi/asm/siginfo.h
@@ -63,6 +63,10 @@ typedef struct siginfo {
unsigned int _flags; /* see below */
unsigned long _isr; /* isr */
short _addr_lsb; /* lsb of faulting address */
+ struct {
+ void __user *_lower;
+ void __user *_upper;
+ } _addr_bnd;
} _sigfault;
/* SIGPOLL */
@@ -110,9 +114,9 @@ typedef struct siginfo {
/*
* SIGSEGV si_codes
*/
-#define __SEGV_PSTKOVF (__SI_FAULT|3) /* paragraph stack overflow */
+#define __SEGV_PSTKOVF (__SI_FAULT|4) /* paragraph stack overflow */
#undef NSIGSEGV
-#define NSIGSEGV 3
+#define NSIGSEGV 4
#undef NSIGTRAP
#define NSIGTRAP 4
--
1.7.1
New fields about bound violation are added into general struct
siginfo. This will impact MIPS and IA64, which extend general
struct siginfo. This patch syncs this struct for MIPS with
general version.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/mips/include/uapi/asm/siginfo.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/arch/mips/include/uapi/asm/siginfo.h b/arch/mips/include/uapi/asm/siginfo.h
index e811744..d08f83f 100644
--- a/arch/mips/include/uapi/asm/siginfo.h
+++ b/arch/mips/include/uapi/asm/siginfo.h
@@ -92,6 +92,10 @@ typedef struct siginfo {
int _trapno; /* TRAP # which caused the signal */
#endif
short _addr_lsb;
+ struct {
+ void __user *_lower;
+ void __user *_upper;
+ } _addr_bnd;
} _sigfault;
/* SIGPOLL, SIGXFSZ (To do ...) */
--
1.7.1
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mpx.h | 20 +++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/mpx.c | 101 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 52 ++++++++++++++++++++++-
4 files changed, 173 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kernel/mpx.c
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 5725ac4..b7598ac 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -18,6 +18,8 @@
#define MPX_BT_ENTRY_SHIFT 5
#define MPX_IGN_BITS 3
+#define MPX_BD_ENTRY_TAIL 3
+
#else
#define MPX_BD_ENTRY_OFFSET 20
@@ -26,13 +28,31 @@
#define MPX_BT_ENTRY_SHIFT 4
#define MPX_IGN_BITS 2
+#define MPX_BD_ENTRY_TAIL 2
+
#endif
+#define MPX_BNDSTA_TAIL 2
+#define MPX_BNDCFG_TAIL 12
+#define MPX_BNDSTA_ADDR_MASK (~((1UL<<MPX_BNDSTA_TAIL)-1))
+#define MPX_BNDCFG_ADDR_MASK (~((1UL<<MPX_BNDCFG_TAIL)-1))
+#define MPX_BT_ADDR_MASK (~((1UL<<MPX_BD_ENTRY_TAIL)-1))
+
#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
#define MPX_BNDSTA_ERROR_CODE 0x3
+#define MPX_BD_ENTRY_VALID_FLAG 0x1
unsigned long mpx_mmap(unsigned long len);
+#ifdef CONFIG_X86_INTEL_MPX
+int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
+#else
+static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_X86_INTEL_MPX */
+
#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ada2e2d..9ece662 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_PREEMPT) += preempt.o
obj-y += process.o
obj-y += i387.o xsave.o
+obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-y += ptrace.o
obj-$(CONFIG_X86_32) += tls.o
obj-$(CONFIG_IA32_EMULATION) += tls.o
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
new file mode 100644
index 0000000..2103b5e
--- /dev/null
+++ b/arch/x86/kernel/mpx.c
@@ -0,0 +1,101 @@
+/*
+ * mpx.c - Memory Protection eXtensions
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ * Qiaowei Ren <[email protected]>
+ * Dave Hansen <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <asm/mpx.h>
+
+/*
+ * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
+ * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
+ * and the size of each bounds table is 4MB.
+ */
+static int allocate_bt(long __user *bd_entry)
+{
+ unsigned long bt_addr;
+ unsigned long expected_old_val = 0;
+ unsigned long actual_old_val = 0;
+ int ret = 0;
+
+ /*
+ * Carve the virtual space out of userspace for the new
+ * bounds table:
+ */
+ bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
+ if (IS_ERR((void *)bt_addr))
+ return PTR_ERR((void *)bt_addr);
+ /*
+ * Set the valid flag (kinda like _PAGE_PRESENT in a pte)
+ */
+ bt_addr = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+
+ /*
+ * Go poke the address of the new bounds table in to the
+ * bounds directory entry out in userspace memory. Note:
+ * we may race with another CPU instantiating the same table.
+ * In that case the cmpxchg will see an unexpected
+ * 'actual_old_val'.
+ */
+ ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
+ expected_old_val, bt_addr);
+ if (ret)
+ goto out;
+
+ /*
+ * The user_atomic_cmpxchg_inatomic() will only return nonzero
+ * for faults, *not* if the cmpxchg itself fails. This verifies
+ * that the existing value was still empty like we expected.
+ *
+ * Note, we might get in here if there is a value in the existing
+ * bd_entry but it did not have the VALID_FLAG set. In that case
+ * we do _not_ replace it. We only replace completely empty
+ * entries.
+ */
+ if (expected_old_val != actual_old_val)
+ goto out;
+
+ return 0;
+
+out:
+ vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
+ return ret;
+}
+
+/*
+ * When a BNDSTX instruction attempts to save bounds to a BD entry
+ * with the lack of the valid bit being set, a #BR is generated.
+ * This is an indication that no BT exists for this entry. In this
+ * case the fault handler will allocate a new BT.
+ *
+ * With 32-bit mode, the size of BD is 4MB, and the size of each
+ * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
+ * and the size of each bound table is 4MB.
+ */
+int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
+{
+ unsigned long status;
+ unsigned long bd_entry, bd_base;
+
+ bd_base = xsave_buf->bndcsr.bndcfgu & MPX_BNDCFG_ADDR_MASK;
+ status = xsave_buf->bndcsr.bndstatus;
+
+ /*
+ * The hardware provides the address of the missing or invalid
+ * entry via BNDSTATUS, so we don't have to go look it up.
+ */
+ bd_entry = status & MPX_BNDSTA_ADDR_MASK;
+ /*
+ * Make sure the directory entry is within where we think
+ * the directory is.
+ */
+ if ((bd_entry < bd_base) ||
+ (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
+ return -EINVAL;
+
+ return allocate_bt((long __user *)bd_entry);
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922..611b6ec 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -60,6 +60,7 @@
#include <asm/fixmap.h>
#include <asm/mach_traps.h>
#include <asm/alternative.h>
+#include <asm/mpx.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -228,7 +229,6 @@ dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error)
DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
-DO_ERROR(X86_TRAP_BR, SIGSEGV, "bounds", bounds)
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun",coprocessor_segment_overrun)
DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
@@ -278,6 +278,56 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
}
#endif
+dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
+{
+ enum ctx_state prev_state;
+ unsigned long status;
+ struct xsave_struct *xsave_buf;
+ struct task_struct *tsk = current;
+
+ prev_state = exception_enter();
+ if (notify_die(DIE_TRAP, "bounds", regs, error_code,
+ X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
+ goto exit;
+ conditional_sti(regs);
+
+ if (!user_mode(regs))
+ die("bounds", regs, error_code);
+
+ if (!cpu_feature_enabled(X86_FEATURE_MPX)) {
+ /* The exception is not from Intel MPX */
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
+ goto exit;
+ }
+
+ fpu_xsave(&tsk->thread.fpu);
+ xsave_buf = &(tsk->thread.fpu.state->xsave);
+ status = xsave_buf->bndcsr.bndstatus;
+
+ /*
+ * The error code field of the BNDSTATUS register communicates status
+ * information of a bound range exception #BR or operation involving
+ * bound directory.
+ */
+ switch (status & MPX_BNDSTA_ERROR_CODE) {
+ case 2: /* Bound directory has invalid entry. */
+ if (do_mpx_bt_fault(xsave_buf))
+ force_sig(SIGSEGV, tsk);
+ break;
+
+ case 1: /* Bound violation. */
+ case 0: /* No exception caused by Intel MPX operations. */
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
+ break;
+
+ default:
+ die("bounds", regs, error_code);
+ }
+
+exit:
+ exception_exit(prev_state);
+}
+
dotraplinkage void
do_general_protection(struct pt_regs *regs, long error_code)
{
--
1.7.1
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.
Signed-off-by: Qiaowei Ren <[email protected]>
---
include/uapi/asm-generic/siginfo.h | 9 ++++++++-
kernel/signal.c | 4 ++++
2 files changed, 12 insertions(+), 1 deletions(-)
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ba5be7f..1e35520 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -91,6 +91,10 @@ typedef struct siginfo {
int _trapno; /* TRAP # which caused the signal */
#endif
short _addr_lsb; /* LSB of the reported address */
+ struct {
+ void __user *_lower;
+ void __user *_upper;
+ } _addr_bnd;
} _sigfault;
/* SIGPOLL */
@@ -131,6 +135,8 @@ typedef struct siginfo {
#define si_trapno _sifields._sigfault._trapno
#endif
#define si_addr_lsb _sifields._sigfault._addr_lsb
+#define si_lower _sifields._sigfault._addr_bnd._lower
+#define si_upper _sifields._sigfault._addr_bnd._upper
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
#ifdef __ARCH_SIGSYS
@@ -199,7 +205,8 @@ typedef struct siginfo {
*/
#define SEGV_MAPERR (__SI_FAULT|1) /* address not mapped to object */
#define SEGV_ACCERR (__SI_FAULT|2) /* invalid permissions for mapped object */
-#define NSIGSEGV 2
+#define SEGV_BNDERR (__SI_FAULT|3) /* failed address bound checks */
+#define NSIGSEGV 3
/*
* SIGBUS si_codes
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f0876f..2c403a4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2748,6 +2748,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
if (from->si_code == BUS_MCEERR_AR || from->si_code == BUS_MCEERR_AO)
err |= __put_user(from->si_addr_lsb, &to->si_addr_lsb);
#endif
+#ifdef SEGV_BNDERR
+ err |= __put_user(from->si_lower, &to->si_lower);
+ err |= __put_user(from->si_upper, &to->si_upper);
+#endif
break;
case __SI_CHLD:
err |= __put_user(from->si_pid, &to->si_pid);
--
1.7.1
This allows us to use cpu_feature_enabled(X86_FEATURE_MPX) as
both a runtime and compile-time check.
When CONFIG_X86_INTEL_MPX is disabled,
cpu_feature_enabled(X86_FEATURE_MPX) will evaluate at
compile-time to 0. If CONFIG_X86_INTEL_MPX=y, then the cpuid
flag will be checked at runtime.
This patch must be applied after another Dave's commit:
381aa07a9b4e1f82969203e9e4863da2a157781d
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 97534a7..f226df0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -10,6 +10,12 @@
* cpu_feature_enabled().
*/
+#ifdef CONFIG_X86_INTEL_MPX
+# define DISABLE_MPX 0
+#else
+# define DISABLE_MPX (1<<(X86_FEATURE_MPX & 31))
+#endif
+
#ifdef CONFIG_X86_64
# define DISABLE_VME (1<<(X86_FEATURE_VME & 31))
# define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31))
@@ -34,6 +40,6 @@
#define DISABLED_MASK6 0
#define DISABLED_MASK7 0
#define DISABLED_MASK8 0
-#define DISABLED_MASK9 0
+#define DISABLED_MASK9 (DISABLE_MPX)
#endif /* _ASM_X86_DISABLED_FEATURES_H */
--
1.7.1
According to Intel SDM extension, MPX configuration and status registers
should be BNDCFGU and BNDSTATUS. This patch renames cfg_reg_u and
status_reg to bndcfgu and bndstatus.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/processor.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..020142f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -379,8 +379,8 @@ struct bndregs_struct {
} __packed;
struct bndcsr_struct {
- u64 cfg_reg_u;
- u64 status_reg;
+ u64 bndcfgu;
+ u64 bndstatus;
} __packed;
struct xsave_hdr_struct {
--
1.7.1
On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> + /*
> + * Go poke the address of the new bounds table in to the
> + * bounds directory entry out in userspace memory. Note:
> + * we may race with another CPU instantiating the same table.
> + * In that case the cmpxchg will see an unexpected
> + * 'actual_old_val'.
> + */
> + ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
> + expected_old_val, bt_addr);
This is fully preemptible non-atomic context, right?
So this wants a proper comment, why using
user_atomic_cmpxchg_inatomic() is the right thing to do here.
Thanks,
tglx
On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> This patch sets bound violation fields of siginfo struct in #BR
> exception handler by decoding the user instruction and constructing
> the faulting pointer.
>
> This patch does't use the generic decoder, and implements a limited
> special-purpose decoder to decode MPX instructions, simply because the
> generic decoder is very heavyweight not just in terms of performance
> but in terms of interface -- because it has to.
My question still stands why using the existing decoder is an
issue. Performance is a complete non issue in case of a bounds
violation and the interface argument is just silly, really.
Thanks,
tglx
On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> +int mpx_enable_management(struct task_struct *tsk)
> +{
> + struct mm_struct *mm = tsk->mm;
> + void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
What's the point of initializing bd_base here. I had to look twice to
figure out that it gets overwritten by task_get_bounds_dir()
> @@ -285,6 +285,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> struct xsave_struct *xsave_buf;
> struct task_struct *tsk = current;
> siginfo_t info;
> + int ret = 0;
>
> prev_state = exception_enter();
> if (notify_die(DIE_TRAP, "bounds", regs, error_code,
> @@ -312,8 +313,35 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> */
> switch (status & MPX_BNDSTA_ERROR_CODE) {
> case 2: /* Bound directory has invalid entry. */
> - if (do_mpx_bt_fault(xsave_buf))
> + down_write(¤t->mm->mmap_sem);
The handling of mm->mmap_sem here is horrible. The only reason why you
want to hold mmap_sem write locked in the first place is that you want
to cover the allocation and the mm->bd_addr check.
I think it's wrong to tie this to mmap_sem in the first place. If MPX
is enabled then you should have mm->bd_addr and an explicit mutex to
protect it.
So the logic would look like this:
mutex_lock(&mm->bd_mutex);
if (!kernel_managed(mm))
do_trap();
else if (do_mpx_bt_fault())
force_sig();
mutex_unlock(&mm->bd_mutex);
No tricks with mmap_sem, no special return value handling. Straight
forward code instead of a convoluted and error prone mess.
Hmm?
Thanks,
tglx
On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> Since we are doing the freeing from munmap() (and other paths like it),
> we hold mmap_sem for write. If we fault, the page fault handler will
> attempt to acquire mmap_sem for read and we will deadlock. For now, to
> avoid deadlock, we disable page faults while touching the bounds directory
> entry. This keeps us from being able to free the tables in this case.
> This deficiency will be addressed in later patches.
This is wrong to begin with. You need a proper design for addressing
this short coming in the first place. Retrofitting it into your
current design will just not work at all or end up with some major
mess.
The problem to solve is, that the kernel needs to unmap the bounds
table if the data mapping which it covers goes away.
You decided to do that at the end of unmap with mmap_sem write
held. As I explained last time, that's not an absolute requirement,
it's just a random choice. And that choice is obvioulsy not a really
good one as your own observation of the page fault issue proves.
So perhaps we should look at it the other way round. We already know
before the actual unmap happens what's the start and the end of the
area.
So we can be way smarter and do the following:
mpx_pre_unmap(from, len);
down_write(mmap_sem);
do_unmap(from, len);
up_write(mmap_sem);
mpx_post_unmap(mpx_unmap);
int mpx_pre_unmap(...)
{
down_write(mm->bd_sem);
down_read(mm->mmap_sem);
Here we do the following:
1) Invalidate the bounds table entries for the to be unmapped area in
the bounds directory. This can fault as we only hold mmap sem for
read.
2) Mark the corresponding VMAs which should be unmapped and
removed. This can be done with mmap sem down read held as the
VMA chain cannot be changed and the 'Mark for removal" field is
protected by mm->bd_sem.
For each to be removed VMA we increment mm->bd_remove_vmas
Holding mm->bd_sem also prevents that a new bound table to be
inserted, if we do the whole protection thing right.
up_read(mm->mmap_sem);
}
void mpx_post_unmap(void)
{
if (mm->bd_remove_vmas) {
down_write(mm->mmap_sem);
cleanup_to_be_removed_vmas();
up_write(mm->mmap_sem);
}
up_write(mm->bd_sem);
}
The mpx_pre_unmap/post_unmap calls can be either added explicit to the
relevant down_write/unmap/up_write code pathes or you hide it behind
some wrapper function.
Now you need to acquire mm->bd_sem for the fault handling code as well
in order to serialize the creation of new bound table entries. In that
case a down_read(mm->bd_sem) is sufficient as the cmpxchg() prevents
the insertion of multiple tables for the same directory entries.
So we'd end up with the following serialization scheme:
prctl enable/disable management: down_write(mm->bd_sem);
bounds violation: down_read(mm->bd_sem);
directory entry allocation: down_read(mm->bd_sem);
directory entry removal: down_write(mm->bd_sem);
Now we need to check whether there is a potential deadlock lurking
around the corner. We get the following nesting scenarios:
- prctl enable/disable management: down_write(mm->bd_sem);
No mmap_sem nesting at all
- bounds violation: down_read(mm->bd_sem);
Might nest down_read(mm->mmap_sem) if the copy code from user space
faults.
- directory entry allocation: down_read(mm->bd_sem);
Nests down_write(mm->mmap_sem) for the VMA allocation.
Might nest down_read(mm->mmap_sem) if the cmpxchg() for the
directory entry faults
- directory entry removal: down_write(mm->bd_sem);
Might nest down_read(mm->mmap_sem) if the invalidation of the
directory entry faults
In other words here is the possible nesting:
#1 down_read(mm->bd_sem); down_read(mm->mmap_sem);
#2 down_read(mm->bd_sem); down_write(mm->mmap_sem);
#3 down_write(mm->bd_sem); down_read(mm->mmap_sem);
#4 down_write(mm->bd_sem); down_write(mm->mmap_sem);
We never execute any of those nested on a single task. The bounds
fault handler is never nested as it always comes from user space. So
we should be safe.
And this prevents all the nasty race conditions I discussed in
Portland with Dave which happen when we try do to stuff outside of the
mmap_sem write held region.
Performance wise it's not an issue either. The prctl case is not a a
hotpath anyway. The fault handlers get away with down_read() and the
unmap code is heavyweight on mmap sem write held anyway.
Thoughts?
Thanks,
tglx
On Fri, 24 Oct 2014, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> > +int mpx_enable_management(struct task_struct *tsk)
> > +{
> > + struct mm_struct *mm = tsk->mm;
> > + void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
>
> What's the point of initializing bd_base here. I had to look twice to
> figure out that it gets overwritten by task_get_bounds_dir()
>
> > @@ -285,6 +285,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> > struct xsave_struct *xsave_buf;
> > struct task_struct *tsk = current;
> > siginfo_t info;
> > + int ret = 0;
> >
> > prev_state = exception_enter();
> > if (notify_die(DIE_TRAP, "bounds", regs, error_code,
> > @@ -312,8 +313,35 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> > */
> > switch (status & MPX_BNDSTA_ERROR_CODE) {
> > case 2: /* Bound directory has invalid entry. */
> > - if (do_mpx_bt_fault(xsave_buf))
> > + down_write(¤t->mm->mmap_sem);
>
> The handling of mm->mmap_sem here is horrible. The only reason why you
> want to hold mmap_sem write locked in the first place is that you want
> to cover the allocation and the mm->bd_addr check.
>
> I think it's wrong to tie this to mmap_sem in the first place. If MPX
> is enabled then you should have mm->bd_addr and an explicit mutex to
> protect it.
>
> So the logic would look like this:
>
> mutex_lock(&mm->bd_mutex);
> if (!kernel_managed(mm))
> do_trap();
> else if (do_mpx_bt_fault())
> force_sig();
> mutex_unlock(&mm->bd_mutex);
>
> No tricks with mmap_sem, no special return value handling. Straight
> forward code instead of a convoluted and error prone mess.
After thinking about the deallocation issue, this would be mm->bd_sem.
Thanks,
tglx
On 2014-10-24, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>
>> This patch sets bound violation fields of siginfo struct in #BR
>> exception handler by decoding the user instruction and constructing
>> the faulting pointer.
>>
>> This patch does't use the generic decoder, and implements a limited
>> special-purpose decoder to decode MPX instructions, simply because
>> the generic decoder is very heavyweight not just in terms of
>> performance but in terms of interface -- because it has to.
>
> My question still stands why using the existing decoder is an issue.
> Performance is a complete non issue in case of a bounds violation and
> the interface argument is just silly, really.
>
As hpa said, we only need to decode several mpx instructions including BNDCL/BNDCU, and general decoder looks like a little heavy. Peter, what do you think about it?
Thanks,
Qiaowei
On 2014-10-24, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>> +int mpx_enable_management(struct task_struct *tsk) {
>> + struct mm_struct *mm = tsk->mm;
>> + void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
>
> What's the point of initializing bd_base here. I had to look twice to
> figure out that it gets overwritten by task_get_bounds_dir()
>
I just want to put task_get_bounds_dir() outside mm->mmap_sem holding.
>> @@ -285,6 +285,7 @@ dotraplinkage void do_bounds(struct pt_regs
>> *regs,
> long error_code)
>> struct xsave_struct *xsave_buf;
>> struct task_struct *tsk = current;
>> siginfo_t info;
>> + int ret = 0;
>>
>> prev_state = exception_enter();
>> if (notify_die(DIE_TRAP, "bounds", regs, error_code, @@ -312,8
>> +313,35 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long
> error_code)
>> */
>> switch (status & MPX_BNDSTA_ERROR_CODE) {
>> case 2: /* Bound directory has invalid entry. */
>> - if (do_mpx_bt_fault(xsave_buf))
>> + down_write(¤t->mm->mmap_sem);
>
> The handling of mm->mmap_sem here is horrible. The only reason why you
> want to hold mmap_sem write locked in the first place is that you want
> to cover the allocation and the mm->bd_addr check.
>
> I think it's wrong to tie this to mmap_sem in the first place. If MPX
> is enabled then you should have mm->bd_addr and an explicit mutex to protect it.
>
> So the logic would look like this:
>
> mutex_lock(&mm->bd_mutex);
> if (!kernel_managed(mm))
> do_trap(); else if (do_mpx_bt_fault()) force_sig();
> mutex_unlock(&mm->bd_mutex);
> No tricks with mmap_sem, no special return value handling. Straight
> forward code instead of a convoluted and error prone mess.
>
> Hmm?
>
I guess this is a good solution. If so, new field 'bd_sem' have to be added into struct mm_struct.
Thanks,
Qiaowei
On 10/24/2014 10:40 PM, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>> Since we are doing the freeing from munmap() (and other paths like it),
>> we hold mmap_sem for write. If we fault, the page fault handler will
>> attempt to acquire mmap_sem for read and we will deadlock. For now, to
>> avoid deadlock, we disable page faults while touching the bounds directory
>> entry. This keeps us from being able to free the tables in this case.
>> This deficiency will be addressed in later patches.
>
> This is wrong to begin with. You need a proper design for addressing
> this short coming in the first place. Retrofitting it into your
> current design will just not work at all or end up with some major
> mess.
>
> The problem to solve is, that the kernel needs to unmap the bounds
> table if the data mapping which it covers goes away.
>
> You decided to do that at the end of unmap with mmap_sem write
> held. As I explained last time, that's not an absolute requirement,
> it's just a random choice. And that choice is obvioulsy not a really
> good one as your own observation of the page fault issue proves.
>
> So perhaps we should look at it the other way round. We already know
> before the actual unmap happens what's the start and the end of the
> area.
>
> So we can be way smarter and do the following:
>
> mpx_pre_unmap(from, len);
>
> down_write(mmap_sem);
> do_unmap(from, len);
> up_write(mmap_sem);
>
> mpx_post_unmap(mpx_unmap);
>
> int mpx_pre_unmap(...)
> {
> down_write(mm->bd_sem);
>
> down_read(mm->mmap_sem);
>
> Here we do the following:
>
> 1) Invalidate the bounds table entries for the to be unmapped area in
> the bounds directory. This can fault as we only hold mmap sem for
> read.
>
> 2) Mark the corresponding VMAs which should be unmapped and
> removed. This can be done with mmap sem down read held as the
> VMA chain cannot be changed and the 'Mark for removal" field is
> protected by mm->bd_sem.
>
> For each to be removed VMA we increment mm->bd_remove_vmas
>
> Holding mm->bd_sem also prevents that a new bound table to be
> inserted, if we do the whole protection thing right.
>
> up_read(mm->mmap_sem);
> }
>
> void mpx_post_unmap(void)
> {
> if (mm->bd_remove_vmas) {
> down_write(mm->mmap_sem);
> cleanup_to_be_removed_vmas();
> up_write(mm->mmap_sem);
> }
>
> up_write(mm->bd_sem);
> }
>
If so, I guess that there are some questions needed to be considered:
1) Almost all palces which call do_munmap() will need to add
mpx_pre_unmap/post_unmap calls, like vm_munmap(), mremap(), shmdt(), etc..
2) before mpx_post_unmap() call, it is possible for those bounds tables
within mm->bd_remove_vmas to be re-used.
In this case, userspace may do new mapping and access one address which
will cover one of those bounds tables. During this period, HW will check
if one bounds table exist, if yes one fault won't be produced.
3) According to Dave, those bounds tables related to adjacent VMAs
within the start and the end possibly don't have to be fully unmmaped,
and we only need free the part of backing physical memory.
> The mpx_pre_unmap/post_unmap calls can be either added explicit to the
> relevant down_write/unmap/up_write code pathes or you hide it behind
> some wrapper function.
>
> Now you need to acquire mm->bd_sem for the fault handling code as well
> in order to serialize the creation of new bound table entries. In that
> case a down_read(mm->bd_sem) is sufficient as the cmpxchg() prevents
> the insertion of multiple tables for the same directory entries.
>
> So we'd end up with the following serialization scheme:
>
> prctl enable/disable management: down_write(mm->bd_sem);
>
> bounds violation: down_read(mm->bd_sem);
>
> directory entry allocation: down_read(mm->bd_sem);
>
> directory entry removal: down_write(mm->bd_sem);
>
> Now we need to check whether there is a potential deadlock lurking
> around the corner. We get the following nesting scenarios:
>
> - prctl enable/disable management: down_write(mm->bd_sem);
>
> No mmap_sem nesting at all
>
> - bounds violation: down_read(mm->bd_sem);
>
> Might nest down_read(mm->mmap_sem) if the copy code from user space
> faults.
>
> - directory entry allocation: down_read(mm->bd_sem);
>
> Nests down_write(mm->mmap_sem) for the VMA allocation.
>
> Might nest down_read(mm->mmap_sem) if the cmpxchg() for the
> directory entry faults
>
> - directory entry removal: down_write(mm->bd_sem);
>
> Might nest down_read(mm->mmap_sem) if the invalidation of the
> directory entry faults
>
> In other words here is the possible nesting:
>
> #1 down_read(mm->bd_sem); down_read(mm->mmap_sem);
>
> #2 down_read(mm->bd_sem); down_write(mm->mmap_sem);
>
> #3 down_write(mm->bd_sem); down_read(mm->mmap_sem);
>
> #4 down_write(mm->bd_sem); down_write(mm->mmap_sem);
>
> We never execute any of those nested on a single task. The bounds
> fault handler is never nested as it always comes from user space. So
> we should be safe.
>
> And this prevents all the nasty race conditions I discussed in
> Portland with Dave which happen when we try do to stuff outside of the
> mmap_sem write held region.
>
> Performance wise it's not an issue either. The prctl case is not a a
> hotpath anyway. The fault handlers get away with down_read() and the
> unmap code is heavyweight on mmap sem write held anyway.
>
> Thoughts?
>
> Thanks,
>
> tglx
>
On 10/24/2014 08:08 PM, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>> + /*
>> + * Go poke the address of the new bounds table in to the
>> + * bounds directory entry out in userspace memory. Note:
>> + * we may race with another CPU instantiating the same table.
>> + * In that case the cmpxchg will see an unexpected
>> + * 'actual_old_val'.
>> + */
>> + ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
>> + expected_old_val, bt_addr);
>
> This is fully preemptible non-atomic context, right?
>
> So this wants a proper comment, why using
> user_atomic_cmpxchg_inatomic() is the right thing to do here.
>
Well, we will address it.
Thanks,
Qiaowei
On Mon, 27 Oct 2014, Ren, Qiaowei wrote:
> On 2014-10-24, Thomas Gleixner wrote:
> > On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> >
> >> This patch sets bound violation fields of siginfo struct in #BR
> >> exception handler by decoding the user instruction and constructing
> >> the faulting pointer.
> >>
> >> This patch does't use the generic decoder, and implements a limited
> >> special-purpose decoder to decode MPX instructions, simply because
> >> the generic decoder is very heavyweight not just in terms of
> >> performance but in terms of interface -- because it has to.
> >
> > My question still stands why using the existing decoder is an issue.
> > Performance is a complete non issue in case of a bounds violation and
> > the interface argument is just silly, really.
> >
>
> As hpa said, we only need to decode several mpx instructions
> including BNDCL/BNDCU, and general decoder looks like a little
> heavy. Peter, what do you think about it?
You're repeating yourself. Care to read the discussion about this from
the last round of review again?
Thanks,
tglx
On Mon, 27 Oct 2014, Ren, Qiaowei wrote:
> On 2014-10-24, Thomas Gleixner wrote:
> > On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> >> +int mpx_enable_management(struct task_struct *tsk) {
> >> + struct mm_struct *mm = tsk->mm;
> >> + void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
> >
> > What's the point of initializing bd_base here. I had to look twice to
> > figure out that it gets overwritten by task_get_bounds_dir()
> >
>
> I just want to put task_get_bounds_dir() outside mm->mmap_sem holding.
What you want is not interesting at all. What's interesting is what
you do and what you send for review.
Thanks,
tglx
On Mon, 27 Oct 2014, Ren Qiaowei wrote:
> If so, I guess that there are some questions needed to be considered:
>
> 1) Almost all palces which call do_munmap() will need to add
> mpx_pre_unmap/post_unmap calls, like vm_munmap(), mremap(), shmdt(), etc..
What's the problem with that?
> 2) before mpx_post_unmap() call, it is possible for those bounds tables within
> mm->bd_remove_vmas to be re-used.
>
> In this case, userspace may do new mapping and access one address which will
> cover one of those bounds tables. During this period, HW will check if one
> bounds table exist, if yes one fault won't be produced.
Errm. Before user space can use the bounds table for the new mapping
it needs to add the entries, right? So:
CPU 0 CPU 1
down_write(mm->bd_sem);
mpx_pre_unmap();
clear bounds directory entries
unmap();
map()
write_bounds_entry()
trap()
down_read(mm->bd_sem);
mpx_post_unmap();
up_write(mm->bd_sem);
allocate_bounds_table();
That's the whole point of bd_sem.
> 3) According to Dave, those bounds tables related to adjacent VMAs within the
> start and the end possibly don't have to be fully unmmaped, and we only need
> free the part of backing physical memory.
Care to explain why that's a problem?
Thanks,
tglx
On 10/28/2014 04:49 AM, Thomas Gleixner wrote:
> On Mon, 27 Oct 2014, Ren Qiaowei wrote:
>> If so, I guess that there are some questions needed to be considered:
>>
>> 1) Almost all palces which call do_munmap() will need to add
>> mpx_pre_unmap/post_unmap calls, like vm_munmap(), mremap(), shmdt(), etc..
>
> What's the problem with that?
>
For example:
shmdt()
down_write(mm->mmap_sem);
vma = find_vma();
while (vma)
do_munmap();
up_write(mm->mmap_sem);
We could not simply add mpx_pre_unmap() before do_munmap() or
down_write(). And seems like it is a little hard for shmdt() to be
changed to match this solution, right?
>> 2) before mpx_post_unmap() call, it is possible for those bounds tables within
>> mm->bd_remove_vmas to be re-used.
>>
>> In this case, userspace may do new mapping and access one address which will
>> cover one of those bounds tables. During this period, HW will check if one
>> bounds table exist, if yes one fault won't be produced.
>
> Errm. Before user space can use the bounds table for the new mapping
> it needs to add the entries, right? So:
>
> CPU 0 CPU 1
>
> down_write(mm->bd_sem);
> mpx_pre_unmap();
> clear bounds directory entries
> unmap();
> map()
> write_bounds_entry()
> trap()
> down_read(mm->bd_sem);
> mpx_post_unmap();
> up_write(mm->bd_sem);
> allocate_bounds_table();
>
> That's the whole point of bd_sem.
>
Yes. Got it.
>> 3) According to Dave, those bounds tables related to adjacent VMAs within the
>> start and the end possibly don't have to be fully unmmaped, and we only need
>> free the part of backing physical memory.
>
> Care to explain why that's a problem?
>
I guess you mean one new field mm->bd_remove_vmas should be added into
staruct mm, right?
For those VMAs which we only need to free part of backing physical
memory, we could not clear bounds directory entries and should also mark
the range of backing physical memory within this vma. If so, maybe there
are too many new fields which will be added into mm struct, right?
Thanks,
Qiaowei
On 10/28/2014 04:38 AM, Thomas Gleixner wrote:
> On Mon, 27 Oct 2014, Ren, Qiaowei wrote:
>> On 2014-10-24, Thomas Gleixner wrote:
>>> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>>>> +int mpx_enable_management(struct task_struct *tsk) {
>>>> + struct mm_struct *mm = tsk->mm;
>>>> + void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
>>>
>>> What's the point of initializing bd_base here. I had to look twice to
>>> figure out that it gets overwritten by task_get_bounds_dir()
>>>
>>
>> I just want to put task_get_bounds_dir() outside mm->mmap_sem holding.
>
> What you want is not interesting at all. What's interesting is what
> you do and what you send for review.
>
I see. Thanks.
- Qiaowei
On 10/28/2014 04:36 AM, Thomas Gleixner wrote:
> On Mon, 27 Oct 2014, Ren, Qiaowei wrote:
>> On 2014-10-24, Thomas Gleixner wrote:
>>> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>>>
>>>> This patch sets bound violation fields of siginfo struct in #BR
>>>> exception handler by decoding the user instruction and constructing
>>>> the faulting pointer.
>>>>
>>>> This patch does't use the generic decoder, and implements a limited
>>>> special-purpose decoder to decode MPX instructions, simply because
>>>> the generic decoder is very heavyweight not just in terms of
>>>> performance but in terms of interface -- because it has to.
>>>
>>> My question still stands why using the existing decoder is an issue.
>>> Performance is a complete non issue in case of a bounds violation and
>>> the interface argument is just silly, really.
>>>
>>
>> As hpa said, we only need to decode several mpx instructions
>> including BNDCL/BNDCU, and general decoder looks like a little
>> heavy. Peter, what do you think about it?
>
> You're repeating yourself. Care to read the discussion about this from
> the last round of review again?
>
Ok. I will go through it again. Thanks.
- Qiaowei
On Tue, 28 Oct 2014, Ren Qiaowei wrote:
> On 10/28/2014 04:49 AM, Thomas Gleixner wrote:
> > On Mon, 27 Oct 2014, Ren Qiaowei wrote:
> > > If so, I guess that there are some questions needed to be considered:
> > >
> > > 1) Almost all palces which call do_munmap() will need to add
> > > mpx_pre_unmap/post_unmap calls, like vm_munmap(), mremap(), shmdt(), etc..
> >
> > What's the problem with that?
> >
>
> For example:
>
> shmdt()
> down_write(mm->mmap_sem);
> vma = find_vma();
> while (vma)
> do_munmap();
> up_write(mm->mmap_sem);
>
> We could not simply add mpx_pre_unmap() before do_munmap() or down_write().
> And seems like it is a little hard for shmdt() to be changed to match this
> solution, right?
Everything which does not fall in place right away seems to be a
little hard, heavy weight or whatever excuses you have for it.
It's not that hard, really. We can simply split out the search code
into a seperate function and use it for both problems.
Yes, it is quite some work to do, but its straight forward.
> > > 3) According to Dave, those bounds tables related to adjacent VMAs within
> > > the
> > > start and the end possibly don't have to be fully unmmaped, and we only
> > > need
> > > free the part of backing physical memory.
> >
> > Care to explain why that's a problem?
> >
>
> I guess you mean one new field mm->bd_remove_vmas should be added into staruct
> mm, right?
That was just to demonstrate the approach. I'm giving you a hint how
to do it, I'm not telling you what the exact solution will be. If I
need to do that, then I can implement it myself right away.
> For those VMAs which we only need to free part of backing physical memory, we
> could not clear bounds directory entries and should also mark the range of
> backing physical memory within this vma. If so, maybe there are too many new
> fields which will be added into mm struct, right?
If we need more data to carry over from pre to post, we can allocate a
proper data structure and just add a pointer to that to mm. And it's
not written in stone, that you need to carry that information from pre
to post. You could do the unmap/zap work in the pre phase already and
reduce mpx_post_unmap() to up_write(mm->bt_sem).
I gave you an idea and the center point of that idea is to have a
separate rwsem to protect against the various races, fault handling
etc. You still have to think about the implementation details.
Thanks,
tglx
On 10/24/2014 05:08 AM, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>> + /*
>> + * Go poke the address of the new bounds table in to the
>> + * bounds directory entry out in userspace memory. Note:
>> + * we may race with another CPU instantiating the same table.
>> + * In that case the cmpxchg will see an unexpected
>> + * 'actual_old_val'.
>> + */
>> + ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
>> + expected_old_val, bt_addr);
>
> This is fully preemptible non-atomic context, right?
>
> So this wants a proper comment, why using
> user_atomic_cmpxchg_inatomic() is the right thing to do here.
Hey Thomas,
How's this for a new comment? Does this cover the points you think need
clarified?
====
The kernel has allocated a bounds table and needs to point the
(userspace-allocated) directory to it. The directory entry is the
*only* place we track that this table was allocated, so we essentially
use it instead of an kernel data structure for synchronization. A
copy_to_user()-style function would not give us the atomicity that we need.
If two threads race to instantiate a table, the cmpxchg ensures we know
which one lost the race and that the loser frees the table that they
just allocated.
On Tue, 28 Oct 2014, Dave Hansen wrote:
> On 10/24/2014 05:08 AM, Thomas Gleixner wrote:
> > On Sun, 12 Oct 2014, Qiaowei Ren wrote:
> >> + /*
> >> + * Go poke the address of the new bounds table in to the
> >> + * bounds directory entry out in userspace memory. Note:
> >> + * we may race with another CPU instantiating the same table.
> >> + * In that case the cmpxchg will see an unexpected
> >> + * 'actual_old_val'.
> >> + */
> >> + ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
> >> + expected_old_val, bt_addr);
> >
> > This is fully preemptible non-atomic context, right?
> >
> > So this wants a proper comment, why using
> > user_atomic_cmpxchg_inatomic() is the right thing to do here.
>
> Hey Thomas,
>
> How's this for a new comment? Does this cover the points you think need
> clarified?
>
> ====
>
> The kernel has allocated a bounds table and needs to point the
> (userspace-allocated) directory to it. The directory entry is the
> *only* place we track that this table was allocated, so we essentially
> use it instead of an kernel data structure for synchronization. A
> copy_to_user()-style function would not give us the atomicity that we need.
>
> If two threads race to instantiate a table, the cmpxchg ensures we know
> which one lost the race and that the loser frees the table that they
> just allocated.
Yup. That explains the cmpxchg.
The other thing which puzzled me was that it calls
user_atomic_cmpxchg_inatomic() but the context is not atomic at
all. Its fully preemptible and actually we want it to be able to
handle the fault. The implementation does that, just the function
itself suggest something different.
Thanks,
tglx
> +void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
> + struct xsave_struct *xsave_buf)
> +{
> + struct mpx_insn insn;
> + uint8_t bndregno;
> + unsigned long addr_vio;
> +
> + addr_vio = mpx_insn_decode(&insn, regs);
> +
> + bndregno = X86_MODRM_REG(insn.modrm.value);
> + if (bndregno > 3)
> + return;
> +
> + /* Note: the upper 32 bits are ignored in 32-bit mode. */
> + info->si_lower = (void __user *)(unsigned long)
> + (xsave_buf->bndregs.bndregs[2*bndregno]);
> + info->si_upper = (void __user *)(unsigned long)
> + (~xsave_buf->bndregs.bndregs[2*bndregno+1]);
> + info->si_addr_lsb = 0;
> + info->si_signo = SIGSEGV;
> + info->si_errno = 0;
> + info->si_code = SEGV_BNDERR;
> + info->si_addr = (void __user *)addr_vio;
> +}
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 611b6ec..b2a916b 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -284,6 +284,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> unsigned long status;
> struct xsave_struct *xsave_buf;
> struct task_struct *tsk = current;
> + siginfo_t info;
>
> prev_state = exception_enter();
> if (notify_die(DIE_TRAP, "bounds", regs, error_code,
> @@ -316,6 +317,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> break;
>
> case 1: /* Bound violation. */
> + do_mpx_bounds(regs, &info, xsave_buf);
> + do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
> + error_code, &info);
> + break;
> +
> case 0: /* No exception caused by Intel MPX operations. */
> do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
> break;
>
So, siginfo is stack-allocarted here. do_mpx_bounds() can error out if
it sees an invalid bndregno. We still send the signal with the &info
whether or not we filled the 'info' in do_mpx_bounds().
Can't this leak some kernel stack out in the 'info'?
On 10/31/2014 06:38 AM, Dave Hansen wrote:
>> +void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
>> + struct xsave_struct *xsave_buf)
>> +{
>> + struct mpx_insn insn;
>> + uint8_t bndregno;
>> + unsigned long addr_vio;
>> +
>> + addr_vio = mpx_insn_decode(&insn, regs);
>> +
>> + bndregno = X86_MODRM_REG(insn.modrm.value);
>> + if (bndregno > 3)
>> + return;
>> +
>> + /* Note: the upper 32 bits are ignored in 32-bit mode. */
>> + info->si_lower = (void __user *)(unsigned long)
>> + (xsave_buf->bndregs.bndregs[2*bndregno]);
>> + info->si_upper = (void __user *)(unsigned long)
>> + (~xsave_buf->bndregs.bndregs[2*bndregno+1]);
>> + info->si_addr_lsb = 0;
>> + info->si_signo = SIGSEGV;
>> + info->si_errno = 0;
>> + info->si_code = SEGV_BNDERR;
>> + info->si_addr = (void __user *)addr_vio;
>> +}
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 611b6ec..b2a916b 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -284,6 +284,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>> unsigned long status;
>> struct xsave_struct *xsave_buf;
>> struct task_struct *tsk = current;
>> + siginfo_t info;
>>
>> prev_state = exception_enter();
>> if (notify_die(DIE_TRAP, "bounds", regs, error_code,
>> @@ -316,6 +317,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>> break;
>>
>> case 1: /* Bound violation. */
>> + do_mpx_bounds(regs, &info, xsave_buf);
>> + do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
>> + error_code, &info);
>> + break;
>> +
>> case 0: /* No exception caused by Intel MPX operations. */
>> do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
>> break;
>>
>
> So, siginfo is stack-allocarted here. do_mpx_bounds() can error out if
> it sees an invalid bndregno. We still send the signal with the &info
> whether or not we filled the 'info' in do_mpx_bounds().
>
> Can't this leak some kernel stack out in the 'info'?
>
This should check the return value of do_mpx_bounds and should be fixed.
Thanks,
Qiaowei
On Fri, 31 Oct 2014, Ren Qiaowei wrote:
> On 10/31/2014 06:38 AM, Dave Hansen wrote:
> > > @@ -316,6 +317,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs,
> > > long error_code)
> > > break;
> > >
> > > case 1: /* Bound violation. */
> > > + do_mpx_bounds(regs, &info, xsave_buf);
> > > + do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
> > > + error_code, &info);
> > > + break;
> > > +
> > > case 0: /* No exception caused by Intel MPX operations. */
> > > do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code,
> > > NULL);
> > > break;
> > >
> >
> > So, siginfo is stack-allocarted here. do_mpx_bounds() can error out if
> > it sees an invalid bndregno. We still send the signal with the &info
> > whether or not we filled the 'info' in do_mpx_bounds().
> >
> > Can't this leak some kernel stack out in the 'info'?
> >
>
> This should check the return value of do_mpx_bounds and should be fixed.
And how's that answering Dave's question about leaking stack information?
Thanks,
tglx
On 10/27/2014 01:36 PM, Thomas Gleixner wrote:
> You're repeating yourself. Care to read the discussion about this from
> the last round of review again?
OK, so here's a rewritten decoder. I think it's a lot more robust and
probably fixes a bug or two. This ends up saving ~70 lines of code out
of ~300 or so for the old patch.
I'll include this in the next series, but I'm posting it early and often
to make sure I'm on the right track.
There is also a preparatory patch or two, but they're small.
On Fri, 31 Oct 2014, Dave Hansen wrote:
> On 10/27/2014 01:36 PM, Thomas Gleixner wrote:
> > You're repeating yourself. Care to read the discussion about this from
> > the last round of review again?
>
> OK, so here's a rewritten decoder. I think it's a lot more robust and
> probably fixes a bug or two. This ends up saving ~70 lines of code out
> of ~300 or so for the old patch.
>
> I'll include this in the next series, but I'm posting it early and often
> to make sure I'm on the right track.
Had a short glance. This looks really very well done!
Thanks,
tglx
On 10/27/2014 01:49 PM, Thomas Gleixner wrote:
> Errm. Before user space can use the bounds table for the new mapping
> it needs to add the entries, right? So:
>
> CPU 0 CPU 1
>
> down_write(mm->bd_sem);
> mpx_pre_unmap();
> clear bounds directory entries
> unmap();
> map()
> write_bounds_entry()
> trap()
> down_read(mm->bd_sem);
> mpx_post_unmap();
> up_write(mm->bd_sem);
> allocate_bounds_table();
>
> That's the whole point of bd_sem.
This does, indeed, seem to work for the normal munmap() cases. However,
take a look at shmdt(). We don't know the size of the segment being
unmapped until after we acquire mmap_sem for write, so we wouldn't be
able to do do a mpx_pre_unmap() for those.
mremap() is similar. We don't know if the area got expanded (and we
don't need to modify bounds tables) or moved (and we need to free the
old location's tables) until well after we've taken mmap_sem for write.
I propose we keep mm->bd_sem. But, I think we need to keep a list
during each of the unmapping operations of VMAs that got unmapped, and
then keep them on a list without freeing then. At up_write() time, we
look at the list, if it is empty, we just do an up_write() and we are done.
If *not* empty, downgrade_write(mm->mmap_sem), and do the work you
spelled out in mpx_pre_unmap() above: clear the bounds directory entries
and gather the VMAs while still holding mm->bd_sem for write.
Here's the other wrinkle: This would invert the ->bd_sem vs. ->mmap_sem
ordering (bd_sem nests outside mmap_sem with the above scheme). We
_could_ always acquire bd_sem for write whenever mmap_sem is acquired,
although that seems a bit heavyweight. I can't think of anything better
at the moment, though.
On Mon, 3 Nov 2014, Dave Hansen wrote:
> On 10/27/2014 01:49 PM, Thomas Gleixner wrote:
> > Errm. Before user space can use the bounds table for the new mapping
> > it needs to add the entries, right? So:
> >
> > CPU 0 CPU 1
> >
> > down_write(mm->bd_sem);
> > mpx_pre_unmap();
> > clear bounds directory entries
> > unmap();
> > map()
> > write_bounds_entry()
> > trap()
> > down_read(mm->bd_sem);
> > mpx_post_unmap();
> > up_write(mm->bd_sem);
> > allocate_bounds_table();
> >
> > That's the whole point of bd_sem.
>
> This does, indeed, seem to work for the normal munmap() cases. However,
> take a look at shmdt(). We don't know the size of the segment being
> unmapped until after we acquire mmap_sem for write, so we wouldn't be
> able to do do a mpx_pre_unmap() for those.
That's not really true. You can evaluate that information with
mmap_sem held for read as well. Nothing can change the mappings until
you drop it. So you could do:
down_write(mm->bd_sem);
down_read(mm->mmap_sem;
evaluate_size_of_shm_to_unmap();
clear_bounds_directory_entries();
up_read(mm->mmap_sem);
do_the_real_shm_unmap();
up_write(mm->bd_sem);
That should still be covered by the above scheme.
> mremap() is similar. We don't know if the area got expanded (and we
> don't need to modify bounds tables) or moved (and we need to free the
> old location's tables) until well after we've taken mmap_sem for write.
See above.
> I propose we keep mm->bd_sem. But, I think we need to keep a list
> during each of the unmapping operations of VMAs that got unmapped, and
> then keep them on a list without freeing then. At up_write() time, we
> look at the list, if it is empty, we just do an up_write() and we are done.
>
> If *not* empty, downgrade_write(mm->mmap_sem), and do the work you
> spelled out in mpx_pre_unmap() above: clear the bounds directory entries
> and gather the VMAs while still holding mm->bd_sem for write.
>
> Here's the other wrinkle: This would invert the ->bd_sem vs. ->mmap_sem
> ordering (bd_sem nests outside mmap_sem with the above scheme). We
> _could_ always acquire bd_sem for write whenever mmap_sem is acquired,
> although that seems a bit heavyweight. I can't think of anything better
> at the moment, though.
That works as well. If it makes stuff simpler I'm all for it. But then
we should really replace down_write(mmap_sem) with a helper function
and add something to checkpatch.pl and to the coccinelle scripts to
catch new instances of open coded 'down_write(mmap_sem)'.
Thanks,
tglx
On 11/03/2014 01:29 PM, Thomas Gleixner wrote:
> On Mon, 3 Nov 2014, Dave Hansen wrote:
> That's not really true. You can evaluate that information with
> mmap_sem held for read as well. Nothing can change the mappings until
> you drop it. So you could do:
>
> down_write(mm->bd_sem);
> down_read(mm->mmap_sem;
> evaluate_size_of_shm_to_unmap();
> clear_bounds_directory_entries();
> up_read(mm->mmap_sem);
> do_the_real_shm_unmap();
> up_write(mm->bd_sem);
>
> That should still be covered by the above scheme.
Yep, that'll work. It just means rewriting the shmdt()/mremap() code to
do a "dry run" of sorts.
Do you have any concerns about adding another mutex to these paths?
munmap() isn't as hot of a path as the allocation side, but it does
worry me a bit that we're going to perturb some workloads. We might
need to find a way to optimize out the bd_sem activity on processes that
never used MPX.
On Tue, 4 Nov 2014, Dave Hansen wrote:
> On 11/03/2014 01:29 PM, Thomas Gleixner wrote:
> > On Mon, 3 Nov 2014, Dave Hansen wrote:
>
> > That's not really true. You can evaluate that information with
> > mmap_sem held for read as well. Nothing can change the mappings until
> > you drop it. So you could do:
> >
> > down_write(mm->bd_sem);
> > down_read(mm->mmap_sem;
> > evaluate_size_of_shm_to_unmap();
> > clear_bounds_directory_entries();
> > up_read(mm->mmap_sem);
> > do_the_real_shm_unmap();
> > up_write(mm->bd_sem);
> >
> > That should still be covered by the above scheme.
>
> Yep, that'll work. It just means rewriting the shmdt()/mremap() code to
> do a "dry run" of sorts.
Right. So either that or we hold bd_sem write locked accross all write
locked mmap_sem sections. Dunno, which solution is prettier :)
> Do you have any concerns about adding another mutex to these paths?
You mean bd_sem? I don't think its an issue. You need to get mmap_sem
for write as well. So
> munmap() isn't as hot of a path as the allocation side, but it does
> worry me a bit that we're going to perturb some workloads. We might
> need to find a way to optimize out the bd_sem activity on processes that
> never used MPX.
I think using mm->bd_addr as a conditional for the bd_sem/mpx activity
is good enough. You just need to make sure that you store the result
of the starting conditional and use it for the closing one as well.
mpx = mpx_pre_unmap(mm);
{
if (!kernel_managing_bounds_tables(mm)
return 0;
down_write(mm->bd_sem);
...
return 1;
}
unmap();
mxp_post_unmap(mm, mpx);
{
if (mpx) {
....
up_write(mm->bd_sem);
}
So this serializes nicely with the bd_sem protected write to
mm->bd_addr. There is a race there, but I don't think it matters. The
worst thing what can happen is a stale bound table.
Thanks,
tglx
Instead of all of these games with dropping and reacquiring mmap_sem and
adding other locks, or deferring the work, why don't we just do a
get_user_pages()? Something along the lines of:
while (1) {
ret = cmpxchg(addr)
if (!ret)
break;
if (ret == -EFAULT)
get_user_pages(addr);
}
Does anybody see a problem with that?
On Thu, 6 Nov 2014, Dave Hansen wrote:
> Instead of all of these games with dropping and reacquiring mmap_sem and
> adding other locks, or deferring the work, why don't we just do a
> get_user_pages()? Something along the lines of:
>
> while (1) {
> ret = cmpxchg(addr)
> if (!ret)
> break;
> if (ret == -EFAULT)
> get_user_pages(addr);
> }
>
> Does anybody see a problem with that?
You want to do that under mmap_sem write held, right? Not a problem per
se, except that you block normal faults for a possibly long time when
the page(s) need to be swapped in.
But yes, this might solve most of the issues at hand. Did not think
about GUP at all :(
Thanks,
tglx
On 11/11/2014 10:27 AM, Thomas Gleixner wrote:
> On Thu, 6 Nov 2014, Dave Hansen wrote:
>> Instead of all of these games with dropping and reacquiring mmap_sem and
>> adding other locks, or deferring the work, why don't we just do a
>> get_user_pages()? Something along the lines of:
>>
>> while (1) {
>> ret = cmpxchg(addr)
>> if (!ret)
>> break;
>> if (ret == -EFAULT)
>> get_user_pages(addr);
>> }
>>
>> Does anybody see a problem with that?
>
> You want to do that under mmap_sem write held, right? Not a problem per
> se, except that you block normal faults for a possibly long time when
> the page(s) need to be swapped in.
Yeah, it might hold mmap_sem for write while doing this in the unmap
path. But, that's only if the bounds directory entry has been swapped
out. There's only 1 pointer of bounds directory entries there for every
1MB of data, so it _should_ be relatively rare. It would mean that
nobody's been accessing a 512MB swath of data controlled by the same
page of the bounds directory.
If it gets to be an issue, we can always add some code to fault it in
before mmap_sem is acquired.
FWIW, I believe we have a fairly long road ahead of us to optimize MPX
in practice. I have a list of things I want to go investigate, but I
have not looked in to it in detail at all.
> But yes, this might solve most of the issues at hand. Did not think
> about GUP at all :(
Whew. Fixing it was getting nasty and complicated. :)
On Tue, 11 Nov 2014, Dave Hansen wrote:
> On 11/11/2014 10:27 AM, Thomas Gleixner wrote:
> > On Thu, 6 Nov 2014, Dave Hansen wrote:
> >> Instead of all of these games with dropping and reacquiring mmap_sem and
> >> adding other locks, or deferring the work, why don't we just do a
> >> get_user_pages()? Something along the lines of:
> >>
> >> while (1) {
> >> ret = cmpxchg(addr)
> >> if (!ret)
> >> break;
> >> if (ret == -EFAULT)
> >> get_user_pages(addr);
> >> }
> >>
> >> Does anybody see a problem with that?
> >
> > You want to do that under mmap_sem write held, right? Not a problem per
> > se, except that you block normal faults for a possibly long time when
> > the page(s) need to be swapped in.
>
> Yeah, it might hold mmap_sem for write while doing this in the unmap
> path. But, that's only if the bounds directory entry has been swapped
> out. There's only 1 pointer of bounds directory entries there for every
> 1MB of data, so it _should_ be relatively rare. It would mean that
> nobody's been accessing a 512MB swath of data controlled by the same
> page of the bounds directory.
>
> If it gets to be an issue, we can always add some code to fault it in
> before mmap_sem is acquired.
I don't think it's a real issue.
> FWIW, I believe we have a fairly long road ahead of us to optimize MPX
> in practice. I have a list of things I want to go investigate, but I
> have not looked in to it in detail at all.
:)
> > But yes, this might solve most of the issues at hand. Did not think
> > about GUP at all :(
>
> Whew. Fixing it was getting nasty and complicated. :)
Indeed. Though I think that distangling specific parts of MPX from
mmap_sem is still a worthwhile exercise. So not all of the complex
ideas we came up with during the discussion are lost in the pointless
complexity universe :)
Thanks,
tglx