This patchset adds support for the Memory Protection Extensions
(MPX) feature found in future Intel processors.
MPX can be used in conjunction with compiler changes to check memory
references, for those references whose compile-time normal intentions
are usurped at runtime due to buffer overflow or underflow.
MPX provides this capability at very low performance overhead for
newly compiled code, and provides compatibility mechanisms with legacy
software components. MPX architecture is designed allow a machine to
run both MPX enabled software and legacy software that is MPX unaware.
In such a case, the legacy software does not benefit from MPX, but it
also does not experience any change in functionality or reduction in
performance.
More information about Intel MPX can be found in "Intel(R) Architecture
Instruction Set Extensions Programming Reference".
To get the advantage of MPX, changes are required in the OS kernel,
binutils, compiler, system libraries support.
New GCC option -fmpx is introduced to utilize MPX instructions.
Currently GCC compiler sources with MPX support is available in a
separate branch in common GCC SVN repository. See GCC SVN page
(http://gcc.gnu.org/svn.html) for details.
To have the full protection, we had to add MPX instrumentation to all
the necessary Glibc routines (e.g. memcpy) written on assembler, and
compile Glibc with the MPX enabled GCC compiler. Currently MPX enabled
Glibc source can be found in Glibc git repository.
Enabling an application to use MPX will generally not require source
code updates but there is some runtime code, which is responsible for
configuring and enabling MPX, needed in order to make use of MPX.
For most applications this runtime support will be available by linking
to a library supplied by the compiler or possibly it will come directly
from the OS once OS versions that support MPX are available.
MPX kernel code, namely this patchset, has mainly the 2 responsibilities:
provide handlers for bounds faults (#BR), and manage bounds memory.
The high-level areas modified in the patchset are as follow:
1) struct siginfo is extended to include bound violation information.
2) two prctl() commands are added to do performance optimization.
Currently no hardware with MPX ISA is available but it is always
possible to use SDE (Intel(R) software Development Emulator) instead,
which can be downloaded from
http://software.intel.com/en-us/articles/intel-software-development-emulator
Future TODO items:
1) support 32-bit binaries on 64-bit kernels.
Changes since v1:
* check to see if #BR occurred in userspace or kernel space.
* use generic structure and macro as much as possible when
decode mpx instructions.
Changes since v2:
* fix some compile warnings.
* update documentation.
Changes since v3:
* correct some syntax errors at documentation, and document
extended struct siginfo.
* for kill the process when the error code of BNDSTATUS is 3.
* add some comments.
* remove new prctl() commands.
* fix some compile warnings for 32-bit.
Changes since v4:
* raise SIGBUS if the allocations of the bound tables fail.
Changes since v5:
* hook unmap() path to cleanup unused bounds tables, and use
new prctl() command to register bounds directory address to
struct mm_struct to check whether one process is MPX enabled
during unmap().
* in order track precisely MPX memory usage, add MPX specific
mmap interface and one VM_MPX flag to check whether a VMA
is MPX bounds table.
* add macro cpu_has_mpx to do performance optimization.
* sync struct figinfo for mips with general version to avoid
build issue.
Qiaowei Ren (10):
x86, mpx: introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: add MPX specific mmap interface
x86, mpx: add macro cpu_has_mpx
x86, mpx: hook #BR exception handler to allocate bound tables
x86, mpx: extend siginfo structure to include bound violation
information
mips: sync struct siginfo with general version
x86, mpx: decode MPX instruction to get bound violation information
x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
x86, mpx: cleanup unused bound tables
x86, mpx: add documentation on Intel MPX
Documentation/x86/intel_mpx.txt | 127 +++++++++++
arch/mips/include/uapi/asm/siginfo.h | 4 +
arch/x86/Kconfig | 4 +
arch/x86/include/asm/cpufeature.h | 6 +
arch/x86/include/asm/mmu_context.h | 16 ++
arch/x86/include/asm/mpx.h | 91 ++++++++
arch/x86/include/asm/processor.h | 18 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/mpx.c | 413 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 62 +++++-
arch/x86/mm/Makefile | 2 +
arch/x86/mm/init_64.c | 2 +
arch/x86/mm/mpx.c | 247 ++++++++++++++++++++
fs/proc/task_mmu.c | 1 +
include/asm-generic/mmu_context.h | 6 +
include/linux/mm.h | 2 +
include/linux/mm_types.h | 3 +
include/uapi/asm-generic/siginfo.h | 9 +-
include/uapi/linux/prctl.h | 6 +
kernel/signal.c | 4 +
kernel/sys.c | 12 +
mm/mmap.c | 2 +
22 files changed, 1036 insertions(+), 2 deletions(-)
create mode 100644 Documentation/x86/intel_mpx.txt
create mode 100644 arch/x86/include/asm/mpx.h
create mode 100644 arch/x86/kernel/mpx.c
create mode 100644 arch/x86/mm/mpx.c
This patch adds one MPX specific mmap interface, which only handles
mpx related maps, including bounds table and bounds directory.
In order to track MPX specific memory usage, this interface is added
to stick new vm_flag VM_MPX in the vma_area_struct when create a
bounds table or bounds directory.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/Kconfig | 4 +++
arch/x86/include/asm/mpx.h | 38 ++++++++++++++++++++++++++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/mpx.c | 58 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 102 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/mpx.h
create mode 100644 arch/x86/mm/mpx.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..0194790 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,6 +237,10 @@ config HAVE_INTEL_TXT
def_bool y
depends on INTEL_IOMMU && ACPI
+config X86_INTEL_MPX
+ def_bool y
+ depends on CPU_SUP_INTEL
+
config X86_32_SMP
def_bool y
depends on X86_32 && SMP
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
new file mode 100644
index 0000000..5725ac4
--- /dev/null
+++ b/arch/x86/include/asm/mpx.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_X86_MPX_H
+#define _ASM_X86_MPX_H
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef CONFIG_X86_64
+
+/* upper 28 bits [47:20] of the virtual address in 64-bit used to
+ * index into bounds directory (BD).
+ */
+#define MPX_BD_ENTRY_OFFSET 28
+#define MPX_BD_ENTRY_SHIFT 3
+/* bits [19:3] of the virtual address in 64-bit used to index into
+ * bounds table (BT).
+ */
+#define MPX_BT_ENTRY_OFFSET 17
+#define MPX_BT_ENTRY_SHIFT 5
+#define MPX_IGN_BITS 3
+
+#else
+
+#define MPX_BD_ENTRY_OFFSET 20
+#define MPX_BD_ENTRY_SHIFT 2
+#define MPX_BT_ENTRY_OFFSET 10
+#define MPX_BT_ENTRY_SHIFT 4
+#define MPX_IGN_BITS 2
+
+#endif
+
+#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
+#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+
+#define MPX_BNDSTA_ERROR_CODE 0x3
+
+unsigned long mpx_mmap(unsigned long len);
+
+#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 6a19ad9..ecfdc46 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -30,3 +30,5 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o
obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
new file mode 100644
index 0000000..546c5d1
--- /dev/null
+++ b/arch/x86/mm/mpx.c
@@ -0,0 +1,58 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <asm/mpx.h>
+#include <asm/mman.h>
+#include <linux/sched/sysctl.h>
+
+/*
+ * this is really a simplified "vm_mmap". it only handles mpx
+ * related maps, including bounds table and bounds directory.
+ *
+ * here we can stick new vm_flag VM_MPX in the vma_area_struct
+ * when create a bounds table or bounds directory, in order to
+ * track MPX specific memory.
+ */
+unsigned long mpx_mmap(unsigned long len)
+{
+ unsigned long ret;
+ unsigned long addr, pgoff;
+ struct mm_struct *mm = current->mm;
+ vm_flags_t vm_flags;
+
+ /* Only bounds table and bounds directory can be allocated here */
+ if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
+ return -EINVAL;
+
+ down_write(&mm->mmap_sem);
+
+ /* Too many mappings? */
+ if (mm->map_count > sysctl_max_map_count) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Obtain the address to map to. we verify (or select) it and ensure
+ * that it represents a valid section of the address space.
+ */
+ addr = get_unmapped_area(NULL, 0, len, 0, MAP_ANONYMOUS | MAP_PRIVATE);
+ if (addr & ~PAGE_MASK) {
+ ret = addr;
+ goto out;
+ }
+
+ vm_flags = VM_READ | VM_WRITE | VM_MPX |
+ mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+
+ /* Make bounds tables and bouds directory unlocked. */
+ if (vm_flags & VM_LOCKED)
+ vm_flags &= ~VM_LOCKED;
+
+ /* Set pgoff according to addr for anon_vma */
+ pgoff = addr >> PAGE_SHIFT;
+
+ ret = mmap_region(NULL, addr, len, vm_flags, pgoff);
+
+out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
--
1.7.1
This patch handles a #BR exception for non-existent tables by
carving the space out of the normal processes address space
(essentially calling mmap() from inside the kernel) and then
pointing the bounds-directory over to it.
The tables need to be accessed and controlled by userspace
because the compiler generates instructions for MPX-enabled
code which frequently store and retrieve entries from the bounds
tables. Any direct kernel involvement (like a syscall) to access
the tables would destroy performance since these are so frequent.
The tables are carved out of userspace because we have no better
spot to put them. For each pointer which is being tracked by MPX,
the bounds tables contain 4 longs worth of data, and the tables
are indexed virtually. If we were to preallocate the tables, we
would theoretically need to allocate 4x the virtual space that
we have available for userspace somewhere else. We don't have
that room in the kernel address space.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mpx.h | 20 ++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/mpx.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 56 ++++++++++++++++++++++++++++++++++++++-
4 files changed, 139 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kernel/mpx.c
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 5725ac4..b7598ac 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -18,6 +18,8 @@
#define MPX_BT_ENTRY_SHIFT 5
#define MPX_IGN_BITS 3
+#define MPX_BD_ENTRY_TAIL 3
+
#else
#define MPX_BD_ENTRY_OFFSET 20
@@ -26,13 +28,31 @@
#define MPX_BT_ENTRY_SHIFT 4
#define MPX_IGN_BITS 2
+#define MPX_BD_ENTRY_TAIL 2
+
#endif
+#define MPX_BNDSTA_TAIL 2
+#define MPX_BNDCFG_TAIL 12
+#define MPX_BNDSTA_ADDR_MASK (~((1UL<<MPX_BNDSTA_TAIL)-1))
+#define MPX_BNDCFG_ADDR_MASK (~((1UL<<MPX_BNDCFG_TAIL)-1))
+#define MPX_BT_ADDR_MASK (~((1UL<<MPX_BD_ENTRY_TAIL)-1))
+
#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
#define MPX_BNDSTA_ERROR_CODE 0x3
+#define MPX_BD_ENTRY_VALID_FLAG 0x1
unsigned long mpx_mmap(unsigned long len);
+#ifdef CONFIG_X86_INTEL_MPX
+int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
+#else
+static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_X86_INTEL_MPX */
+
#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d9600..3e81aed 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_PREEMPT) += preempt.o
obj-y += process.o
obj-y += i387.o xsave.o
+obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-y += ptrace.o
obj-$(CONFIG_X86_32) += tls.o
obj-$(CONFIG_IA32_EMULATION) += tls.o
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
new file mode 100644
index 0000000..4230c7b
--- /dev/null
+++ b/arch/x86/kernel/mpx.c
@@ -0,0 +1,63 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <asm/mpx.h>
+
+static int allocate_bt(long __user *bd_entry)
+{
+ unsigned long bt_addr, old_val = 0;
+ int ret = 0;
+
+ bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
+ if (IS_ERR((void *)bt_addr)) {
+ pr_err("Bounds table allocation failed at entry addr %p\n",
+ bd_entry);
+ return bt_addr;
+ }
+ bt_addr = (bt_addr & MPX_BT_ADDR_MASK) | MPX_BD_ENTRY_VALID_FLAG;
+
+ ret = user_atomic_cmpxchg_inatomic(&old_val, bd_entry, 0, bt_addr);
+ if (ret)
+ goto out;
+
+ /*
+ * there is a existing bounds table pointed at this bounds
+ * directory entry, and so we need to free the bounds table
+ * allocated just now.
+ */
+ if (old_val)
+ goto out;
+
+ pr_debug("Allocate bounds table %lx at entry %p\n",
+ bt_addr, bd_entry);
+ return 0;
+
+out:
+ vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
+ return ret;
+}
+
+/*
+ * When a BNDSTX instruction attempts to save bounds to a BD entry
+ * with the lack of the valid bit being set, a #BR is generated.
+ * This is an indication that no BT exists for this entry. In this
+ * case the fault handler will allocate a new BT.
+ *
+ * With 32-bit mode, the size of BD is 4MB, and the size of each
+ * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
+ * and the size of each bound table is 4MB.
+ */
+int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
+{
+ unsigned long status;
+ unsigned long bd_entry, bd_base;
+
+ bd_base = xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ADDR_MASK;
+ status = xsave_buf->bndcsr.status_reg;
+
+ bd_entry = status & MPX_BNDSTA_ADDR_MASK;
+ if ((bd_entry < bd_base) ||
+ (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
+ return -EINVAL;
+
+ return allocate_bt((long __user *)bd_entry);
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f73b5d4..35b9b29 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -59,6 +59,7 @@
#include <asm/fixmap.h>
#include <asm/mach_traps.h>
#include <asm/alternative.h>
+#include <asm/mpx.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -213,7 +214,6 @@ dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
DO_ERROR_INFO(X86_TRAP_DE, SIGFPE, "divide error", divide_error, FPE_INTDIV, regs->ip )
DO_ERROR (X86_TRAP_OF, SIGSEGV, "overflow", overflow )
-DO_ERROR (X86_TRAP_BR, SIGSEGV, "bounds", bounds )
DO_ERROR_INFO(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->ip )
DO_ERROR (X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun )
DO_ERROR (X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS )
@@ -263,6 +263,60 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
}
#endif
+dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
+{
+ enum ctx_state prev_state;
+ unsigned long status;
+ struct xsave_struct *xsave_buf;
+ struct task_struct *tsk = current;
+
+ prev_state = exception_enter();
+ if (notify_die(DIE_TRAP, "bounds", regs, error_code,
+ X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
+ goto exit;
+ conditional_sti(regs);
+
+ if (!user_mode(regs))
+ die("bounds", regs, error_code);
+
+ if (!cpu_has_mpx) {
+ /* The exception is not from Intel MPX */
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
+ goto exit;
+ }
+
+ fpu_xsave(&tsk->thread.fpu);
+ xsave_buf = &(tsk->thread.fpu.state->xsave);
+ status = xsave_buf->bndcsr.status_reg;
+
+ /*
+ * The error code field of the BNDSTATUS register communicates status
+ * information of a bound range exception #BR or operation involving
+ * bound directory.
+ */
+ switch (status & MPX_BNDSTA_ERROR_CODE) {
+ case 2:
+ /*
+ * Bound directory has invalid entry.
+ * No signal will be sent to the user space.
+ */
+ if (do_mpx_bt_fault(xsave_buf))
+ force_sig(SIGBUS, tsk);
+ break;
+
+ case 1: /* Bound violation. */
+ case 0: /* No exception caused by Intel MPX operations. */
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
+ break;
+
+ default:
+ die("bounds", regs, error_code);
+ }
+
+exit:
+ exception_exit(prev_state);
+}
+
dotraplinkage void __kprobes
do_general_protection(struct pt_regs *regs, long error_code)
{
--
1.7.1
Due to new fields about bound violation added into struct siginfo,
this patch syncs it with general version to avoid build issue.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/mips/include/uapi/asm/siginfo.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/arch/mips/include/uapi/asm/siginfo.h b/arch/mips/include/uapi/asm/siginfo.h
index e811744..d08f83f 100644
--- a/arch/mips/include/uapi/asm/siginfo.h
+++ b/arch/mips/include/uapi/asm/siginfo.h
@@ -92,6 +92,10 @@ typedef struct siginfo {
int _trapno; /* TRAP # which caused the signal */
#endif
short _addr_lsb;
+ struct {
+ void __user *_lower;
+ void __user *_upper;
+ } _addr_bnd;
} _sigfault;
/* SIGPOLL, SIGXFSZ (To do ...) */
--
1.7.1
This patch sets bound violation fields of siginfo struct in #BR
exception handler by decoding the user instruction and constructing
the faulting pointer.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mpx.h | 23 ++++
arch/x86/kernel/mpx.c | 294 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 6 +
3 files changed, 323 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index b7598ac..780af63 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <asm/ptrace.h>
+#include <asm/insn.h>
#ifdef CONFIG_X86_64
@@ -44,15 +45,37 @@
#define MPX_BNDSTA_ERROR_CODE 0x3
#define MPX_BD_ENTRY_VALID_FLAG 0x1
+struct mpx_insn {
+ struct insn_field rex_prefix; /* REX prefix */
+ struct insn_field modrm;
+ struct insn_field sib;
+ struct insn_field displacement;
+
+ unsigned char addr_bytes; /* effective address size */
+ unsigned char limit;
+ unsigned char x86_64;
+
+ const unsigned char *kaddr; /* kernel address of insn to analyze */
+ const unsigned char *next_byte;
+};
+
+#define MAX_MPX_INSN_SIZE 15
+
unsigned long mpx_mmap(unsigned long len);
#ifdef CONFIG_X86_INTEL_MPX
int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
+void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf);
#else
static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
{
return -EINVAL;
}
+static inline void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf)
+{
+}
#endif /* CONFIG_X86_INTEL_MPX */
#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
index 4230c7b..650b282 100644
--- a/arch/x86/kernel/mpx.c
+++ b/arch/x86/kernel/mpx.c
@@ -2,6 +2,270 @@
#include <linux/syscalls.h>
#include <asm/mpx.h>
+typedef enum {REG_TYPE_RM, REG_TYPE_INDEX, REG_TYPE_BASE} reg_type_t;
+static unsigned long get_reg(struct mpx_insn *insn, struct pt_regs *regs,
+ reg_type_t type)
+{
+ int regno = 0;
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+ unsigned char sib = (unsigned char)insn->sib.value;
+
+ static const int regoff[] = {
+ offsetof(struct pt_regs, ax),
+ offsetof(struct pt_regs, cx),
+ offsetof(struct pt_regs, dx),
+ offsetof(struct pt_regs, bx),
+ offsetof(struct pt_regs, sp),
+ offsetof(struct pt_regs, bp),
+ offsetof(struct pt_regs, si),
+ offsetof(struct pt_regs, di),
+#ifdef CONFIG_X86_64
+ offsetof(struct pt_regs, r8),
+ offsetof(struct pt_regs, r9),
+ offsetof(struct pt_regs, r10),
+ offsetof(struct pt_regs, r11),
+ offsetof(struct pt_regs, r12),
+ offsetof(struct pt_regs, r13),
+ offsetof(struct pt_regs, r14),
+ offsetof(struct pt_regs, r15),
+#endif
+ };
+
+ switch (type) {
+ case REG_TYPE_RM:
+ regno = X86_MODRM_RM(modrm);
+ if (X86_REX_B(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ case REG_TYPE_INDEX:
+ regno = X86_SIB_INDEX(sib);
+ if (X86_REX_X(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ case REG_TYPE_BASE:
+ regno = X86_SIB_BASE(sib);
+ if (X86_REX_B(insn->rex_prefix.value) == 1)
+ regno += 8;
+ break;
+
+ default:
+ break;
+ }
+
+ return regs_get_register(regs, regoff[regno]);
+}
+
+/*
+ * return the address being referenced be instruction
+ * for rm=3 returning the content of the rm reg
+ * for rm!=3 calculates the address using SIB and Disp
+ */
+static unsigned long get_addr_ref(struct mpx_insn *insn, struct pt_regs *regs)
+{
+ unsigned long addr;
+ unsigned long base;
+ unsigned long indx;
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+ unsigned char sib = (unsigned char)insn->sib.value;
+
+ if (X86_MODRM_MOD(modrm) == 3) {
+ addr = get_reg(insn, regs, REG_TYPE_RM);
+ } else {
+ if (insn->sib.nbytes) {
+ base = get_reg(insn, regs, REG_TYPE_BASE);
+ indx = get_reg(insn, regs, REG_TYPE_INDEX);
+ addr = base + indx * (1 << X86_SIB_SCALE(sib));
+ } else {
+ addr = get_reg(insn, regs, REG_TYPE_RM);
+ }
+ addr += insn->displacement.value;
+ }
+
+ return addr;
+}
+
+/* Verify next sizeof(t) bytes can be on the same instruction */
+#define validate_next(t, insn, n) \
+ ((insn)->next_byte + sizeof(t) + n - (insn)->kaddr <= (insn)->limit)
+
+#define __get_next(t, insn) \
+({ \
+ t r = *(t *)insn->next_byte; \
+ insn->next_byte += sizeof(t); \
+ r; \
+})
+
+#define __peek_next(t, insn) \
+({ \
+ t r = *(t *)insn->next_byte; \
+ r; \
+})
+
+#define get_next(t, insn) \
+({ \
+ if (unlikely(!validate_next(t, insn, 0))) \
+ goto err_out; \
+ __get_next(t, insn); \
+})
+
+#define peek_next(t, insn) \
+({ \
+ if (unlikely(!validate_next(t, insn, 0))) \
+ goto err_out; \
+ __peek_next(t, insn); \
+})
+
+static void mpx_insn_get_prefixes(struct mpx_insn *insn)
+{
+ unsigned char b;
+
+ /* Decode legacy prefix and REX prefix */
+ b = peek_next(unsigned char, insn);
+ while (b != 0x0f) {
+ /*
+ * look for a rex prefix
+ * a REX prefix cannot be followed by a legacy prefix.
+ */
+ if (insn->x86_64 && ((b&0xf0) == 0x40)) {
+ insn->rex_prefix.value = b;
+ insn->rex_prefix.nbytes = 1;
+ insn->next_byte++;
+ break;
+ }
+
+ /* check the other legacy prefixes */
+ switch (b) {
+ case 0xf2:
+ case 0xf3:
+ case 0xf0:
+ case 0x64:
+ case 0x65:
+ case 0x2e:
+ case 0x3e:
+ case 0x26:
+ case 0x36:
+ case 0x66:
+ case 0x67:
+ insn->next_byte++;
+ break;
+ default: /* everything else is garbage */
+ goto err_out;
+ }
+ b = peek_next(unsigned char, insn);
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_modrm(struct mpx_insn *insn)
+{
+ insn->modrm.value = get_next(unsigned char, insn);
+ insn->modrm.nbytes = 1;
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_sib(struct mpx_insn *insn)
+{
+ unsigned char modrm = (unsigned char)insn->modrm.value;
+
+ if (X86_MODRM_MOD(modrm) != 3 && X86_MODRM_RM(modrm) == 4) {
+ insn->sib.value = get_next(unsigned char, insn);
+ insn->sib.nbytes = 1;
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_get_displacement(struct mpx_insn *insn)
+{
+ unsigned char mod, rm, base;
+
+ /*
+ * Interpreting the modrm byte:
+ * mod = 00 - no displacement fields (exceptions below)
+ * mod = 01 - 1-byte displacement field
+ * mod = 10 - displacement field is 4 bytes
+ * mod = 11 - no memory operand
+ *
+ * mod != 11, r/m = 100 - SIB byte exists
+ * mod = 00, SIB base = 101 - displacement field is 4 bytes
+ * mod = 00, r/m = 101 - rip-relative addressing, displacement
+ * field is 4 bytes
+ */
+ mod = X86_MODRM_MOD(insn->modrm.value);
+ rm = X86_MODRM_RM(insn->modrm.value);
+ base = X86_SIB_BASE(insn->sib.value);
+ if (mod == 3)
+ return;
+ if (mod == 1) {
+ insn->displacement.value = get_next(unsigned char, insn);
+ insn->displacement.nbytes = 1;
+ } else if ((mod == 0 && rm == 5) || mod == 2 ||
+ (mod == 0 && base == 5)) {
+ insn->displacement.value = get_next(int, insn);
+ insn->displacement.nbytes = 4;
+ }
+
+err_out:
+ return;
+}
+
+static void mpx_insn_init(struct mpx_insn *insn, struct pt_regs *regs)
+{
+ unsigned char buf[MAX_MPX_INSN_SIZE];
+ int bytes;
+
+ memset(insn, 0, sizeof(*insn));
+
+ bytes = copy_from_user(buf, (void __user *)regs->ip, MAX_MPX_INSN_SIZE);
+ insn->limit = MAX_MPX_INSN_SIZE - bytes;
+ insn->kaddr = buf;
+ insn->next_byte = buf;
+
+ /*
+ * In 64-bit Mode, all Intel MPX instructions use 64-bit
+ * operands for bounds and 64 bit addressing, i.e. REX.W &
+ * 67H have no effect on data or address size.
+ *
+ * In compatibility and legacy modes (including 16-bit code
+ * segments, real and virtual 8086 modes) all Intel MPX
+ * instructions use 32-bit operands for bounds and 32 bit
+ * addressing.
+ */
+#ifdef CONFIG_X86_64
+ insn->x86_64 = 1;
+ insn->addr_bytes = 8;
+#else
+ insn->x86_64 = 0;
+ insn->addr_bytes = 4;
+#endif
+}
+
+static unsigned long mpx_insn_decode(struct mpx_insn *insn,
+ struct pt_regs *regs)
+{
+ mpx_insn_init(insn, regs);
+
+ /*
+ * In this case, we only need decode bndcl/bndcn/bndcu,
+ * so we can use private diassembly interfaces to get
+ * prefixes, modrm, sib, displacement, etc..
+ */
+ mpx_insn_get_prefixes(insn);
+ insn->next_byte += 2; /* ignore opcode */
+ mpx_insn_get_modrm(insn);
+ mpx_insn_get_sib(insn);
+ mpx_insn_get_displacement(insn);
+
+ return get_addr_ref(insn, regs);
+}
+
static int allocate_bt(long __user *bd_entry)
{
unsigned long bt_addr, old_val = 0;
@@ -61,3 +325,33 @@ int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
return allocate_bt((long __user *)bd_entry);
}
+
+/*
+ * If a bounds overflow occurs then a #BR is generated. The fault
+ * handler will decode MPX instructions to get violation address
+ * and set this address into extended struct siginfo.
+ */
+void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
+ struct xsave_struct *xsave_buf)
+{
+ struct mpx_insn insn;
+ uint8_t bndregno;
+ unsigned long addr_vio;
+
+ addr_vio = mpx_insn_decode(&insn, regs);
+
+ bndregno = X86_MODRM_REG(insn.modrm.value);
+ if (bndregno > 3)
+ return;
+
+ /* Note: the upper 32 bits are ignored in 32-bit mode. */
+ info->si_lower = (void __user *)(unsigned long)
+ (xsave_buf->bndregs.bndregs[2*bndregno]);
+ info->si_upper = (void __user *)(unsigned long)
+ (~xsave_buf->bndregs.bndregs[2*bndregno+1]);
+ info->si_addr_lsb = 0;
+ info->si_signo = SIGSEGV;
+ info->si_errno = 0;
+ info->si_code = SEGV_BNDERR;
+ info->si_addr = (void __user *)addr_vio;
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 35b9b29..859cf50 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -269,6 +269,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
unsigned long status;
struct xsave_struct *xsave_buf;
struct task_struct *tsk = current;
+ siginfo_t info;
prev_state = exception_enter();
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
@@ -305,6 +306,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
break;
case 1: /* Bound violation. */
+ do_mpx_bounds(regs, &info, xsave_buf);
+ do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs,
+ error_code, &info);
+ break;
+
case 0: /* No exception caused by Intel MPX operations. */
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
break;
--
1.7.1
This patch adds the Documentation/x86/intel_mpx.txt file with some
information about Intel MPX.
Signed-off-by: Qiaowei Ren <[email protected]>
---
Documentation/x86/intel_mpx.txt | 127 +++++++++++++++++++++++++++++++++++++++
1 files changed, 127 insertions(+), 0 deletions(-)
create mode 100644 Documentation/x86/intel_mpx.txt
diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt
new file mode 100644
index 0000000..1af9809
--- /dev/null
+++ b/Documentation/x86/intel_mpx.txt
@@ -0,0 +1,127 @@
+1. Intel(R) MPX Overview
+========================
+
+Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new
+capability introduced into Intel Architecture. Intel MPX provides
+hardware features that can be used in conjunction with compiler
+changes to check memory references, for those references whose
+compile-time normal intentions are usurped at runtime due to
+buffer overflow or underflow.
+
+For more information, please refer to Intel(R) Architecture
+Instruction Set Extensions Programming Reference, Chapter 9:
+Intel(R) Memory Protection Extensions.
+
+Note: Currently no hardware with MPX ISA is available but it is always
+possible to use SDE (Intel(R) Software Development Emulator) instead,
+which can be downloaded from
+http://software.intel.com/en-us/articles/intel-software-development-emulator
+
+
+2. How does MPX kernel code work
+================================
+
+Handling #BR faults caused by MPX
+---------------------------------
+
+When MPX is enabled, there are 2 new situations that can generate
+#BR faults.
+ * bounds violation caused by MPX instructions.
+ * new bounds tables (BT) need to be allocated to save bounds.
+
+We hook #BR handler to handle these two new situations.
+
+Decoding MPX instructions
+-------------------------
+
+If a #BR is generated due to a bounds violation caused by MPX.
+We need to decode MPX instructions to get violation address and
+set this address into extended struct siginfo.
+
+The _sigfault feild of struct siginfo is extended as follow:
+
+87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
+88 struct {
+89 void __user *_addr; /* faulting insn/memory ref. */
+90 #ifdef __ARCH_SI_TRAPNO
+91 int _trapno; /* TRAP # which caused the signal */
+92 #endif
+93 short _addr_lsb; /* LSB of the reported address */
+94 struct {
+95 void __user *_lower;
+96 void __user *_upper;
+97 } _addr_bnd;
+98 } _sigfault;
+
+The '_addr' field refers to violation address, and new '_addr_and'
+field refers to the upper/lower bounds when a #BR is caused.
+
+Glibc will be also updated to support this new siginfo. So user
+can get violation address and bounds when bounds violations occur.
+
+Freeing unused bounds tables
+----------------------------
+
+When a BNDSTX instruction attempts to save bounds to a bounds directory
+entry marked as invalid, a #BR is generated. This is an indication that
+no bounds table exists for this entry. In this case the fault handler
+will allocate a new bounds table on demand.
+
+Since the kernel allocated those tables on-demand without userspace
+knowledge, it is also responsible for freeing them when the associated
+mappings go away.
+
+Here, the solution for this issue is to hook do_munmap() to check
+whether one process is MPX enabled. If yes, those bounds tables covered
+in the virtual address region which is being unmapped will be freed also.
+
+Adding new prctl commands
+-------------------------
+
+Runtime library in userspace is responsible for allocation of bounds
+directory. So kernel have to use XSAVE instruction to get the base
+of bounds directory from BNDCFG register.
+
+But XSAVE is expected to be very expensive. In order to do performance
+optimization, we have to add new prctl command to get the base of
+bounds directory to be used in future.
+
+Two new prctl commands are added to register and unregister MPX related
+resource.
+
+155 #define PR_MPX_REGISTER 41
+156 #define PR_MPX_UNREGISTER 42
+
+The base of the bounds directory is set into mm_struct during
+PR_MPX_REGISTER command execution. This member can be used to
+check whether one application is mpx enabled.
+
+
+3. Tips
+=======
+
+1) Users are not allowed to create bounds tables and point the bounds
+directory at them in the userspace. In fact, it is not also necessary
+for users to create bounds tables in the userspace.
+
+When #BR fault is produced due to invalid entry, bounds table will be
+created in kernel on demand and kernel will not transfer this fault to
+userspace. So usersapce can't receive #BR fault for invalid entry, and
+it is not also necessary for users to create bounds tables by themselves.
+
+Certainly users can allocate bounds tables and forcibly point the bounds
+directory at them through XSAVE instruction, and then set valid bit
+of bounds entry to have this entry valid. But we have no way to track
+the memory usage of these user-created bounds tables. In regard to this,
+this behaviour is outlawed here.
+
+2) We will not support the case that multiple bounds directory entries
+are pointed at the same bounds table.
+
+Users can be allowed to take multiple bounds directory entries and point
+them at the same bounds table. See more information "Intel(R) Architecture
+Instruction Set Extensions Programming Reference" (9.3.4).
+
+If userspace did this, it will be possible for kernel to unmap an in-use
+bounds table since it does not recognize sharing. So this behavior is
+also outlawed here.
--
1.7.1
When user memory region is unmapped, related bound tables
become unused and need to be released also. This patch cleanups
these unused bound tables through hooking unmap path.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 16 +++
arch/x86/include/asm/mpx.h | 9 ++
arch/x86/mm/mpx.c | 189 ++++++++++++++++++++++++++++++++++++
include/asm-generic/mmu_context.h | 6 +
mm/mmap.c | 2 +
5 files changed, 222 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index be12c53..af70d4f 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -6,6 +6,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
+#include <asm/mpx.h>
#ifndef CONFIG_PARAVIRT
#include <asm-generic/mm_hooks.h>
@@ -96,4 +97,19 @@ do { \
} while (0)
#endif
+static inline void arch_unmap(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+#ifdef CONFIG_X86_INTEL_MPX
+ /*
+ * Check whether this vma comes from MPX-enabled application.
+ * If so, release this vma related bound tables.
+ */
+ if (mm->bd_addr && !(vma->vm_flags & VM_MPX))
+ mpx_unmap(mm, start, end);
+
+#endif
+}
+
#endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 6cb0853..e848a74 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -42,6 +42,13 @@
#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+#define MPX_BD_ENTRY_MASK ((1<<MPX_BD_ENTRY_OFFSET)-1)
+#define MPX_BT_ENTRY_MASK ((1<<MPX_BT_ENTRY_OFFSET)-1)
+#define MPX_GET_BD_ENTRY_OFFSET(addr) ((((addr)>>(MPX_BT_ENTRY_OFFSET+ \
+ MPX_IGN_BITS)) & MPX_BD_ENTRY_MASK) << MPX_BD_ENTRY_SHIFT)
+#define MPX_GET_BT_ENTRY_OFFSET(addr) ((((addr)>>MPX_IGN_BITS) & \
+ MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
+
#define MPX_BNDSTA_ERROR_CODE 0x3
#define MPX_BNDCFG_ENABLE_FLAG 0x1
#define MPX_BD_ENTRY_VALID_FLAG 0x1
@@ -63,6 +70,8 @@ struct mpx_insn {
#define MAX_MPX_INSN_SIZE 15
unsigned long mpx_mmap(unsigned long len);
+void mpx_unmap(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
#ifdef CONFIG_X86_INTEL_MPX
int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 546c5d1..fd05cd4 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -2,6 +2,7 @@
#include <linux/syscalls.h>
#include <asm/mpx.h>
#include <asm/mman.h>
+#include <asm/mmu_context.h>
#include <linux/sched/sysctl.h>
/*
@@ -56,3 +57,191 @@ out:
up_write(&mm->mmap_sem);
return ret;
}
+
+/*
+ * Get the base of bounds tables pointed by specific bounds
+ * directory entry.
+ */
+static int get_bt_addr(long __user *bd_entry, unsigned long *bt_addr,
+ unsigned int *valid)
+{
+ if (get_user(*bt_addr, bd_entry))
+ return -EFAULT;
+
+ *valid = *bt_addr & MPX_BD_ENTRY_VALID_FLAG;
+ *bt_addr &= MPX_BT_ADDR_MASK;
+
+ /*
+ * If this bounds directory entry is nonzero, and meanwhile
+ * the valid bit is zero, one SIGSEGV will be produced due to
+ * this unexpected situation.
+ */
+ if (!(*valid) && *bt_addr)
+ force_sig(SIGSEGV, current);
+
+ pr_debug("get_bt: BD Entry (%p) - Table (%lx,%d)\n",
+ bd_entry, *bt_addr, *valid);
+ return 0;
+}
+
+/*
+ * Free the backing physical pages of bounds table 'bt_addr'.
+ * Assume start...end is within that bounds table.
+ */
+static void zap_bt_entries(struct mm_struct *mm, unsigned long bt_addr,
+ unsigned long start, unsigned long end)
+{
+ struct vm_area_struct *vma;
+
+ /* Find the vma which overlaps this bounds table */
+ vma = find_vma(mm, bt_addr);
+ if (!vma || vma->vm_start > bt_addr ||
+ vma->vm_end < bt_addr+MPX_BT_SIZE_BYTES)
+ return;
+
+ zap_page_range(vma, start, end, NULL);
+ pr_debug("Bound table de-allocation %lx (%lx, %lx)\n",
+ bt_addr, start, end);
+}
+
+static void unmap_single_bt(struct mm_struct *mm, long __user *bd_entry,
+ unsigned long bt_addr)
+{
+ if (user_atomic_cmpxchg_inatomic(&bt_addr, bd_entry,
+ bt_addr | MPX_BD_ENTRY_VALID_FLAG, 0))
+ return;
+
+ pr_debug("Bound table de-allocation %lx at entry addr %p\n",
+ bt_addr, bd_entry);
+ /*
+ * to avoid recursion, do_munmap() will check whether it comes
+ * from one bounds table through VM_MPX flag.
+ */
+ do_munmap(mm, bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
+}
+
+/*
+ * If the bounds table pointed by bounds directory 'bd_entry' is
+ * not shared, unmap this whole bounds table. Otherwise, only free
+ * those backing physical pages of bounds table entries covered
+ * in this virtual address region start...end.
+ */
+static void unmap_shared_bt(struct mm_struct *mm, long __user *bd_entry,
+ unsigned long start, unsigned long end,
+ bool prev_shared, bool next_shared)
+{
+ unsigned long bt_addr;
+ unsigned int bde_valid = 0;
+
+ if (get_bt_addr(bd_entry, &bt_addr, &bde_valid) || !bde_valid)
+ return;
+
+ if (prev_shared && next_shared)
+ zap_bt_entries(mm, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(end-1));
+ else if (prev_shared)
+ zap_bt_entries(mm, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
+ bt_addr+MPX_BT_SIZE_BYTES);
+ else if (next_shared)
+ zap_bt_entries(mm, bt_addr, bt_addr,
+ bt_addr+MPX_GET_BT_ENTRY_OFFSET(end-1));
+ else
+ unmap_single_bt(mm, bd_entry, bt_addr);
+}
+
+/*
+ * A virtual address region being munmap()ed might share bounds table
+ * with adjacent VMAs. We only need to free the backing physical
+ * memory of these shared bounds tables entries covered in this virtual
+ * address region.
+ *
+ * the VMAs covering the virtual address region start...end have already
+ * been split if necessary and removed from the VMA list.
+ */
+static void unmap_side_bts(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
+{
+ long __user *bde_start, *bde_end;
+ struct vm_area_struct *prev, *next;
+ bool prev_shared = false, next_shared = false;
+
+ bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
+ bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+
+ /*
+ * Check whether bde_start and bde_end are shared with adjacent
+ * VMAs. Because the VMAs covering the virtual address region
+ * start...end have already been removed from the VMA list, if
+ * next is not NULL it will satisfy start < end <= next->vm_start.
+ * And if prev is not NULL, prev->vm_end <= start < end.
+ */
+ next = find_vma_prev(mm, start, &prev);
+ if (prev && MPX_GET_BD_ENTRY_OFFSET(prev->vm_end-1) == (long)bde_start)
+ prev_shared = true;
+ if (next && MPX_GET_BD_ENTRY_OFFSET(next->vm_start) == (long)bde_end)
+ next_shared = true;
+
+ /*
+ * This virtual address region being munmap()ed is only
+ * covered by one bounds table.
+ *
+ * In this case, if this table is also shared with adjacent
+ * VMAs, only part of the backing physical memory of the bounds
+ * table need be freeed. Otherwise the whole bounds table need
+ * be unmapped.
+ */
+ if (bde_start == bde_end) {
+ unmap_shared_bt(mm, bde_start, start, end,
+ prev_shared, next_shared);
+ return;
+ }
+
+ /*
+ * If more than one bounds tables are covered in this virtual
+ * address region being munmap()ed, we need to separately check
+ * whether bde_start and bde_end are shared with adjacent VMAs.
+ */
+ unmap_shared_bt(mm, bde_start, start, end, prev_shared, false);
+ unmap_shared_bt(mm, bde_end, start, end, false, next_shared);
+}
+
+/*
+ * Free unused bounds tables covered in a virtual address region being
+ * munmap()ed. Assume end > start.
+ *
+ * This function will be called by do_munmap(), and the VMAs covering
+ * the virtual address region start...end have already been split if
+ * necessary and remvoed from the VMA list.
+ */
+void mpx_unmap(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ long __user *bd_entry, *bde_start, *bde_end;
+ unsigned long bt_addr;
+ unsigned int bde_valid;
+
+ pr_debug("addr(%lx, %lx) related Bounds Table will be released\n",
+ start, end);
+ /*
+ * unmap bounds tables pointed out by start/end bounds directory
+ * entries, or only free part of their backing physical memroy
+ * if they are shared with adjacent VMAs.
+ */
+ unmap_side_bts(mm, start, end);
+
+ /*
+ * unmap those bounds table which are entirely covered in this
+ * virtual address region.
+ */
+ bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
+ bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+ for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
+ if (get_bt_addr(bd_entry, &bt_addr, &bde_valid))
+ return;
+ if (!bde_valid)
+ continue;
+ unmap_single_bt(mm, bd_entry, bt_addr);
+ }
+}
diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index a7eec91..ac558ca 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -42,4 +42,10 @@ static inline void activate_mm(struct mm_struct *prev_mm,
{
}
+static inline void arch_unmap(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+}
+
#endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index b1202cf..8d8ac65 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2558,6 +2558,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
/* Fix up all other VM information */
remove_vma_list(mm, vma);
+ arch_unmap(mm, vma, start, end);
+
return 0;
}
--
1.7.1
This patch adds the PR_MPX_REGISTER and PR_MPX_UNREGISTER prctl()
commands. These commands can be used to register and unregister MPX
related resource on the x86 platform.
The base of the bounds directory is set into mm_struct during
PR_MPX_REGISTER command execution. This member can be used to
check whether one application is mpx enabled.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/mpx.h | 1 +
arch/x86/include/asm/processor.h | 18 ++++++++++++
arch/x86/kernel/mpx.c | 56 ++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 3 ++
include/uapi/linux/prctl.h | 6 ++++
kernel/sys.c | 12 ++++++++
6 files changed, 96 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 780af63..6cb0853 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -43,6 +43,7 @@
#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
#define MPX_BNDSTA_ERROR_CODE 0x3
+#define MPX_BNDCFG_ENABLE_FLAG 0x1
#define MPX_BD_ENTRY_VALID_FLAG 0x1
struct mpx_insn {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a4ea023..6e0966e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -952,6 +952,24 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
extern int get_tsc_mode(unsigned long adr);
extern int set_tsc_mode(unsigned int val);
+/* Register/unregister a process' MPX related resource */
+#define MPX_REGISTER(tsk) mpx_register((tsk))
+#define MPX_UNREGISTER(tsk) mpx_unregister((tsk))
+
+#ifdef CONFIG_X86_INTEL_MPX
+extern int mpx_register(struct task_struct *tsk);
+extern int mpx_unregister(struct task_struct *tsk);
+#else
+static inline int mpx_register(struct task_struct *tsk)
+{
+ return -EINVAL;
+}
+static inline int mpx_unregister(struct task_struct *tsk)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_X86_INTEL_MPX */
+
extern u16 amd_get_nb_id(int cpu);
static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
index 650b282..d8a2a09 100644
--- a/arch/x86/kernel/mpx.c
+++ b/arch/x86/kernel/mpx.c
@@ -1,6 +1,62 @@
#include <linux/kernel.h>
#include <linux/syscalls.h>
+#include <linux/prctl.h>
#include <asm/mpx.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
+
+/*
+ * This should only be called when cpuid has been checked
+ * and we are sure that MPX is available.
+ */
+static __user void *task_get_bounds_dir(struct task_struct *tsk)
+{
+ struct xsave_struct *xsave_buf;
+
+ fpu_xsave(&tsk->thread.fpu);
+ xsave_buf = &(tsk->thread.fpu.state->xsave);
+ if (!(xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ENABLE_FLAG))
+ return NULL;
+
+ return (void __user *)(xsave_buf->bndcsr.cfg_reg_u &
+ MPX_BNDCFG_ADDR_MASK);
+}
+
+int mpx_register(struct task_struct *tsk)
+{
+ struct mm_struct *mm = tsk->mm;
+
+ if (!cpu_has_mpx)
+ return -EINVAL;
+
+ /*
+ * runtime in the userspace will be responsible for allocation of
+ * the bounds directory. Then, it will save the base of the bounds
+ * directory into XSAVE/XRSTOR Save Area and enable MPX through
+ * XRSTOR instruction.
+ *
+ * fpu_xsave() is expected to be very expensive. In order to do
+ * performance optimization, here we get the base of the bounds
+ * directory and then save it into mm_struct to be used in future.
+ */
+ mm->bd_addr = task_get_bounds_dir(tsk);
+ if (!mm->bd_addr)
+ return -EINVAL;
+
+ pr_debug("MPX BD base address %p\n", mm->bd_addr);
+ return 0;
+}
+
+int mpx_unregister(struct task_struct *tsk)
+{
+ struct mm_struct *mm = current->mm;
+
+ if (!cpu_has_mpx)
+ return -EINVAL;
+
+ mm->bd_addr = NULL;
+ return 0;
+}
typedef enum {REG_TYPE_RM, REG_TYPE_INDEX, REG_TYPE_BASE} reg_type_t;
static unsigned long get_reg(struct mpx_insn *insn, struct pt_regs *regs,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8967e20..54b8011 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -454,6 +454,9 @@ struct mm_struct {
bool tlb_flush_pending;
#endif
struct uprobes_state uprobes_state;
+#ifdef CONFIG_X86_INTEL_MPX
+ void __user *bd_addr; /* address of the bounds directory */
+#endif
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04..ce86fa9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,10 @@
#define PR_SET_THP_DISABLE 41
#define PR_GET_THP_DISABLE 42
+/*
+ * Register/unregister MPX related resource.
+ */
+#define PR_MPX_REGISTER 43
+#define PR_MPX_UNREGISTER 44
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index fba0f29..ed1a03c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -91,6 +91,12 @@
#ifndef SET_TSC_CTL
# define SET_TSC_CTL(a) (-EINVAL)
#endif
+#ifndef MPX_REGISTER
+# define MPX_REGISTER(a) (-EINVAL)
+#endif
+#ifndef MPX_UNREGISTER
+# define MPX_UNREGISTER(a) (-EINVAL)
+#endif
/*
* this is where the system-wide overflow UID and GID are defined, for
@@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
me->mm->def_flags &= ~VM_NOHUGEPAGE;
up_write(&me->mm->mmap_sem);
break;
+ case PR_MPX_REGISTER:
+ error = MPX_REGISTER(me);
+ break;
+ case PR_MPX_UNREGISTER:
+ error = MPX_UNREGISTER(me);
+ break;
default:
error = -EINVAL;
break;
--
1.7.1
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.
Signed-off-by: Qiaowei Ren <[email protected]>
---
include/uapi/asm-generic/siginfo.h | 9 ++++++++-
kernel/signal.c | 4 ++++
2 files changed, 12 insertions(+), 1 deletions(-)
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ba5be7f..1e35520 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -91,6 +91,10 @@ typedef struct siginfo {
int _trapno; /* TRAP # which caused the signal */
#endif
short _addr_lsb; /* LSB of the reported address */
+ struct {
+ void __user *_lower;
+ void __user *_upper;
+ } _addr_bnd;
} _sigfault;
/* SIGPOLL */
@@ -131,6 +135,8 @@ typedef struct siginfo {
#define si_trapno _sifields._sigfault._trapno
#endif
#define si_addr_lsb _sifields._sigfault._addr_lsb
+#define si_lower _sifields._sigfault._addr_bnd._lower
+#define si_upper _sifields._sigfault._addr_bnd._upper
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
#ifdef __ARCH_SIGSYS
@@ -199,7 +205,8 @@ typedef struct siginfo {
*/
#define SEGV_MAPERR (__SI_FAULT|1) /* address not mapped to object */
#define SEGV_ACCERR (__SI_FAULT|2) /* invalid permissions for mapped object */
-#define NSIGSEGV 2
+#define SEGV_BNDERR (__SI_FAULT|3) /* failed address bound checks */
+#define NSIGSEGV 3
/*
* SIGBUS si_codes
diff --git a/kernel/signal.c b/kernel/signal.c
index 6ea13c0..0fcf749 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2773,6 +2773,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
if (from->si_code == BUS_MCEERR_AR || from->si_code == BUS_MCEERR_AO)
err |= __put_user(from->si_addr_lsb, &to->si_addr_lsb);
#endif
+#ifdef SEGV_BNDERR
+ err |= __put_user(from->si_lower, &to->si_lower);
+ err |= __put_user(from->si_upper, &to->si_upper);
+#endif
break;
case __SI_CHLD:
err |= __put_user(from->si_pid, &to->si_pid);
--
1.7.1
In order to do performance optimization, this patch adds macro
cpu_has_mpx which will directly return 0 when MPX is not supported
by kernel.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e265ff9..f302d08 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -339,6 +339,12 @@ extern const char * const x86_power_flags[32];
#define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
#define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
+#ifdef CONFIG_X86_INTEL_MPX
+#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
+#else
+#define cpu_has_mpx 0
+#endif /* CONFIG_X86_INTEL_MPX */
+
#ifdef CONFIG_X86_64
#undef cpu_has_vme
--
1.7.1
MPX-enabled application will possibly create a lot of bounds tables
in process address space to save bounds information. These tables
can take up huge swaths of memory (as much as 80% of the memory on
the system) even if we clean them up aggressively. Being this huge,
we need a way to track their memory use. If we want to track them,
we essentially have two options:
1. walk the multi-GB (in virtual space) bounds directory to locate
all the VMAs and walk them
2. Find a way to distinguish MPX bounds-table VMAs from normal
anonymous VMAs and use some existing mechanism to walk them
We expect (1) will be prohibitively expensive. For (2), we only
need a single bit, and we've chosen to use a VM_ flag. We understand
that they are scarce and are open to other options.
There is one potential hybrid approach: check the bounds directory
entry for any anonymous VMA that could possibly contain a bounds table.
This is less expensive than (1), but still requires reading a pointer
out of userspace for every VMA that we iterate over.
Signed-off-by: Qiaowei Ren <[email protected]>
---
arch/x86/mm/init_64.c | 2 ++
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 2 ++
3 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f35c66c..2d41679 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1223,6 +1223,8 @@ int in_gate_area_no_mm(unsigned long addr)
const char *arch_vma_name(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_MPX)
+ return "[mpx]";
if (vma->vm_mm && vma->vm_start == (long)vma->vm_mm->context.vdso)
return "[vdso]";
if (vma == &gate_vma)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 442177b..09266bd 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -543,6 +543,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_GROWSDOWN)] = "gd",
[ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_DENYWRITE)] = "dw",
+ [ilog2(VM_MPX)] = "mp",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d677706..029c716 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -127,6 +127,8 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
+/* MPX specific bounds table or bounds directory (x86) */
+#define VM_MPX 0x02000000
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
#ifdef CONFIG_MEM_SOFT_DIRTY
--
1.7.1
On Wed, Jun 18, 2014 at 05:44:09PM +0800, Qiaowei Ren wrote:
> In order to do performance optimization, this patch adds macro
> cpu_has_mpx which will directly return 0 when MPX is not supported
> by kernel.
>
> Signed-off-by: Qiaowei Ren <[email protected]>
> ---
> arch/x86/include/asm/cpufeature.h | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index e265ff9..f302d08 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -339,6 +339,12 @@ extern const char * const x86_power_flags[32];
> #define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
> #define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
>
> +#ifdef CONFIG_X86_INTEL_MPX
> +#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
I think we don't want those macros anymore because they're obfuscating
the code. You should use static_cpu_has instead.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Wed, Jun 18, 2014 at 05:44:13PM +0800, Qiaowei Ren wrote:
> This patch sets bound violation fields of siginfo struct in #BR
> exception handler by decoding the user instruction and constructing
> the faulting pointer.
>
> Signed-off-by: Qiaowei Ren <[email protected]>
> ---
> arch/x86/include/asm/mpx.h | 23 ++++
> arch/x86/kernel/mpx.c | 294 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/traps.c | 6 +
> 3 files changed, 323 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index b7598ac..780af63 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -3,6 +3,7 @@
>
> #include <linux/types.h>
> #include <asm/ptrace.h>
> +#include <asm/insn.h>
>
> #ifdef CONFIG_X86_64
>
> @@ -44,15 +45,37 @@
> #define MPX_BNDSTA_ERROR_CODE 0x3
> #define MPX_BD_ENTRY_VALID_FLAG 0x1
>
> +struct mpx_insn {
> + struct insn_field rex_prefix; /* REX prefix */
> + struct insn_field modrm;
> + struct insn_field sib;
> + struct insn_field displacement;
> +
> + unsigned char addr_bytes; /* effective address size */
> + unsigned char limit;
> + unsigned char x86_64;
> +
> + const unsigned char *kaddr; /* kernel address of insn to analyze */
> + const unsigned char *next_byte;
> +};
> +
> +#define MAX_MPX_INSN_SIZE 15
> +
> unsigned long mpx_mmap(unsigned long len);
>
> #ifdef CONFIG_X86_INTEL_MPX
> int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
> +void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
> + struct xsave_struct *xsave_buf);
> #else
> static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
> {
> return -EINVAL;
> }
> +static inline void do_mpx_bounds(struct pt_regs *regs, siginfo_t *info,
> + struct xsave_struct *xsave_buf)
> +{
> +}
> #endif /* CONFIG_X86_INTEL_MPX */
>
> #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
> index 4230c7b..650b282 100644
> --- a/arch/x86/kernel/mpx.c
> +++ b/arch/x86/kernel/mpx.c
> @@ -2,6 +2,270 @@
> #include <linux/syscalls.h>
> #include <asm/mpx.h>
>
> +typedef enum {REG_TYPE_RM, REG_TYPE_INDEX, REG_TYPE_BASE} reg_type_t;
> +static unsigned long get_reg(struct mpx_insn *insn, struct pt_regs *regs,
> + reg_type_t type)
> +{
> + int regno = 0;
> + unsigned char modrm = (unsigned char)insn->modrm.value;
> + unsigned char sib = (unsigned char)insn->sib.value;
> +
> + static const int regoff[] = {
> + offsetof(struct pt_regs, ax),
> + offsetof(struct pt_regs, cx),
> + offsetof(struct pt_regs, dx),
> + offsetof(struct pt_regs, bx),
> + offsetof(struct pt_regs, sp),
> + offsetof(struct pt_regs, bp),
> + offsetof(struct pt_regs, si),
> + offsetof(struct pt_regs, di),
> +#ifdef CONFIG_X86_64
> + offsetof(struct pt_regs, r8),
> + offsetof(struct pt_regs, r9),
> + offsetof(struct pt_regs, r10),
> + offsetof(struct pt_regs, r11),
> + offsetof(struct pt_regs, r12),
> + offsetof(struct pt_regs, r13),
> + offsetof(struct pt_regs, r14),
> + offsetof(struct pt_regs, r15),
> +#endif
> + };
> +
> + switch (type) {
> + case REG_TYPE_RM:
> + regno = X86_MODRM_RM(modrm);
> + if (X86_REX_B(insn->rex_prefix.value) == 1)
> + regno += 8;
> + break;
> +
> + case REG_TYPE_INDEX:
> + regno = X86_SIB_INDEX(sib);
> + if (X86_REX_X(insn->rex_prefix.value) == 1)
> + regno += 8;
> + break;
> +
> + case REG_TYPE_BASE:
> + regno = X86_SIB_BASE(sib);
> + if (X86_REX_B(insn->rex_prefix.value) == 1)
> + regno += 8;
> + break;
> +
> + default:
> + break;
> + }
> +
> + return regs_get_register(regs, regoff[regno]);
> +}
> +
> +/*
> + * return the address being referenced be instruction
> + * for rm=3 returning the content of the rm reg
> + * for rm!=3 calculates the address using SIB and Disp
> + */
> +static unsigned long get_addr_ref(struct mpx_insn *insn, struct pt_regs *regs)
> +{
> + unsigned long addr;
> + unsigned long base;
> + unsigned long indx;
> + unsigned char modrm = (unsigned char)insn->modrm.value;
> + unsigned char sib = (unsigned char)insn->sib.value;
> +
> + if (X86_MODRM_MOD(modrm) == 3) {
> + addr = get_reg(insn, regs, REG_TYPE_RM);
> + } else {
> + if (insn->sib.nbytes) {
> + base = get_reg(insn, regs, REG_TYPE_BASE);
> + indx = get_reg(insn, regs, REG_TYPE_INDEX);
> + addr = base + indx * (1 << X86_SIB_SCALE(sib));
> + } else {
> + addr = get_reg(insn, regs, REG_TYPE_RM);
> + }
> + addr += insn->displacement.value;
> + }
> +
> + return addr;
> +}
> +
> +/* Verify next sizeof(t) bytes can be on the same instruction */
> +#define validate_next(t, insn, n) \
> + ((insn)->next_byte + sizeof(t) + n - (insn)->kaddr <= (insn)->limit)
> +
> +#define __get_next(t, insn) \
> +({ \
> + t r = *(t *)insn->next_byte; \
> + insn->next_byte += sizeof(t); \
> + r; \
> +})
> +
> +#define __peek_next(t, insn) \
> +({ \
> + t r = *(t *)insn->next_byte; \
> + r; \
> +})
> +
> +#define get_next(t, insn) \
> +({ \
> + if (unlikely(!validate_next(t, insn, 0))) \
> + goto err_out; \
> + __get_next(t, insn); \
> +})
> +
> +#define peek_next(t, insn) \
> +({ \
> + if (unlikely(!validate_next(t, insn, 0))) \
> + goto err_out; \
> + __peek_next(t, insn); \
> +})
> +
> +static void mpx_insn_get_prefixes(struct mpx_insn *insn)
> +{
> + unsigned char b;
> +
> + /* Decode legacy prefix and REX prefix */
> + b = peek_next(unsigned char, insn);
> + while (b != 0x0f) {
> + /*
> + * look for a rex prefix
> + * a REX prefix cannot be followed by a legacy prefix.
> + */
> + if (insn->x86_64 && ((b&0xf0) == 0x40)) {
> + insn->rex_prefix.value = b;
> + insn->rex_prefix.nbytes = 1;
> + insn->next_byte++;
> + break;
> + }
> +
> + /* check the other legacy prefixes */
> + switch (b) {
> + case 0xf2:
> + case 0xf3:
> + case 0xf0:
> + case 0x64:
> + case 0x65:
> + case 0x2e:
> + case 0x3e:
> + case 0x26:
> + case 0x36:
> + case 0x66:
> + case 0x67:
> + insn->next_byte++;
> + break;
> + default: /* everything else is garbage */
> + goto err_out;
> + }
> + b = peek_next(unsigned char, insn);
> + }
> +
> +err_out:
> + return;
> +}
> +
> +static void mpx_insn_get_modrm(struct mpx_insn *insn)
> +{
> + insn->modrm.value = get_next(unsigned char, insn);
> + insn->modrm.nbytes = 1;
> +
> +err_out:
> + return;
> +}
> +
> +static void mpx_insn_get_sib(struct mpx_insn *insn)
> +{
> + unsigned char modrm = (unsigned char)insn->modrm.value;
> +
> + if (X86_MODRM_MOD(modrm) != 3 && X86_MODRM_RM(modrm) == 4) {
> + insn->sib.value = get_next(unsigned char, insn);
> + insn->sib.nbytes = 1;
> + }
> +
> +err_out:
> + return;
> +}
> +
> +static void mpx_insn_get_displacement(struct mpx_insn *insn)
> +{
> + unsigned char mod, rm, base;
> +
> + /*
> + * Interpreting the modrm byte:
> + * mod = 00 - no displacement fields (exceptions below)
> + * mod = 01 - 1-byte displacement field
> + * mod = 10 - displacement field is 4 bytes
> + * mod = 11 - no memory operand
> + *
> + * mod != 11, r/m = 100 - SIB byte exists
> + * mod = 00, SIB base = 101 - displacement field is 4 bytes
> + * mod = 00, r/m = 101 - rip-relative addressing, displacement
> + * field is 4 bytes
> + */
> + mod = X86_MODRM_MOD(insn->modrm.value);
> + rm = X86_MODRM_RM(insn->modrm.value);
> + base = X86_SIB_BASE(insn->sib.value);
> + if (mod == 3)
> + return;
> + if (mod == 1) {
> + insn->displacement.value = get_next(unsigned char, insn);
> + insn->displacement.nbytes = 1;
> + } else if ((mod == 0 && rm == 5) || mod == 2 ||
> + (mod == 0 && base == 5)) {
> + insn->displacement.value = get_next(int, insn);
> + insn->displacement.nbytes = 4;
> + }
> +
> +err_out:
> + return;
> +}
> +
> +static void mpx_insn_init(struct mpx_insn *insn, struct pt_regs *regs)
> +{
> + unsigned char buf[MAX_MPX_INSN_SIZE];
> + int bytes;
> +
> + memset(insn, 0, sizeof(*insn));
> +
> + bytes = copy_from_user(buf, (void __user *)regs->ip, MAX_MPX_INSN_SIZE);
> + insn->limit = MAX_MPX_INSN_SIZE - bytes;
> + insn->kaddr = buf;
> + insn->next_byte = buf;
> +
> + /*
> + * In 64-bit Mode, all Intel MPX instructions use 64-bit
> + * operands for bounds and 64 bit addressing, i.e. REX.W &
> + * 67H have no effect on data or address size.
> + *
> + * In compatibility and legacy modes (including 16-bit code
> + * segments, real and virtual 8086 modes) all Intel MPX
> + * instructions use 32-bit operands for bounds and 32 bit
> + * addressing.
> + */
> +#ifdef CONFIG_X86_64
> + insn->x86_64 = 1;
> + insn->addr_bytes = 8;
> +#else
> + insn->x86_64 = 0;
> + insn->addr_bytes = 4;
> +#endif
> +}
> +
> +static unsigned long mpx_insn_decode(struct mpx_insn *insn,
> + struct pt_regs *regs)
> +{
> + mpx_insn_init(insn, regs);
> +
> + /*
> + * In this case, we only need decode bndcl/bndcn/bndcu,
> + * so we can use private diassembly interfaces to get
> + * prefixes, modrm, sib, displacement, etc..
> + */
> + mpx_insn_get_prefixes(insn);
> + insn->next_byte += 2; /* ignore opcode */
> + mpx_insn_get_modrm(insn);
> + mpx_insn_get_sib(insn);
> + mpx_insn_get_displacement(insn);
> +
> + return get_addr_ref(insn, regs);
> +}
> +
This whole insn decoding machinery above looks like adapted from
arch/x86/lib/insn.c. You should merge it with the generic code in insn.c
instead of homegrowing it here only for the purposes of MPX. And if it
doesn't work for your needs, you should should extend the generic code
to do so. I think we even talked about this last time.
Also, make sure you run all your patches through checkpatch.pl before
submitting.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 06/18/2014 02:57 AM, Borislav Petkov wrote:
>> > @@ -339,6 +339,12 @@ extern const char * const x86_power_flags[32];
>> > #define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
>> > #define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
>> >
>> > +#ifdef CONFIG_X86_INTEL_MPX
>> > +#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
> I think we don't want those macros anymore because they're obfuscating
> the code. You should use static_cpu_has instead.
It looks like static_cpu_has() is the right thing to use instead of
boot_cpu_has(). But, this doesn't just obfuscate things.
We actually _want_ the compiler to cull code out when the config option
is off. Things like do_bounds() will see code savings with _some_ kind
of #ifdef rather than using static_cpu_has().
So, we can either use the well worn, consistent with other features in
x86, cpu_has_$foo approach. Or, we can roll our own macros.
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> This patchset adds support for the Memory Protection Extensions
> (MPX) feature found in future Intel processors.
It's very important to note that this is a very different patch set than
the last one. The way we are freeing the unused bounds tables is
_completely different (9/10), and needs some very heavy mm reviews. I'm
sure Qiaowei will cc linux-mm@ next time.
We're also not asking that this be merged in its current state. The
32-bit binary on 64-bit kernel issue is a show stopper for merging, but
we're trying to post early and often.
On Wed, Jun 18, 2014 at 07:35:17AM -0700, Dave Hansen wrote:
> On 06/18/2014 02:57 AM, Borislav Petkov wrote:
> >> > @@ -339,6 +339,12 @@ extern const char * const x86_power_flags[32];
> >> > #define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
> >> > #define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
> >> >
> >> > +#ifdef CONFIG_X86_INTEL_MPX
> >> > +#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
> > I think we don't want those macros anymore because they're obfuscating
> > the code. You should use static_cpu_has instead.
>
> It looks like static_cpu_has() is the right thing to use instead of
> boot_cpu_has(). But, this doesn't just obfuscate things.
>
> We actually _want_ the compiler to cull code out when the config option
> is off. Things like do_bounds() will see code savings with _some_ kind
> of #ifdef rather than using static_cpu_has().
Why?
Practically, distros will have it enabled anyway (you have X86_INTEL_MPX
depend on CPU_SUP_INTEL).
Are you talking about the miniscule percentage of people building their
own kernels?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 06/18/2014 07:44 AM, Borislav Petkov wrote:
> On Wed, Jun 18, 2014 at 07:35:17AM -0700, Dave Hansen wrote:
>> On 06/18/2014 02:57 AM, Borislav Petkov wrote:
>>>>> @@ -339,6 +339,12 @@ extern const char * const x86_power_flags[32];
>>>>> #define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
>>>>> #define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
>>>>>
>>>>> +#ifdef CONFIG_X86_INTEL_MPX
>>>>> +#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
>>> I think we don't want those macros anymore because they're obfuscating
>>> the code. You should use static_cpu_has instead.
>>
>> It looks like static_cpu_has() is the right thing to use instead of
>> boot_cpu_has(). But, this doesn't just obfuscate things.
>>
>> We actually _want_ the compiler to cull code out when the config option
>> is off. Things like do_bounds() will see code savings with _some_ kind
>> of #ifdef rather than using static_cpu_has().
>
> Why?
Are you seriously asking why we would want to cull out code guaranteed
to be unused?
People are going to want to turn this off at compile time. When they
do, I want as much of the code to go away as is reasonably possible.
Adding a single-line #ifdef in a header qualifies as a pretty decent
tradeoff in my book.
I'm sure Qiaowei will do an experiment and show us what the code savings
are.
> Practically, distros will have it enabled anyway (you have X86_INTEL_MPX
> depend on CPU_SUP_INTEL).
>
> Are you talking about the miniscule percentage of people building their
> own kernels?
The minuscule number of people not using a distro kernel? Like, every
Android and Chrome device in the world? How about the cloud providers
with millions of servers?
On 06/18/2014 07:44 AM, Borislav Petkov wrote:
>
> Why?
>
> Practically, distros will have it enabled anyway (you have X86_INTEL_MPX
> depend on CPU_SUP_INTEL).
>
> Are you talking about the miniscule percentage of people building their
> own kernels?
>
We have people working hard to shave kilobytes off the smallest
configurations, because they want to fit into a megabyte of memory with
the kernel and application.
Let's try to not make their life harder than it needs to be.
-hpa
On 06/18/2014 07:35 AM, Dave Hansen wrote:
>
> It looks like static_cpu_has() is the right thing to use instead of
> boot_cpu_has(). But, this doesn't just obfuscate things.
>
> We actually _want_ the compiler to cull code out when the config option
> is off. Things like do_bounds() will see code savings with _some_ kind
> of #ifdef rather than using static_cpu_has().
>
> So, we can either use the well worn, consistent with other features in
> x86, cpu_has_$foo approach. Or, we can roll our own macros.
>
We could do something like:
#define MPX_ENABLED (IS_ENABLED(CONFIG_X86_MPX) &&
static_cpu_has(X86_FEATURE_MPX))
-hpa
On Wed, Jun 18, 2014 at 07:58:21AM -0700, Dave Hansen wrote:
> Are you seriously asking why we would want to cull out code guaranteed
> to be unused?
Yes, I am. Don't get me wrong - I'm all for not enabling code which is
unused - I'm just questioning the actual usage case here.
Because you have this:
+config X86_INTEL_MPX
+ def_bool y
+ depends on CPU_SUP_INTEL
which means, even on !MPX Intels, you need to explicitly say "no" here.
Is this how the configuration is supposed to be done?
Or do you need to add help text to explain to people not to enable this
on !MPX machines after looking at /proc/cpuinfo first?
Am I close?
> The minuscule number of people not using a distro kernel? Like, every
> Android and Chrome device in the world? How about the cloud providers
> with millions of servers?
Cloud people won't be benefiting from MPX?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Wed, Jun 18, 2014 at 08:00:29AM -0700, H. Peter Anvin wrote:
> We have people working hard to shave kilobytes off the smallest
> configurations, because they want to fit into a megabyte of memory
> with the kernel and application.
>
> Let's try to not make their life harder than it needs to be.
Oh sure - I'm simply asking how this is going to be handled practically
by asking stupid questions. You know me :)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 06/18/2014 08:25 AM, Borislav Petkov wrote:
> +config X86_INTEL_MPX
> + def_bool y
> + depends on CPU_SUP_INTEL
>
> which means, even on !MPX Intels, you need to explicitly say "no" here.
> Is this how the configuration is supposed to be done?
I think we need to promote this to a real option with real help text.
On 06/18/2014 07:59 AM, H. Peter Anvin wrote:
> On 06/18/2014 07:35 AM, Dave Hansen wrote:
>> It looks like static_cpu_has() is the right thing to use instead of
>> boot_cpu_has(). But, this doesn't just obfuscate things.
>>
>> We actually _want_ the compiler to cull code out when the config option
>> is off. Things like do_bounds() will see code savings with _some_ kind
>> of #ifdef rather than using static_cpu_has().
>>
>> So, we can either use the well worn, consistent with other features in
>> x86, cpu_has_$foo approach. Or, we can roll our own macros.
>
> We could do something like:
>
> #define MPX_ENABLED (IS_ENABLED(CONFIG_X86_MPX) &&
> static_cpu_has(X86_FEATURE_MPX))
How about something like the attached patch?
This lets us use static_cpu_has() for the checks, and allows us to
easily add new checks for other features that might be compile-time
disabled.
On Wed, Jun 18, 2014 at 09:25:57AM -0700, Dave Hansen wrote:
> On 06/18/2014 07:59 AM, H. Peter Anvin wrote:
> > On 06/18/2014 07:35 AM, Dave Hansen wrote:
> >> It looks like static_cpu_has() is the right thing to use instead of
> >> boot_cpu_has(). But, this doesn't just obfuscate things.
> >>
> >> We actually _want_ the compiler to cull code out when the config option
> >> is off. Things like do_bounds() will see code savings with _some_ kind
> >> of #ifdef rather than using static_cpu_has().
> >>
> >> So, we can either use the well worn, consistent with other features in
> >> x86, cpu_has_$foo approach. Or, we can roll our own macros.
> >
> > We could do something like:
> >
> > #define MPX_ENABLED (IS_ENABLED(CONFIG_X86_MPX) &&
> > static_cpu_has(X86_FEATURE_MPX))
>
> How about something like the attached patch?
>
> This lets us use static_cpu_has() for the checks, and allows us to
> easily add new checks for other features that might be compile-time
> disabled.
>
>
> ---
>
> b/arch/x86/include/asm/cpufeature.h | 26 ++++++++++++++++++++------
> b/arch/x86/kernel/mpx.c | 4 ++--
> b/arch/x86/kernel/traps.c | 2 +-
> 3 files changed, 23 insertions(+), 9 deletions(-)
>
> diff -puN arch/x86/include/asm/cpufeature.h~x86-disabled_mask arch/x86/include/asm/cpufeature.h
> --- a/arch/x86/include/asm/cpufeature.h~x86-disabled_mask 2014-06-18 08:48:41.329750895 -0700
> +++ b/arch/x86/include/asm/cpufeature.h 2014-06-18 09:19:19.143546973 -0700
> @@ -339,12 +339,6 @@ extern const char * const x86_power_flag
> #define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU)
> #define cpu_has_topoext boot_cpu_has(X86_FEATURE_TOPOEXT)
>
> -#ifdef CONFIG_X86_INTEL_MPX
> -#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX)
> -#else
> -#define cpu_has_mpx 0
> -#endif /* CONFIG_X86_INTEL_MPX */
> -
> #ifdef CONFIG_X86_64
>
> #undef cpu_has_vme
> @@ -367,6 +361,22 @@ extern const char * const x86_power_flag
>
> #endif /* CONFIG_X86_64 */
>
> +/*
> + * Add features and their corresponding config options here
> + * if you want to have the compiler optimize out code that
> + * uses them.
> + *
> + * You should not use this function directly. Use
> + * static_cpu_has() so that you also benefit from alternatives
> + * when the features are enabled.
> + */
> +static __always_inline int __cpu_feature_compile_enabled(u16 bit)
> +{
> + if (bit == X86_FEATURE_MPX)
> + return IS_ENABLED(CONFIG_X86_MPX);
Right, this should be CONFIG_X86_INTEL_MPX but that's details.
I guess this might grow into a big if-else noodle case but when that
happens, we probably could add some sort of mapping between X86_FEATURE
bits to CONFIG_ items or so.
Then, you could probably add this call to the enclosing macros
static_cpu_has{,_safe} so that it gets evaluated first. The advantage is
that the IS_ENABLED test would work very early - always, actually - even
before alternatives have run and thus catch the cases where the feature
is config-disabled.
Otherwise, asm looks ok.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 2014-06-18, Borislav Petkov wrote:
> On Wed, Jun 18, 2014 at 05:44:13PM +0800, Qiaowei Ren wrote:
>
> This whole insn decoding machinery above looks like adapted from
> arch/x86/lib/insn.c. You should merge it with the generic code in
> insn.c instead of homegrowing it here only for the purposes of MPX.
> And if it doesn't work for your needs, you should should extend the
> generic code to do so. I think we even talked about this last time.
>
> Also, make sure you run all your patches through checkpatch.pl before
> submitting.
>
Petkov, as we discussed on initial version of this patchset, general insn framework didn't work out well and I have tried to use generic struct insn, insn_field, etc. for obvious benefits.
Thanks,
Qiaowei
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Jun 19, 2014 at 01:13:48AM +0000, Ren, Qiaowei wrote:
> On 2014-06-18, Borislav Petkov wrote:
> > On Wed, Jun 18, 2014 at 05:44:13PM +0800, Qiaowei Ren wrote:
> >
> > This whole insn decoding machinery above looks like adapted from
> > arch/x86/lib/insn.c. You should merge it with the generic code in
> > insn.c instead of homegrowing it here only for the purposes of MPX.
> > And if it doesn't work for your needs, you should should extend the
> > generic code to do so.
> Petkov, as we discussed on initial version of this patchset, general
> insn framework didn't work out well and I have tried to use generic
> struct insn, insn_field, etc. for obvious benefits.
Let me repeat myself: "And if it doesn't work for your needs, you should
extend the generic code to do so."
We don't do homegrown almost-copies of generic code.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 2014-06-19, Borislav Petkov wrote:
> On Thu, Jun 19, 2014 at 01:13:48AM +0000, Ren, Qiaowei wrote:
>> On 2014-06-18, Borislav Petkov wrote:
>>> On Wed, Jun 18, 2014 at 05:44:13PM +0800, Qiaowei Ren wrote:
>>>
>>> This whole insn decoding machinery above looks like adapted from
>>> arch/x86/lib/insn.c. You should merge it with the generic code in
>>> insn.c instead of homegrowing it here only for the purposes of MPX.
>>> And if it doesn't work for your needs, you should should extend
>>> the generic code to do so.
>
>> Petkov, as we discussed on initial version of this patchset, general
>> insn framework didn't work out well and I have tried to use generic
>> struct insn, insn_field, etc. for obvious benefits.
>
> Let me repeat myself: "And if it doesn't work for your needs, you
> should extend the generic code to do so."
>
> We don't do homegrown almost-copies of generic code.
>
I see. If possible, I will be very happy to use or extend generic code. But due to extra overhead caused by MPX, I have to use MPX specific decoding to do performance optimization.
You can check the discussion on this: https://lkml.org/lkml/2014/1/11/190
Thanks,
Qiaowei
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On 06/18/2014 11:53 PM, Ren, Qiaowei wrote:
> On 2014-06-19, Borislav Petkov wrote:
>> On Thu, Jun 19, 2014 at 01:13:48AM +0000, Ren, Qiaowei wrote:
>>> On 2014-06-18, Borislav Petkov wrote:
>>>> On Wed, Jun 18, 2014 at 05:44:13PM +0800, Qiaowei Ren wrote:
>>>>
>>>> This whole insn decoding machinery above looks like adapted from
>>>> arch/x86/lib/insn.c. You should merge it with the generic code in
>>>> insn.c instead of homegrowing it here only for the purposes of MPX.
>>>> And if it doesn't work for your needs, you should should extend
>>>> the generic code to do so.
>>
>>> Petkov, as we discussed on initial version of this patchset, general
>>> insn framework didn't work out well and I have tried to use generic
>>> struct insn, insn_field, etc. for obvious benefits.
>>
>> Let me repeat myself: "And if it doesn't work for your needs, you
>> should extend the generic code to do so."
>>
>> We don't do homegrown almost-copies of generic code.
>>
> I see. If possible, I will be very happy to use or extend generic
> code. But due to extra overhead caused by MPX, I have to use MPX
> specific decoding to do performance optimization.
Could you please support this position with some data? I'm a bit
skeptical that instruction decoding is going to be a
performance-critical path.
I also don't see the extra field that you talked about in the previous
thread? What's the extra field? I see a 'limit' vs. 'length', but you
don't use 'length' at all, so I think you can use it instead, or at
least union it.
I've taken a quick stab at trying to consolidate things. I think I may
have screwed up this:
insn->limit = MAX_MPX_INSN_SIZE - bytes;
Qiaowei, is there anything fundamentally broken with what I've got here?
On 06/19/2014 10:04 AM, Dave Hansen wrote:
>
> Could you please support this position with some data? I'm a bit
> skeptical that instruction decoding is going to be a
> performance-critical path.
>
> I also don't see the extra field that you talked about in the previous
> thread? What's the extra field? I see a 'limit' vs. 'length', but you
> don't use 'length' at all, so I think you can use it instead, or at
> least union it.
>
> I've taken a quick stab at trying to consolidate things. I think I may
> have screwed up this:
>
> insn->limit = MAX_MPX_INSN_SIZE - bytes;
>
> Qiaowei, is there anything fundamentally broken with what I've got here?
>
So I encouraged Qiaowei to do a limited special-purpose decoder, simply
because the glue to use the generic decoder was almost as large. I am
overall not a huge fan of using the generic decoder in constrained
situation, because the generic decoder is very heavyweight not just in
terms of performance but in terms of interface -- because it has to.
-hpa
On 06/18/2014 09:25 AM, Dave Hansen wrote:
> On 06/18/2014 07:59 AM, H. Peter Anvin wrote:
>> On 06/18/2014 07:35 AM, Dave Hansen wrote:
>>> It looks like static_cpu_has() is the right thing to use instead of
>>> boot_cpu_has(). But, this doesn't just obfuscate things.
>>>
>>> We actually _want_ the compiler to cull code out when the config option
>>> is off. Things like do_bounds() will see code savings with _some_ kind
>>> of #ifdef rather than using static_cpu_has().
>>>
>>> So, we can either use the well worn, consistent with other features in
>>> x86, cpu_has_$foo approach. Or, we can roll our own macros.
>>
>> We could do something like:
>>
>> #define MPX_ENABLED (IS_ENABLED(CONFIG_X86_MPX) &&
>> static_cpu_has(X86_FEATURE_MPX))
>
> How about something like the attached patch?
>
> This lets us use static_cpu_has() for the checks, and allows us to
> easily add new checks for other features that might be compile-time
> disabled.
>
Hmm... I would like something similar to required-features.h which
reflect features which *cannot* be enabled or will always be ignored; we
actually already have a handful of those.
-hpa
On 06/19/2014 11:02 AM, H. Peter Anvin wrote:
> On 06/18/2014 09:25 AM, Dave Hansen wrote:
>> How about something like the attached patch?
>>
>> This lets us use static_cpu_has() for the checks, and allows us to
>> easily add new checks for other features that might be compile-time
>> disabled.
>
> Hmm... I would like something similar to required-features.h which
> reflect features which *cannot* be enabled or will always be ignored; we
> actually already have a handful of those
Could you elaborate a bit? I'll try and include them in the approach to
make sure it works broadly.
Is there a benefit to the required-features.h approach that's missing
from mine? I _believe_ all of the compiler optimization around
__builtin_constant_p() continues to work with the inline function
instead of the #defines and bitmasks. I think the inline function
approach is a bit easier to work with.
Could the required-features.h approach just be from a time before
__builtin_constant_p() worked well across inlines?
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> +static __user void *task_get_bounds_dir(struct task_struct *tsk)
> +{
> + struct xsave_struct *xsave_buf;
> +
> + fpu_xsave(&tsk->thread.fpu);
> + xsave_buf = &(tsk->thread.fpu.state->xsave);
> + if (!(xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ENABLE_FLAG))
> + return NULL;
> +
> + return (void __user *)(xsave_buf->bndcsr.cfg_reg_u &
> + MPX_BNDCFG_ADDR_MASK);
> +}
This spits out a warning on 32-bit:
arch/x86/kernel/mpx.c: In function ?task_get_bounds_dir?:
arch/x86/kernel/mpx.c:21:9: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
On 2014-06-20, H. Peter Anvin wrote:
> On 06/19/2014 10:04 AM, Dave Hansen wrote:
>>
>> Could you please support this position with some data? I'm a bit
>> skeptical that instruction decoding is going to be a
>> performance-critical path.
>>
>> I also don't see the extra field that you talked about in the
>> previous thread? What's the extra field? I see a 'limit' vs.
>> 'length', but you don't use 'length' at all, so I think you can use
>> it instead, or at least union it.
>>
>> I've taken a quick stab at trying to consolidate things. I think I
>> may have screwed up this:
>>
>> insn->limit = MAX_MPX_INSN_SIZE - bytes;
>>
>> Qiaowei, is there anything fundamentally broken with what I've got here?
>>
>
Firstly instructions will be got from user pointer stored in 'ip', and then validate_next() will use 'limit' to make sure that next sizeof(t) bytes can be on the same instruction.
As hpa said, generic decoder, including struct insn and implementation of decoding, is very heavyweight because it has to. So MPX specific decoding should be better choice.
> So I encouraged Qiaowei to do a limited special-purpose decoder,
> simply because the glue to use the generic decoder was almost as
> large. I am overall not a huge fan of using the generic decoder in
> constrained situation, because the generic decoder is very heavyweight
> not just in terms of performance but in terms of interface -- because it has to.
>
Thanks,
Qiaowei
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On 06/19/2014 11:50 AM, Dave Hansen wrote:
> On 06/19/2014 11:02 AM, H. Peter Anvin wrote:
>> On 06/18/2014 09:25 AM, Dave Hansen wrote:
>>> How about something like the attached patch?
>>>
>>> This lets us use static_cpu_has() for the checks, and allows us to
>>> easily add new checks for other features that might be compile-time
>>> disabled.
>>
>> Hmm... I would like something similar to required-features.h which
>> reflect features which *cannot* be enabled or will always be ignored; we
>> actually already have a handful of those
>
> Could you elaborate a bit? I'll try and include them in the approach to
> make sure it works broadly.
>
> Is there a benefit to the required-features.h approach that's missing
> from mine? I _believe_ all of the compiler optimization around
> __builtin_constant_p() continues to work with the inline function
> instead of the #defines and bitmasks. I think the inline function
> approach is a bit easier to work with.
>
> Could the required-features.h approach just be from a time before
> __builtin_constant_p() worked well across inlines?
>
Not so much. What I don't want is one approach for doing things in one
direction and another approach for the other direction.
-hpa
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> This patch adds one MPX specific mmap interface, which only handles
> mpx related maps, including bounds table and bounds directory.
>
> In order to track MPX specific memory usage, this interface is added
> to stick new vm_flag VM_MPX in the vma_area_struct when create a
> bounds table or bounds directory.
I imagine the linux-mm people would want to think about any new vm flag.
Why is this needed?
>
> Signed-off-by: Qiaowei Ren <[email protected]>
> ---
> arch/x86/Kconfig | 4 +++
> arch/x86/include/asm/mpx.h | 38 ++++++++++++++++++++++++++++
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/mpx.c | 58 ++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 102 insertions(+), 0 deletions(-)
> create mode 100644 arch/x86/include/asm/mpx.h
> create mode 100644 arch/x86/mm/mpx.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 25d2c6f..0194790 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -237,6 +237,10 @@ config HAVE_INTEL_TXT
> def_bool y
> depends on INTEL_IOMMU && ACPI
>
> +config X86_INTEL_MPX
> + def_bool y
> + depends on CPU_SUP_INTEL
> +
> config X86_32_SMP
> def_bool y
> depends on X86_32 && SMP
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> new file mode 100644
> index 0000000..5725ac4
> --- /dev/null
> +++ b/arch/x86/include/asm/mpx.h
> @@ -0,0 +1,38 @@
> +#ifndef _ASM_X86_MPX_H
> +#define _ASM_X86_MPX_H
> +
> +#include <linux/types.h>
> +#include <asm/ptrace.h>
> +
> +#ifdef CONFIG_X86_64
> +
> +/* upper 28 bits [47:20] of the virtual address in 64-bit used to
> + * index into bounds directory (BD).
> + */
> +#define MPX_BD_ENTRY_OFFSET 28
> +#define MPX_BD_ENTRY_SHIFT 3
> +/* bits [19:3] of the virtual address in 64-bit used to index into
> + * bounds table (BT).
> + */
> +#define MPX_BT_ENTRY_OFFSET 17
> +#define MPX_BT_ENTRY_SHIFT 5
> +#define MPX_IGN_BITS 3
> +
> +#else
> +
> +#define MPX_BD_ENTRY_OFFSET 20
> +#define MPX_BD_ENTRY_SHIFT 2
> +#define MPX_BT_ENTRY_OFFSET 10
> +#define MPX_BT_ENTRY_SHIFT 4
> +#define MPX_IGN_BITS 2
> +
> +#endif
> +
> +#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
> +#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
> +
> +#define MPX_BNDSTA_ERROR_CODE 0x3
> +
> +unsigned long mpx_mmap(unsigned long len);
> +
> +#endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 6a19ad9..ecfdc46 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -30,3 +30,5 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o
> obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
>
> obj-$(CONFIG_MEMTEST) += memtest.o
> +
> +obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> new file mode 100644
> index 0000000..546c5d1
> --- /dev/null
> +++ b/arch/x86/mm/mpx.c
> @@ -0,0 +1,58 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <asm/mpx.h>
> +#include <asm/mman.h>
> +#include <linux/sched/sysctl.h>
> +
> +/*
> + * this is really a simplified "vm_mmap". it only handles mpx
> + * related maps, including bounds table and bounds directory.
> + *
> + * here we can stick new vm_flag VM_MPX in the vma_area_struct
> + * when create a bounds table or bounds directory, in order to
> + * track MPX specific memory.
> + */
> +unsigned long mpx_mmap(unsigned long len)
> +{
> + unsigned long ret;
> + unsigned long addr, pgoff;
> + struct mm_struct *mm = current->mm;
> + vm_flags_t vm_flags;
> +
> + /* Only bounds table and bounds directory can be allocated here */
> + if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
> + return -EINVAL;
> +
> + down_write(&mm->mmap_sem);
> +
> + /* Too many mappings? */
> + if (mm->map_count > sysctl_max_map_count) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + /* Obtain the address to map to. we verify (or select) it and ensure
> + * that it represents a valid section of the address space.
> + */
> + addr = get_unmapped_area(NULL, 0, len, 0, MAP_ANONYMOUS | MAP_PRIVATE);
> + if (addr & ~PAGE_MASK) {
> + ret = addr;
> + goto out;
> + }
> +
> + vm_flags = VM_READ | VM_WRITE | VM_MPX |
> + mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> +
> + /* Make bounds tables and bouds directory unlocked. */
> + if (vm_flags & VM_LOCKED)
> + vm_flags &= ~VM_LOCKED;
Why? I would expect MCL_FUTURE to lock these.
--Andy
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> This patch handles a #BR exception for non-existent tables by
> carving the space out of the normal processes address space
> (essentially calling mmap() from inside the kernel) and then
> pointing the bounds-directory over to it.
>
> The tables need to be accessed and controlled by userspace
> because the compiler generates instructions for MPX-enabled
> code which frequently store and retrieve entries from the bounds
> tables. Any direct kernel involvement (like a syscall) to access
> the tables would destroy performance since these are so frequent.
>
> The tables are carved out of userspace because we have no better
> spot to put them. For each pointer which is being tracked by MPX,
> the bounds tables contain 4 longs worth of data, and the tables
> are indexed virtually. If we were to preallocate the tables, we
> would theoretically need to allocate 4x the virtual space that
> we have available for userspace somewhere else. We don't have
> that room in the kernel address space.
>
> Signed-off-by: Qiaowei Ren <[email protected]>
> ---
> arch/x86/include/asm/mpx.h | 20 ++++++++++++++
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/mpx.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/traps.c | 56 ++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 139 insertions(+), 1 deletions(-)
> create mode 100644 arch/x86/kernel/mpx.c
>
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index 5725ac4..b7598ac 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -18,6 +18,8 @@
> #define MPX_BT_ENTRY_SHIFT 5
> #define MPX_IGN_BITS 3
>
> +#define MPX_BD_ENTRY_TAIL 3
> +
> #else
>
> #define MPX_BD_ENTRY_OFFSET 20
> @@ -26,13 +28,31 @@
> #define MPX_BT_ENTRY_SHIFT 4
> #define MPX_IGN_BITS 2
>
> +#define MPX_BD_ENTRY_TAIL 2
> +
> #endif
>
> +#define MPX_BNDSTA_TAIL 2
> +#define MPX_BNDCFG_TAIL 12
> +#define MPX_BNDSTA_ADDR_MASK (~((1UL<<MPX_BNDSTA_TAIL)-1))
> +#define MPX_BNDCFG_ADDR_MASK (~((1UL<<MPX_BNDCFG_TAIL)-1))
> +#define MPX_BT_ADDR_MASK (~((1UL<<MPX_BD_ENTRY_TAIL)-1))
> +
> #define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
> #define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
>
> #define MPX_BNDSTA_ERROR_CODE 0x3
> +#define MPX_BD_ENTRY_VALID_FLAG 0x1
>
> unsigned long mpx_mmap(unsigned long len);
>
> +#ifdef CONFIG_X86_INTEL_MPX
> +int do_mpx_bt_fault(struct xsave_struct *xsave_buf);
> +#else
> +static inline int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
> +{
> + return -EINVAL;
> +}
> +#endif /* CONFIG_X86_INTEL_MPX */
> +
> #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index f4d9600..3e81aed 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -41,6 +41,7 @@ obj-$(CONFIG_PREEMPT) += preempt.o
>
> obj-y += process.o
> obj-y += i387.o xsave.o
> +obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
> obj-y += ptrace.o
> obj-$(CONFIG_X86_32) += tls.o
> obj-$(CONFIG_IA32_EMULATION) += tls.o
> diff --git a/arch/x86/kernel/mpx.c b/arch/x86/kernel/mpx.c
> new file mode 100644
> index 0000000..4230c7b
> --- /dev/null
> +++ b/arch/x86/kernel/mpx.c
> @@ -0,0 +1,63 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <asm/mpx.h>
> +
> +static int allocate_bt(long __user *bd_entry)
> +{
> + unsigned long bt_addr, old_val = 0;
> + int ret = 0;
> +
> + bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
> + if (IS_ERR((void *)bt_addr)) {
> + pr_err("Bounds table allocation failed at entry addr %p\n",
> + bd_entry);
> + return bt_addr;
> + }
> + bt_addr = (bt_addr & MPX_BT_ADDR_MASK) | MPX_BD_ENTRY_VALID_FLAG;
> +
> + ret = user_atomic_cmpxchg_inatomic(&old_val, bd_entry, 0, bt_addr);
> + if (ret)
> + goto out;
> +
> + /*
> + * there is a existing bounds table pointed at this bounds
> + * directory entry, and so we need to free the bounds table
> + * allocated just now.
> + */
> + if (old_val)
> + goto out;
> +
> + pr_debug("Allocate bounds table %lx at entry %p\n",
> + bt_addr, bd_entry);
> + return 0;
> +
> +out:
> + vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
> + return ret;
> +}
> +
> +/*
> + * When a BNDSTX instruction attempts to save bounds to a BD entry
> + * with the lack of the valid bit being set, a #BR is generated.
> + * This is an indication that no BT exists for this entry. In this
> + * case the fault handler will allocate a new BT.
> + *
> + * With 32-bit mode, the size of BD is 4MB, and the size of each
> + * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
> + * and the size of each bound table is 4MB.
> + */
> +int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
> +{
> + unsigned long status;
> + unsigned long bd_entry, bd_base;
> +
> + bd_base = xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ADDR_MASK;
> + status = xsave_buf->bndcsr.status_reg;
> +
> + bd_entry = status & MPX_BNDSTA_ADDR_MASK;
> + if ((bd_entry < bd_base) ||
> + (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
> + return -EINVAL;
> +
> + return allocate_bt((long __user *)bd_entry);
> +}
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index f73b5d4..35b9b29 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -59,6 +59,7 @@
> #include <asm/fixmap.h>
> #include <asm/mach_traps.h>
> #include <asm/alternative.h>
> +#include <asm/mpx.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
> @@ -213,7 +214,6 @@ dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
>
> DO_ERROR_INFO(X86_TRAP_DE, SIGFPE, "divide error", divide_error, FPE_INTDIV, regs->ip )
> DO_ERROR (X86_TRAP_OF, SIGSEGV, "overflow", overflow )
> -DO_ERROR (X86_TRAP_BR, SIGSEGV, "bounds", bounds )
> DO_ERROR_INFO(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->ip )
> DO_ERROR (X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun )
> DO_ERROR (X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS )
> @@ -263,6 +263,60 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> }
> #endif
>
> +dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> +{
> + enum ctx_state prev_state;
> + unsigned long status;
> + struct xsave_struct *xsave_buf;
> + struct task_struct *tsk = current;
> +
> + prev_state = exception_enter();
> + if (notify_die(DIE_TRAP, "bounds", regs, error_code,
> + X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
> + goto exit;
> + conditional_sti(regs);
> +
> + if (!user_mode(regs))
> + die("bounds", regs, error_code);
Does this need to be user_mode_vm?
> +
> + if (!cpu_has_mpx) {
> + /* The exception is not from Intel MPX */
> + do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
> + goto exit;
> + }
> +
> + fpu_xsave(&tsk->thread.fpu);
> + xsave_buf = &(tsk->thread.fpu.state->xsave);
> + status = xsave_buf->bndcsr.status_reg;
> +
> + /*
> + * The error code field of the BNDSTATUS register communicates status
> + * information of a bound range exception #BR or operation involving
> + * bound directory.
> + */
> + switch (status & MPX_BNDSTA_ERROR_CODE) {
> + case 2:
> + /*
> + * Bound directory has invalid entry.
> + * No signal will be sent to the user space.
This comment is a lie.
> + */
> + if (do_mpx_bt_fault(xsave_buf))
> + force_sig(SIGBUS, tsk);
Would it make sense to assign and use a new si_code value here?
--Andy
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> When user memory region is unmapped, related bound tables
> become unused and need to be released also. This patch cleanups
> these unused bound tables through hooking unmap path.
This is not a full review at all -- I don't understand the mm stuff well
enough.
> +
> + /*
> + * If this bounds directory entry is nonzero, and meanwhile
> + * the valid bit is zero, one SIGSEGV will be produced due to
> + * this unexpected situation.
> + */
> + if (!(*valid) && *bt_addr)
> + force_sig(SIGSEGV, current);
force_sig from a syscall handler seems rather weird.
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> This patch adds the PR_MPX_REGISTER and PR_MPX_UNREGISTER prctl()
> commands. These commands can be used to register and unregister MPX
> related resource on the x86 platform.
>
> The base of the bounds directory is set into mm_struct during
> PR_MPX_REGISTER command execution. This member can be used to
> check whether one application is mpx enabled.
The register and unregister operations seem to be almost the same thing.
How about just PR_SYNC_MPX?
Also, would this make more sense in arch_prctl?
--Andy
On 06/23/2014 12:49 PM, Andy Lutomirski wrote:
> On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
>> This patch adds one MPX specific mmap interface, which only handles
>> mpx related maps, including bounds table and bounds directory.
>>
>> In order to track MPX specific memory usage, this interface is added
>> to stick new vm_flag VM_MPX in the vma_area_struct when create a
>> bounds table or bounds directory.
>
> I imagine the linux-mm people would want to think about any new vm flag.
> Why is this needed?
These tables can take huge amounts of memory. In the worst-case
scenario, the tables can be 4x the size of the data structure being
tracked. IOW, a 1-page structure can require 4 bounds-table pages.
My expectation is that folks using MPX are going to be keen on figuring
out how much memory is being dedicated to it. With this feature, plus
some grepping in /proc/$pid/smaps one could take a pretty good stab at it.
I know VM flags are scarce, and I'm open to other ways to skin this cat.
On Mon, Jun 23, 2014 at 1:03 PM, Dave Hansen <[email protected]> wrote:
> On 06/23/2014 12:49 PM, Andy Lutomirski wrote:
>> On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
>>> This patch adds one MPX specific mmap interface, which only handles
>>> mpx related maps, including bounds table and bounds directory.
>>>
>>> In order to track MPX specific memory usage, this interface is added
>>> to stick new vm_flag VM_MPX in the vma_area_struct when create a
>>> bounds table or bounds directory.
>>
>> I imagine the linux-mm people would want to think about any new vm flag.
>> Why is this needed?
>
> These tables can take huge amounts of memory. In the worst-case
> scenario, the tables can be 4x the size of the data structure being
> tracked. IOW, a 1-page structure can require 4 bounds-table pages.
>
> My expectation is that folks using MPX are going to be keen on figuring
> out how much memory is being dedicated to it. With this feature, plus
> some grepping in /proc/$pid/smaps one could take a pretty good stab at it.
>
> I know VM flags are scarce, and I'm open to other ways to skin this cat.
>
Can the new vm_operation "name" be use for this? The magic "always
written to core dumps" feature might need to be reconsidered.
There's also arch_vma_name, but I just finished removing for x86, and
I'd be a little sad to see it come right back.
--Andy
On 06/23/2014 01:00 PM, Andy Lutomirski wrote:
> On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
>> > This patch adds the PR_MPX_REGISTER and PR_MPX_UNREGISTER prctl()
>> > commands. These commands can be used to register and unregister MPX
>> > related resource on the x86 platform.
>> >
>> > The base of the bounds directory is set into mm_struct during
>> > PR_MPX_REGISTER command execution. This member can be used to
>> > check whether one application is mpx enabled.
> The register and unregister operations seem to be almost the same thing.
> How about just PR_SYNC_MPX?
That wouldn't support a usage model where userspace wanted to keep using
MPX, but wanted the kernel to butt out and not try to free unused bounds
tables. That's not super-important, but it does lose some level of
flexibility.
FWIW, I think it would also be handy to support a PR_MPX_DISABLE prctl
too. That way, a wrapper program could set a flag that any children
could notice if they try a PR_MPX_REGISTER. That way we could
software-disable MPX in cases in a process tree where it was not wanted.
On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
> Can the new vm_operation "name" be use for this? The magic "always
> written to core dumps" feature might need to be reconsidered.
One thing I'd like to avoid is an MPX vma getting merged with a non-MPX
vma. I don't see any code to prevent two VMAs with different
vm_ops->names from getting merged. That seems like a bit of a design
oversight for ->name. Right?
Thinking out loud a bit... There are also some more complicated but more
performant cleanup mechanisms that I'd like to go after in the future.
Given a page, we might want to figure out if it is an MPX page or not.
I wonder if we'll ever collide with some other user of vm_ops->name. It
looks fairly narrowly used at the moment, but would this keep us from
putting these pages on, say, a tmpfs mount? Doesn't look that way at
the moment.
On Mon, Jun 23, 2014 at 1:28 PM, Dave Hansen <[email protected]> wrote:
> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>> Can the new vm_operation "name" be use for this? The magic "always
>> written to core dumps" feature might need to be reconsidered.
>
> One thing I'd like to avoid is an MPX vma getting merged with a non-MPX
> vma. I don't see any code to prevent two VMAs with different
> vm_ops->names from getting merged. That seems like a bit of a design
> oversight for ->name. Right?
AFAIK there are no ->name users that don't also set ->close, for
exactly that reason. I'd be okay with adding a check for ->name, too.
Hmm. If MPX vmas had a real struct file attached, this would all come
for free. Maybe vmas with non-default vm_ops and file != NULL should
never be mergeable?
>
> Thinking out loud a bit... There are also some more complicated but more
> performant cleanup mechanisms that I'd like to go after in the future.
> Given a page, we might want to figure out if it is an MPX page or not.
> I wonder if we'll ever collide with some other user of vm_ops->name. It
> looks fairly narrowly used at the moment, but would this keep us from
> putting these pages on, say, a tmpfs mount? Doesn't look that way at
> the moment.
You could always check the vm_ops pointer to see if it's MPX.
One feature I've wanted: a way to have special per-process vmas that
can be easily found. For example, I want to be able to efficiently
find out where the vdso and vvar vmas are. I don't think this is
currently supported.
--Andy
On Jun 23, 2014 1:09 PM, "Dave Hansen" <[email protected]> wrote:
>
> On 06/23/2014 01:00 PM, Andy Lutomirski wrote:
> > On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> >> > This patch adds the PR_MPX_REGISTER and PR_MPX_UNREGISTER prctl()
> >> > commands. These commands can be used to register and unregister MPX
> >> > related resource on the x86 platform.
> >> >
> >> > The base of the bounds directory is set into mm_struct during
> >> > PR_MPX_REGISTER command execution. This member can be used to
> >> > check whether one application is mpx enabled.
> > The register and unregister operations seem to be almost the same thing.
> > How about just PR_SYNC_MPX?
>
> That wouldn't support a usage model where userspace wanted to keep using
> MPX, but wanted the kernel to butt out and not try to free unused bounds
> tables. That's not super-important, but it does lose some level of
> flexibility.
>
Hmm. How about PR_SET/GET_MPX_BOUNDS_TABLE, to update the kernel's
copy. No fpu magic needed.
This has an added benefit: CRIU will need updating for MPX, and
they'll appreciate having the required interface already exist.
(They'll want a way to allocate "MPX" memory, too, but that's probably
somewhat less important, and it won't result in duplicated
functionality.)
> FWIW, I think it would also be handy to support a PR_MPX_DISABLE prctl
> too. That way, a wrapper program could set a flag that any children
> could notice if they try a PR_MPX_REGISTER. That way we could
> software-disable MPX in cases in a process tree where it was not wanted.
Seccomp can do this :)
--Andy
On 06/23/2014 03:00 PM, Andy Lutomirski wrote:
> Hmm. How about PR_SET/GET_MPX_BOUNDS_TABLE, to update the kernel's
> copy. No fpu magic needed.
>
> This has an added benefit: CRIU will need updating for MPX, and
> they'll appreciate having the required interface already exist.
> (They'll want a way to allocate "MPX" memory, too, but that's probably
> somewhat less important, and it won't result in duplicated
> functionality.)
I like the idea of the most minimal interface possible. If the kernel
ever needed or wanted to cache more of the register setup, we wouldn't
need to change the interface. For CRIU, I don't know much about the
phases of how it sets itself up, but I guess the difference would be
whether userspace has to do a register save and restore and a prctl() or
just a plain prctl() with extra arguments. Doesn't seem fundamentally
different to me.
BTW, it's not a pointer to a bounds table, it's the bounds directory.
There are two levels of the tables.
On Mon, Jun 23, 2014 at 4:42 PM, Dave Hansen <[email protected]> wrote:
> On 06/23/2014 03:00 PM, Andy Lutomirski wrote:
>> Hmm. How about PR_SET/GET_MPX_BOUNDS_TABLE, to update the kernel's
>> copy. No fpu magic needed.
>>
>> This has an added benefit: CRIU will need updating for MPX, and
>> they'll appreciate having the required interface already exist.
>> (They'll want a way to allocate "MPX" memory, too, but that's probably
>> somewhat less important, and it won't result in duplicated
>> functionality.)
>
> I like the idea of the most minimal interface possible. If the kernel
> ever needed or wanted to cache more of the register setup, we wouldn't
> need to change the interface. For CRIU, I don't know much about the
> phases of how it sets itself up, but I guess the difference would be
> whether userspace has to do a register save and restore and a prctl() or
> just a plain prctl() with extra arguments. Doesn't seem fundamentally
> different to me.
>
> BTW, it's not a pointer to a bounds table, it's the bounds directory.
> There are two levels of the tables.
I suspect that the existence of the get operation matters more. What
if the checkpointed process has the cached copy out of sync with the
register copy? More realistically, what if the checkpointed process
doesn't want to use kernel MPX assistance at all? CRIU won't be able
to detect this with the current proposed interface.
--Andy
On 06/23/2014 05:01 PM, Andy Lutomirski wrote:
> I suspect that the existence of the get operation matters more. What
> if the checkpointed process has the cached copy out of sync with the
> register copy? More realistically, what if the checkpointed process
> doesn't want to use kernel MPX assistance at all? CRIU won't be able
> to detect this with the current proposed interface.
That's fair criticism.
Pavel, do you have any concrete suggestions on what kind of interface
would be nice for checkpoint/restart? Or is this something you will
just tackle down the road some other way?
On 2014-06-24, Andy Lutomirski wrote:
> On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
>> + /*
>> + * The error code field of the BNDSTATUS register communicates status
>> + * information of a bound range exception #BR or operation involving
>> + * bound directory.
>> + */
>> + switch (status & MPX_BNDSTA_ERROR_CODE) {
>> + case 2:
>> + /*
>> + * Bound directory has invalid entry.
>> + * No signal will be sent to the user space.
>
> This comment is a lie.
>
Hmm, thanks.
>> + */
>> + if (do_mpx_bt_fault(xsave_buf))
>> + force_sig(SIGBUS, tsk);
>
> Would it make sense to assign and use a new si_code value here?
>
There is a new si_code SEGV_BNDERR for bounds violation reported by MPX. But in this case, it is mainly due to the failure caused by allocation of bounds table. I guess it is not necessary to add another new si_code value.
Thanks,
Qiaowei
On 2014-06-24, Andy Lutomirski wrote:
>> + /* Make bounds tables and bouds directory unlocked. */
>> + if (vm_flags & VM_LOCKED)
>> + vm_flags &= ~VM_LOCKED;
>
> Why? I would expect MCL_FUTURE to lock these.
>
Andy, I was just a little confused about LOCKED & POPULATE earlier and I thought VM_LOCKED is not necessary for MPX specific bounds tables. Now, this checking should be removed, and there should be mm_populate() for VM_LOCKED case after mmap_region():
if (!IS_ERR_VALUE(addr) && (vm_flags & VM_LOCKED))
mm_populate(addr, len);
Thanks,
Qiaowei
On 2014-06-24, Andy Lutomirski wrote:
>> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>>> Can the new vm_operation "name" be use for this? The magic "always
>>> written to core dumps" feature might need to be reconsidered.
>>
>> One thing I'd like to avoid is an MPX vma getting merged with a
>> non-MPX vma. I don't see any code to prevent two VMAs with
>> different vm_ops->names from getting merged. That seems like a bit
>> of a design oversight for ->name. Right?
>
> AFAIK there are no ->name users that don't also set ->close, for
> exactly that reason. I'd be okay with adding a check for ->name, too.
>
> Hmm. If MPX vmas had a real struct file attached, this would all come
> for free. Maybe vmas with non-default vm_ops and file != NULL should
> never be mergeable?
>
>>
>> Thinking out loud a bit... There are also some more complicated but
>> more performant cleanup mechanisms that I'd like to go after in the future.
>> Given a page, we might want to figure out if it is an MPX page or not.
>> I wonder if we'll ever collide with some other user of vm_ops->name.
>> It looks fairly narrowly used at the moment, but would this keep us
>> from putting these pages on, say, a tmpfs mount? Doesn't look that
>> way at the moment.
>
> You could always check the vm_ops pointer to see if it's MPX.
>
> One feature I've wanted: a way to have special per-process vmas that
> can be easily found. For example, I want to be able to efficiently
> find out where the vdso and vvar vmas are. I don't think this is currently supported.
>
Andy, if you add a check for ->name to avoid the MPX vmas merged with non-MPX vmas, I guess the work flow should be as follow (use _install_special_mapping to get a new vma):
unsigned long mpx_mmap(unsigned long len)
{
......
static struct vm_special_mapping mpx_mapping = {
.name = "[mpx]",
.pages = no_pages,
};
.......
vma = _install_special_mapping(mm, addr, len, vm_flags, &mpx_mapping);
......
}
Then, we could check the ->name to see if the VMA is MPX specific. Right?
Thanks,
Qiaowei
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Jun 23, 2014 at 10:53 PM, Ren, Qiaowei <[email protected]> wrote:
> On 2014-06-24, Andy Lutomirski wrote:
>>> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>>>> Can the new vm_operation "name" be use for this? The magic "always
>>>> written to core dumps" feature might need to be reconsidered.
>>>
>>> One thing I'd like to avoid is an MPX vma getting merged with a
>>> non-MPX vma. I don't see any code to prevent two VMAs with
>>> different vm_ops->names from getting merged. That seems like a bit
>>> of a design oversight for ->name. Right?
>>
>> AFAIK there are no ->name users that don't also set ->close, for
>> exactly that reason. I'd be okay with adding a check for ->name, too.
>>
>> Hmm. If MPX vmas had a real struct file attached, this would all come
>> for free. Maybe vmas with non-default vm_ops and file != NULL should
>> never be mergeable?
>>
>>>
>>> Thinking out loud a bit... There are also some more complicated but
>>> more performant cleanup mechanisms that I'd like to go after in the future.
>>> Given a page, we might want to figure out if it is an MPX page or not.
>>> I wonder if we'll ever collide with some other user of vm_ops->name.
>>> It looks fairly narrowly used at the moment, but would this keep us
>>> from putting these pages on, say, a tmpfs mount? Doesn't look that
>>> way at the moment.
>>
>> You could always check the vm_ops pointer to see if it's MPX.
>>
>> One feature I've wanted: a way to have special per-process vmas that
>> can be easily found. For example, I want to be able to efficiently
>> find out where the vdso and vvar vmas are. I don't think this is currently supported.
>>
> Andy, if you add a check for ->name to avoid the MPX vmas merged with non-MPX vmas, I guess the work flow should be as follow (use _install_special_mapping to get a new vma):
>
> unsigned long mpx_mmap(unsigned long len)
> {
> ......
> static struct vm_special_mapping mpx_mapping = {
> .name = "[mpx]",
> .pages = no_pages,
> };
>
> .......
> vma = _install_special_mapping(mm, addr, len, vm_flags, &mpx_mapping);
> ......
> }
>
> Then, we could check the ->name to see if the VMA is MPX specific. Right?
Does this actually create a vma backed with real memory? Doesn't this
need to go through anon_vma or something? _install_special_mapping
completely prevents merging.
Possibly silly question: would it make more sense to just create one
giant vma for the MPX tables and only populate pieces of it as needed?
This wouldn't work for 32-bit code, but maybe we don't care. (I see
no reason why it couldn't work for x32, though.)
(I don't really understand how anonymous memory works at all. I'm not
an mm person.)
--Andy
On 2014-06-25, Andy Lutomirski wrote:
> On Mon, Jun 23, 2014 at 10:53 PM, Ren, Qiaowei <[email protected]>
> wrote:
>> On 2014-06-24, Andy Lutomirski wrote:
>>>> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>>>>> Can the new vm_operation "name" be use for this? The magic
>>>>> "always written to core dumps" feature might need to be reconsidered.
>>>>
>>>> One thing I'd like to avoid is an MPX vma getting merged with a
>>>> non-MPX vma. I don't see any code to prevent two VMAs with
>>>> different vm_ops->names from getting merged. That seems like a
>>>> bit of a design oversight for ->name. Right?
>>>
>>> AFAIK there are no ->name users that don't also set ->close, for
>>> exactly that reason. I'd be okay with adding a check for ->name, too.
>>>
>>> Hmm. If MPX vmas had a real struct file attached, this would all
>>> come for free. Maybe vmas with non-default vm_ops and file != NULL
>>> should never be mergeable?
>>>
>>>>
>>>> Thinking out loud a bit... There are also some more complicated
>>>> but more performant cleanup mechanisms that I'd like to go after in the future.
>>>> Given a page, we might want to figure out if it is an MPX page or not.
>>>> I wonder if we'll ever collide with some other user of vm_ops->name.
>>>> It looks fairly narrowly used at the moment, but would this keep
>>>> us from putting these pages on, say, a tmpfs mount? Doesn't look
>>>> that way at the moment.
>>>
>>> You could always check the vm_ops pointer to see if it's MPX.
>>>
>>> One feature I've wanted: a way to have special per-process vmas that
>>> can be easily found. For example, I want to be able to efficiently
>>> find out where the vdso and vvar vmas are. I don't think this is
>>> currently supported.
>>>
>> Andy, if you add a check for ->name to avoid the MPX vmas merged
>> with
> non-MPX vmas, I guess the work flow should be as follow (use
> _install_special_mapping to get a new vma):
>>
>> unsigned long mpx_mmap(unsigned long len) {
>> ......
>> static struct vm_special_mapping mpx_mapping = {
>> .name = "[mpx]",
>> .pages = no_pages,
>> };
>>
>> ....... vma = _install_special_mapping(mm, addr, len, vm_flags,
>> &mpx_mapping); ......
>> }
>>
>> Then, we could check the ->name to see if the VMA is MPX specific. Right?
>
> Does this actually create a vma backed with real memory? Doesn't this
> need to go through anon_vma or something? _install_special_mapping
> completely prevents merging.
>
Hmm, _install_special_mapping should completely prevent merging, even among MPX vmas.
So, could you tell me how to set MPX specific ->name to the vma when it is created? Seems like that I could not find such interface.
Thanks,
Qiaowei
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Jun 24, 2014 at 6:40 PM, Ren, Qiaowei <[email protected]> wrote:
> On 2014-06-25, Andy Lutomirski wrote:
>> On Mon, Jun 23, 2014 at 10:53 PM, Ren, Qiaowei <[email protected]>
>> wrote:
>>> On 2014-06-24, Andy Lutomirski wrote:
>>>>> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>>>>>> Can the new vm_operation "name" be use for this? The magic
>>>>>> "always written to core dumps" feature might need to be reconsidered.
>>>>>
>>>>> One thing I'd like to avoid is an MPX vma getting merged with a
>>>>> non-MPX vma. I don't see any code to prevent two VMAs with
>>>>> different vm_ops->names from getting merged. That seems like a
>>>>> bit of a design oversight for ->name. Right?
>>>>
>>>> AFAIK there are no ->name users that don't also set ->close, for
>>>> exactly that reason. I'd be okay with adding a check for ->name, too.
>>>>
>>>> Hmm. If MPX vmas had a real struct file attached, this would all
>>>> come for free. Maybe vmas with non-default vm_ops and file != NULL
>>>> should never be mergeable?
>>>>
>>>>>
>>>>> Thinking out loud a bit... There are also some more complicated
>>>>> but more performant cleanup mechanisms that I'd like to go after in the future.
>>>>> Given a page, we might want to figure out if it is an MPX page or not.
>>>>> I wonder if we'll ever collide with some other user of vm_ops->name.
>>>>> It looks fairly narrowly used at the moment, but would this keep
>>>>> us from putting these pages on, say, a tmpfs mount? Doesn't look
>>>>> that way at the moment.
>>>>
>>>> You could always check the vm_ops pointer to see if it's MPX.
>>>>
>>>> One feature I've wanted: a way to have special per-process vmas that
>>>> can be easily found. For example, I want to be able to efficiently
>>>> find out where the vdso and vvar vmas are. I don't think this is
>>>> currently supported.
>>>>
>>> Andy, if you add a check for ->name to avoid the MPX vmas merged
>>> with
>> non-MPX vmas, I guess the work flow should be as follow (use
>> _install_special_mapping to get a new vma):
>>>
>>> unsigned long mpx_mmap(unsigned long len) {
>>> ......
>>> static struct vm_special_mapping mpx_mapping = {
>>> .name = "[mpx]",
>>> .pages = no_pages,
>>> };
>>>
>>> ....... vma = _install_special_mapping(mm, addr, len, vm_flags,
>>> &mpx_mapping); ......
>>> }
>>>
>>> Then, we could check the ->name to see if the VMA is MPX specific. Right?
>>
>> Does this actually create a vma backed with real memory? Doesn't this
>> need to go through anon_vma or something? _install_special_mapping
>> completely prevents merging.
>>
> Hmm, _install_special_mapping should completely prevent merging, even among MPX vmas.
>
> So, could you tell me how to set MPX specific ->name to the vma when it is created? Seems like that I could not find such interface.
You may need to add one.
I'd suggest posting a new thread to linux-mm describing what you need
and asking how to do it.
--Andy
On Wed, Jun 25, 2014 at 2:04 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Jun 24, 2014 at 6:40 PM, Ren, Qiaowei <[email protected]> wrote:
>> On 2014-06-25, Andy Lutomirski wrote:
>>> On Mon, Jun 23, 2014 at 10:53 PM, Ren, Qiaowei <[email protected]>
>>> wrote:
>>>> On 2014-06-24, Andy Lutomirski wrote:
>>>>>> On 06/23/2014 01:06 PM, Andy Lutomirski wrote:
>>>>>>> Can the new vm_operation "name" be use for this? The magic
>>>>>>> "always written to core dumps" feature might need to be reconsidered.
>>>>>>
>>>>>> One thing I'd like to avoid is an MPX vma getting merged with a
>>>>>> non-MPX vma. I don't see any code to prevent two VMAs with
>>>>>> different vm_ops->names from getting merged. That seems like a
>>>>>> bit of a design oversight for ->name. Right?
>>>>>
>>>>> AFAIK there are no ->name users that don't also set ->close, for
>>>>> exactly that reason. I'd be okay with adding a check for ->name, too.
>>>>>
>>>>> Hmm. If MPX vmas had a real struct file attached, this would all
>>>>> come for free. Maybe vmas with non-default vm_ops and file != NULL
>>>>> should never be mergeable?
>>>>>
>>>>>>
>>>>>> Thinking out loud a bit... There are also some more complicated
>>>>>> but more performant cleanup mechanisms that I'd like to go after in the future.
>>>>>> Given a page, we might want to figure out if it is an MPX page or not.
>>>>>> I wonder if we'll ever collide with some other user of vm_ops->name.
>>>>>> It looks fairly narrowly used at the moment, but would this keep
>>>>>> us from putting these pages on, say, a tmpfs mount? Doesn't look
>>>>>> that way at the moment.
>>>>>
>>>>> You could always check the vm_ops pointer to see if it's MPX.
>>>>>
>>>>> One feature I've wanted: a way to have special per-process vmas that
>>>>> can be easily found. For example, I want to be able to efficiently
>>>>> find out where the vdso and vvar vmas are. I don't think this is
>>>>> currently supported.
>>>>>
>>>> Andy, if you add a check for ->name to avoid the MPX vmas merged
>>>> with
>>> non-MPX vmas, I guess the work flow should be as follow (use
>>> _install_special_mapping to get a new vma):
>>>>
>>>> unsigned long mpx_mmap(unsigned long len) {
>>>> ......
>>>> static struct vm_special_mapping mpx_mapping = {
>>>> .name = "[mpx]",
>>>> .pages = no_pages,
>>>> };
>>>>
>>>> ....... vma = _install_special_mapping(mm, addr, len, vm_flags,
>>>> &mpx_mapping); ......
>>>> }
>>>>
>>>> Then, we could check the ->name to see if the VMA is MPX specific. Right?
>>>
>>> Does this actually create a vma backed with real memory? Doesn't this
>>> need to go through anon_vma or something? _install_special_mapping
>>> completely prevents merging.
>>>
>> Hmm, _install_special_mapping should completely prevent merging, even among MPX vmas.
>>
>> So, could you tell me how to set MPX specific ->name to the vma when it is created? Seems like that I could not find such interface.
>
> You may need to add one.
>
> I'd suggest posting a new thread to linux-mm describing what you need
> and asking how to do it.
Hmm. the memfd_create thing may be able to do this for you. If you
created a per-mm memfd and mapped it, it all just might work.
--Andy
On 06/25/2014 02:04 PM, Andy Lutomirski wrote:
> On Tue, Jun 24, 2014 at 6:40 PM, Ren, Qiaowei <[email protected]> wrote:
>> Hmm, _install_special_mapping should completely prevent merging, even among MPX vmas.
>>
>> So, could you tell me how to set MPX specific ->name to the vma when it is created? Seems like that I could not find such interface.
>
> You may need to add one.
>
> I'd suggest posting a new thread to linux-mm describing what you need
> and asking how to do it.
I shared this with Qiaowei privately, but might as well repeat myself
here in case anyone wants to set me straight.
Most of the interfaces do to set vm_ops do it in file_operations ->mmap
op. Nobody sets ->vm_ops on anonymous VMAs, so we're in uncharted
territory.
My suggestion: you can either plumb a new API down in to mmap_region()
to get the VMA or set ->vm_ops, or just call find_vma() after
mmap_region() or get_unmapped_area() and set it manually. Just make
sure you still have mmap_sem held over the whole thing.
I think I prefer just setting ->vm_ops directly, even though it's a wee
bit of a hack to create something just to look it up a moment later.
Oh, well.
On 06/25/2014 02:05 PM, Andy Lutomirski wrote:
> Hmm. the memfd_create thing may be able to do this for you. If you
> created a per-mm memfd and mapped it, it all just might work.
memfd_create() seems to bring a fair amount of baggage along (the fd
part :) if all we want is a marker. Really, all we need is _a_ bit, and
some way to plumb to userspace the RSS values of VMAs with that bit set.
Creating and mmap()'ing a fd seems a rather roundabout way to get there.
On Wed, Jun 25, 2014 at 2:45 PM, Dave Hansen <[email protected]> wrote:
> On 06/25/2014 02:05 PM, Andy Lutomirski wrote:
>> Hmm. the memfd_create thing may be able to do this for you. If you
>> created a per-mm memfd and mapped it, it all just might work.
>
> memfd_create() seems to bring a fair amount of baggage along (the fd
> part :) if all we want is a marker. Really, all we need is _a_ bit, and
> some way to plumb to userspace the RSS values of VMAs with that bit set.
>
> Creating and mmap()'ing a fd seems a rather roundabout way to get there.
Hmm. So does VM_MPX, though. If this stuff were done entirely in
userspace, then memfd_create would be exactly the right solution, I
think.
Would it work to just scan the bound directory to figure out how many
bound tables exist?
--Andy
On 06/26/2014 03:19 PM, Andy Lutomirski wrote:
> On Wed, Jun 25, 2014 at 2:45 PM, Dave Hansen <[email protected]> wrote:
>> On 06/25/2014 02:05 PM, Andy Lutomirski wrote:
>>> Hmm. the memfd_create thing may be able to do this for you. If you
>>> created a per-mm memfd and mapped it, it all just might work.
>>
>> memfd_create() seems to bring a fair amount of baggage along (the fd
>> part :) if all we want is a marker. Really, all we need is _a_ bit, and
>> some way to plumb to userspace the RSS values of VMAs with that bit set.
>>
>> Creating and mmap()'ing a fd seems a rather roundabout way to get there.
>
> Hmm. So does VM_MPX, though. If this stuff were done entirely in
> userspace, then memfd_create would be exactly the right solution, I
> think.
>
> Would it work to just scan the bound directory to figure out how many
> bound tables exist?
Theoretically, perhaps.
Practically, the bounds directory is 2GB, and it is likely to be very
sparse. You would have to walk the page tables finding where pages were
mapped, then search the mapped pages for bounds table entries.
Assuming that it was aligned and minimally populated, that's a *MINIMUM*
search looking for a PGD entry, then you have to look at 512 PUD
entries. A full search would have to look at half a million ptes.
That's just finding out how sparse the first level of the tables are
before you've looked at a byte of actual data, and if they were empty.
We could keep another, parallel, data structure that handles this better
other than the hardware tables. Like, say, an rbtree that stores ranges
of virtual addresses. We could call them vm_area_somethings ... wait a
sec... we have a structure like that. ;)
On Thu, Jun 26, 2014 at 3:58 PM, Dave Hansen <[email protected]> wrote:
> On 06/26/2014 03:19 PM, Andy Lutomirski wrote:
>> On Wed, Jun 25, 2014 at 2:45 PM, Dave Hansen <[email protected]> wrote:
>>> On 06/25/2014 02:05 PM, Andy Lutomirski wrote:
>>>> Hmm. the memfd_create thing may be able to do this for you. If you
>>>> created a per-mm memfd and mapped it, it all just might work.
>>>
>>> memfd_create() seems to bring a fair amount of baggage along (the fd
>>> part :) if all we want is a marker. Really, all we need is _a_ bit, and
>>> some way to plumb to userspace the RSS values of VMAs with that bit set.
>>>
>>> Creating and mmap()'ing a fd seems a rather roundabout way to get there.
>>
>> Hmm. So does VM_MPX, though. If this stuff were done entirely in
>> userspace, then memfd_create would be exactly the right solution, I
>> think.
>>
>> Would it work to just scan the bound directory to figure out how many
>> bound tables exist?
>
> Theoretically, perhaps.
>
> Practically, the bounds directory is 2GB, and it is likely to be very
> sparse. You would have to walk the page tables finding where pages were
> mapped, then search the mapped pages for bounds table entries.
>
> Assuming that it was aligned and minimally populated, that's a *MINIMUM*
> search looking for a PGD entry, then you have to look at 512 PUD
> entries. A full search would have to look at half a million ptes.
> That's just finding out how sparse the first level of the tables are
> before you've looked at a byte of actual data, and if they were empty.
>
> We could keep another, parallel, data structure that handles this better
> other than the hardware tables. Like, say, an rbtree that stores ranges
> of virtual addresses. We could call them vm_area_somethings ... wait a
> sec... we have a structure like that. ;)
>
>
So here's my mental image of how I might do this if I were doing it
entirely in userspace: I'd create a file or memfd for the bound tables
and another for the bound directory. These files would be *huge*: the
bound directory file would be 2GB and the bounds table file would be
2^48 bytes or whatever it is. (Maybe even bigger?)
Then I'd just map pieces of those files wherever they'd need to be,
and I'd make the mappings sparse. I suspect that you don't actually
want a vma for each piece of bound table that gets mapped -- the space
of vmas could end up incredibly sparse. So I'd at least map (in the
vma sense, not the pte sense) and entire bound table at a time. And
I'd probably just map the bound directory in one big piece.
Then I'd populate it in the fault handler.
This is almost what the code is doing, I think, modulo the files.
This has one killer problem: these mappings need to be private (cowed
on fork). So memfd is no good. There's got to be an easyish way to
modify the mm code to allow anonymous maps with vm_ops. Maybe a new
mmap_region parameter or something? Maybe even a special anon_vma,
but I don't really understand how those work.
Also, egads: what happens when a bound table entry is associated with
a MAP_SHARED page?
--Andy
On 06/26/2014 04:15 PM, Andy Lutomirski wrote:
> So here's my mental image of how I might do this if I were doing it
> entirely in userspace: I'd create a file or memfd for the bound tables
> and another for the bound directory. These files would be *huge*: the
> bound directory file would be 2GB and the bounds table file would be
> 2^48 bytes or whatever it is. (Maybe even bigger?)
>
> Then I'd just map pieces of those files wherever they'd need to be,
> and I'd make the mappings sparse. I suspect that you don't actually
> want a vma for each piece of bound table that gets mapped -- the space
> of vmas could end up incredibly sparse. So I'd at least map (in the
> vma sense, not the pte sense) and entire bound table at a time. And
> I'd probably just map the bound directory in one big piece.
>
> Then I'd populate it in the fault handler.
>
> This is almost what the code is doing, I think, modulo the files.
>
> This has one killer problem: these mappings need to be private (cowed
> on fork). So memfd is no good.
This essentially uses the page cache's radix tree as a parallel data
structure in order to keep a vaddr->mpx_vma map. That's not a bad idea,
but it is a parallel data structure that does not handle copy-on-write
very well.
I'm pretty sure we need the semantics that anonymous memory provides.
> There's got to be an easyish way to
> modify the mm code to allow anonymous maps with vm_ops. Maybe a new
> mmap_region parameter or something? Maybe even a special anon_vma,
> but I don't really understand how those work.
Yeah, we very well might end up having to go down that path.
> Also, egads: what happens when a bound table entry is associated with
> a MAP_SHARED page?
Bounds table entries are for pointers. Do we keep pointers inside of
MAP_SHARED-mapped things? :)
On Thu, Jun 26, 2014 at 5:19 PM, Dave Hansen <[email protected]> wrote:
> On 06/26/2014 04:15 PM, Andy Lutomirski wrote:
>> So here's my mental image of how I might do this if I were doing it
>> entirely in userspace: I'd create a file or memfd for the bound tables
>> and another for the bound directory. These files would be *huge*: the
>> bound directory file would be 2GB and the bounds table file would be
>> 2^48 bytes or whatever it is. (Maybe even bigger?)
>>
>> Then I'd just map pieces of those files wherever they'd need to be,
>> and I'd make the mappings sparse. I suspect that you don't actually
>> want a vma for each piece of bound table that gets mapped -- the space
>> of vmas could end up incredibly sparse. So I'd at least map (in the
>> vma sense, not the pte sense) and entire bound table at a time. And
>> I'd probably just map the bound directory in one big piece.
>>
>> Then I'd populate it in the fault handler.
>>
>> This is almost what the code is doing, I think, modulo the files.
>>
>> This has one killer problem: these mappings need to be private (cowed
>> on fork). So memfd is no good.
>
> This essentially uses the page cache's radix tree as a parallel data
> structure in order to keep a vaddr->mpx_vma map. That's not a bad idea,
> but it is a parallel data structure that does not handle copy-on-write
> very well.
>
> I'm pretty sure we need the semantics that anonymous memory provides.
>
>> There's got to be an easyish way to
>> modify the mm code to allow anonymous maps with vm_ops. Maybe a new
>> mmap_region parameter or something? Maybe even a special anon_vma,
>> but I don't really understand how those work.
>
> Yeah, we very well might end up having to go down that path.
>
>> Also, egads: what happens when a bound table entry is associated with
>> a MAP_SHARED page?
>
> Bounds table entries are for pointers. Do we keep pointers inside of
> MAP_SHARED-mapped things? :)
Sure, if it's MAP_SHARED | MAP_ANONYMOUS. For example:
struct thing {
struct thing *next;
};
struct thing *storage = mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...);
storage[0].next = &storage[1];
fork();
I'm not suggesting that this needs to *work* in the first incarnation of this :)
--Andy
On 06/26/2014 05:26 PM, Andy Lutomirski wrote:
> On Thu, Jun 26, 2014 at 5:19 PM, Dave Hansen <[email protected]> wrote:
>> On 06/26/2014 04:15 PM, Andy Lutomirski wrote:
>>> Also, egads: what happens when a bound table entry is associated with
>>> a MAP_SHARED page?
>>
>> Bounds table entries are for pointers. Do we keep pointers inside of
>> MAP_SHARED-mapped things? :)
>
> Sure, if it's MAP_SHARED | MAP_ANONYMOUS. For example:
>
> struct thing {
> struct thing *next;
> };
>
> struct thing *storage = mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...);
> storage[0].next = &storage[1];
> fork();
>
> I'm not suggesting that this needs to *work* in the first incarnation of this :)
I'm not sure I'm seeing the issue.
I'm claiming that we need COW behavior for the bounds tables, at least
by default. If userspace knows enough about the ways that it is using
the tables and knows how to share them, let it go to town. The kernel
will permit this kind of usage model, but we simply won't be helping
with the management of the tables when userspace creates them.
You've demonstrated a case where userspace might theoretically might
want to share bounds tables (although I think it's pretty dangerous).
It's equally theoretically possible that userspace might *not* want to
share the tables for instance if one process narrowed the bounds and the
other did not.
On 06/27/2014 10:34 AM, Dave Hansen wrote:
> I'm claiming that we need COW behavior for the bounds tables, at least
> by default. If userspace knows enough about the ways that it is using
> the tables and knows how to share them, let it go to town. The kernel
> will permit this kind of usage model, but we simply won't be helping
> with the management of the tables when userspace creates them.
Actually, this is another reason we need to mark VMAs as being
MPX-related explicitly instead of inferring it from the tables. If
userspace does something really specialized like this, the kernel does
not want to confuse these VMAs the ones it created.
On Fri, Jun 27, 2014 at 10:42 AM, Dave Hansen <[email protected]> wrote:
> On 06/27/2014 10:34 AM, Dave Hansen wrote:
>> I'm claiming that we need COW behavior for the bounds tables, at least
>> by default. If userspace knows enough about the ways that it is using
>> the tables and knows how to share them, let it go to town. The kernel
>> will permit this kind of usage model, but we simply won't be helping
>> with the management of the tables when userspace creates them.
>
> Actually, this is another reason we need to mark VMAs as being
> MPX-related explicitly instead of inferring it from the tables. If
> userspace does something really specialized like this, the kernel does
> not want to confuse these VMAs the ones it created.
>
Good point.
--Andy
On 06/18/2014 02:44 AM, Qiaowei Ren wrote:
> + bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
> + if (IS_ERR((void *)bt_addr)) {
> + pr_err("Bounds table allocation failed at entry addr %p\n",
> + bd_entry);
> + return bt_addr;
> + }
Don't forget to remove the unnecessary pr_*() calls for the next version.