Hi all,
Here is v6 of the XPFO set; see v5 discussion here:
https://lkml.org/lkml/2017/8/9/803
Changelogs are in the individual patch notes, but the highlights are:
* add primitives for ensuring memory areas are mapped (although these are quite
ugly, using stack allocation; I'm open to better suggestions)
* instead of not flushing caches, re-map pages using the above
* TLB flushing is much more correct (i.e. we're always flushing everything
everywhere). I suspect we may be able to back this off in some cases, but I'm
still trying to collect performance numbers to prove this is worth doing.
I have no TODOs left for this set myself, other than fixing whatever review
feedback people have. Thoughts and testing welcome!
Cheers,
Tycho
Juerg Haefliger (6):
mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
swiotlb: Map the buffer if it was unmapped by XPFO
arm64/mm: Add support for XPFO
arm64/mm, xpfo: temporarily map dcache regions
arm64/mm: Add support for XPFO to swiotlb
lkdtm: Add test for XPFO
Tycho Andersen (5):
mm: add MAP_HUGETLB support to vm_mmap
x86: always set IF before oopsing from page fault
xpfo: add primitives for mapping underlying memory
arm64/mm: disable section/contiguous mappings if XPFO is enabled
mm: add a user_virt_to_phys symbol
Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/cacheflush.h | 11 +
arch/arm64/mm/Makefile | 2 +
arch/arm64/mm/dma-mapping.c | 32 +--
arch/arm64/mm/flush.c | 7 +
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/xpfo.c | 127 +++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 25 +++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/fault.c | 6 +
arch/x86/mm/pageattr.c | 22 +-
arch/x86/mm/xpfo.c | 171 +++++++++++++++
drivers/misc/Makefile | 1 +
drivers/misc/lkdtm.h | 5 +
drivers/misc/lkdtm_core.c | 3 +
drivers/misc/lkdtm_xpfo.c | 194 +++++++++++++++++
include/linux/highmem.h | 15 +-
include/linux/mm.h | 2 +
include/linux/xpfo.h | 79 +++++++
lib/swiotlb.c | 3 +-
mm/Makefile | 1 +
mm/mmap.c | 19 +-
mm/page_alloc.c | 2 +
mm/page_ext.c | 4 +
mm/util.c | 32 +++
mm/xpfo.c | 273 ++++++++++++++++++++++++
security/Kconfig | 19 ++
29 files changed, 1005 insertions(+), 57 deletions(-)
create mode 100644 arch/arm64/mm/xpfo.c
create mode 100644 arch/x86/mm/xpfo.c
create mode 100644 drivers/misc/lkdtm_xpfo.c
create mode 100644 include/linux/xpfo.h
create mode 100644 mm/xpfo.c
--
2.11.0
From: Juerg Haefliger <[email protected]>
Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).
v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
CC: [email protected]
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/Makefile | 2 ++
arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index dfd908630631..44fa44ef02ec 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -121,6 +121,7 @@ config ARM64
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_XPFO
help
ARM 64-bit (AArch64) Linux support.
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 9b0ba191e48e..22e5cab543d8 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n
obj-$(CONFIG_KASAN) += kasan_init.o
KASAN_SANITIZE_kasan_init.o := n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index 000000000000..678e2be848eb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,58 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/tlbflush.h>
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ pte_t *pte = lookup_address((unsigned long)kaddr);
+
+ set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+ unsigned long kaddr = (unsigned long)page_address(page);
+ unsigned long size = PAGE_SIZE;
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
--
2.11.0
From: Juerg Haefliger <[email protected]>
This test simply reads from userspace memory via the kernel's linear
map.
v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
unmap of an address, that it can't be used accidentally by other
CPUs.
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
drivers/misc/Makefile | 1 +
drivers/misc/lkdtm.h | 5 ++
drivers/misc/lkdtm_core.c | 3 +
drivers/misc/lkdtm_xpfo.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 203 insertions(+)
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index b0b766416306..8447b42a447d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -62,6 +62,7 @@ lkdtm-$(CONFIG_LKDTM) += lkdtm_heap.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_perms.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_rodata_objcopy.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_usercopy.o
+lkdtm-$(CONFIG_LKDTM) += lkdtm_xpfo.o
KCOV_INSTRUMENT_lkdtm_rodata.o := n
diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 3b4976396ec4..34a6ee37f216 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -64,4 +64,9 @@ void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
void lkdtm_USERCOPY_STACK_BEYOND(void);
void lkdtm_USERCOPY_KERNEL(void);
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
#endif
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 42d2b8e31e6b..9544e329de4b 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -235,6 +235,9 @@ struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
CRASHTYPE(USERCOPY_STACK_BEYOND),
CRASHTYPE(USERCOPY_KERNEL),
+ CRASHTYPE(XPFO_READ_USER),
+ CRASHTYPE(XPFO_READ_USER_HUGE),
+ CRASHTYPE(XPFO_SMP),
};
diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
new file mode 100644
index 000000000000..d903063bdd0b
--- /dev/null
+++ b/drivers/misc/lkdtm_xpfo.c
@@ -0,0 +1,194 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include <linux/cpumask.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/xpfo.h>
+#include <linux/kthread.h>
+
+#include <linux/delay.h>
+#include <linux/sched/task.h>
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+ unsigned long user_addr, user_data = XPFO_DATA;
+
+ user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+ PROT_READ | PROT_WRITE | PROT_EXEC,
+ flags, 0);
+ if (user_addr >= TASK_SIZE) {
+ pr_warn("Failed to allocate user memory\n");
+ return 0;
+ }
+
+ if (copy_to_user((void __user *)user_addr, &user_data,
+ sizeof(user_data))) {
+ pr_warn("copy_to_user failed\n");
+ goto free_user;
+ }
+
+ return user_addr;
+
+free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+ return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+ phys_addr_t phys_addr;
+ void *virt_addr;
+
+ phys_addr = user_virt_to_phys(user_addr);
+ if (!phys_addr) {
+ pr_warn("Failed to get physical address of user memory\n");
+ return NULL;
+ }
+
+ virt_addr = phys_to_virt(phys_addr);
+ if (phys_addr != virt_to_phys(virt_addr)) {
+ pr_warn("Physical address of user memory seems incorrect\n");
+ return NULL;
+ }
+
+ return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+ pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+ if (*(unsigned long *)virt_addr == XPFO_DATA)
+ pr_err("FAIL: Bad read succeeded?!\n");
+ else
+ pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+ unsigned long user_addr, *kernel;
+
+ user_addr = do_map(flags);
+ if (!user_addr) {
+ pr_err("FAIL: map failed\n");
+ return;
+ }
+
+ kernel = user_to_kernel(user_addr);
+ if (!kernel) {
+ pr_err("FAIL: user to kernel conversion failed\n");
+ goto free_user;
+ }
+
+ read_map(kernel);
+
+free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
+}
+
+struct smp_arg {
+ unsigned long *virt_addr;
+ unsigned int cpu;
+};
+
+static int smp_reader(void *parg)
+{
+ struct smp_arg *arg = parg;
+ unsigned long *virt_addr;
+
+ if (arg->cpu != smp_processor_id()) {
+ pr_err("FAIL: scheduled on wrong CPU?\n");
+ return 0;
+ }
+
+ virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
+ read_map(virt_addr);
+
+ return 0;
+}
+
+#ifdef CONFIG_X86
+#define XPFO_SMP_KILLED SIGKILL
+#elif CONFIG_ARM64
+#define XPFO_SMP_KILLED SIGSEGV
+#else
+#error unsupported arch
+#endif
+
+/* The idea here is to read from the kernel's map on a different thread than
+ * did the mapping (and thus the TLB flushing), to make sure that the page
+ * faults on other cores too.
+ */
+void lkdtm_XPFO_SMP(void)
+{
+ unsigned long user_addr, *virt_addr;
+ struct task_struct *thread;
+ int ret;
+ struct smp_arg arg;
+
+ if (num_online_cpus() < 2) {
+ pr_err("not enough to do a multi cpu test\n");
+ return;
+ }
+
+ arg.virt_addr = NULL;
+ arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
+ thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
+ if (IS_ERR(thread)) {
+ pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
+ return;
+ }
+
+ kthread_bind(thread, arg.cpu);
+ get_task_struct(thread);
+ wake_up_process(thread);
+
+ user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
+ if (!user_addr)
+ goto kill_thread;
+
+ virt_addr = user_to_kernel(user_addr);
+ if (!virt_addr) {
+ /*
+ * let's store something that will fail, so we can unblock the
+ * thread
+ */
+ smp_store_release(&arg.virt_addr, &arg);
+ goto free_user;
+ }
+
+ smp_store_release(&arg.virt_addr, virt_addr);
+
+ /* there must be a better way to do this. */
+ while (1) {
+ if (thread->exit_state)
+ break;
+ msleep_interruptible(100);
+ }
+
+free_user:
+ if (user_addr)
+ vm_munmap(user_addr, PAGE_SIZE);
+
+kill_thread:
+ ret = kthread_stop(thread);
+ if (ret != XPFO_SMP_KILLED)
+ pr_err("FAIL: thread wasn't killed: %d\n", ret);
+ put_task_struct(thread);
+}
--
2.11.0
From: Juerg Haefliger <[email protected]>
If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.
v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly
CC: [email protected]
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/mm/flush.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 21a8d828cbf4..e335e3fd4fca 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/xpfo.h>
#include <asm/cacheflush.h>
#include <asm/cache.h>
@@ -28,9 +29,15 @@
void sync_icache_aliases(void *kaddr, unsigned long len)
{
unsigned long addr = (unsigned long)kaddr;
+ unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+ void *mapping[num_pages];
if (icache_is_aliasing()) {
+ xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+ xpfo_temp_unmap(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
flush_icache_range(addr, addr + len);
--
2.11.0
We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.
v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
CC: [email protected]
CC: [email protected]
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
arch/arm64/mm/xpfo.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/xpfo.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/xpfo.h | 5 +++++
3 files changed, 113 insertions(+)
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
index 342a9ccb93c1..94a667d94e15 100644
--- a/arch/arm64/mm/xpfo.c
+++ b/arch/arm64/mm/xpfo.c
@@ -74,3 +74,54 @@ void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
xpfo_temp_unmap(addr, size, mapping, sizeof(mapping[0]) * num_pages);
}
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(current->mm, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+
+ if (pud_sect(*pud) || !pud_present(*pud)) {
+ phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+ offset = addr & ~PUD_MASK;
+ goto out;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+
+ if (pmd_sect(*pmd) || !pmd_present(*pmd)) {
+ phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+ offset = addr & ~PMD_MASK;
+ goto out;
+ }
+
+ pte = pte_offset_kernel(pmd, addr);
+ phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ offset = addr & ~PAGE_MASK;
+
+out:
+ return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 6794d6724ab5..d24cf2c600e8 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -112,3 +112,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(current->mm, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+ phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+ offset = addr & ~P4D_MASK;
+ goto out;
+ }
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+
+ if (pud_large(*pud) || !pud_present(*pud)) {
+ phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+ offset = addr & ~PUD_MASK;
+ goto out;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+
+ if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+ phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+ offset = addr & ~PMD_MASK;
+ goto out;
+ }
+
+ pte = pte_offset_kernel(pmd, addr);
+ phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ offset = addr & ~PAGE_MASK;
+
+out:
+ return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1693af1a0293..be72da5fba26 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -19,6 +19,7 @@
#ifdef CONFIG_XPFO
#include <linux/dma-mapping.h>
+#include <linux/types.h>
extern struct page_ext_operations page_xpfo_ops;
@@ -45,6 +46,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
bool xpfo_enabled(void);
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -69,6 +72,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
static inline bool xpfo_enabled(void) { return false; }
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
--
2.11.0
From: Juerg Haefliger <[email protected]>
Pages that are unmapped by XPFO need to be mapped before and unmapped
again after (to restore the original state) the __dma_{map,unmap}_area()
operations to prevent fatal page faults.
v6: * use the hoisted out temporary mapping code instead
CC: [email protected]
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/include/asm/cacheflush.h | 11 +++++++++++
arch/arm64/mm/dma-mapping.c | 32 ++++++++++++++++----------------
arch/arm64/mm/xpfo.c | 18 ++++++++++++++++++
include/linux/xpfo.h | 2 ++
4 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d74a284abdc2..b6a462e3b2f9 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -93,6 +93,17 @@ extern void __dma_map_area(const void *, size_t, int);
extern void __dma_unmap_area(const void *, size_t, int);
extern void __dma_flush_area(const void *, size_t);
+#ifdef CONFIG_XPFO
+#include <linux/xpfo.h>
+#define _dma_map_area(addr, size, dir) \
+ xpfo_dma_map_unmap_area(true, addr, size, dir)
+#define _dma_unmap_area(addr, size, dir) \
+ xpfo_dma_map_unmap_area(false, addr, size, dir)
+#else
+#define _dma_map_area(addr, size, dir) __dma_map_area(addr, size, dir)
+#define _dma_unmap_area(addr, size, dir) __dma_unmap_area(addr, size, dir)
+#endif
+
/*
* Copy user data from/to a page which is mapped into a different
* processes address space. Really, we want to allow our "user
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index f27d4dd04384..a79f200786ab 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -204,7 +204,7 @@ static dma_addr_t __swiotlb_map_page(struct device *dev, struct page *page,
dev_addr = swiotlb_map_page(dev, page, offset, size, dir, attrs);
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
return dev_addr;
}
@@ -216,7 +216,7 @@ static void __swiotlb_unmap_page(struct device *dev, dma_addr_t dev_addr,
{
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
swiotlb_unmap_page(dev, dev_addr, size, dir, attrs);
}
@@ -231,8 +231,8 @@ static int __swiotlb_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
for_each_sg(sgl, sg, ret, i)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
return ret;
}
@@ -248,8 +248,8 @@ static void __swiotlb_unmap_sg_attrs(struct device *dev,
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
swiotlb_unmap_sg_attrs(dev, sgl, nelems, dir, attrs);
}
@@ -258,7 +258,7 @@ static void __swiotlb_sync_single_for_cpu(struct device *dev,
enum dma_data_direction dir)
{
if (!is_device_dma_coherent(dev))
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
swiotlb_sync_single_for_cpu(dev, dev_addr, size, dir);
}
@@ -268,7 +268,7 @@ static void __swiotlb_sync_single_for_device(struct device *dev,
{
swiotlb_sync_single_for_device(dev, dev_addr, size, dir);
if (!is_device_dma_coherent(dev))
- __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
}
static void __swiotlb_sync_sg_for_cpu(struct device *dev,
@@ -280,8 +280,8 @@ static void __swiotlb_sync_sg_for_cpu(struct device *dev,
if (!is_device_dma_coherent(dev))
for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
swiotlb_sync_sg_for_cpu(dev, sgl, nelems, dir);
}
@@ -295,8 +295,8 @@ static void __swiotlb_sync_sg_for_device(struct device *dev,
swiotlb_sync_sg_for_device(dev, sgl, nelems, dir);
if (!is_device_dma_coherent(dev))
for_each_sg(sgl, sg, nelems, i)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
}
static int __swiotlb_mmap_pfn(struct vm_area_struct *vma,
@@ -758,7 +758,7 @@ static void __iommu_sync_single_for_cpu(struct device *dev,
return;
phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
- __dma_unmap_area(phys_to_virt(phys), size, dir);
+ _dma_unmap_area(phys_to_virt(phys), size, dir);
}
static void __iommu_sync_single_for_device(struct device *dev,
@@ -771,7 +771,7 @@ static void __iommu_sync_single_for_device(struct device *dev,
return;
phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
- __dma_map_area(phys_to_virt(phys), size, dir);
+ _dma_map_area(phys_to_virt(phys), size, dir);
}
static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
@@ -811,7 +811,7 @@ static void __iommu_sync_sg_for_cpu(struct device *dev,
return;
for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(sg_virt(sg), sg->length, dir);
+ _dma_unmap_area(sg_virt(sg), sg->length, dir);
}
static void __iommu_sync_sg_for_device(struct device *dev,
@@ -825,7 +825,7 @@ static void __iommu_sync_sg_for_device(struct device *dev,
return;
for_each_sg(sgl, sg, nelems, i)
- __dma_map_area(sg_virt(sg), sg->length, dir);
+ _dma_map_area(sg_virt(sg), sg->length, dir);
}
static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
index 678e2be848eb..342a9ccb93c1 100644
--- a/arch/arm64/mm/xpfo.c
+++ b/arch/arm64/mm/xpfo.c
@@ -11,8 +11,10 @@
* the Free Software Foundation.
*/
+#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/module.h>
+#include <linux/xpfo.h>
#include <asm/tlbflush.h>
@@ -56,3 +58,19 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
+
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+ enum dma_data_direction dir)
+{
+ unsigned long num_pages = XPFO_NUM_PAGES(addr, size);
+ void *mapping[num_pages];
+
+ xpfo_temp_map(addr, size, mapping, sizeof(mapping[0]) * num_pages);
+
+ if (map)
+ __dma_map_area(addr, size, dir);
+ else
+ __dma_unmap_area(addr, size, dir);
+
+ xpfo_temp_unmap(addr, size, mapping, sizeof(mapping[0]) * num_pages);
+}
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 304b104ec637..d37a06c9d62c 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -18,6 +18,8 @@
#ifdef CONFIG_XPFO
+#include <linux/dma-mapping.h>
+
extern struct page_ext_operations page_xpfo_ops;
void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
--
2.11.0
XPFO doesn't support section/contiguous mappings yet, so let's disable it
if XPFO is turned on.
Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
pointing out we need NO_CONT_MAPPINGS too.
CC: [email protected]
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/mm/mmu.c | 2 +-
include/linux/xpfo.h | 4 ++++
mm/xpfo.c | 6 ++++++
3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e0e864..34bb95303cce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -420,7 +420,7 @@ static void __init map_mem(pgd_t *pgd)
struct memblock_region *reg;
int flags = 0;
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
/*
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index d37a06c9d62c..1693af1a0293 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -43,6 +43,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
size_t mapping_len);
+bool xpfo_enabled(void);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -65,6 +67,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
}
+static inline bool xpfo_enabled(void) { return false; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index f79075bf7d65..25fba05d01bd 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -70,6 +70,12 @@ struct page_ext_operations page_xpfo_ops = {
.init = init_xpfo,
};
+bool __init xpfo_enabled(void)
+{
+ return !xpfo_disabled;
+}
+EXPORT_SYMBOL(xpfo_enabled);
+
static inline struct xpfo *lookup_xpfo(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
--
2.11.0
From: Juerg Haefliger <[email protected]>
v6: * guard against lookup_xpfo() returning NULL
CC: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
---
include/linux/xpfo.h | 4 ++++
lib/swiotlb.c | 3 ++-
mm/xpfo.c | 15 +++++++++++++++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 442c58ee930e..04590d1dcefa 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -30,6 +30,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
void xpfo_free_pages(struct page *page, int order);
+bool xpfo_page_is_unmapped(struct page *page);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -37,6 +39,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
static inline void xpfo_free_pages(struct page *page, int order) { }
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index a8d74a733a38..d4fee5ca2d9e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -420,8 +420,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
{
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+ struct page *page = pfn_to_page(pfn);
- if (PageHighMem(pfn_to_page(pfn))) {
+ if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
/* The buffer does not have a mapping. Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index bff24afcaa2e..cdbcbac582d5 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -220,3 +220,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
spin_unlock(&xpfo->maplock);
}
EXPORT_SYMBOL(xpfo_kunmap);
+
+bool xpfo_page_is_unmapped(struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return false;
+
+ xpfo = lookup_xpfo(page);
+ if (unlikely(!xpfo) && !xpfo->inited)
+ return false;
+
+ return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
--
2.11.0
In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.
Signed-off-by: Tycho Andersen <[email protected]>
---
include/linux/xpfo.h | 22 ++++++++++++++++++++++
mm/xpfo.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 04590d1dcefa..304b104ec637 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -32,6 +32,15 @@ void xpfo_free_pages(struct page *page, int order);
bool xpfo_page_is_unmapped(struct page *page);
+#define XPFO_NUM_PAGES(addr, size) \
+ (PFN_UP((unsigned long) (addr) + (size)) - \
+ PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+ size_t mapping_len);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -41,6 +50,19 @@ static inline void xpfo_free_pages(struct page *page, int order) { }
static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+ void **mapping, size_t mapping_len)
+{
+}
+
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index cdbcbac582d5..f79075bf7d65 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -13,6 +13,7 @@
* the Free Software Foundation.
*/
+#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/page_ext.h>
@@ -235,3 +236,32 @@ bool xpfo_page_is_unmapped(struct page *page)
return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
}
EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+ struct page *page = virt_to_page(addr);
+ int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+ memset(mapping, 0, mapping_len);
+
+ for (i = 0; i < num_pages; i++) {
+ if (page_to_virt(page + i) >= addr + size)
+ break;
+
+ if (xpfo_page_is_unmapped(page + i))
+ mapping[i] = kmap_atomic(page + i);
+ }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+ int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+ for (i = 0; i < num_pages; i++)
+ if (mapping[i])
+ kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
--
2.11.0
Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:
Aug 23 19:30:27 xpfo kernel: [ 38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [ 38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [ 38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [ 38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [ 38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [ 38.307243] dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [ 38.307665] ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [ 38.308139] __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [ 38.308593] exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [ 38.309046] ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [ 38.309677] do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [ 38.310078] ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [ 38.310604] ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [ 38.311045] ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [ 38.311680] rewind_stack_do_exit+0x17/0x20
To be safe, let's just always enable irqs.
The particular case I'm hitting is:
Aug 23 19:30:27 xpfo kernel: [ 38.278615] __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [ 38.278617] bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [ 38.278618] __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [ 38.278620] ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [ 38.278623] ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [ 38.278624] trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [ 38.278625] do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [ 38.278627] async_page_fault+0x28/0x30
When a fault is in kernel space which has been triggered by XPFO.
Signed-off-by: Tycho Andersen <[email protected]>
CC: [email protected]
---
arch/x86/mm/fault.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2a1fa10c6a98..7572ad4dae70 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -864,6 +864,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
+ /*
+ * We're about to oops, which might kill the task. Make sure we're
+ * allowed to sleep.
+ */
+ flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
}
--
2.11.0
From: Juerg Haefliger <[email protected]>
This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.
Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
- two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
- a reference counter to balance kmap/kunmap operations.
- a lock to serialize access to the XPFO fields.
This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
__split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
pass it
CC: [email protected]
Suggested-by: Vasileios P. Kemerlis <[email protected]>
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Marco Benatto <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 25 +++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/pageattr.c | 22 +--
arch/x86/mm/xpfo.c | 114 ++++++++++++
include/linux/highmem.h | 15 +-
include/linux/xpfo.h | 42 +++++
mm/Makefile | 1 +
mm/page_alloc.c | 2 +
mm/page_ext.c | 4 +
mm/xpfo.c | 222 ++++++++++++++++++++++++
security/Kconfig | 19 ++
13 files changed, 449 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..444d83183f75 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2736,6 +2736,8 @@
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
+ noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on.
+
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 323cb065be5e..d78a0d538900 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -185,6 +185,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select ARCH_SUPPORTS_XPFO if X86_64
config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 77037b6f1caa..c2eb40f7a74b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1238,6 +1238,31 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
return __pte_access_permitted(pud_val(pud), write);
}
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+ unsigned long *vaddr;
+ pgd_t *pgd;
+ pgprot_t mask_set;
+ pgprot_t mask_clr;
+ unsigned long numpages;
+ int flags;
+ unsigned long pfn;
+ unsigned force_split : 1;
+ int curpage;
+ struct page **pages;
+};
+
+
+int
+try_preserve_large_page(pte_t *kpte, unsigned long address,
+ struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+ struct page *base);
+
#include <asm-generic/pgtable.h>
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 0fbdcb64f9f8..89ba6d25fb51 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -39,3 +39,4 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 757b0bcdf712..f25d07191e60 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -27,28 +27,12 @@
#include <asm/set_memory.h>
/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
- unsigned long *vaddr;
- pgd_t *pgd;
- pgprot_t mask_set;
- pgprot_t mask_clr;
- unsigned long numpages;
- int flags;
- unsigned long pfn;
- unsigned force_split : 1;
- int curpage;
- struct page **pages;
-};
-
-/*
* Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings)
* using cpa_lock. So that we don't allow any other cpu, with stale large tlb
* entries change the page attribute in parallel to some other cpu
* splitting a large page entry along with changing the attribute.
*/
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);
#define CPA_FLUSHTLB 1
#define CPA_ARRAY 2
@@ -512,7 +496,7 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
#endif
}
-static int
+int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
@@ -648,7 +632,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
return do_split;
}
-static int
+int
__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
struct page *base)
{
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..6794d6724ab5
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,114 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ unsigned int level;
+ pgprot_t msk_clr;
+ pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+ if (unlikely(!pte)) {
+ WARN(1, "xpfo: invalid address %p\n", kaddr);
+ return;
+ }
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+ break;
+ case PG_LEVEL_2M:
+ case PG_LEVEL_1G: {
+ struct cpa_data cpa = { };
+ int do_split;
+
+ if (level == PG_LEVEL_2M)
+ msk_clr = pmd_pgprot(*(pmd_t*)pte);
+ else
+ msk_clr = pud_pgprot(*(pud_t*)pte);
+
+ cpa.vaddr = kaddr;
+ cpa.pages = &page;
+ cpa.mask_set = prot;
+ cpa.mask_clr = msk_clr;
+ cpa.numpages = 1;
+ cpa.flags = 0;
+ cpa.curpage = 0;
+ cpa.force_split = 0;
+
+
+ do_split = try_preserve_large_page(pte, (unsigned long)kaddr,
+ &cpa);
+ if (do_split) {
+ struct page *base;
+
+ base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 0);
+ if (!base) {
+ WARN(1, "xpfo: failed to split large page\n");
+ break;
+ }
+
+ if (!debug_pagealloc_enabled())
+ spin_lock(&cpa_lock);
+ if (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0)
+ WARN(1, "xpfo: failed to split large page\n");
+ if (!debug_pagealloc_enabled())
+ spin_unlock(&cpa_lock);
+ }
+
+ break;
+ }
+ case PG_LEVEL_512G:
+ /* fallthrough, splitting infrastructure doesn't
+ * support 512G pages. */
+ default:
+ WARN(1, "xpfo: unsupported page level %x\n", level);
+ }
+
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+ int level;
+ unsigned long size, kaddr;
+
+ kaddr = (unsigned long)page_address(page);
+
+ if (unlikely(!lookup_address(kaddr, &level))) {
+ WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
+ return;
+ }
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ size = PAGE_SIZE;
+ break;
+ case PG_LEVEL_2M:
+ size = PMD_SIZE;
+ break;
+ case PG_LEVEL_1G:
+ size = PUD_SIZE;
+ break;
+ default:
+ WARN(1, "xpfo: unsupported page level %x\n", level);
+ return;
+ }
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
#include <linux/mm.h>
#include <linux/uaccess.h>
#include <linux/hardirq.h>
+#include <linux/xpfo.h>
#include <asm/cacheflush.h>
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
#ifndef ARCH_HAS_KMAP
static inline void *kmap(struct page *page)
{
+ void *kaddr;
+
might_sleep();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}
static inline void kunmap(struct page *page)
{
+ xpfo_kunmap(page_address(page), page);
}
static inline void *kmap_atomic(struct page *page)
{
+ void *kaddr;
+
preempt_disable();
pagefault_disable();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}
#define kmap_atomic_prot(page, prot) kmap_atomic(page)
static inline void __kunmap_atomic(void *addr)
{
+ xpfo_kunmap(addr, virt_to_page(addr));
pagefault_enable();
preempt_enable();
}
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..442c58ee930e
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ * Tycho Andersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+ enum dma_data_direction dir);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_kmap(void *kaddr, struct page *page);
+void xpfo_kunmap(void *kaddr, struct page *page);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/mm/Makefile b/mm/Makefile
index 411bd24d4a7c..0be67cac8f6c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1423da8dd16f..09fdf1bad21f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1059,6 +1059,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order, 0);
kernel_map_pages(page, 1 << order, 0);
kasan_free_pages(page, order);
+ xpfo_free_pages(page, order);
return true;
}
@@ -1758,6 +1759,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
kernel_map_pages(page, 1 << order, 1);
kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
+ xpfo_alloc_pages(page, order, gfp_flags);
set_page_owner(page, order, gfp_flags);
}
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044b09a..4899df1f5d66 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
#include <linux/kmemleak.h>
#include <linux/page_owner.h>
#include <linux/page_idle.h>
+#include <linux/xpfo.h>
/*
* struct page extension
@@ -65,6 +66,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_XPFO
+ &page_xpfo_ops,
+#endif
};
static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..bff24afcaa2e
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,222 @@
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ * Tycho Andersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+/* XPFO page state flags */
+enum xpfo_flags {
+ XPFO_PAGE_USER, /* Page is allocated to user-space */
+ XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */
+};
+
+/* Per-page XPFO house-keeping data */
+struct xpfo {
+ unsigned long flags; /* Page state */
+ bool inited; /* Map counter and lock initialized */
+ atomic_t mapcount; /* Counter for balancing map/unmap requests */
+ spinlock_t maplock; /* Lock to serialize map/unmap requests */
+};
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool xpfo_disabled __initdata;
+
+static int __init noxpfo_param(char *str)
+{
+ xpfo_disabled = true;
+
+ return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+static bool __init need_xpfo(void)
+{
+ if (xpfo_disabled) {
+ printk(KERN_INFO "XPFO disabled\n");
+ return false;
+ }
+
+ return true;
+}
+
+static void init_xpfo(void)
+{
+ printk(KERN_INFO "XPFO enabled\n");
+ static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+ .size = sizeof(struct xpfo),
+ .need = need_xpfo,
+ .init = init_xpfo,
+};
+
+static inline struct xpfo *lookup_xpfo(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext)) {
+ WARN(1, "xpfo: failed to get page ext");
+ return NULL;
+ }
+
+ return (void *)page_ext + page_xpfo_ops.offset;
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+{
+ int i, flush_tlb = 0;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+ if (!xpfo)
+ continue;
+
+ WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+ "xpfo: unmapped page being allocated\n");
+
+ /* Initialize the map lock and map counter */
+ if (unlikely(!xpfo->inited)) {
+ spin_lock_init(&xpfo->maplock);
+ atomic_set(&xpfo->mapcount, 0);
+ xpfo->inited = true;
+ }
+ WARN(atomic_read(&xpfo->mapcount),
+ "xpfo: already mapped page being allocated\n");
+
+ if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+ /*
+ * Tag the page as a user page and flush the TLB if it
+ * was previously allocated to the kernel.
+ */
+ if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ flush_tlb = 1;
+ } else {
+ /* Tag the page as a non-user (kernel) page */
+ clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+ }
+ }
+
+ if (flush_tlb)
+ xpfo_flush_kernel_tlb(page, order);
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+ int i;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+ if (!xpfo || unlikely(!xpfo->inited)) {
+ /*
+ * The page was allocated before page_ext was
+ * initialized, so it is a kernel page.
+ */
+ continue;
+ }
+
+ /*
+ * Map the page back into the kernel if it was previously
+ * allocated to user space.
+ */
+ if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
+ clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ set_kpte(page_address(page + i), page + i,
+ PAGE_KERNEL);
+ }
+ }
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (!xpfo || unlikely(!xpfo->inited) ||
+ !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock(&xpfo->maplock);
+
+ /*
+ * The page was previously allocated to user space, so map it back
+ * into the kernel. No TLB flush required.
+ */
+ if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
+ test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ set_kpte(kaddr, page, PAGE_KERNEL);
+
+ spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (!xpfo || unlikely(!xpfo->inited) ||
+ !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock(&xpfo->maplock);
+
+ /*
+ * The page is to be allocated back to user space, so unmap it from the
+ * kernel, flush the TLB and tag it as a user page.
+ */
+ if (atomic_dec_return(&xpfo->mapcount) == 0) {
+ WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+ "xpfo: unmapping already unmapped page\n");
+ set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ set_kpte(kaddr, page, __pgprot(0));
+ xpfo_flush_kernel_tlb(page, 0);
+ }
+
+ spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..be5145eeed7d 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
source security/keys/Kconfig
+config ARCH_SUPPORTS_XPFO
+ bool
+
+config XPFO
+ bool "Enable eXclusive Page Frame Ownership (XPFO)"
+ default n
+ depends on ARCH_SUPPORTS_XPFO
+ select PAGE_EXTENSION
+ help
+ This option offers protection against 'ret2dir' kernel attacks.
+ When enabled, every time a page frame is allocated to user space, it
+ is unmapped from the direct mapped RAM region in kernel space
+ (physmap). Similarly, when a page frame is freed/reclaimed, it is
+ mapped back to physmap.
+
+ There is a slight performance impact when this option is enabled.
+
+ If in doubt, say "N".
+
config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
--
2.11.0
vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
include/linux/mm.h | 2 ++
mm/mmap.c | 19 +------------------
mm/util.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 18 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1f6c95f3496..5dfe009adcb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2141,6 +2141,8 @@ struct vm_unmapped_area_info {
extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
/*
* Search for an unmapped address range.
*
diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..f24fc14808e1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1490,24 +1490,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
- struct hstate *hs;
-
- hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
- if (!hs)
- return -EINVAL;
-
- len = ALIGN(len, huge_page_size(hs));
- /*
- * VM_NORESERVE is used because the reservations will be
- * taken when vm_ops->mmap() is called
- * A dummy user value is used because we are not locking
- * memory so no accounting is necessary
- */
- file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
- VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
- (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ file = map_hugetlb_setup(&len, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 9ecddf568fe3..93c253512aaa 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -340,6 +340,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
return ret;
}
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+ struct user_struct *user = NULL;
+ struct hstate *hs;
+
+ hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ if (!hs)
+ return ERR_PTR(-EINVAL);
+
+ *len = ALIGN(*len, huge_page_size(hs));
+
+ /*
+ * VM_NORESERVE is used because the reservations will be
+ * taken when vm_ops->mmap() is called
+ * A dummy user value is used because we are not locking
+ * memory so no accounting is necessary
+ */
+ return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+ VM_NORESERVE,
+ &user, HUGETLB_ANONHUGE_INODE,
+ (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -349,6 +372,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
+ if (flag & MAP_HUGETLB) {
+ if (file)
+ return -EINVAL;
+
+ file = map_hugetlb_setup(&len, flag);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
}
EXPORT_SYMBOL(vm_mmap);
--
2.11.0
> - if (PageHighMem(pfn_to_page(pfn))) {
> + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
Please don't sprinkle xpfo details over various bits of code.
Just add a helper with a descriptive name, e.g.
page_is_unmapped()
that also includes the highmem case, as that will easily document
what this check is doing.
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Tycho Andersen
> Sent: Thursday, September 7, 2017 10:36 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; Marco Benatto
> <[email protected]>; Juerg Haefliger
> <[email protected]>; [email protected]; Tycho Andersen
> <[email protected]>
> Subject: [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame
> Ownership (XPFO)
>
> From: Juerg Haefliger <[email protected]>
>
> This patch adds support for XPFO which protects against 'ret2dir' kernel attacks.
> The basic idea is to enforce exclusive ownership of page frames by either the
> kernel or userspace, unless explicitly requested by the kernel. Whenever a page
> destined for userspace is allocated, it is unmapped from physmap (the kernel's
> page table). When such a page is reclaimed from userspace, it is mapped back to
> physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping,
> specifically:
> - two flags to distinguish user vs. kernel pages and to tag unmapped
> pages.
> - a reference counter to balance kmap/kunmap operations.
> - a lock to serialize access to the XPFO fields.
>
> This patch is based on the work of Vasileios P. Kemerlis et al. who published their
> work in this paper:
> http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
> the tlb entry on all CPUs when unmapping it in kunmap
> * handle lookup_page_ext()/lookup_xpfo() returning NULL
> * drop lots of BUG()s in favor of WARN()
> * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
> __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
> pass it
>
> CC: [email protected]
> Suggested-by: Vasileios P. Kemerlis <[email protected]>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Marco Benatto <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 2 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 25 +++
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/pageattr.c | 22 +--
> arch/x86/mm/xpfo.c | 114 ++++++++++++
> include/linux/highmem.h | 15 +-
> include/linux/xpfo.h | 42 +++++
> mm/Makefile | 1 +
> mm/page_alloc.c | 2 +
> mm/page_ext.c | 4 +
> mm/xpfo.c | 222 ++++++++++++++++++++++++
> security/Kconfig | 19 ++
> 13 files changed, 449 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index d9c171ce4190..444d83183f75 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2736,6 +2736,8 @@
>
> nox2apic [X86-64,APIC] Do not enable x2APIC mode.
>
> + noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on.
> +
> cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
> CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
> Some features depend on CPU0. Known dependencies
<... snip>
A bit more description for system administrators would be very useful.
Perhaps something like:
noxpfo [XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO)
Physical pages mapped into user applications will also be mapped
in the kernel's address space as if CONFIG_XPFO was not enabled.
Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo.
On Thu, Sep 07, 2017 at 11:10:15AM -0700, Christoph Hellwig wrote:
> > - if (PageHighMem(pfn_to_page(pfn))) {
> > + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>
> Please don't sprinkle xpfo details over various bits of code.
>
> Just add a helper with a descriptive name, e.g.
>
> page_is_unmapped()
>
> that also includes the highmem case, as that will easily document
> what this check is doing.
Will do, thanks.
Patch 7 has a similar feel to this one, I can add a wrapper around
__clean_dcache_area_pou() if that makes sense?
Tycho
On Thu, Sep 07, 2017 at 06:33:09PM +0000, Ralph Campbell wrote:
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2736,6 +2736,8 @@
> >
> > nox2apic [X86-64,APIC] Do not enable x2APIC mode.
> >
> > + noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on.
> > +
> > cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
> > CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
> > Some features depend on CPU0. Known dependencies
> <... snip>
>
> A bit more description for system administrators would be very useful.
> Perhaps something like:
>
> noxpfo [XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO)
> Physical pages mapped into user applications will also be mapped
> in the kernel's address space as if CONFIG_XPFO was not enabled.
>
> Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo.
Nice catch, thanks. I'll fix both.
Cheers,
Tycho
On Thu, Sep 7, 2017 at 10:36 AM, Tycho Andersen <[email protected]> wrote:
> From: Juerg Haefliger <[email protected]>
>
> This test simply reads from userspace memory via the kernel's linear
> map.
>
> v6: * drop an #ifdef, just let the test fail if XPFO is not supported
> * add XPFO_SMP test to try and test the case when one CPU does an xpfo
> unmap of an address, that it can't be used accidentally by other
> CPUs.
This is very close! Thanks for the updates. :) Notes below...
>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Tested-by: Marco Benatto <[email protected]>
> ---
> drivers/misc/Makefile | 1 +
> drivers/misc/lkdtm.h | 5 ++
> drivers/misc/lkdtm_core.c | 3 +
> drivers/misc/lkdtm_xpfo.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 203 insertions(+)
>
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index b0b766416306..8447b42a447d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -62,6 +62,7 @@ lkdtm-$(CONFIG_LKDTM) += lkdtm_heap.o
> lkdtm-$(CONFIG_LKDTM) += lkdtm_perms.o
> lkdtm-$(CONFIG_LKDTM) += lkdtm_rodata_objcopy.o
> lkdtm-$(CONFIG_LKDTM) += lkdtm_usercopy.o
> +lkdtm-$(CONFIG_LKDTM) += lkdtm_xpfo.o
>
> KCOV_INSTRUMENT_lkdtm_rodata.o := n
>
> diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
> index 3b4976396ec4..34a6ee37f216 100644
> --- a/drivers/misc/lkdtm.h
> +++ b/drivers/misc/lkdtm.h
> @@ -64,4 +64,9 @@ void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
> void lkdtm_USERCOPY_STACK_BEYOND(void);
> void lkdtm_USERCOPY_KERNEL(void);
>
> +/* lkdtm_xpfo.c */
> +void lkdtm_XPFO_READ_USER(void);
> +void lkdtm_XPFO_READ_USER_HUGE(void);
> +void lkdtm_XPFO_SMP(void);
> +
> #endif
> diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
> index 42d2b8e31e6b..9544e329de4b 100644
> --- a/drivers/misc/lkdtm_core.c
> +++ b/drivers/misc/lkdtm_core.c
> @@ -235,6 +235,9 @@ struct crashtype crashtypes[] = {
> CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
> CRASHTYPE(USERCOPY_STACK_BEYOND),
> CRASHTYPE(USERCOPY_KERNEL),
> + CRASHTYPE(XPFO_READ_USER),
> + CRASHTYPE(XPFO_READ_USER_HUGE),
> + CRASHTYPE(XPFO_SMP),
> };
>
>
> diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
> new file mode 100644
> index 000000000000..d903063bdd0b
> --- /dev/null
> +++ b/drivers/misc/lkdtm_xpfo.c
> @@ -0,0 +1,194 @@
> +/*
> + * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
> + */
> +
> +#include "lkdtm.h"
> +
> +#include <linux/cpumask.h>
> +#include <linux/mman.h>
> +#include <linux/uaccess.h>
> +#include <linux/xpfo.h>
> +#include <linux/kthread.h>
> +
> +#include <linux/delay.h>
> +#include <linux/sched/task.h>
> +
> +#define XPFO_DATA 0xdeadbeef
> +
> +static unsigned long do_map(unsigned long flags)
> +{
> + unsigned long user_addr, user_data = XPFO_DATA;
> +
> + user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
> + PROT_READ | PROT_WRITE | PROT_EXEC,
> + flags, 0);
> + if (user_addr >= TASK_SIZE) {
> + pr_warn("Failed to allocate user memory\n");
> + return 0;
> + }
> +
> + if (copy_to_user((void __user *)user_addr, &user_data,
> + sizeof(user_data))) {
> + pr_warn("copy_to_user failed\n");
> + goto free_user;
> + }
> +
> + return user_addr;
> +
> +free_user:
> + vm_munmap(user_addr, PAGE_SIZE);
> + return 0;
> +}
> +
> +static unsigned long *user_to_kernel(unsigned long user_addr)
> +{
> + phys_addr_t phys_addr;
> + void *virt_addr;
> +
> + phys_addr = user_virt_to_phys(user_addr);
> + if (!phys_addr) {
> + pr_warn("Failed to get physical address of user memory\n");
> + return NULL;
> + }
> +
> + virt_addr = phys_to_virt(phys_addr);
> + if (phys_addr != virt_to_phys(virt_addr)) {
> + pr_warn("Physical address of user memory seems incorrect\n");
> + return NULL;
> + }
> +
> + return virt_addr;
> +}
> +
> +static void read_map(unsigned long *virt_addr)
> +{
> + pr_info("Attempting bad read from kernel address %p\n", virt_addr);
> + if (*(unsigned long *)virt_addr == XPFO_DATA)
> + pr_err("FAIL: Bad read succeeded?!\n");
> + else
> + pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
> +}
> +
> +static void read_user_with_flags(unsigned long flags)
> +{
> + unsigned long user_addr, *kernel;
> +
> + user_addr = do_map(flags);
> + if (!user_addr) {
> + pr_err("FAIL: map failed\n");
> + return;
> + }
> +
> + kernel = user_to_kernel(user_addr);
> + if (!kernel) {
> + pr_err("FAIL: user to kernel conversion failed\n");
> + goto free_user;
> + }
> +
> + read_map(kernel);
> +
> +free_user:
> + vm_munmap(user_addr, PAGE_SIZE);
> +}
> +
> +/* Read from userspace via the kernel's linear map. */
> +void lkdtm_XPFO_READ_USER(void)
> +{
> + read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
> +}
> +
> +void lkdtm_XPFO_READ_USER_HUGE(void)
> +{
> + read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
> +}
> +
> +struct smp_arg {
> + unsigned long *virt_addr;
> + unsigned int cpu;
> +};
> +
> +static int smp_reader(void *parg)
> +{
> + struct smp_arg *arg = parg;
> + unsigned long *virt_addr;
> +
> + if (arg->cpu != smp_processor_id()) {
> + pr_err("FAIL: scheduled on wrong CPU?\n");
> + return 0;
> + }
> +
> + virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
> + read_map(virt_addr);
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_X86
> +#define XPFO_SMP_KILLED SIGKILL
> +#elif CONFIG_ARM64
> +#define XPFO_SMP_KILLED SIGSEGV
> +#else
> +#error unsupported arch
> +#endif
This will fail the build for other architectures, so I would just do
this as an single if/else:
#ifdef CONFIG_ARM64
# define XPFO_SMP_KILLED SIGSEGV
#else
# define XPFO_SMP_KILLED SIGKILL
#endif
> +
> +/* The idea here is to read from the kernel's map on a different thread than
Comment style nit: leading /*\n please...
> + * did the mapping (and thus the TLB flushing), to make sure that the page
> + * faults on other cores too.
> + */
> +void lkdtm_XPFO_SMP(void)
> +{
> + unsigned long user_addr, *virt_addr;
> + struct task_struct *thread;
> + int ret;
> + struct smp_arg arg;
> +
> + if (num_online_cpus() < 2) {
> + pr_err("not enough to do a multi cpu test\n");
> + return;
> + }
> +
> + arg.virt_addr = NULL;
> + arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
> + thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
> + if (IS_ERR(thread)) {
> + pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
> + return;
> + }
> +
> + kthread_bind(thread, arg.cpu);
> + get_task_struct(thread);
> + wake_up_process(thread);
> +
> + user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
> + if (!user_addr)
> + goto kill_thread;
> +
> + virt_addr = user_to_kernel(user_addr);
> + if (!virt_addr) {
> + /*
> + * let's store something that will fail, so we can unblock the
> + * thread
> + */
> + smp_store_release(&arg.virt_addr, &arg);
> + goto free_user;
> + }
> +
> + smp_store_release(&arg.virt_addr, virt_addr);
> +
> + /* there must be a better way to do this. */
> + while (1) {
> + if (thread->exit_state)
> + break;
> + msleep_interruptible(100);
> + }
I don't like infinite loops. How about giving this a 1 second max runtime?
> +
> +free_user:
> + if (user_addr)
> + vm_munmap(user_addr, PAGE_SIZE);
> +
> +kill_thread:
> + ret = kthread_stop(thread);
> + if (ret != XPFO_SMP_KILLED)
> + pr_err("FAIL: thread wasn't killed: %d\n", ret);
> + put_task_struct(thread);
> +}
> --
> 2.11.0
>
Otherwise it looks great, thanks!
-Kees
--
Kees Cook
Pixel Security
On Thu, Sep 07, 2017 at 12:44:14PM -0600, Tycho Andersen wrote:
> On Thu, Sep 07, 2017 at 11:10:15AM -0700, Christoph Hellwig wrote:
> > > - if (PageHighMem(pfn_to_page(pfn))) {
> > > + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >
> > Please don't sprinkle xpfo details over various bits of code.
> >
> > Just add a helper with a descriptive name, e.g.
> >
> > page_is_unmapped()
> >
> > that also includes the highmem case, as that will easily document
> > what this check is doing.
>
> Will do, thanks.
>
> Patch 7 has a similar feel to this one, I can add a wrapper around
> __clean_dcache_area_pou() if that makes sense?
That one is in low-level ARM code so I'm not that worried.
But in general it seems like we should simply have one interface
to check if a page has a kernel mapping or not, nad map/unmap it
if not instead of distinguishing between highmem and xpfo.
On Thu, Sep 07, 2017 at 11:35:59AM -0600, Tycho Andersen wrote:
> vm_mmap is exported, which means kernel modules can use it. In particular,
> for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
> let's support it via vm_mmap.
> } else if (flags & MAP_HUGETLB) {
> + file = map_hugetlb_setup(&len, flags);
> if (IS_ERR(file))
> return PTR_ERR(file);
> }
It seems like you should remove this hunk entirely and make all
MAP_HUGETLB calls go through vm_mmap.
I think this patch needs to be split into the generic mm code, and
the x86 arch code at least.
> +/*
> + * The current flushing context - we pass it instead of 5 arguments:
> + */
> +struct cpa_data {
> + unsigned long *vaddr;
> + pgd_t *pgd;
> + pgprot_t mask_set;
> + pgprot_t mask_clr;
> + unsigned long numpages;
> + int flags;
> + unsigned long pfn;
> + unsigned force_split : 1;
> + int curpage;
> + struct page **pages;
> +};
Fitting these 10 variables into 5 arguments would require an awesome
compression scheme anyway :)
> + if (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0)
Overly long line.
> +#include <linux/xpfo.h>
>
> #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> #ifndef ARCH_HAS_KMAP
> static inline void *kmap(struct page *page)
> {
> + void *kaddr;
> +
> might_sleep();
> - return page_address(page);
> + kaddr = page_address(page);
> + xpfo_kmap(kaddr, page);
> + return kaddr;
> }
>
> static inline void kunmap(struct page *page)
> {
> + xpfo_kunmap(page_address(page), page);
> }
>
> static inline void *kmap_atomic(struct page *page)
> {
> + void *kaddr;
> +
> preempt_disable();
> pagefault_disable();
> - return page_address(page);
> + kaddr = page_address(page);
> + xpfo_kmap(kaddr, page);
> + return kaddr;
> }
It seems to me like we should simply direct to pure xpfo
implementations for the !HIGHMEM && XPFO case. - that is
just have the prototypes for kmap, kunmap and co in
linux/highmem.h and implement them in xpfo under those names.
Instead of sprinkling them around.
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
s/inited/initialized/g ?
> + bool "Enable eXclusive Page Frame Ownership (XPFO)"
> + default n
default n is the default, so you can remove this line.
> +/*
> + * Lookup the page table entry for a virtual address and return a pointer to
> + * the entry. Based on x86 tree.
> + */
> +static pte_t *lookup_address(unsigned long addr)
Seems like this should be moved to common arm64 mm code and used by
kernel_page_present.
On Thu, Sep 07, 2017 at 11:36:08AM -0600, Tycho Andersen wrote:
> We need someting like this for testing XPFO. Since it's architecture
> specific, putting it in the test code is slightly awkward, so let's make it
> an arch-specific symbol and export it for use in LKDTM.
We really should not add an export for this.
I think you'll want to just open code it in your test module.
On Fri, Sep 08, 2017 at 12:51:40AM -0700, Christoph Hellwig wrote:
> > +#include <linux/xpfo.h>
> >
> > #include <asm/cacheflush.h>
> >
> > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> > #ifndef ARCH_HAS_KMAP
> > static inline void *kmap(struct page *page)
> > {
> > + void *kaddr;
> > +
> > might_sleep();
> > - return page_address(page);
> > + kaddr = page_address(page);
> > + xpfo_kmap(kaddr, page);
> > + return kaddr;
> > }
> >
> > static inline void kunmap(struct page *page)
> > {
> > + xpfo_kunmap(page_address(page), page);
> > }
> >
> > static inline void *kmap_atomic(struct page *page)
> > {
> > + void *kaddr;
> > +
> > preempt_disable();
> > pagefault_disable();
> > - return page_address(page);
> > + kaddr = page_address(page);
> > + xpfo_kmap(kaddr, page);
> > + return kaddr;
> > }
>
> It seems to me like we should simply direct to pure xpfo
> implementations for the !HIGHMEM && XPFO case. - that is
> just have the prototypes for kmap, kunmap and co in
> linux/highmem.h and implement them in xpfo under those names.
>
> Instead of sprinkling them around.
Ok, IIUC we'll still need a #ifdef CONFIG_XPFO in this file, but at
least the implementations here won't have a diff. I'll make this
change, and all the others you've suggested.
Thanks!
Tycho
On Fri, Sep 8, 2017 at 12:55 AM, Christoph Hellwig <[email protected]> wrote:
> On Thu, Sep 07, 2017 at 11:36:08AM -0600, Tycho Andersen wrote:
>> We need someting like this for testing XPFO. Since it's architecture
>> specific, putting it in the test code is slightly awkward, so let's make it
>> an arch-specific symbol and export it for use in LKDTM.
>
> We really should not add an export for this.
>
> I think you'll want to just open code it in your test module.
Isn't that going to be fragile? Why not an export?
-Kees
--
Kees Cook
Pixel Security
On Fri, Sep 08, 2017 at 12:53:47AM -0700, Christoph Hellwig wrote:
> > +/*
> > + * Lookup the page table entry for a virtual address and return a pointer to
> > + * the entry. Based on x86 tree.
> > + */
> > +static pte_t *lookup_address(unsigned long addr)
>
> Seems like this should be moved to common arm64 mm code and used by
> kernel_page_present.
Sounds good, I'll include something like the patch below in the next
series.
Unfortunately, adding an implementation of lookup_address seems to be
slightly more complicated than necessary, because of the xen piece. We
have to define lookup_address() with the level parameter, but it's not
obvious to me to name the page levels. So for now I've just left it as
a WARN() if someone supplies it.
It seems like xen still does need this to be defined, because if I
define it without level:
drivers/xen/xenbus/xenbus_client.c: In function ‘xenbus_unmap_ring_vfree_pv’:
drivers/xen/xenbus/xenbus_client.c:760:4: error: too many arguments to function ‘lookup_address’
lookup_address(addr, &level)).maddr;
^~~~~~~~~~~~~~
In file included from ./arch/arm64/include/asm/page.h:37:0,
from ./include/linux/mmzone.h:20,
from ./include/linux/gfp.h:5,
from ./include/linux/mm.h:9,
from drivers/xen/xenbus/xenbus_client.c:33:
./arch/arm64/include/asm/pgtable-types.h:67:15: note: declared here
extern pte_t *lookup_address(unsigned long addr);
^~~~~~~~~~~~~~
I've cc-d the xen folks, maybe they can suggest a way to untangle it?
Alternatively, if someone can suggest a good naming scheme for the
page levels, I can just do that.
Cheers,
Tycho
>From 0b3be95873e3e8caa39fa50efc0d06d57fc6eb5e Mon Sep 17 00:00:00 2001
From: Tycho Andersen <[email protected]>
Date: Fri, 8 Sep 2017 10:43:26 -0600
Subject: [PATCH] arm64: add lookup_address()
Similarly to x86, let's add lookup_address() and use it in
kernel_page_present(). We'll use it in the next patch for the
implementation of XPFO as well.
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/include/asm/pgtable-types.h | 2 ++
arch/arm64/mm/pageattr.c | 47 ++++++++++++++++++++--------------
include/xen/arm/page.h | 10 --------
3 files changed, 30 insertions(+), 29 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h
index 345a072b5856..fad3db5a673f 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -64,4 +64,6 @@ typedef struct { pteval_t pgprot; } pgprot_t;
#include <asm-generic/5level-fixup.h>
#endif
+extern pte_t *lookup_address(unsigned long addr, unsigned int *level);
+
#endif /* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a682a0a2a0fa..437a12485873 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -138,6 +138,32 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
__pgprot(PTE_VALID));
}
+pte_t *lookup_address(unsigned long addr, unsigned int *level)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ if (unlikely(level)) {
+ WARN(1, "level unused on arm64\n");
+ *level = 0;
+ }
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ return pte_offset_kernel(pmd, addr);
+}
+
#ifdef CONFIG_DEBUG_PAGEALLOC
void __kernel_map_pages(struct page *page, int numpages, int enable)
{
@@ -156,29 +182,12 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
*/
bool kernel_page_present(struct page *page)
{
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
unsigned long addr = (unsigned long)page_address(page);
+ pte_t *pte = lookup_address(addr);
- pgd = pgd_offset_k(addr);
- if (pgd_none(*pgd))
- return false;
-
- pud = pud_offset(pgd, addr);
- if (pud_none(*pud))
- return false;
- if (pud_sect(*pud))
- return true;
-
- pmd = pmd_offset(pud, addr);
- if (pmd_none(*pmd))
+ if (!pte)
return false;
- if (pmd_sect(*pmd))
- return true;
- pte = pte_offset_kernel(pmd, addr);
return pte_valid(*pte);
}
#endif /* CONFIG_HIBERNATION */
diff --git a/include/xen/arm/page.h b/include/xen/arm/page.h
index 415dbc6e43fd..6adc2a955340 100644
--- a/include/xen/arm/page.h
+++ b/include/xen/arm/page.h
@@ -84,16 +84,6 @@ static inline xmaddr_t arbitrary_virt_to_machine(void *vaddr)
BUG();
}
-/* TODO: this shouldn't be here but it is because the frontend drivers
- * are using it (its rolled in headers) even though we won't hit the code path.
- * So for right now just punt with this.
- */
-static inline pte_t *lookup_address(unsigned long address, unsigned int *level)
-{
- BUG();
- return NULL;
-}
-
extern int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_grant_ref *kmap_ops,
struct page **pages, unsigned int count);
--
2.11.0
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> +static inline struct xpfo *lookup_xpfo(struct page *page)
> +{
> + struct page_ext *page_ext = lookup_page_ext(page);
> +
> + if (unlikely(!page_ext)) {
> + WARN(1, "xpfo: failed to get page ext");
> + return NULL;
> + }
> +
> + return (void *)page_ext + page_xpfo_ops.offset;
> +}
> +
Just drop the WARN. On my arm64 UEFI machine this spews warnings
under most normal operation. This should be normal for some
situations but I haven't had the time to dig into why this
is so pronounced on arm64.
Thanks,
Laura
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> XPFO doesn't support section/contiguous mappings yet, so let's disable it
> if XPFO is turned on.
>
> Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
> pointing out we need NO_CONT_MAPPINGS too.
>
> CC: [email protected]
> Signed-off-by: Tycho Andersen <[email protected]>
This should just be folded into the initial arm64 patch
since it basically bugs out otherwise.
Thanks,
Laura
> ---
> arch/arm64/mm/mmu.c | 2 +-
> include/linux/xpfo.h | 4 ++++
> mm/xpfo.c | 6 ++++++
> 3 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e0e864..34bb95303cce 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -420,7 +420,7 @@ static void __init map_mem(pgd_t *pgd)
> struct memblock_region *reg;
> int flags = 0;
>
> - if (debug_pagealloc_enabled())
> + if (debug_pagealloc_enabled() || xpfo_enabled())
> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>
> /*
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index d37a06c9d62c..1693af1a0293 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -43,6 +43,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
> void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
> size_t mapping_len);
>
> +bool xpfo_enabled(void);
> +
> #else /* !CONFIG_XPFO */
>
> static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> @@ -65,6 +67,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
> }
>
>
> +static inline bool xpfo_enabled(void) { return false; }
> +
> #endif /* CONFIG_XPFO */
>
> #endif /* _LINUX_XPFO_H */
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index f79075bf7d65..25fba05d01bd 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -70,6 +70,12 @@ struct page_ext_operations page_xpfo_ops = {
> .init = init_xpfo,
> };
>
> +bool __init xpfo_enabled(void)
> +{
> + return !xpfo_disabled;
> +}
> +EXPORT_SYMBOL(xpfo_enabled);
> +
> static inline struct xpfo *lookup_xpfo(struct page *page)
> {
> struct page_ext *page_ext = lookup_page_ext(page);
>
Hi Juerg,
[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v4.13]
[cannot apply to mmotm/master next-20170908]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/Add-support-for-eXclusive-Page-Frame-Ownership/20170910-073030
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa
All error/warnings (new ones prefixed by >>):
drivers/misc/lkdtm_xpfo.c: In function 'user_to_kernel':
>> drivers/misc/lkdtm_xpfo.c:54:2: error: implicit declaration of function 'phys_to_virt' [-Werror=implicit-function-declaration]
virt_addr = phys_to_virt(phys_addr);
^
drivers/misc/lkdtm_xpfo.c:54:12: warning: assignment makes pointer from integer without a cast
virt_addr = phys_to_virt(phys_addr);
^
>> drivers/misc/lkdtm_xpfo.c:55:2: error: implicit declaration of function 'virt_to_phys' [-Werror=implicit-function-declaration]
if (phys_addr != virt_to_phys(virt_addr)) {
^
drivers/misc/lkdtm_xpfo.c: At top level:
>> drivers/misc/lkdtm_xpfo.c:128:7: warning: "CONFIG_ARM64" is not defined [-Wundef]
#elif CONFIG_ARM64
^
>> drivers/misc/lkdtm_xpfo.c:131:2: error: #error unsupported arch
#error unsupported arch
^
drivers/misc/lkdtm_xpfo.c: In function 'lkdtm_XPFO_SMP':
>> drivers/misc/lkdtm_xpfo.c:191:13: error: 'XPFO_SMP_KILLED' undeclared (first use in this function)
if (ret != XPFO_SMP_KILLED)
^
drivers/misc/lkdtm_xpfo.c:191:13: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
vim +/phys_to_virt +54 drivers/misc/lkdtm_xpfo.c
42
43 static unsigned long *user_to_kernel(unsigned long user_addr)
44 {
45 phys_addr_t phys_addr;
46 void *virt_addr;
47
48 phys_addr = user_virt_to_phys(user_addr);
49 if (!phys_addr) {
50 pr_warn("Failed to get physical address of user memory\n");
51 return NULL;
52 }
53
> 54 virt_addr = phys_to_virt(phys_addr);
> 55 if (phys_addr != virt_to_phys(virt_addr)) {
56 pr_warn("Physical address of user memory seems incorrect\n");
57 return NULL;
58 }
59
60 return virt_addr;
61 }
62
63 static void read_map(unsigned long *virt_addr)
64 {
65 pr_info("Attempting bad read from kernel address %p\n", virt_addr);
66 if (*(unsigned long *)virt_addr == XPFO_DATA)
67 pr_err("FAIL: Bad read succeeded?!\n");
68 else
69 pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
70 }
71
72 static void read_user_with_flags(unsigned long flags)
73 {
74 unsigned long user_addr, *kernel;
75
76 user_addr = do_map(flags);
77 if (!user_addr) {
78 pr_err("FAIL: map failed\n");
79 return;
80 }
81
82 kernel = user_to_kernel(user_addr);
83 if (!kernel) {
84 pr_err("FAIL: user to kernel conversion failed\n");
85 goto free_user;
86 }
87
88 read_map(kernel);
89
90 free_user:
91 vm_munmap(user_addr, PAGE_SIZE);
92 }
93
94 /* Read from userspace via the kernel's linear map. */
95 void lkdtm_XPFO_READ_USER(void)
96 {
97 read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
98 }
99
100 void lkdtm_XPFO_READ_USER_HUGE(void)
101 {
102 read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
103 }
104
105 struct smp_arg {
106 unsigned long *virt_addr;
107 unsigned int cpu;
108 };
109
110 static int smp_reader(void *parg)
111 {
112 struct smp_arg *arg = parg;
113 unsigned long *virt_addr;
114
115 if (arg->cpu != smp_processor_id()) {
116 pr_err("FAIL: scheduled on wrong CPU?\n");
117 return 0;
118 }
119
120 virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
121 read_map(virt_addr);
122
123 return 0;
124 }
125
126 #ifdef CONFIG_X86
127 #define XPFO_SMP_KILLED SIGKILL
> 128 #elif CONFIG_ARM64
129 #define XPFO_SMP_KILLED SIGSEGV
130 #else
> 131 #error unsupported arch
132 #endif
133
134 /* The idea here is to read from the kernel's map on a different thread than
135 * did the mapping (and thus the TLB flushing), to make sure that the page
136 * faults on other cores too.
137 */
138 void lkdtm_XPFO_SMP(void)
139 {
140 unsigned long user_addr, *virt_addr;
141 struct task_struct *thread;
142 int ret;
143 struct smp_arg arg;
144
145 if (num_online_cpus() < 2) {
146 pr_err("not enough to do a multi cpu test\n");
147 return;
148 }
149
150 arg.virt_addr = NULL;
151 arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
152 thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
153 if (IS_ERR(thread)) {
154 pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
155 return;
156 }
157
158 kthread_bind(thread, arg.cpu);
159 get_task_struct(thread);
160 wake_up_process(thread);
161
162 user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
163 if (!user_addr)
164 goto kill_thread;
165
166 virt_addr = user_to_kernel(user_addr);
167 if (!virt_addr) {
168 /*
169 * let's store something that will fail, so we can unblock the
170 * thread
171 */
172 smp_store_release(&arg.virt_addr, &arg);
173 goto free_user;
174 }
175
176 smp_store_release(&arg.virt_addr, virt_addr);
177
178 /* there must be a better way to do this. */
179 while (1) {
180 if (thread->exit_state)
181 break;
182 msleep_interruptible(100);
183 }
184
185 free_user:
186 if (user_addr)
187 vm_munmap(user_addr, PAGE_SIZE);
188
189 kill_thread:
190 ret = kthread_stop(thread);
> 191 if (ret != XPFO_SMP_KILLED)
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Tycho,
On 2017/9/8 1:36, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping,
> specifically:
> - two flags to distinguish user vs. kernel pages and to tag unmapped
> pages.
> - a reference counter to balance kmap/kunmap operations.
> - a lock to serialize access to the XPFO fields.
>
> This patch is based on the work of Vasileios P. Kemerlis et al. who
> published their work in this paper:
> http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> [...]
> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> +{
> + int i, flush_tlb = 0;
> + struct xpfo *xpfo;
> +
> + if (!static_branch_unlikely(&xpfo_inited))
> + return;
> +
> + for (i = 0; i < (1 << order); i++) {
> + xpfo = lookup_xpfo(page + i);
> + if (!xpfo)
> + continue;
> +
> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> + "xpfo: unmapped page being allocated\n");
> +
> + /* Initialize the map lock and map counter */
> + if (unlikely(!xpfo->inited)) {
> + spin_lock_init(&xpfo->maplock);
> + atomic_set(&xpfo->mapcount, 0);
> + xpfo->inited = true;
> + }
> + WARN(atomic_read(&xpfo->mapcount),
> + "xpfo: already mapped page being allocated\n");
> +
> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> + /*
> + * Tag the page as a user page and flush the TLB if it
> + * was previously allocated to the kernel.
> + */
> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> + flush_tlb = 1;
I'm not sure whether I am miss anything, however, when the page was previously allocated
to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
the page to user now
Yisheng Xie
Thanks
On Fri, Sep 08, 2017 at 08:44:22AM -0700, Kees Cook wrote:
> On Fri, Sep 8, 2017 at 12:55 AM, Christoph Hellwig <[email protected]> wrote:
> > On Thu, Sep 07, 2017 at 11:36:08AM -0600, Tycho Andersen wrote:
> >> We need someting like this for testing XPFO. Since it's architecture
> >> specific, putting it in the test code is slightly awkward, so let's make it
> >> an arch-specific symbol and export it for use in LKDTM.
> >
> > We really should not add an export for this.
> >
> > I think you'll want to just open code it in your test module.
>
> Isn't that going to be fragile? Why not an export?
It is a little fragile, but it is functionality not needed at all by
the kernel, so we should not add it to the kernel image and/or export
it.
Hi Tycho ,
On 2017/9/8 1:35, Tycho Andersen wrote:
> Hi all,
>
> Here is v6 of the XPFO set; see v5 discussion here:
> https://lkml.org/lkml/2017/8/9/803
>
> Changelogs are in the individual patch notes, but the highlights are:
> * add primitives for ensuring memory areas are mapped (although these are quite
> ugly, using stack allocation; I'm open to better suggestions)
> * instead of not flushing caches, re-map pages using the above
> * TLB flushing is much more correct (i.e. we're always flushing everything
> everywhere). I suspect we may be able to back this off in some cases, but I'm
> still trying to collect performance numbers to prove this is worth doing.
>
> I have no TODOs left for this set myself, other than fixing whatever review
> feedback people have. Thoughts and testing welcome!
According to the paper of Vasileios P. Kemerlis et al, the mainline kernel
will not set the Pro. of physmap(direct map area) to RW(X), so do we really
need XPFO to protect from ret2dir attack?
Thanks
Yisheng xie
>
> Cheers,
>
> Tycho
>
> Juerg Haefliger (6):
> mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
> swiotlb: Map the buffer if it was unmapped by XPFO
> arm64/mm: Add support for XPFO
> arm64/mm, xpfo: temporarily map dcache regions
> arm64/mm: Add support for XPFO to swiotlb
> lkdtm: Add test for XPFO
>
> Tycho Andersen (5):
> mm: add MAP_HUGETLB support to vm_mmap
> x86: always set IF before oopsing from page fault
> xpfo: add primitives for mapping underlying memory
> arm64/mm: disable section/contiguous mappings if XPFO is enabled
> mm: add a user_virt_to_phys symbol
>
> Documentation/admin-guide/kernel-parameters.txt | 2 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/cacheflush.h | 11 +
> arch/arm64/mm/Makefile | 2 +
> arch/arm64/mm/dma-mapping.c | 32 +--
> arch/arm64/mm/flush.c | 7 +
> arch/arm64/mm/mmu.c | 2 +-
> arch/arm64/mm/xpfo.c | 127 +++++++++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 25 +++
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/fault.c | 6 +
> arch/x86/mm/pageattr.c | 22 +-
> arch/x86/mm/xpfo.c | 171 +++++++++++++++
> drivers/misc/Makefile | 1 +
> drivers/misc/lkdtm.h | 5 +
> drivers/misc/lkdtm_core.c | 3 +
> drivers/misc/lkdtm_xpfo.c | 194 +++++++++++++++++
> include/linux/highmem.h | 15 +-
> include/linux/mm.h | 2 +
> include/linux/xpfo.h | 79 +++++++
> lib/swiotlb.c | 3 +-
> mm/Makefile | 1 +
> mm/mmap.c | 19 +-
> mm/page_alloc.c | 2 +
> mm/page_ext.c | 4 +
> mm/util.c | 32 +++
> mm/xpfo.c | 273 ++++++++++++++++++++++++
> security/Kconfig | 19 ++
> 29 files changed, 1005 insertions(+), 57 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm_xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
>
Hi Yisheng,
On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> > +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> > +{
> > + int i, flush_tlb = 0;
> > + struct xpfo *xpfo;
> > +
> > + if (!static_branch_unlikely(&xpfo_inited))
> > + return;
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + xpfo = lookup_xpfo(page + i);
> > + if (!xpfo)
> > + continue;
> > +
> > + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> > + "xpfo: unmapped page being allocated\n");
> > +
> > + /* Initialize the map lock and map counter */
> > + if (unlikely(!xpfo->inited)) {
> > + spin_lock_init(&xpfo->maplock);
> > + atomic_set(&xpfo->mapcount, 0);
> > + xpfo->inited = true;
> > + }
> > + WARN(atomic_read(&xpfo->mapcount),
> > + "xpfo: already mapped page being allocated\n");
> > +
> > + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> > + /*
> > + * Tag the page as a user page and flush the TLB if it
> > + * was previously allocated to the kernel.
> > + */
> > + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> > + flush_tlb = 1;
>
> I'm not sure whether I am miss anything, however, when the page was previously allocated
> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
> the page to user now
Yes, I think you're right. Oddly, the XPFO_READ_USER test works
correctly for me, but I think (?) should not because of this bug...
Tycho
Hi Yisheng,
On Mon, Sep 11, 2017 at 06:34:45PM +0800, Yisheng Xie wrote:
> Hi Tycho ,
>
> On 2017/9/8 1:35, Tycho Andersen wrote:
> > Hi all,
> >
> > Here is v6 of the XPFO set; see v5 discussion here:
> > https://lkml.org/lkml/2017/8/9/803
> >
> > Changelogs are in the individual patch notes, but the highlights are:
> > * add primitives for ensuring memory areas are mapped (although these are quite
> > ugly, using stack allocation; I'm open to better suggestions)
> > * instead of not flushing caches, re-map pages using the above
> > * TLB flushing is much more correct (i.e. we're always flushing everything
> > everywhere). I suspect we may be able to back this off in some cases, but I'm
> > still trying to collect performance numbers to prove this is worth doing.
> >
> > I have no TODOs left for this set myself, other than fixing whatever review
> > feedback people have. Thoughts and testing welcome!
>
> According to the paper of Vasileios P. Kemerlis et al, the mainline kernel
> will not set the Pro. of physmap(direct map area) to RW(X), so do we really
> need XPFO to protect from ret2dir attack?
I guess you're talking about section 4.3? They mention that that x86
only gets rw, but that aarch64 is rwx still.
But in either case this still provides access protection, similar to
SMAP. Also, if I understand things correctly the protections are
unmanaged, so a page that had the +x bit set at some point, it could
be used for ret2dir.
Cheers,
Tycho
On Sat, Sep 09, 2017 at 08:35:17AM -0700, Laura Abbott wrote:
> On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> > +static inline struct xpfo *lookup_xpfo(struct page *page)
> > +{
> > + struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > + if (unlikely(!page_ext)) {
> > + WARN(1, "xpfo: failed to get page ext");
> > + return NULL;
> > + }
> > +
> > + return (void *)page_ext + page_xpfo_ops.offset;
> > +}
> > +
>
> Just drop the WARN. On my arm64 UEFI machine this spews warnings
> under most normal operation. This should be normal for some
> situations but I haven't had the time to dig into why this
> is so pronounced on arm64.
Will do, thanks! If you figure out under what conditions it's normal,
I'd be curious :)
Tycho
On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> Hi Yisheng,
>
> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
>>> +{
>>> + int i, flush_tlb = 0;
>>> + struct xpfo *xpfo;
>>> +
>>> + if (!static_branch_unlikely(&xpfo_inited))
>>> + return;
>>> +
>>> + for (i = 0; i < (1 << order); i++) {
>>> + xpfo = lookup_xpfo(page + i);
>>> + if (!xpfo)
>>> + continue;
>>> +
>>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
>>> + "xpfo: unmapped page being allocated\n");
>>> +
>>> + /* Initialize the map lock and map counter */
>>> + if (unlikely(!xpfo->inited)) {
>>> + spin_lock_init(&xpfo->maplock);
>>> + atomic_set(&xpfo->mapcount, 0);
>>> + xpfo->inited = true;
>>> + }
>>> + WARN(atomic_read(&xpfo->mapcount),
>>> + "xpfo: already mapped page being allocated\n");
>>> +
>>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
>>> + /*
>>> + * Tag the page as a user page and flush the TLB if it
>>> + * was previously allocated to the kernel.
>>> + */
>>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
>>> + flush_tlb = 1;
>>
>> I'm not sure whether I am miss anything, however, when the page was previously allocated
>> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
>> the page to user now
>>
> Yes, I think you're right. Oddly, the XPFO_READ_USER test works
> correctly for me, but I think (?) should not because of this bug...
IIRC, this is an optimization carried forward from the initial
implementation. The assumption is that the kernel will map the user
buffer so it's not unmapped on allocation but only on the first (and
subsequent) call of kunmap. I.e.:
- alloc -> noop
- kmap -> noop
- kunmap -> unmapped from the kernel
- kmap -> mapped into the kernel
- kunmap -> unmapped from the kernel
and so on until:
- free -> mapped back into the kernel
I'm not sure if that make sense though since it leaves a window.
...Juerg
> Tycho
>
On Mon, Sep 11, 2017 at 06:03:55PM +0200, Juerg Haefliger wrote:
>
>
> On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> > Hi Yisheng,
> >
> > On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> >>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> >>> +{
> >>> + int i, flush_tlb = 0;
> >>> + struct xpfo *xpfo;
> >>> +
> >>> + if (!static_branch_unlikely(&xpfo_inited))
> >>> + return;
> >>> +
> >>> + for (i = 0; i < (1 << order); i++) {
> >>> + xpfo = lookup_xpfo(page + i);
> >>> + if (!xpfo)
> >>> + continue;
> >>> +
> >>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> >>> + "xpfo: unmapped page being allocated\n");
> >>> +
> >>> + /* Initialize the map lock and map counter */
> >>> + if (unlikely(!xpfo->inited)) {
> >>> + spin_lock_init(&xpfo->maplock);
> >>> + atomic_set(&xpfo->mapcount, 0);
> >>> + xpfo->inited = true;
> >>> + }
> >>> + WARN(atomic_read(&xpfo->mapcount),
> >>> + "xpfo: already mapped page being allocated\n");
> >>> +
> >>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> >>> + /*
> >>> + * Tag the page as a user page and flush the TLB if it
> >>> + * was previously allocated to the kernel.
> >>> + */
> >>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> >>> + flush_tlb = 1;
> >>
> >> I'm not sure whether I am miss anything, however, when the page was previously allocated
> >> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
> >> the page to user now
> >>
> > Yes, I think you're right. Oddly, the XPFO_READ_USER test works
> > correctly for me, but I think (?) should not because of this bug...
>
> IIRC, this is an optimization carried forward from the initial
> implementation. The assumption is that the kernel will map the user
> buffer so it's not unmapped on allocation but only on the first (and
Does the kernel always map it, though? e.g. in the case of
XPFO_READ_USER, I'm not sure where the kernel would do a kmap() of the
test's user buffer.
Tycho
> subsequent) call of kunmap. I.e.:
> - alloc -> noop
> - kmap -> noop
> - kunmap -> unmapped from the kernel
> - kmap -> mapped into the kernel
> - kunmap -> unmapped from the kernel
> and so on until:
> - free -> mapped back into the kernel
>
> I'm not sure if that make sense though since it leaves a window.
>
> ...Juerg
>
>
>
> > Tycho
> >
Hi all,
On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote:
>
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> + int level;
> + unsigned long size, kaddr;
> +
> + kaddr = (unsigned long)page_address(page);
> +
> + if (unlikely(!lookup_address(kaddr, &level))) {
> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> + return;
> + }
> +
> + switch (level) {
> + case PG_LEVEL_4K:
> + size = PAGE_SIZE;
> + break;
> + case PG_LEVEL_2M:
> + size = PMD_SIZE;
> + break;
> + case PG_LEVEL_1G:
> + size = PUD_SIZE;
> + break;
> + default:
> + WARN(1, "xpfo: unsupported page level %x\n", level);
> + return;
> + }
> +
> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
Marco was testing and got the stack trace below. The issue is that on x86,
flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since
this is called from xpfo_kmap/unmap in this interrupt handler, the WARN()
triggers.
I'm not sure what to do about this -- based on the discussion in v6 we need to
flush the TLBs for all CPUs -- but we can't do that with interrupts disabled,
which basically means with this we wouldn't be able to map/unmap pages in
interrupts.
Any thoughts?
Tycho
[ 2.712912] ------------[ cut here ]------------
[ 2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414
smp_call_function_many+0x9a/0x270
[ 2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl
drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt
virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp
virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw
floppy dm_mirror dm_region_hash dm_log dm_mod
[ 2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8
[ 2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.9.3-1.fc25 04/01/2014
[ 2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000
[ 2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270
[ 2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046
[ 2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001
[ 2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320
[ 2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063
[ 2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8
[ 2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320
[ 2.712950] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000)
knlGS:0000000000000000
[ 2.712951] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0
[ 2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2.712955] Call Trace:
[ 2.712959] <IRQ>
[ 2.712964] ? x86_configure_nx+0x50/0x50
[ 2.712966] on_each_cpu+0x2d/0x60
[ 2.712967] flush_tlb_kernel_range+0x79/0x80
[ 2.712969] xpfo_flush_kernel_tlb+0xaa/0xe0
[ 2.712975] xpfo_kunmap+0xa8/0xc0
[ 2.712981] swiotlb_bounce+0xd1/0x1c0
[ 2.712982] swiotlb_tbl_unmap_single+0x10f/0x120
[ 2.712984] unmap_single+0x20/0x30
[ 2.712985] swiotlb_unmap_sg_attrs+0x46/0x70
[ 2.712991] __ata_qc_complete+0xfa/0x150 [libata]
[ 2.712994] ata_qc_complete+0xd2/0x2e0 [libata]
[ 2.712998] ata_hsm_qc_complete+0x6f/0x90 [libata]
[ 2.713004] ata_sff_hsm_move+0xae/0x6b0 [libata]
[ 2.713009] __ata_sff_port_intr+0x8e/0x100 [libata]
[ 2.713013] ata_bmdma_port_intr+0x2f/0xd0 [libata]
[ 2.713019] ata_bmdma_interrupt+0x161/0x1b0 [libata]
[ 2.713022] __handle_irq_event_percpu+0x3c/0x190
[ 2.713024] handle_irq_event_percpu+0x32/0x80
[ 2.713026] handle_irq_event+0x3b/0x60
[ 2.713027] handle_edge_irq+0x8f/0x190
[ 2.713029] handle_irq+0xab/0x120
[ 2.713032] ? _local_bh_enable+0x21/0x30
[ 2.713039] do_IRQ+0x48/0xd0
[ 2.713040] common_interrupt+0x93/0x93
[ 2.713042] RIP: 0010:native_safe_halt+0x6/0x10
[ 2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffc1
[ 2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000
[ 2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000
[ 2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000
[ 2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000
[ 2.713048] </IRQ>
[ 2.713050] default_idle+0x1e/0x100
[ 2.713052] arch_cpu_idle+0xf/0x20
[ 2.713053] default_idle_call+0x2c/0x40
[ 2.713055] do_idle+0x158/0x1e0
[ 2.713056] cpu_startup_entry+0x73/0x80
[ 2.713058] rest_init+0xb8/0xc0
[ 2.713070] start_kernel+0x4a2/0x4c3
[ 2.713072] ? set_init_arg+0x5a/0x5a
[ 2.713074] ? early_idt_handler_array+0x120/0x120
[ 2.713075] x86_64_start_reservations+0x2a/0x2c
[ 2.713077] x86_64_start_kernel+0x14c/0x16f
[ 2.713079] secondary_startup_64+0x9f/0x9f
[ 2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c
41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0
00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20
b5 62
[ 2.713105] ---[ end trace 4d101d4c176c16b0 ]---
I think the point here is, as running under interrupt-context, we are
using k{map,unmap)_atomic instead of k{map,unmap}
but after the flush_tlb_kernel_range() change we don't have an XPFO
atomic version from both.
My suggestion here is, at least for x86, split this into xpfo_k{un}map
and xpfo_k{un}map_atomic which wild only flush local tlb.
This would create a window where the same page would be mapped both
for user and kernel on other cpu's due to stale remote tlb entries.
Is there any other suggestion for this scenario?
Thanks,
- Marco Benatto
On Mon, Sep 11, 2017 at 3:32 PM, Tycho Andersen <[email protected]> wrote:
> Hi all,
>
> On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote:
>>
>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>> +{
>> + int level;
>> + unsigned long size, kaddr;
>> +
>> + kaddr = (unsigned long)page_address(page);
>> +
>> + if (unlikely(!lookup_address(kaddr, &level))) {
>> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
>> + return;
>> + }
>> +
>> + switch (level) {
>> + case PG_LEVEL_4K:
>> + size = PAGE_SIZE;
>> + break;
>> + case PG_LEVEL_2M:
>> + size = PMD_SIZE;
>> + break;
>> + case PG_LEVEL_1G:
>> + size = PUD_SIZE;
>> + break;
>> + default:
>> + WARN(1, "xpfo: unsupported page level %x\n", level);
>> + return;
>> + }
>> +
>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>
> Marco was testing and got the stack trace below. The issue is that on x86,
> flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since
> this is called from xpfo_kmap/unmap in this interrupt handler, the WARN()
> triggers.
>
> I'm not sure what to do about this -- based on the discussion in v6 we need to
> flush the TLBs for all CPUs -- but we can't do that with interrupts disabled,
> which basically means with this we wouldn't be able to map/unmap pages in
> interrupts.
>
> Any thoughts?
>
> Tycho
>
> [ 2.712912] ------------[ cut here ]------------
> [ 2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414
> smp_call_function_many+0x9a/0x270
> [ 2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl
> drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt
> virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp
> virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw
> floppy dm_mirror dm_region_hash dm_log dm_mod
> [ 2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8
> [ 2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.fc25 04/01/2014
> [ 2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000
> [ 2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270
> [ 2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046
> [ 2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001
> [ 2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320
> [ 2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063
> [ 2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8
> [ 2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320
> [ 2.712950] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000)
> knlGS:0000000000000000
> [ 2.712951] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0
> [ 2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2.712955] Call Trace:
> [ 2.712959] <IRQ>
> [ 2.712964] ? x86_configure_nx+0x50/0x50
> [ 2.712966] on_each_cpu+0x2d/0x60
> [ 2.712967] flush_tlb_kernel_range+0x79/0x80
> [ 2.712969] xpfo_flush_kernel_tlb+0xaa/0xe0
> [ 2.712975] xpfo_kunmap+0xa8/0xc0
> [ 2.712981] swiotlb_bounce+0xd1/0x1c0
> [ 2.712982] swiotlb_tbl_unmap_single+0x10f/0x120
> [ 2.712984] unmap_single+0x20/0x30
> [ 2.712985] swiotlb_unmap_sg_attrs+0x46/0x70
> [ 2.712991] __ata_qc_complete+0xfa/0x150 [libata]
> [ 2.712994] ata_qc_complete+0xd2/0x2e0 [libata]
> [ 2.712998] ata_hsm_qc_complete+0x6f/0x90 [libata]
> [ 2.713004] ata_sff_hsm_move+0xae/0x6b0 [libata]
> [ 2.713009] __ata_sff_port_intr+0x8e/0x100 [libata]
> [ 2.713013] ata_bmdma_port_intr+0x2f/0xd0 [libata]
> [ 2.713019] ata_bmdma_interrupt+0x161/0x1b0 [libata]
> [ 2.713022] __handle_irq_event_percpu+0x3c/0x190
> [ 2.713024] handle_irq_event_percpu+0x32/0x80
> [ 2.713026] handle_irq_event+0x3b/0x60
> [ 2.713027] handle_edge_irq+0x8f/0x190
> [ 2.713029] handle_irq+0xab/0x120
> [ 2.713032] ? _local_bh_enable+0x21/0x30
> [ 2.713039] do_IRQ+0x48/0xd0
> [ 2.713040] common_interrupt+0x93/0x93
> [ 2.713042] RIP: 0010:native_safe_halt+0x6/0x10
> [ 2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffc1
> [ 2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000
> [ 2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [ 2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000
> [ 2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000
> [ 2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000
> [ 2.713048] </IRQ>
> [ 2.713050] default_idle+0x1e/0x100
> [ 2.713052] arch_cpu_idle+0xf/0x20
> [ 2.713053] default_idle_call+0x2c/0x40
> [ 2.713055] do_idle+0x158/0x1e0
> [ 2.713056] cpu_startup_entry+0x73/0x80
> [ 2.713058] rest_init+0xb8/0xc0
> [ 2.713070] start_kernel+0x4a2/0x4c3
> [ 2.713072] ? set_init_arg+0x5a/0x5a
> [ 2.713074] ? early_idt_handler_array+0x120/0x120
> [ 2.713075] x86_64_start_reservations+0x2a/0x2c
> [ 2.713077] x86_64_start_kernel+0x14c/0x16f
> [ 2.713079] secondary_startup_64+0x9f/0x9f
> [ 2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c
> 41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0
> 00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20
> b5 62
> [ 2.713105] ---[ end trace 4d101d4c176c16b0 ]---
--
Marco Antonio Benatto
Linux user ID: #506236
Hi Tycho,
On 2017/9/11 23:02, Tycho Andersen wrote:
> Hi Yisheng,
>
> On Mon, Sep 11, 2017 at 06:34:45PM +0800, Yisheng Xie wrote:
>> Hi Tycho ,
>>
>> On 2017/9/8 1:35, Tycho Andersen wrote:
>>> Hi all,
>>>
>>> Here is v6 of the XPFO set; see v5 discussion here:
>>> https://lkml.org/lkml/2017/8/9/803
>>>
>>> Changelogs are in the individual patch notes, but the highlights are:
>>> * add primitives for ensuring memory areas are mapped (although these are quite
>>> ugly, using stack allocation; I'm open to better suggestions)
>>> * instead of not flushing caches, re-map pages using the above
>>> * TLB flushing is much more correct (i.e. we're always flushing everything
>>> everywhere). I suspect we may be able to back this off in some cases, but I'm
>>> still trying to collect performance numbers to prove this is worth doing.
>>>
>>> I have no TODOs left for this set myself, other than fixing whatever review
>>> feedback people have. Thoughts and testing welcome!
>>
>> According to the paper of Vasileios P. Kemerlis et al, the mainline kernel
>> will not set the Pro. of physmap(direct map area) to RW(X), so do we really
>> need XPFO to protect from ret2dir attack?
>
> I guess you're talking about section 4.3?
Yes
> They mention that that x86
> only gets rw, but that aarch64 is rwx still.
IIRC, the in kernel of v4.13 the aarch64 is not rwx, I will check it.
>
> But in either case this still provides access protection, similar to
> SMAP. Also, if I understand things correctly the protections are
> unmanaged, so a page that had the +x bit set at some point, it could
> be used for ret2dir.
So you means that the Pro. of direct map area maybe changed to +x, then ret2dir attack can use it?
Thanks
Yisheng Xie
On 09/12/2017 09:07 AM, Yisheng Xie wrote:
> Hi Tycho,
>
> On 2017/9/11 23:02, Tycho Andersen wrote:
>> Hi Yisheng,
>>
>> On Mon, Sep 11, 2017 at 06:34:45PM +0800, Yisheng Xie wrote:
>>> Hi Tycho ,
>>>
>>> On 2017/9/8 1:35, Tycho Andersen wrote:
>>>> Hi all,
>>>>
>>>> Here is v6 of the XPFO set; see v5 discussion here:
>>>> https://lkml.org/lkml/2017/8/9/803
>>>>
>>>> Changelogs are in the individual patch notes, but the highlights are:
>>>> * add primitives for ensuring memory areas are mapped (although these are quite
>>>> ugly, using stack allocation; I'm open to better suggestions)
>>>> * instead of not flushing caches, re-map pages using the above
>>>> * TLB flushing is much more correct (i.e. we're always flushing everything
>>>> everywhere). I suspect we may be able to back this off in some cases, but I'm
>>>> still trying to collect performance numbers to prove this is worth doing.
>>>>
>>>> I have no TODOs left for this set myself, other than fixing whatever review
>>>> feedback people have. Thoughts and testing welcome!
>>>
>>> According to the paper of Vasileios P. Kemerlis et al, the mainline kernel
>>> will not set the Pro. of physmap(direct map area) to RW(X), so do we really
>>> need XPFO to protect from ret2dir attack?
>>
>> I guess you're talking about section 4.3?
> Yes
>
>> They mention that that x86
>> only gets rw, but that aarch64 is rwx still.
> IIRC, the in kernel of v4.13 the aarch64 is not rwx, I will check it.
>
>>
>> But in either case this still provides access protection, similar to
>> SMAP. Also, if I understand things correctly the protections are
>> unmanaged, so a page that had the +x bit set at some point, it could
>> be used for ret2dir.
> So you means that the Pro. of direct map area maybe changed to +x, then ret2dir attack can use it?
XPFO protects against malicious reads from userspace (potentially
accessing sensitive data). I've also been told by a security expert that
ROP attacks are still possible even if user space memory is
non-executable. XPFO is supposed to prevent that but I haven't been able
to confirm this. It's way out of my comfort zone.
...Juerg
> Thanks
> Yisheng Xie
>
>
On 2017/9/12 0:03, Juerg Haefliger wrote:
>
>
> On 09/11/2017 04:50 PM, Tycho Andersen wrote:
>> Hi Yisheng,
>>
>> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
>>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
>>>> +{
>>>> + int i, flush_tlb = 0;
>>>> + struct xpfo *xpfo;
>>>> +
>>>> + if (!static_branch_unlikely(&xpfo_inited))
>>>> + return;
>>>> +
>>>> + for (i = 0; i < (1 << order); i++) {
>>>> + xpfo = lookup_xpfo(page + i);
>>>> + if (!xpfo)
>>>> + continue;
>>>> +
>>>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
>>>> + "xpfo: unmapped page being allocated\n");
>>>> +
>>>> + /* Initialize the map lock and map counter */
>>>> + if (unlikely(!xpfo->inited)) {
>>>> + spin_lock_init(&xpfo->maplock);
>>>> + atomic_set(&xpfo->mapcount, 0);
>>>> + xpfo->inited = true;
>>>> + }
>>>> + WARN(atomic_read(&xpfo->mapcount),
>>>> + "xpfo: already mapped page being allocated\n");
>>>> +
>>>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
>>>> + /*
>>>> + * Tag the page as a user page and flush the TLB if it
>>>> + * was previously allocated to the kernel.
>>>> + */
>>>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
>>>> + flush_tlb = 1;
>>>
>>> I'm not sure whether I am miss anything, however, when the page was previously allocated
>>> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
>>> the page to user now
>>>
>> Yes, I think you're right. Oddly, the XPFO_READ_USER test works
Hi Tycho,
Could you share this test? I'd like to know how it works.
Thanks
>> correctly for me, but I think (?) should not because of this bug...
>
> IIRC, this is an optimization carried forward from the initial
> implementation.
Hi Juerg,
hmm.. If below is the first version, then it seems this exist from the first version:
https://patchwork.kernel.org/patch/8437451/
> The assumption is that the kernel will map the user
> buffer so it's not unmapped on allocation but only on the first (and
> subsequent) call of kunmap.
IMO, before a page is allocated, it is in buddy system, which means it is free
and no other 'map' on the page except direct map. Then if the page is allocated
to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
this window before it is freed. Or maybe I'm still missing anything.
Thanks
Yisheng Xie
> I.e.:
> - alloc -> noop
> - kmap -> noop
> - kunmap -> unmapped from the kernel
> - kmap -> mapped into the kernel
> - kunmap -> unmapped from the kernel
> and so on until:
> - free -> mapped back into the kernel
>
> I'm not sure if that make sense though since it leaves a window.
>
> ...Juerg
>
>
>
>> Tycho
>>
>
> .
>
On 2017/9/12 15:40, Juerg Haefliger wrote:
>
>
> On 09/12/2017 09:07 AM, Yisheng Xie wrote:
>> Hi Tycho,
>>
>> On 2017/9/11 23:02, Tycho Andersen wrote:
>>> Hi Yisheng,
>>>
>>> On Mon, Sep 11, 2017 at 06:34:45PM +0800, Yisheng Xie wrote:
>>>> Hi Tycho ,
>>>>
>>>> On 2017/9/8 1:35, Tycho Andersen wrote:
>>>>> Hi all,
>>>>>
>>>>> Here is v6 of the XPFO set; see v5 discussion here:
>>>>> https://lkml.org/lkml/2017/8/9/803
>>>>>
>>>>> Changelogs are in the individual patch notes, but the highlights are:
>>>>> * add primitives for ensuring memory areas are mapped (although these are quite
>>>>> ugly, using stack allocation; I'm open to better suggestions)
>>>>> * instead of not flushing caches, re-map pages using the above
>>>>> * TLB flushing is much more correct (i.e. we're always flushing everything
>>>>> everywhere). I suspect we may be able to back this off in some cases, but I'm
>>>>> still trying to collect performance numbers to prove this is worth doing.
>>>>>
>>>>> I have no TODOs left for this set myself, other than fixing whatever review
>>>>> feedback people have. Thoughts and testing welcome!
>>>>
>>>> According to the paper of Vasileios P. Kemerlis et al, the mainline kernel
>>>> will not set the Pro. of physmap(direct map area) to RW(X), so do we really
>>>> need XPFO to protect from ret2dir attack?
>>>
>>> I guess you're talking about section 4.3?
>> Yes
>>
>>> They mention that that x86
>>> only gets rw, but that aarch64 is rwx still.
>> IIRC, the in kernel of v4.13 the aarch64 is not rwx, I will check it.
>>
>>>
>>> But in either case this still provides access protection, similar to
>>> SMAP. Also, if I understand things correctly the protections are
>>> unmanaged, so a page that had the +x bit set at some point, it could
>>> be used for ret2dir.
>> So you means that the Pro. of direct map area maybe changed to +x, then ret2dir attack can use it?
>
> XPFO protects against malicious reads from userspace (potentially
> accessing sensitive data).
This sounds reasonable to me.
> I've also been told by a security expert that
> ROP attacks are still possible even if user space memory is
> non-executable. XPFO is supposed to prevent that but I haven't been able
> to confirm this. It's way out of my comfort zone.
It also quite out of knowledge, and I just try hard to understand it. Thanks so much for
your kind explain. And hope some security expert can give some more detail explain?
Thanks
Yisheng Xie
>
> ...Juerg
On Tue, Sep 12, 2017 at 04:05:22PM +0800, Yisheng Xie wrote:
>
>
> On 2017/9/12 0:03, Juerg Haefliger wrote:
> >
> >
> > On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> >> Hi Yisheng,
> >>
> >> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> >>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> >>>> +{
> >>>> + int i, flush_tlb = 0;
> >>>> + struct xpfo *xpfo;
> >>>> +
> >>>> + if (!static_branch_unlikely(&xpfo_inited))
> >>>> + return;
> >>>> +
> >>>> + for (i = 0; i < (1 << order); i++) {
> >>>> + xpfo = lookup_xpfo(page + i);
> >>>> + if (!xpfo)
> >>>> + continue;
> >>>> +
> >>>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> >>>> + "xpfo: unmapped page being allocated\n");
> >>>> +
> >>>> + /* Initialize the map lock and map counter */
> >>>> + if (unlikely(!xpfo->inited)) {
> >>>> + spin_lock_init(&xpfo->maplock);
> >>>> + atomic_set(&xpfo->mapcount, 0);
> >>>> + xpfo->inited = true;
> >>>> + }
> >>>> + WARN(atomic_read(&xpfo->mapcount),
> >>>> + "xpfo: already mapped page being allocated\n");
> >>>> +
> >>>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> >>>> + /*
> >>>> + * Tag the page as a user page and flush the TLB if it
> >>>> + * was previously allocated to the kernel.
> >>>> + */
> >>>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> >>>> + flush_tlb = 1;
> >>>
> >>> I'm not sure whether I am miss anything, however, when the page was previously allocated
> >>> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate
> >>> the page to user now
> >>>
> >> Yes, I think you're right. Oddly, the XPFO_READ_USER test works
>
> Hi Tycho,
> Could you share this test? I'd like to know how it works.
See the last patch in the series.
> >> correctly for me, but I think (?) should not because of this bug...
> >
> > IIRC, this is an optimization carried forward from the initial
> > implementation.
> Hi Juerg,
>
> hmm.. If below is the first version, then it seems this exist from the first version:
> https://patchwork.kernel.org/patch/8437451/
>
> > The assumption is that the kernel will map the user
> > buffer so it's not unmapped on allocation but only on the first (and
> > subsequent) call of kunmap.
>
> IMO, before a page is allocated, it is in buddy system, which means it is free
> and no other 'map' on the page except direct map. Then if the page is allocated
> to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
> this window before it is freed. Or maybe I'm still missing anything.
I agree that it seems broken. I'm just not sure why the test doesn't
fail. It's certainly worth understanding.
Tycho
Hi Yisheng,
> On Tue, Sep 12, 2017 at 04:05:22PM +0800, Yisheng Xie wrote:
> > IMO, before a page is allocated, it is in buddy system, which means it is free
> > and no other 'map' on the page except direct map. Then if the page is allocated
> > to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
> > this window before it is freed. Or maybe I'm still missing anything.
>
> I agree that it seems broken. I'm just not sure why the test doesn't
> fail. It's certainly worth understanding.
Ok, so I think what's going on is that the page *is* mapped and unmapped by the
kernel as Juerg described, but only in certain cases. See prep_new_page(),
which has the following:
if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
for (i = 0; i < (1 << order); i++)
clear_highpage(page + i);
clear_highpage() maps and unmaps the pages, so that's why xpfo works with this
set.
I tried with CONFIG_PAGE_POISONING_ZERO=y and page_poison=y, and the
XPFO_READ_USER test does not fail, i.e. the read succeeds. So, I think we need
to include this zeroing condition in xpfo_alloc_pages(), something like the
patch below. Unfortunately, this fails to boot for me, probably for an
unrelated reason that I'll look into.
Thanks a lot!
Tycho
>From bfc21a6438cf8c56741af94cac939f1b0f63752c Mon Sep 17 00:00:00 2001
From: Tycho Andersen <[email protected]>
Date: Tue, 12 Sep 2017 12:06:41 -0600
Subject: [PATCH] draft of unmapping patch
Signed-off-by: Tycho Andersen <[email protected]>
---
include/linux/xpfo.h | 5 +++--
mm/compaction.c | 2 +-
mm/internal.h | 2 +-
mm/page_alloc.c | 10 ++++++----
mm/xpfo.c | 10 ++++++++--
5 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index b24be9ac4a2d..c991bf7f051d 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -29,7 +29,7 @@ void xpfo_flush_kernel_tlb(struct page *page, int order);
void xpfo_kmap(void *kaddr, struct page *page);
void xpfo_kunmap(void *kaddr, struct page *page);
-void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
void xpfo_free_pages(struct page *page, int order);
bool xpfo_page_is_unmapped(struct page *page);
@@ -49,7 +49,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
-static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp,
+ bool will_map) { }
static inline void xpfo_free_pages(struct page *page, int order) { }
static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
diff --git a/mm/compaction.c b/mm/compaction.c
index fb548e4c7bd4..9a222258e65c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -76,7 +76,7 @@ static void map_pages(struct list_head *list)
order = page_private(page);
nr_pages = 1 << order;
- post_alloc_hook(page, order, __GFP_MOVABLE);
+ post_alloc_hook(page, order, __GFP_MOVABLE, false);
if (order)
split_page(page, order);
diff --git a/mm/internal.h b/mm/internal.h
index 4ef49fc55e58..1a0331ec2b2d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -165,7 +165,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
unsigned int order);
extern void prep_compound_page(struct page *page, unsigned int order);
extern void post_alloc_hook(struct page *page, unsigned int order,
- gfp_t gfp_flags);
+ gfp_t gfp_flags, bool will_map);
extern int user_min_free_kbytes;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 09fdf1bad21f..f73809847c58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1750,7 +1750,7 @@ static bool check_new_pages(struct page *page, unsigned int order)
}
inline void post_alloc_hook(struct page *page, unsigned int order,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, bool will_map)
{
set_page_private(page, 0);
set_page_refcounted(page);
@@ -1759,18 +1759,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
kernel_map_pages(page, 1 << order, 1);
kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
- xpfo_alloc_pages(page, order, gfp_flags);
+ xpfo_alloc_pages(page, order, gfp_flags, will_map);
set_page_owner(page, order, gfp_flags);
}
+extern bool xpfo_test;
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
unsigned int alloc_flags)
{
int i;
+ bool needs_zero = !free_pages_prezeroed() && (gfp_flags & __GFP_ZERO);
- post_alloc_hook(page, order, gfp_flags);
+ post_alloc_hook(page, order, gfp_flags, needs_zero);
- if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
+ if (needs_zero)
for (i = 0; i < (1 << order); i++)
clear_highpage(page + i);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ca5d4d1838f9..dd25e24213fe 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -86,7 +86,7 @@ static inline struct xpfo *lookup_xpfo(struct page *page)
return (void *)page_ext + page_xpfo_ops.offset;
}
-void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
{
int i, flush_tlb = 0;
struct xpfo *xpfo;
@@ -116,8 +116,14 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
* Tag the page as a user page and flush the TLB if it
* was previously allocated to the kernel.
*/
- if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ bool was_user = !test_and_set_bit(XPFO_PAGE_USER,
+ &xpfo->flags);
+
+ if (was_user || !will_map) {
+ set_kpte(page_address(page + i), page + i,
+ __pgprot(0));
flush_tlb = 1;
+ }
} else {
/* Tag the page as a non-user (kernel) page */
clear_bit(XPFO_PAGE_USER, &xpfo->flags);
--
2.11.0
Hi Tycho,
On 2017/9/13 2:13, Tycho Andersen wrote:
> Hi Yisheng,
>
>> On Tue, Sep 12, 2017 at 04:05:22PM +0800, Yisheng Xie wrote:
>>> IMO, before a page is allocated, it is in buddy system, which means it is free
>>> and no other 'map' on the page except direct map. Then if the page is allocated
>>> to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
>>> this window before it is freed. Or maybe I'm still missing anything.
>>
>> I agree that it seems broken. I'm just not sure why the test doesn't
>> fail. It's certainly worth understanding.
>
> Ok, so I think what's going on is that the page *is* mapped and unmapped by the
> kernel as Juerg described, but only in certain cases. See prep_new_page(),
> which has the following:
>
> if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
> for (i = 0; i < (1 << order); i++)
> clear_highpage(page + i);
>
> clear_highpage() maps and unmaps the pages, so that's why xpfo works with this
> set.
Oh, I really missed this point. For we need zero the memory before user get them.
Thanks a lot for figuring out.
>
> I tried with CONFIG_PAGE_POISONING_ZERO=y and page_poison=y, and the
> XPFO_READ_USER test does not fail, i.e. the read succeeds. So, I think we need
> to include this zeroing condition in xpfo_alloc_pages(), something like the
> patch below. Unfortunately, this fails to boot for me, probably for an
> unrelated reason that I'll look into.
Yes, seems need to fix in this case, and I also a litter puzzle about why boot fail.
Thanks
Yisheng Xie
>
> Thanks a lot!
>
> Tycho
>
>
>>From bfc21a6438cf8c56741af94cac939f1b0f63752c Mon Sep 17 00:00:00 2001
> From: Tycho Andersen <[email protected]>
> Date: Tue, 12 Sep 2017 12:06:41 -0600
> Subject: [PATCH] draft of unmapping patch
>
> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> include/linux/xpfo.h | 5 +++--
> mm/compaction.c | 2 +-
> mm/internal.h | 2 +-
> mm/page_alloc.c | 10 ++++++----
> mm/xpfo.c | 10 ++++++++--
> 5 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index b24be9ac4a2d..c991bf7f051d 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -29,7 +29,7 @@ void xpfo_flush_kernel_tlb(struct page *page, int order);
>
> void xpfo_kmap(void *kaddr, struct page *page);
> void xpfo_kunmap(void *kaddr, struct page *page);
> -void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
> void xpfo_free_pages(struct page *page, int order);
>
> bool xpfo_page_is_unmapped(struct page *page);
> @@ -49,7 +49,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
>
> static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> -static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp,
> + bool will_map) { }
> static inline void xpfo_free_pages(struct page *page, int order) { }
>
> static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index fb548e4c7bd4..9a222258e65c 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -76,7 +76,7 @@ static void map_pages(struct list_head *list)
> order = page_private(page);
> nr_pages = 1 << order;
>
> - post_alloc_hook(page, order, __GFP_MOVABLE);
> + post_alloc_hook(page, order, __GFP_MOVABLE, false);
> if (order)
> split_page(page, order);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 4ef49fc55e58..1a0331ec2b2d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -165,7 +165,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
> unsigned int order);
> extern void prep_compound_page(struct page *page, unsigned int order);
> extern void post_alloc_hook(struct page *page, unsigned int order,
> - gfp_t gfp_flags);
> + gfp_t gfp_flags, bool will_map);
> extern int user_min_free_kbytes;
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 09fdf1bad21f..f73809847c58 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1750,7 +1750,7 @@ static bool check_new_pages(struct page *page, unsigned int order)
> }
>
> inline void post_alloc_hook(struct page *page, unsigned int order,
> - gfp_t gfp_flags)
> + gfp_t gfp_flags, bool will_map)
> {
> set_page_private(page, 0);
> set_page_refcounted(page);
> @@ -1759,18 +1759,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> kernel_map_pages(page, 1 << order, 1);
> kernel_poison_pages(page, 1 << order, 1);
> kasan_alloc_pages(page, order);
> - xpfo_alloc_pages(page, order, gfp_flags);
> + xpfo_alloc_pages(page, order, gfp_flags, will_map);
> set_page_owner(page, order, gfp_flags);
> }
>
> +extern bool xpfo_test;
> static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> unsigned int alloc_flags)
> {
> int i;
> + bool needs_zero = !free_pages_prezeroed() && (gfp_flags & __GFP_ZERO);
>
> - post_alloc_hook(page, order, gfp_flags);
> + post_alloc_hook(page, order, gfp_flags, needs_zero);
>
> - if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
> + if (needs_zero)
> for (i = 0; i < (1 << order); i++)
> clear_highpage(page + i);
>
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index ca5d4d1838f9..dd25e24213fe 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -86,7 +86,7 @@ static inline struct xpfo *lookup_xpfo(struct page *page)
> return (void *)page_ext + page_xpfo_ops.offset;
> }
>
> -void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
> {
> int i, flush_tlb = 0;
> struct xpfo *xpfo;
> @@ -116,8 +116,14 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> * Tag the page as a user page and flush the TLB if it
> * was previously allocated to the kernel.
> */
> - if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> + bool was_user = !test_and_set_bit(XPFO_PAGE_USER,
> + &xpfo->flags);
> +
> + if (was_user || !will_map) {
> + set_kpte(page_address(page + i), page + i,
> + __pgprot(0));
> flush_tlb = 1;
> + }
> } else {
> /* Tag the page as a non-user (kernel) page */
> clear_bit(XPFO_PAGE_USER, &xpfo->flags);
>
Hi,
CC Juergen, Boris and Stefano.
On 08/09/17 18:24, Tycho Andersen wrote:
> On Fri, Sep 08, 2017 at 12:53:47AM -0700, Christoph Hellwig wrote:
>>> +/*
>>> + * Lookup the page table entry for a virtual address and return a pointer to
>>> + * the entry. Based on x86 tree.
>>> + */
>>> +static pte_t *lookup_address(unsigned long addr)
>>
>> Seems like this should be moved to common arm64 mm code and used by
>> kernel_page_present.
>
> Sounds good, I'll include something like the patch below in the next
> series.
>
> Unfortunately, adding an implementation of lookup_address seems to be
> slightly more complicated than necessary, because of the xen piece. We
> have to define lookup_address() with the level parameter, but it's not
> obvious to me to name the page levels. So for now I've just left it as
> a WARN() if someone supplies it.
>
> It seems like xen still does need this to be defined, because if I
> define it without level:
>
> drivers/xen/xenbus/xenbus_client.c: In function ‘xenbus_unmap_ring_vfree_pv’:
> drivers/xen/xenbus/xenbus_client.c:760:4: error: too many arguments to function ‘lookup_address’
> lookup_address(addr, &level)).maddr;
> ^~~~~~~~~~~~~~
> In file included from ./arch/arm64/include/asm/page.h:37:0,
> from ./include/linux/mmzone.h:20,
> from ./include/linux/gfp.h:5,
> from ./include/linux/mm.h:9,
> from drivers/xen/xenbus/xenbus_client.c:33:
> ./arch/arm64/include/asm/pgtable-types.h:67:15: note: declared here
> extern pte_t *lookup_address(unsigned long addr);
> ^~~~~~~~~~~~~~
>
> I've cc-d the xen folks, maybe they can suggest a way to untangle it?
> Alternatively, if someone can suggest a good naming scheme for the
> page levels, I can just do that.
The implementation of lookup_address(...) on ARM for Xen (see
include/xen/arm/page.h) is just a BUG(). This is because this code
should never be called (only used for x86 PV code).
Furthermore, xenbus client does not use at all the level. It is just to
cope with the x86 version of lookup_address.
So one way to solve the problem would be to introduce
xen_lookup_address(addr) that would be implemented as:
- on x86
unsigned int level;
return lookup_address(addr, &level).maddr;
- on ARM
BUG();
With that there would be no prototype clash and avoid introducing a
level parameter.
Cheers,
--
Julien Grall
On 14/09/17 12:41, Julien Grall wrote:
> Hi,
>
> CC Juergen, Boris and Stefano.
>
> On 08/09/17 18:24, Tycho Andersen wrote:
>> On Fri, Sep 08, 2017 at 12:53:47AM -0700, Christoph Hellwig wrote:
>>>> +/*
>>>> + * Lookup the page table entry for a virtual address and return a
>>>> pointer to
>>>> + * the entry. Based on x86 tree.
>>>> + */
>>>> +static pte_t *lookup_address(unsigned long addr)
>>>
>>> Seems like this should be moved to common arm64 mm code and used by
>>> kernel_page_present.
>>
>> Sounds good, I'll include something like the patch below in the next
>> series.
>>
>> Unfortunately, adding an implementation of lookup_address seems to be
>> slightly more complicated than necessary, because of the xen piece. We
>> have to define lookup_address() with the level parameter, but it's not
>> obvious to me to name the page levels. So for now I've just left it as
>> a WARN() if someone supplies it.
>>
>> It seems like xen still does need this to be defined, because if I
>> define it without level:
>>
>> drivers/xen/xenbus/xenbus_client.c: In function
>> ‘xenbus_unmap_ring_vfree_pv’:
>> drivers/xen/xenbus/xenbus_client.c:760:4: error: too many arguments to
>> function ‘lookup_address’
>> lookup_address(addr, &level)).maddr;
>> ^~~~~~~~~~~~~~
>> In file included from ./arch/arm64/include/asm/page.h:37:0,
>> from ./include/linux/mmzone.h:20,
>> from ./include/linux/gfp.h:5,
>> from ./include/linux/mm.h:9,
>> from drivers/xen/xenbus/xenbus_client.c:33:
>> ./arch/arm64/include/asm/pgtable-types.h:67:15: note: declared here
>> extern pte_t *lookup_address(unsigned long addr);
>> ^~~~~~~~~~~~~~
>>
>> I've cc-d the xen folks, maybe they can suggest a way to untangle it?
>> Alternatively, if someone can suggest a good naming scheme for the
>> page levels, I can just do that.
>
> The implementation of lookup_address(...) on ARM for Xen (see
> include/xen/arm/page.h) is just a BUG(). This is because this code
> should never be called (only used for x86 PV code).
>
> Furthermore, xenbus client does not use at all the level. It is just to
> cope with the x86 version of lookup_address.
>
> So one way to solve the problem would be to introduce
> xen_lookup_address(addr) that would be implemented as:
> - on x86
> unsigned int level;
>
> return lookup_address(addr, &level).maddr;
> - on ARM
> BUG();
>
> With that there would be no prototype clash and avoid introducing a
> level parameter.
I'd rather add some #ifdef CONFIG_XEN_PV and remove the *_pv functions
from ARM completely this way. I'm sending a patch soon...
Juergen
Hi,
On Thu, Sep 07, 2017 at 11:36:03AM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> provide a hook for updating a single kernel page table entry (which is
> required by the generic XPFO code).
>
> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 ++
> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 61 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index dfd908630631..44fa44ef02ec 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -121,6 +121,7 @@ config ARM64
> select SPARSE_IRQ
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_XPFO
A bit of a nit, but this list is (intended to be) organised alphabetically.
Could you please try to retain that?
i.e. place this between ARCH_SUPPORTS_NUMA_BALANCING and
ARCH_WANT_COMPAT_IPC_PARSE_VERSION.
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 9b0ba191e48e..22e5cab543d8 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n
>
> obj-$(CONFIG_KASAN) += kasan_init.o
> KASAN_SANITIZE_kasan_init.o := n
> +
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> new file mode 100644
> index 000000000000..678e2be848eb
> --- /dev/null
> +++ b/arch/arm64/mm/xpfo.c
> @@ -0,0 +1,58 @@
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + * Juerg Haefliger <[email protected]>
> + * Vasileios P. Kemerlis <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/tlbflush.h>
> +
> +/*
> + * Lookup the page table entry for a virtual address and return a pointer to
> + * the entry. Based on x86 tree.
> + */
Is this intended for kernel VAs, user VAs, or both?
There are different constraints for fiddling with either (e.g. holding
mmap_sem), so we should be clear regarding the intended use-case.
> +static pte_t *lookup_address(unsigned long addr)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> +
> + pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd))
> + return NULL;
> +
> + pud = pud_offset(pgd, addr);
> + if (pud_none(*pud))
> + return NULL;
What if it's not none, but not a table?
I think we chould check pud_sect() here, and/or pud_bad().
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd))
> + return NULL;
Likewise.
> +
> + return pte_offset_kernel(pmd, addr);
> +}
Given this expects a pte, it might make more sense to call this
lookup_address_pte() to make that clear.
> +
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> + pte_t *pte = lookup_address((unsigned long)kaddr);
> +
> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
We can get NULL from lookup_address(), so this doesn't look right.
If NULL implies an error, drop a BUG_ON(!pte) before the set_pte.
> +}
> +
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> + unsigned long kaddr = (unsigned long)page_address(page);
> + unsigned long size = PAGE_SIZE;
unsigned long size = PAGE_SIZE << order;
> +
> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
... and this can be simpler.
I haven't brought myself back up to speed, so it might not be possible, but I
still think it would be preferable for XPFO to call flush_tlb_kernel_range()
directly.
Thanks,
Mark.
On Thu, Sep 07, 2017 at 11:36:05AM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault, so let's temporarily map the region, flush the cache, and then
> unmap it.
>
> v6: actually flush in the face of xpfo, and temporarily map the underlying
> memory so it can be flushed correctly
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/mm/flush.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 21a8d828cbf4..e335e3fd4fca 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -20,6 +20,7 @@
> #include <linux/export.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> +#include <linux/xpfo.h>
>
> #include <asm/cacheflush.h>
> #include <asm/cache.h>
> @@ -28,9 +29,15 @@
> void sync_icache_aliases(void *kaddr, unsigned long len)
> {
> unsigned long addr = (unsigned long)kaddr;
> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> + void *mapping[num_pages];
>
> if (icache_is_aliasing()) {
> + xpfo_temp_map(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
> __clean_dcache_area_pou(kaddr, len);
> + xpfo_temp_unmap(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
Does this create the mapping in-place?
Can we not just kmap_atomic() an alias? Or is there a problem with that?
Thanks,
Mark.
On Thu, Sep 07, 2017 at 11:36:08AM -0600, Tycho Andersen wrote:
> We need someting like this for testing XPFO. Since it's architecture
> specific, putting it in the test code is slightly awkward, so let's make it
> an arch-specific symbol and export it for use in LKDTM.
>
> v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
>
> CC: [email protected]
> CC: [email protected]
> Signed-off-by: Tycho Andersen <[email protected]>
> Tested-by: Marco Benatto <[email protected]>
> ---
> arch/arm64/mm/xpfo.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/mm/xpfo.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/xpfo.h | 5 +++++
> 3 files changed, 113 insertions(+)
>
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> index 342a9ccb93c1..94a667d94e15 100644
> --- a/arch/arm64/mm/xpfo.c
> +++ b/arch/arm64/mm/xpfo.c
> @@ -74,3 +74,54 @@ void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
>
> xpfo_temp_unmap(addr, size, mapping, sizeof(mapping[0]) * num_pages);
> }
> +
> +/* Convert a user space virtual address to a physical address.
> + * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
> + * arch/x86/mm/pageattr.c
> + */
When can this be called? What prevents concurrent modification of the user page
tables?
i.e. must mmap_sem be held?
> +phys_addr_t user_virt_to_phys(unsigned long addr)
Does this really need to be architecture specific?
Core mm code manages to walk user page tables just fine...
> +{
> + phys_addr_t phys_addr;
> + unsigned long offset;
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(current->mm, addr);
> + if (pgd_none(*pgd))
> + return 0;
Can we please separate the address and return value? e.g. pass the PA by
reference and return an error code.
AFAIK, zero is a valid PA, and even if the tables exist, they might point there
in the presence of an error.
> +
> + p4d = p4d_offset(pgd, addr);
> + if (p4d_none(*p4d))
> + return 0;
> +
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud))
> + return 0;
> +
> + if (pud_sect(*pud) || !pud_present(*pud)) {
> + phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
Was there some problem with:
phys_addr = pmd_page_paddr(*pud);
... and similar for the other levels?
... I'd rather introduce new helpers than more open-coded calculations.
Thanks,
Mark.
Hi Mark,
On Thu, Sep 14, 2017 at 07:34:02PM +0100, Mark Rutland wrote:
> On Thu, Sep 07, 2017 at 11:36:08AM -0600, Tycho Andersen wrote:
> > We need someting like this for testing XPFO. Since it's architecture
> > specific, putting it in the test code is slightly awkward, so let's make it
> > an arch-specific symbol and export it for use in LKDTM.
> >
> > v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
> >
> > CC: [email protected]
> > CC: [email protected]
> > Signed-off-by: Tycho Andersen <[email protected]>
> > Tested-by: Marco Benatto <[email protected]>
> > ---
> > arch/arm64/mm/xpfo.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> > arch/x86/mm/xpfo.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/xpfo.h | 5 +++++
> > 3 files changed, 113 insertions(+)
> >
> > diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> > index 342a9ccb93c1..94a667d94e15 100644
> > --- a/arch/arm64/mm/xpfo.c
> > +++ b/arch/arm64/mm/xpfo.c
> > @@ -74,3 +74,54 @@ void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
> >
> > xpfo_temp_unmap(addr, size, mapping, sizeof(mapping[0]) * num_pages);
> > }
> > +
> > +/* Convert a user space virtual address to a physical address.
> > + * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
> > + * arch/x86/mm/pageattr.c
> > + */
>
> When can this be called? What prevents concurrent modification of the user page
> tables?
>
> i.e. must mmap_sem be held?
Yes, it should be. Since we're moving this back into the lkdtm test
code I think it's less important, since nothing should be modifying
the tables while the thread is doing the lookup, but I'll add it in
the next version.
> > +phys_addr_t user_virt_to_phys(unsigned long addr)
>
> Does this really need to be architecture specific?
>
> Core mm code manages to walk user page tables just fine...
I think so since we don't support section mappings right now, so
p*d_sect will always be false.
> > +{
> > + phys_addr_t phys_addr;
> > + unsigned long offset;
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pte_t *pte;
> > +
> > + pgd = pgd_offset(current->mm, addr);
> > + if (pgd_none(*pgd))
> > + return 0;
>
> Can we please separate the address and return value? e.g. pass the PA by
> reference and return an error code.
>
> AFAIK, zero is a valid PA, and even if the tables exist, they might point there
> in the presence of an error.
Sure, I'll rearrange this.
> > +
> > + p4d = p4d_offset(pgd, addr);
> > + if (p4d_none(*p4d))
> > + return 0;
> > +
> > + pud = pud_offset(p4d, addr);
> > + if (pud_none(*pud))
> > + return 0;
> > +
> > + if (pud_sect(*pud) || !pud_present(*pud)) {
> > + phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
>
> Was there some problem with:
>
> phys_addr = pmd_page_paddr(*pud);
>
> ... and similar for the other levels?
>
> ... I'd rather introduce new helpers than more open-coded calculations.
I wasn't aware of these; we could define a similar set of functions
for x86 and then make it not arch-specific.
I wonder if we could also use follow_page_pte(), since we know that
the page is always present (given that it's been allocated).
Unfortunately follow_page_pte() is not exported.
Tycho
Hi Mark,
On Thu, Sep 14, 2017 at 07:22:08PM +0100, Mark Rutland wrote:
> Hi,
>
> On Thu, Sep 07, 2017 at 11:36:03AM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> > provide a hook for updating a single kernel page table entry (which is
> > required by the generic XPFO code).
> >
> > v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
> >
> > CC: [email protected]
> > Signed-off-by: Juerg Haefliger <[email protected]>
> > Signed-off-by: Tycho Andersen <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/mm/Makefile | 2 ++
> > arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 61 insertions(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index dfd908630631..44fa44ef02ec 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -121,6 +121,7 @@ config ARM64
> > select SPARSE_IRQ
> > select SYSCTL_EXCEPTION_TRACE
> > select THREAD_INFO_IN_TASK
> > + select ARCH_SUPPORTS_XPFO
>
> A bit of a nit, but this list is (intended to be) organised alphabetically.
> Could you please try to retain that?
>
> i.e. place this between ARCH_SUPPORTS_NUMA_BALANCING and
> ARCH_WANT_COMPAT_IPC_PARSE_VERSION.
Will do.
> > help
> > ARM 64-bit (AArch64) Linux support.
> >
> > diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> > index 9b0ba191e48e..22e5cab543d8 100644
> > --- a/arch/arm64/mm/Makefile
> > +++ b/arch/arm64/mm/Makefile
> > @@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n
> >
> > obj-$(CONFIG_KASAN) += kasan_init.o
> > KASAN_SANITIZE_kasan_init.o := n
> > +
> > +obj-$(CONFIG_XPFO) += xpfo.o
> > diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> > new file mode 100644
> > index 000000000000..678e2be848eb
> > --- /dev/null
> > +++ b/arch/arm64/mm/xpfo.c
> > @@ -0,0 +1,58 @@
> > +/*
> > + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + * Juerg Haefliger <[email protected]>
> > + * Vasileios P. Kemerlis <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +/*
> > + * Lookup the page table entry for a virtual address and return a pointer to
> > + * the entry. Based on x86 tree.
> > + */
>
> Is this intended for kernel VAs, user VAs, or both?
>
> There are different constraints for fiddling with either (e.g. holding
> mmap_sem), so we should be clear regarding the intended use-case.
kernel only; I can add a comment noting this.
> > +static pte_t *lookup_address(unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > +
> > + pgd = pgd_offset_k(addr);
> > + if (pgd_none(*pgd))
> > + return NULL;
> > +
> > + pud = pud_offset(pgd, addr);
> > + if (pud_none(*pud))
> > + return NULL;
>
> What if it's not none, but not a table?
>
> I think we chould check pud_sect() here, and/or pud_bad().
>
> > +
> > + pmd = pmd_offset(pud, addr);
> > + if (pmd_none(*pmd))
> > + return NULL;
>
> Likewise.
In principle pud_sect() should be okay, because we say that XPFO
doesn't support section mappings yet. I'll add a check for pud_bad().
However, Christoph suggested that we move this to common code and
there it won't be okay.
> > +
> > + return pte_offset_kernel(pmd, addr);
> > +}
>
> Given this expects a pte, it might make more sense to call this
> lookup_address_pte() to make that clear.
>
> > +
> > +/* Update a single kernel page table entry */
> > +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> > +{
> > + pte_t *pte = lookup_address((unsigned long)kaddr);
> > +
> > + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
>
> We can get NULL from lookup_address(), so this doesn't look right.
>
> If NULL implies an error, drop a BUG_ON(!pte) before the set_pte.
It does, I'll add this (as a WARN() and then no-op), thanks.
> > +}
> > +
> > +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> > +{
> > + unsigned long kaddr = (unsigned long)page_address(page);
> > + unsigned long size = PAGE_SIZE;
>
> unsigned long size = PAGE_SIZE << order;
>
> > +
> > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>
> ... and this can be simpler.
>
> I haven't brought myself back up to speed, so it might not be possible, but I
> still think it would be preferable for XPFO to call flush_tlb_kernel_range()
> directly.
I don't think we can, since on x86 it uses smp functions, and in some
cases those aren't safe.
Cheers,
Tycho
On Thu, Sep 14, 2017 at 07:25:56PM +0100, Mark Rutland wrote:
> On Thu, Sep 07, 2017 at 11:36:05AM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > If the page is unmapped by XPFO, a data cache flush results in a fatal
> > page fault, so let's temporarily map the region, flush the cache, and then
> > unmap it.
> >
> > v6: actually flush in the face of xpfo, and temporarily map the underlying
> > memory so it can be flushed correctly
> >
> > CC: [email protected]
> > Signed-off-by: Juerg Haefliger <[email protected]>
> > Signed-off-by: Tycho Andersen <[email protected]>
> > ---
> > arch/arm64/mm/flush.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> > index 21a8d828cbf4..e335e3fd4fca 100644
> > --- a/arch/arm64/mm/flush.c
> > +++ b/arch/arm64/mm/flush.c
> > @@ -20,6 +20,7 @@
> > #include <linux/export.h>
> > #include <linux/mm.h>
> > #include <linux/pagemap.h>
> > +#include <linux/xpfo.h>
> >
> > #include <asm/cacheflush.h>
> > #include <asm/cache.h>
> > @@ -28,9 +29,15 @@
> > void sync_icache_aliases(void *kaddr, unsigned long len)
> > {
> > unsigned long addr = (unsigned long)kaddr;
> > + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> > + void *mapping[num_pages];
> >
> > if (icache_is_aliasing()) {
> > + xpfo_temp_map(kaddr, len, mapping,
> > + sizeof(mapping[0]) * num_pages);
> > __clean_dcache_area_pou(kaddr, len);
> > + xpfo_temp_unmap(kaddr, len, mapping,
> > + sizeof(mapping[0]) * num_pages);
>
> Does this create the mapping in-place?
>
> Can we not just kmap_atomic() an alias? Or is there a problem with that?
I think what we really want is something like vmap(), looking at
xpfo_temp_map() it seems like the implementation is completely wrong.
I wonder if what you mentioned at LSS is possible though: doing cache
management with userspace primitives instead of mapping the region
just to flush it.
Cheers,
Tycho
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
...
> Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
I'm looking for the code where it's unmapped at allocation to userspace.
I see TLB flushing and 'struct xpfo' manipulation, but I don't see the
unmapping. Where is that occurring?
How badly does this hurt performance? Since we (generally) have
different migrate types for user and kernel allocation, I can imagine
that a given page *generally* doesn't oscillate between user and
kernel-allocated, but I'm curious how it works in practice. Doesn't the
IPI load from the TLB flushes eat you alive?
It's a bit scary to have such a deep code path under the main allocator.
This all seems insanely expensive. It will *barely* work on an
allocation-heavy workload on a desktop. I'm pretty sure the locking
will just fall over entirely on any reasonably-sized server.
I really have to wonder whether there are better ret2dir defenses than
this. The allocator just seems like the *wrong* place to be doing this
because it's such a hot path.
> + cpa.vaddr = kaddr;
> + cpa.pages = &page;
> + cpa.mask_set = prot;
> + cpa.mask_clr = msk_clr;
> + cpa.numpages = 1;
> + cpa.flags = 0;
> + cpa.curpage = 0;
> + cpa.force_split = 0;
> +
> +
> + do_split = try_preserve_large_page(pte, (unsigned
Is this safe to do without a TLB flush? I thought we had plenty of bugs
in CPUs around having multiple entries for the same page in the TLB at
once. We're *REALLY* careful when we split large pages for THP, and I'm
surprised we don't do the same here.
Why do you even bother keeping large pages around? Won't the entire
kernel just degrade to using 4k everywhere, eventually?
> + if (do_split) {
> + struct page *base;
> +
> + base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK,
Ugh, GFP_ATOMIC. That's nasty. Do you really want this allocation to
fail all the time? GFP_ATOMIC could really be called
GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :)
You probably want to do what the THP code does here and keep a spare
page around, then allocate it before you take the locks.
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> + int level;
> + unsigned long size, kaddr;
> +
> + kaddr = (unsigned long)page_address(page);
> +
> + if (unlikely(!lookup_address(kaddr, &level))) {
> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> + return;
> + }
> +
> + switch (level) {
> + case PG_LEVEL_4K:
> + size = PAGE_SIZE;
> + break;
> + case PG_LEVEL_2M:
> + size = PMD_SIZE;
> + break;
> + case PG_LEVEL_1G:
> + size = PUD_SIZE;
> + break;
> + default:
> + WARN(1, "xpfo: unsupported page level %x\n", level);
> + return;
> + }
> +
> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +}
I'm not sure flush_tlb_kernel_range() is the best primitive to be
calling here.
Let's say you walk the page tables and find level=PG_LEVEL_1G. You call
flush_tlb_kernel_range(), you will be above
tlb_single_page_flush_ceiling, and you will do a full TLB flush. But,
with a 1GB page, you could have just used a single INVLPG and skipped
the global flush.
I guess the cost of the IPI is way more than the flush itself, but it's
still a shame to toss the entire TLB when you don't have to.
I also think the TLB flush should be done closer to the page table
manipulation that it is connected to. It's hard to figure out whether
the flush is the right one otherwise.
Also, the "(1 << order) * size" thing looks goofy to me. Let's say you
are flushing a order=1 (8k) page and its mapped in a 1GB mapping. You
flush 2GB. Is that intentional?
> +
> +void xpfo_free_pages(struct page *page, int order)
> +{
...
> + /*
> + * Map the page back into the kernel if it was previously
> + * allocated to user space.
> + */
> + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> + set_kpte(page_address(page + i), page + i,
> + PAGE_KERNEL);
> + }
> + }
> +}
This seems like a bad idea, performance-wise. Kernel and userspace
pages tend to be separated by migrate types. So, a given physical page
will tend to be used as kernel *or* for userspace. With this nugget,
every time a userspace page is freed, we will go to the trouble of
making it *back* into a kernel page. Then, when it is allocated again
(probably as userspace), we will re-make it into a userspace page. That
seems horribly inefficient.
Also, this weakens the security guarantees. Let's say you're mounting a
ret2dir attack. You populate a page with your evil data and you know
the kernel address for the page. All you have to do is coordinate your
attack with freeing the page. You can control when it gets freed. Now,
the xpfo_free_pages() helpfully just mapped your attack code back into
the kernel.
Why not *just* do these moves at allocation time?
Hi Dave,
Thanks for taking a look!
On Wed, Sep 20, 2017 at 08:48:36AM -0700, Dave Hansen wrote:
> On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> ...
> > Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.
>
> I'm looking for the code where it's unmapped at allocation to userspace.
> I see TLB flushing and 'struct xpfo' manipulation, but I don't see the
> unmapping. Where is that occurring?
This is discussed here: https://lkml.org/lkml/2017/9/11/289 but,
you're right that it's wrong in some cases. I've fixed it up for v7:
https://lkml.org/lkml/2017/9/12/512
> How badly does this hurt performance? Since we (generally) have
> different migrate types for user and kernel allocation, I can imagine
> that a given page *generally* doesn't oscillate between user and
> kernel-allocated, but I'm curious how it works in practice. Doesn't the
> IPI load from the TLB flushes eat you alive?
>
> It's a bit scary to have such a deep code path under the main allocator.
>
> This all seems insanely expensive. It will *barely* work on an
> allocation-heavy workload on a desktop. I'm pretty sure the locking
> will just fall over entirely on any reasonably-sized server.
Basically, yes :(. I presented some numbers at LSS, but the gist was
on a 2.4x slowdown on a 24 core/48 thread Xeon E5-2650, and a 1.4x
slowdown on a 4 core/8 thread E3-1240. The story seems a little bit
better on ARM, but I'm struggling to get it to boot on a box with more
than 4 cores, so I can't draw a better picture yet.
> I really have to wonder whether there are better ret2dir defenses than
> this. The allocator just seems like the *wrong* place to be doing this
> because it's such a hot path.
This might be crazy, but what if we defer flushing of the kernel
ranges until just before we return to userspace? We'd still manipulate
the prot/xpfo bits for the pages, but then just keep a list of which
ranges need to be flushed, and do the right thing before we return.
This leaves a little window between the actual allocation and the
flush, but userspace would need another thread in its threadgroup to
predict the next allocation, write the bad stuff there, and do the
exploit all in that window.
I'm of course open to other suggestions. I'm new :)
> > + cpa.vaddr = kaddr;
> > + cpa.pages = &page;
> > + cpa.mask_set = prot;
> > + cpa.mask_clr = msk_clr;
> > + cpa.numpages = 1;
> > + cpa.flags = 0;
> > + cpa.curpage = 0;
> > + cpa.force_split = 0;
> > +
> > +
> > + do_split = try_preserve_large_page(pte, (unsigned
>
> Is this safe to do without a TLB flush? I thought we had plenty of bugs
> in CPUs around having multiple entries for the same page in the TLB at
> once. We're *REALLY* careful when we split large pages for THP, and I'm
> surprised we don't do the same here.
It looks like on some code paths we do flush, and some we don't.
Sounds like it's not safe to do without a flush, so I'll see about
adding one.
> Why do you even bother keeping large pages around? Won't the entire
> kernel just degrade to using 4k everywhere, eventually?
Isn't that true of large pages in general? Is there something about
xpfo that makes this worse? I thought this would only split things if
they had already been split somewhere else, and the protection can't
apply to the whole huge page.
> > + if (do_split) {
> > + struct page *base;
> > +
> > + base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK,
>
> Ugh, GFP_ATOMIC. That's nasty. Do you really want this allocation to
> fail all the time? GFP_ATOMIC could really be called
> GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :)
>
> You probably want to do what the THP code does here and keep a spare
> page around, then allocate it before you take the locks.
Sounds like a good idea, thanks.
> > +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> > +{
> > + int level;
> > + unsigned long size, kaddr;
> > +
> > + kaddr = (unsigned long)page_address(page);
> > +
> > + if (unlikely(!lookup_address(kaddr, &level))) {
> > + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> > + return;
> > + }
> > +
> > + switch (level) {
> > + case PG_LEVEL_4K:
> > + size = PAGE_SIZE;
> > + break;
> > + case PG_LEVEL_2M:
> > + size = PMD_SIZE;
> > + break;
> > + case PG_LEVEL_1G:
> > + size = PUD_SIZE;
> > + break;
> > + default:
> > + WARN(1, "xpfo: unsupported page level %x\n", level);
> > + return;
> > + }
> > +
> > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> > +}
>
> I'm not sure flush_tlb_kernel_range() is the best primitive to be
> calling here.
>
> Let's say you walk the page tables and find level=PG_LEVEL_1G. You call
> flush_tlb_kernel_range(), you will be above
> tlb_single_page_flush_ceiling, and you will do a full TLB flush. But,
> with a 1GB page, you could have just used a single INVLPG and skipped
> the global flush.
>
> I guess the cost of the IPI is way more than the flush itself, but it's
> still a shame to toss the entire TLB when you don't have to.
Ok, do you think it's worth making a new helper for others to use? Or
should I just keep the logic in this function?
> I also think the TLB flush should be done closer to the page table
> manipulation that it is connected to. It's hard to figure out whether
> the flush is the right one otherwise.
Yes, sounds good.
> Also, the "(1 << order) * size" thing looks goofy to me. Let's say you
> are flushing a order=1 (8k) page and its mapped in a 1GB mapping. You
> flush 2GB. Is that intentional?
I don't think so; seems like we should be flushing
(1 << order) * PAGE_SIZE instead.
> > +
> > +void xpfo_free_pages(struct page *page, int order)
> > +{
> ...
> > + /*
> > + * Map the page back into the kernel if it was previously
> > + * allocated to user space.
> > + */
> > + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> > + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> > + set_kpte(page_address(page + i), page + i,
> > + PAGE_KERNEL);
> > + }
> > + }
> > +}
>
> This seems like a bad idea, performance-wise. Kernel and userspace
> pages tend to be separated by migrate types. So, a given physical page
> will tend to be used as kernel *or* for userspace. With this nugget,
> every time a userspace page is freed, we will go to the trouble of
> making it *back* into a kernel page. Then, when it is allocated again
> (probably as userspace), we will re-make it into a userspace page. That
> seems horribly inefficient.
>
> Also, this weakens the security guarantees. Let's say you're mounting a
> ret2dir attack. You populate a page with your evil data and you know
> the kernel address for the page. All you have to do is coordinate your
> attack with freeing the page. You can control when it gets freed. Now,
> the xpfo_free_pages() helpfully just mapped your attack code back into
> the kernel.
>
> Why not *just* do these moves at allocation time?
Yes, this is a great point, thanks. I think this can be a no-op, and
with the fixed up v7 logic for alloc pages that I linked to above it
should work out correctly.
Tycho
On 09/20/2017 03:34 PM, Tycho Andersen wrote:
>> I really have to wonder whether there are better ret2dir defenses than
>> this. The allocator just seems like the *wrong* place to be doing this
>> because it's such a hot path.
>
> This might be crazy, but what if we defer flushing of the kernel
> ranges until just before we return to userspace? We'd still manipulate
> the prot/xpfo bits for the pages, but then just keep a list of which
> ranges need to be flushed, and do the right thing before we return.
> This leaves a little window between the actual allocation and the
> flush, but userspace would need another thread in its threadgroup to
> predict the next allocation, write the bad stuff there, and do the
> exploit all in that window.
I think the common case is still that you enter the kernel, allocate a
single page (or very few) and then exit. So, you don't really reduce
the total number of flushes.
Just think of this in terms of IPIs to do the remote TLB flushes. A CPU
can do roughly 1 million page faults and allocations a second. Say you
have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads.
That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH*
*CPU*.
I think the only thing that will really help here is if you batch the
allocations. For instance, you could make sure that the per-cpu-pageset
lists always contain either all kernel or all user data. Then remap the
entire list at once and do a single flush after the entire list is consumed.
>> Why do you even bother keeping large pages around? Won't the entire
>> kernel just degrade to using 4k everywhere, eventually?
>
> Isn't that true of large pages in general? Is there something about
> xpfo that makes this worse? I thought this would only split things if
> they had already been split somewhere else, and the protection can't
> apply to the whole huge page.
Even though the kernel gives out 4k pages, it still *maps* them in the
kernel linear direct map with the largest size available. My 16GB
laptop, for instance, has 3GB of 2MB transparent huge pages, but the
rest is used as 4k pages. Yet, from /proc/meminfo:
DirectMap4k: 665280 kB
DirectMap2M: 11315200 kB
DirectMap1G: 4194304 kB
Your code pretty much forces 4k pages coming out of the allocator to be
mapped with 4k mappings.
>>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>>> +{
>>> + int level;
>>> + unsigned long size, kaddr;
>>> +
>>> + kaddr = (unsigned long)page_address(page);
>>> +
>>> + if (unlikely(!lookup_address(kaddr, &level))) {
>>> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
>>> + return;
>>> + }
>>> +
>>> + switch (level) {
>>> + case PG_LEVEL_4K:
>>> + size = PAGE_SIZE;
>>> + break;
>>> + case PG_LEVEL_2M:
>>> + size = PMD_SIZE;
>>> + break;
>>> + case PG_LEVEL_1G:
>>> + size = PUD_SIZE;
>>> + break;
>>> + default:
>>> + WARN(1, "xpfo: unsupported page level %x\n", level);
>>> + return;
>>> + }
>>> +
>>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>>> +}
>>
>> I'm not sure flush_tlb_kernel_range() is the best primitive to be
>> calling here.
>>
>> Let's say you walk the page tables and find level=PG_LEVEL_1G. You call
>> flush_tlb_kernel_range(), you will be above
>> tlb_single_page_flush_ceiling, and you will do a full TLB flush. But,
>> with a 1GB page, you could have just used a single INVLPG and skipped
>> the global flush.
>>
>> I guess the cost of the IPI is way more than the flush itself, but it's
>> still a shame to toss the entire TLB when you don't have to.
>
> Ok, do you think it's worth making a new helper for others to use? Or
> should I just keep the logic in this function?
I'd just leave it in place. Most folks already have a PTE when they do
the invalidation, so this is a bit of a weirdo.
On 09/12/2017 11:13 AM, Tycho Andersen wrote:
> -void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
> {
> int i, flush_tlb = 0;
> struct xpfo *xpfo;
> @@ -116,8 +116,14 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> * Tag the page as a user page and flush the TLB if it
> * was previously allocated to the kernel.
> */
> - if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> + bool was_user = !test_and_set_bit(XPFO_PAGE_USER,
> + &xpfo->flags);
> +
> + if (was_user || !will_map) {
> + set_kpte(page_address(page + i), page + i,
> + __pgprot(0));
> flush_tlb = 1;
> + }
Shouldn't the "was_user" be "was_kernel"?
Also, the way this now works, let's say we have a nice, 2MB pmd_t (page
table entry) mapping a nice, 2MB page in the allocator. Then it gets
allocated to userspace. We do
for (i = 0; i < (1 << order); i++) {
...
set_kpte(page_address(page + i), page+i, __pgprot(0));
}
The set_kpte() will take the nice, 2MB mapping and break it down into
512 4k mappings, all pointing to a non-present PTE, in a newly-allocated
PTE page. So, you get the same result and waste 4k of memory in the
process, *AND* make it slower because we added a level to the page tables.
I think you actually want to make a single set_kpte() call at the end of
the function. That's faster and preserves the large page in the direct
mapping.
On Wed, Sep 20, 2017 at 04:46:41PM -0700, Dave Hansen wrote:
> On 09/12/2017 11:13 AM, Tycho Andersen wrote:
> > -void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> > +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
> > {
> > int i, flush_tlb = 0;
> > struct xpfo *xpfo;
> > @@ -116,8 +116,14 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> > * Tag the page as a user page and flush the TLB if it
> > * was previously allocated to the kernel.
> > */
> > - if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> > + bool was_user = !test_and_set_bit(XPFO_PAGE_USER,
> > + &xpfo->flags);
> > +
> > + if (was_user || !will_map) {
> > + set_kpte(page_address(page + i), page + i,
> > + __pgprot(0));
> > flush_tlb = 1;
> > + }
>
> Shouldn't the "was_user" be "was_kernel"?
Oof, yes, thanks.
> Also, the way this now works, let's say we have a nice, 2MB pmd_t (page
> table entry) mapping a nice, 2MB page in the allocator. Then it gets
> allocated to userspace. We do
>
> for (i = 0; i < (1 << order); i++) {
> ...
> set_kpte(page_address(page + i), page+i, __pgprot(0));
> }
>
> The set_kpte() will take the nice, 2MB mapping and break it down into
> 512 4k mappings, all pointing to a non-present PTE, in a newly-allocated
> PTE page. So, you get the same result and waste 4k of memory in the
> process, *AND* make it slower because we added a level to the page tables.
>
> I think you actually want to make a single set_kpte() call at the end of
> the function. That's faster and preserves the large page in the direct
> mapping.
...and makes it easier to pair tlb flushes with changing the
protections. I guess we still need the for loop, because we need to
set/unset the xpfo bits as necessary, but I'll switch it to a single
set_kpte(). This also implies that the xpfo bits should all be the
same on every page in the mapping, which I think is true.
This will be a nice change, thanks!
Tycho
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> + /*
> + * Map the page back into the kernel if it was previously
> + * allocated to user space.
> + */
> + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> + set_kpte(page_address(page + i), page + i,
> + PAGE_KERNEL);
> + }
> + }
It might also be a really good idea to clear the page here. Otherwise,
the page still might have attack code in it and now it is mapped into
the kernel again, ready to be exploited.
Think of it this way: pages either trusted data and are mapped all the
time, or they have potentially bad data and are unmapped mostly. If we
want to take a bad page and map it always, we have to make sure the
contents are not evil. 0's are not evil.
> static inline void *kmap(struct page *page)
> {
> + void *kaddr;
> +
> might_sleep();
> - return page_address(page);
> + kaddr = page_address(page);
> + xpfo_kmap(kaddr, page);
> + return kaddr;
> }
The time between kmap() and kunmap() is potentially a really long
operation. I think we, for instance, keep some pages kmap()'d while we
do I/O to them, or wait for I/O elsewhere.
IOW, this will map predictable data at a predictable location and it
will do it for a long time. While that's better than the current state
(mapped always), it still seems rather risky.
Could you, for instance, turn kmap(page) into vmap(&page, 1, ...)? That
way, at least the address may be different each time. Even if an
attacker knows the physical address, they don't know where it will be
mapped.
On 09/20/2017 05:02 PM, Tycho Andersen wrote:
> ...and makes it easier to pair tlb flushes with changing the
> protections. I guess we still need the for loop, because we need to
> set/unset the xpfo bits as necessary, but I'll switch it to a single
> set_kpte(). This also implies that the xpfo bits should all be the
> same on every page in the mapping, which I think is true.
FWIW, it's a bit bonkers to keep all this duplicate xpfo metadata for
compound pages. You could probably get away with only keeping it for
the head page.
On Wed, Sep 20, 2017 at 04:21:15PM -0700, Dave Hansen wrote:
> On 09/20/2017 03:34 PM, Tycho Andersen wrote:
> >> I really have to wonder whether there are better ret2dir defenses than
> >> this. The allocator just seems like the *wrong* place to be doing this
> >> because it's such a hot path.
> >
> > This might be crazy, but what if we defer flushing of the kernel
> > ranges until just before we return to userspace? We'd still manipulate
> > the prot/xpfo bits for the pages, but then just keep a list of which
> > ranges need to be flushed, and do the right thing before we return.
> > This leaves a little window between the actual allocation and the
> > flush, but userspace would need another thread in its threadgroup to
> > predict the next allocation, write the bad stuff there, and do the
> > exploit all in that window.
>
> I think the common case is still that you enter the kernel, allocate a
> single page (or very few) and then exit. So, you don't really reduce
> the total number of flushes.
>
> Just think of this in terms of IPIs to do the remote TLB flushes. A CPU
> can do roughly 1 million page faults and allocations a second. Say you
> have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads.
> That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH*
> *CPU*.
Since we only need to flush when something switches from a userspace
to a kernel page or back, hopefully it's not this bad, but point
taken.
> I think the only thing that will really help here is if you batch the
> allocations. For instance, you could make sure that the per-cpu-pageset
> lists always contain either all kernel or all user data. Then remap the
> entire list at once and do a single flush after the entire list is consumed.
Just so I understand, the idea would be that we only flush when the
type of allocation alternates, so:
kmalloc(..., GFP_KERNEL);
kmalloc(..., GFP_KERNEL);
/* remap+flush here */
kmalloc(..., GFP_HIGHUSER);
/* remap+flush here */
kmalloc(..., GFP_KERNEL);
?
Tycho
On 09/20/2017 05:09 PM, Tycho Andersen wrote:
>> I think the only thing that will really help here is if you batch the
>> allocations. For instance, you could make sure that the per-cpu-pageset
>> lists always contain either all kernel or all user data. Then remap the
>> entire list at once and do a single flush after the entire list is consumed.
> Just so I understand, the idea would be that we only flush when the
> type of allocation alternates, so:
>
> kmalloc(..., GFP_KERNEL);
> kmalloc(..., GFP_KERNEL);
> /* remap+flush here */
> kmalloc(..., GFP_HIGHUSER);
> /* remap+flush here */
> kmalloc(..., GFP_KERNEL);
Not really. We keep a free list per migrate type, and a per_cpu_pages
(pcp) list per migratetype:
> struct per_cpu_pages {
> int count; /* number of pages in the list */
> int high; /* high watermark, emptying needed */
> int batch; /* chunk size for buddy add/remove */
>
> /* Lists of pages, one per migrate type stored on the pcp-lists */
> struct list_head lists[MIGRATE_PCPTYPES];
> };
The migratetype is derived from the GFP flags in
gfpflags_to_migratetype(). In general, GFP_HIGHUSER and GFP_KERNEL come
from different migratetypes, so they come from different free lists.
In your case above, the GFP_HIGHUSER allocation come through the
MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
MIGRATE_UNMOVABLE one. Since we add a bunch of pages to those lists at
once, you could do all the mapping/unmapping/flushing on a bunch of
pages at once
Or, you could hook your code into the places where the migratetype of
memory is changed (set_pageblock_migratetype(), plus where we fall
back). Those changes are much more rare than page allocation.
At a high level, does this approach keep an attacker from being able to
determine the address of data in the linear map, or does it keep them
from being able to *exploit* it? Can you have a ret2dir attack if the
attacker doesn't know the address, for instance?
On Wed, Sep 20, 2017 at 05:28:11PM -0700, Dave Hansen wrote:
> At a high level, does this approach keep an attacker from being able to
> determine the address of data in the linear map, or does it keep them
> from being able to *exploit* it?
It keeps them from exploiting it, by faulting when a physmap alias is
used.
> Can you have a ret2dir attack if the attacker doesn't know the
> address, for instance?
Yes, through a technique similar to heap spraying. The original paper
has a study of this, section 5.2 outlines the attack and 7.2 describes
their success rate:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
Tycho
On Wed, Sep 20, 2017 at 05:27:02PM -0700, Dave Hansen wrote:
> On 09/20/2017 05:09 PM, Tycho Andersen wrote:
> >> I think the only thing that will really help here is if you batch the
> >> allocations. For instance, you could make sure that the per-cpu-pageset
> >> lists always contain either all kernel or all user data. Then remap the
> >> entire list at once and do a single flush after the entire list is consumed.
> > Just so I understand, the idea would be that we only flush when the
> > type of allocation alternates, so:
> >
> > kmalloc(..., GFP_KERNEL);
> > kmalloc(..., GFP_KERNEL);
> > /* remap+flush here */
> > kmalloc(..., GFP_HIGHUSER);
> > /* remap+flush here */
> > kmalloc(..., GFP_KERNEL);
>
> Not really. We keep a free list per migrate type, and a per_cpu_pages
> (pcp) list per migratetype:
>
> > struct per_cpu_pages {
> > int count; /* number of pages in the list */
> > int high; /* high watermark, emptying needed */
> > int batch; /* chunk size for buddy add/remove */
> >
> > /* Lists of pages, one per migrate type stored on the pcp-lists */
> > struct list_head lists[MIGRATE_PCPTYPES];
> > };
>
> The migratetype is derived from the GFP flags in
> gfpflags_to_migratetype(). In general, GFP_HIGHUSER and GFP_KERNEL come
> from different migratetypes, so they come from different free lists.
>
> In your case above, the GFP_HIGHUSER allocation come through the
> MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
> MIGRATE_UNMOVABLE one. Since we add a bunch of pages to those lists at
> once, you could do all the mapping/unmapping/flushing on a bunch of
> pages at once
>
> Or, you could hook your code into the places where the migratetype of
> memory is changed (set_pageblock_migratetype(), plus where we fall
> back). Those changes are much more rare than page allocation.
I see, thanks for all this discussion. It has been very helpful!
Tycho
On Tue, Nov 14, 2017 at 11:00:20PM -0800, Dave Hansen wrote:
> On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> > We don't need to kmap in order to access MOVABLE allocations. kmap is
> > only needed for HIGHMEM allocations. So there's nothing wrong with ext4
> > or set_bh_page().
>
> Yeah, it's definitely not _buggy_.
>
> Although, I do wonder what we should do about these for XPFO. Should we
> just stick a kmap() in there and comment it? What we really need is a
> mechanism to say "use this as a kernel page" and "stop using this as a
> kernel page". kmap() does that... kinda. It's not a perfect fit, but
> it's pretty close.
It'd be kind of funny if getting XPFO working better means improving
how well Linux runs on 32-bit machines with HIGHMEM. I think there's
always going to be interest in those -- ARM developed 36 bit physmem
before biting the bullet and going to arm64. Maybe OpenRISC will do
that next ;-)
From 1584119071160873144@xxx Wed Nov 15 08:14:32 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> On Mon, Nov 13, 2017 at 02:46:25PM -0800, Dave Hansen wrote:
>> On 11/13/2017 02:20 PM, Dave Hansen wrote:
>>> On 11/09/2017 05:09 PM, Tycho Andersen wrote:
>>>> which I guess is from the additional flags in grow_dev_page() somewhere down
>>>> the stack. Anyway... it seems this is a kernel allocation that's using
>>>> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
>>>> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
>>>>
>>>> Do you have any ideas?
>>>
>>> It still has to do a kmap() or kmap_atomic() to be able to access it. I
>>> thought you hooked into that. Why isn't that path getting hit for these?
>>
>> Oh, this looks to be accessing data mapped by a buffer_head. It
>> (rudely) accesses data via:
>>
>> void set_bh_page(struct buffer_head *bh,
>> ...
>> bh->b_data = page_address(page) + offset;
>
> We don't need to kmap in order to access MOVABLE allocations. kmap is
> only needed for HIGHMEM allocations. So there's nothing wrong with ext4
> or set_bh_page().
Yeah, it's definitely not _buggy_.
Although, I do wonder what we should do about these for XPFO. Should we
just stick a kmap() in there and comment it? What we really need is a
mechanism to say "use this as a kernel page" and "stop using this as a
kernel page". kmap() does that... kinda. It's not a perfect fit, but
it's pretty close.
From 1584110338343607794@xxx Wed Nov 15 05:55:44 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Mon, Nov 13, 2017 at 02:46:25PM -0800, Dave Hansen wrote:
> On 11/13/2017 02:20 PM, Dave Hansen wrote:
> > On 11/09/2017 05:09 PM, Tycho Andersen wrote:
> >> which I guess is from the additional flags in grow_dev_page() somewhere down
> >> the stack. Anyway... it seems this is a kernel allocation that's using
> >> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
> >> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
> >>
> >> Do you have any ideas?
> >
> > It still has to do a kmap() or kmap_atomic() to be able to access it. I
> > thought you hooked into that. Why isn't that path getting hit for these?
>
> Oh, this looks to be accessing data mapped by a buffer_head. It
> (rudely) accesses data via:
>
> void set_bh_page(struct buffer_head *bh,
> ...
> bh->b_data = page_address(page) + offset;
We don't need to kmap in order to access MOVABLE allocations. kmap is
only needed for HIGHMEM allocations. So there's nothing wrong with ext4
or set_bh_page().
From 1584094878440184708@xxx Wed Nov 15 01:50:00 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On 11/13/2017 02:20 PM, Dave Hansen wrote:
> On 11/09/2017 05:09 PM, Tycho Andersen wrote:
>> which I guess is from the additional flags in grow_dev_page() somewhere down
>> the stack. Anyway... it seems this is a kernel allocation that's using
>> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
>> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
>>
>> Do you have any ideas?
>
> It still has to do a kmap() or kmap_atomic() to be able to access it. I
> thought you hooked into that. Why isn't that path getting hit for these?
Oh, this looks to be accessing data mapped by a buffer_head. It
(rudely) accesses data via:
void set_bh_page(struct buffer_head *bh,
...
bh->b_data = page_address(page) + offset;
From 1583991148926147877@xxx Mon Nov 13 22:21:16 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On 11/09/2017 05:09 PM, Tycho Andersen wrote:
> which I guess is from the additional flags in grow_dev_page() somewhere down
> the stack. Anyway... it seems this is a kernel allocation that's using
> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
>
> Do you have any ideas?
It still has to do a kmap() or kmap_atomic() to be able to access it. I
thought you hooked into that. Why isn't that path getting hit for these?
From 1583639383899400857@xxx Fri Nov 10 01:10:06 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
Hi Dave,
On Wed, Sep 20, 2017 at 05:27:02PM -0700, Dave Hansen wrote:
> On 09/20/2017 05:09 PM, Tycho Andersen wrote:
> >> I think the only thing that will really help here is if you batch the
> >> allocations. For instance, you could make sure that the per-cpu-pageset
> >> lists always contain either all kernel or all user data. Then remap the
> >> entire list at once and do a single flush after the entire list is consumed.
> > Just so I understand, the idea would be that we only flush when the
> > type of allocation alternates, so:
> >
> > kmalloc(..., GFP_KERNEL);
> > kmalloc(..., GFP_KERNEL);
> > /* remap+flush here */
> > kmalloc(..., GFP_HIGHUSER);
> > /* remap+flush here */
> > kmalloc(..., GFP_KERNEL);
>
> Not really. We keep a free list per migrate type, and a per_cpu_pages
> (pcp) list per migratetype:
>
> > struct per_cpu_pages {
> > int count; /* number of pages in the list */
> > int high; /* high watermark, emptying needed */
> > int batch; /* chunk size for buddy add/remove */
> >
> > /* Lists of pages, one per migrate type stored on the pcp-lists */
> > struct list_head lists[MIGRATE_PCPTYPES];
> > };
>
> The migratetype is derived from the GFP flags in
> gfpflags_to_migratetype(). In general, GFP_HIGHUSER and GFP_KERNEL come
> from different migratetypes, so they come from different free lists.
>
> In your case above, the GFP_HIGHUSER allocation come through the
> MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
> MIGRATE_UNMOVABLE one. Since we add a bunch of pages to those lists at
> once, you could do all the mapping/unmapping/flushing on a bunch of
> pages at once
So I've been playing around with an implementation of this, which is basically:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d9c1b486e1f..47b46ff1148a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2348,6 +2348,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
if (is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
+ xpfo_pcp_refill(page, migratetype, order);
}
/*
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 080235a2f129..b381d83c6e78 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -260,3 +265,85 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
kunmap_atomic(mapping[i]);
}
EXPORT_SYMBOL(xpfo_temp_unmap);
+
+void xpfo_pcp_refill(struct page *page, enum migratetype migratetype, int order)
+{
+ int i;
+ bool flush_tlb = false;
+
+ if (!static_branch_unlikely(&xpfo_initialized))
+ return;
+
+ for (i = 0; i < 1 << order; i++) {
+ struct xpfo *xpfo;
+
+ xpfo = lookup_xpfo(page + i);
+ if (!xpfo)
+ continue;
+
+ if (unlikely(!xpfo->initialized)) {
+ spin_lock_init(&xpfo->maplock);
+ atomic_set(&xpfo->mapcount, 0);
+ xpfo->initialized = true;
+ }
+
+ xpfo->trace.max_entries = 20;
+ xpfo->trace.skip = 1;
+ xpfo->trace.entries = xpfo->entries;
+ xpfo->trace.nr_entries = 0;
+ xpfo->trace2.max_entries = 20;
+ xpfo->trace2.skip = 1;
+ xpfo->trace2.entries = xpfo->entries2;
+ xpfo->trace2.nr_entries = 0;
+
+ xpfo->migratetype = migratetype;
+
+ save_stack_trace(&xpfo->trace);
+
+ if (migratetype == MIGRATE_MOVABLE) {
+ /* GPF_HIGHUSER */
+ set_kpte(page_address(page + i), page + i, __pgprot(0));
+ if (!test_and_set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ flush_tlb = true;
+ set_bit(XPFO_PAGE_USER, &xpfo->flags);
+ } else {
+ /*
+ * GFP_KERNEL and everything else; for now we just
+ * leave it mapped
+ */
+ set_kpte(page_address(page + i), page + i, PAGE_KERNEL);
+ if (test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ flush_tlb = true;
+ clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+ }
+ }
+
+ if (flush_tlb)
+ xpfo_flush_kernel_tlb(page, order);
+}
+
But I'm getting some faults:
[ 1.897311] BUG: unable to handle kernel paging request at ffff880139b75012
[ 1.898244] IP: ext4_fill_super+0x2f3b/0x33c0
[ 1.898827] PGD 1ea6067
[ 1.898828] P4D 1ea6067
[ 1.899170] PUD 1ea9067
[ 1.899508] PMD 119478063
[ 1.899850] PTE 139b75000
[ 1.900211]
[ 1.900760] Oops: 0000 [#1] SMP
[ 1.901160] Modules linked in:
[ 1.901565] CPU: 3 PID: 990 Comm: exe Not tainted 4.13.0+ #85
[ 1.902348] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1.903420] task: ffff88011ae7cb00 task.stack: ffffc9001a338000
[ 1.904108] RIP: 0010:ext4_fill_super+0x2f3b/0x33c0
[ 1.904649] RSP: 0018:ffffc9001a33bce0 EFLAGS: 00010246
[ 1.905240] RAX: 00000000000000f0 RBX: ffff880139b75000 RCX: ffffffff81c456b8
[ 1.906047] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246
[ 1.906874] RBP: ffffc9001a33bda8 R08: 0000000000000000 R09: 0000000000000183
[ 1.908053] R10: ffff88011a9e0800 R11: ffffffff818493e0 R12: ffff88011a9e0800
[ 1.908920] R13: ffff88011a9e6800 R14: 000000000077fefa R15: 0000000000000000
[ 1.909775] FS: 00007f8169747700(0000) GS:ffff880139d80000(0000) knlGS:0000000000000000
[ 1.910667] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.911293] CR2: ffff880139b75012 CR3: 000000011a965000 CR4: 00000000000006e0
[ 1.912050] Call Trace:
[ 1.912356] ? register_shrinker+0x80/0x90
[ 1.912826] mount_bdev+0x177/0x1b0
[ 1.913234] ? ext4_calculate_overhead+0x4a0/0x4a0
[ 1.913744] ext4_mount+0x10/0x20
[ 1.914115] mount_fs+0x2d/0x140
[ 1.914490] ? __alloc_percpu+0x10/0x20
[ 1.914903] vfs_kern_mount.part.20+0x58/0x110
[ 1.915394] do_mount+0x1cc/0xca0
[ 1.915758] ? _copy_from_user+0x6b/0xa0
[ 1.916198] ? memdup_user+0x3d/0x70
[ 1.916576] SyS_mount+0x93/0xe0
[ 1.916915] entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 1.917401] RIP: 0033:0x7f8169264b5a
[ 1.917777] RSP: 002b:00007fff6ce82bc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[ 1.918576] RAX: ffffffffffffffda RBX: 0000000000fb2030 RCX: 00007f8169264b5a
[ 1.919313] RDX: 00007fff6ce84e61 RSI: 00007fff6ce84e70 RDI: 00007fff6ce84e66
[ 1.920042] RBP: 0000000000008000 R08: 0000000000000000 R09: 0000000000000000
[ 1.920771] R10: 0000000000008001 R11: 0000000000000202 R12: 0000000000000000
[ 1.921512] R13: 0000000000000000 R14: 00007fff6ce82c70 R15: 0000000000445c20
[ 1.922254] Code: 83 ee 01 48 c7 c7 70 e6 97 81 e8 1d 0c e2 ff 48 89 de 48 c7 c7 a4 48 96 81 e8 0e 0c e2 ff 8b 85 5c ff ff ff 41 39 44 24 40 75 0e <f6> 43 12 04 41 0f 44 c7 89 85 5c ff ff ff 48 c7 c7 ad 48 96 81
[ 1.924489] RIP: ext4_fill_super+0x2f3b/0x33c0 RSP: ffffc9001a33bce0
[ 1.925334] CR2: ffff880139b75012
[ 1.942161] ---[ end trace fe884f328a0a7338 ]---
This is the code:
if ((grp == sbi->s_groups_count) &&
!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))
in fs/ext4/super.c:ext4_check_descriptors() that's ultimately failing. It looks
like this allocation comes from sb_bread_unmovable(), which although it says
unmovable, seems to allocate the memory with:
MOVABLE
IO
NOFAIL
HARDWALL
DIRECT_RECLAIM
KSWAPD_RECLAIM
which I guess is from the additional flags in grow_dev_page() somewhere down
the stack. Anyway... it seems this is a kernel allocation that's using
MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
Do you have any ideas?
> Or, you could hook your code into the places where the migratetype of
> memory is changed (set_pageblock_migratetype(), plus where we fall
> back). Those changes are much more rare than page allocation.
I guess this has the same issue, that sometimes the kernel allocates MOVABLE
stuff that it wants to use.
Thanks,
Tycho
From 1579111285274812862@xxx Thu Sep 21 01:37:55 +0000 2017
X-GM-THRID: 1577903505730930323
X-Gmail-Labels: Inbox,Category Forums