2017-08-09 20:08:56

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 00/10] Add support for eXclusive Page Frame Ownership

Hi all,

Here's a v5 of the XPFO set. Changes from v4 are:

* huge pages support actually works now on x86
* arm64 support, which boots on several different arm64 boards
* tests for hugepages support as well via LKDTM (thanks Kees for suggesting how
to make this work)

Patch 2 contains some potentially controversial stuff, exposing the cpa_lock
and lifting some other static functions out; there is probably a better way to
do this, thoughts welcome.

Still to do are:

* get it to work with non-64k pages on ARM
* get rid of the BUG()s, in favor or WARN or similar
* other things people come up with in this review

Please have a look. Thoughts welcome!

Previously: http://www.openwall.com/lists/kernel-hardening/2017/06/07/24

Tycho

Juerg Haefliger (8):
mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
swiotlb: Map the buffer if it was unmapped by XPFO
arm64: Add __flush_tlb_one()
arm64/mm: Add support for XPFO
arm64/mm: Disable section mappings if XPFO is enabled
arm64/mm: Don't flush the data cache if the page is unmapped by XPFO
arm64/mm: Add support for XPFO to swiotlb
lkdtm: Add test for XPFO

Tycho Andersen (2):
mm: add MAP_HUGETLB support to vm_mmap
mm: add a user_virt_to_phys symbol

Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/cacheflush.h | 11 ++
arch/arm64/include/asm/tlbflush.h | 8 +
arch/arm64/mm/Makefile | 2 +
arch/arm64/mm/dma-mapping.c | 32 ++--
arch/arm64/mm/flush.c | 5 +-
arch/arm64/mm/mmu.c | 14 +-
arch/arm64/mm/xpfo.c | 160 +++++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 23 +++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/pageattr.c | 24 +--
arch/x86/mm/xpfo.c | 153 +++++++++++++++++
drivers/misc/Makefile | 1 +
drivers/misc/lkdtm.h | 4 +
drivers/misc/lkdtm_core.c | 4 +
drivers/misc/lkdtm_xpfo.c | 62 +++++++
include/linux/highmem.h | 15 +-
include/linux/mm.h | 2 +
include/linux/xpfo.h | 47 +++++
lib/swiotlb.c | 3 +-
mm/Makefile | 1 +
mm/mmap.c | 19 +--
mm/page_alloc.c | 2 +
mm/page_ext.c | 4 +
mm/util.c | 32 ++++
mm/xpfo.c | 217 ++++++++++++++++++++++++
security/Kconfig | 19 +++
29 files changed, 810 insertions(+), 59 deletions(-)
create mode 100644 arch/arm64/mm/xpfo.c
create mode 100644 arch/x86/mm/xpfo.c
create mode 100644 drivers/misc/lkdtm_xpfo.c
create mode 100644 include/linux/xpfo.h
create mode 100644 mm/xpfo.c

--
2.11.0


2017-08-09 20:09:09

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 09/10] mm: add a user_virt_to_phys symbol

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
arch/arm64/mm/xpfo.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/xpfo.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/xpfo.h | 4 ++++
3 files changed, 112 insertions(+)

diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
index c4deb2b720cf..a221799a9242 100644
--- a/arch/arm64/mm/xpfo.c
+++ b/arch/arm64/mm/xpfo.c
@@ -107,3 +107,54 @@ inline void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,

local_irq_restore(flags);
}
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(current->mm, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+
+ if (pud_sect(*pud) || !pud_present(*pud)) {
+ phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+ offset = addr & ~PUD_MASK;
+ goto out;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+
+ if (pmd_sect(*pmd) || !pmd_present(*pmd)) {
+ phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+ offset = addr & ~PMD_MASK;
+ goto out;
+ }
+
+ pte = pte_offset_kernel(pmd, addr);
+ phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ offset = addr & ~PAGE_MASK;
+
+out:
+ return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 3635b37f2fc5..a1344f27406c 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -94,3 +94,60 @@ inline void xpfo_flush_kernel_page(struct page *page, int order)

flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(current->mm, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+ phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+ offset = addr & ~P4D_MASK;
+ goto out;
+ }
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+
+ if (pud_large(*pud) || !pud_present(*pud)) {
+ phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+ offset = addr & ~PUD_MASK;
+ goto out;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+
+ if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+ phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+ offset = addr & ~PMD_MASK;
+ goto out;
+ }
+
+ pte = pte_offset_kernel(pmd, addr);
+ phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ offset = addr & ~PAGE_MASK;
+
+out:
+ return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 6b61f7b820f4..449cd8dbf064 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -16,6 +16,8 @@

#ifdef CONFIG_XPFO

+#include <linux/types.h>
+
extern struct page_ext_operations page_xpfo_ops;

void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
@@ -29,6 +31,8 @@ void xpfo_free_pages(struct page *page, int order);

bool xpfo_page_is_unmapped(struct page *page);

+extern phys_addr_t user_virt_to_phys(unsigned long addr);
+
#else /* !CONFIG_XPFO */

static inline void xpfo_kmap(void *kaddr, struct page *page) { }
--
2.11.0

2017-08-09 20:09:06

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 07/10] arm64/mm: Don't flush the data cache if the page is unmapped by XPFO

From: Juerg Haefliger <[email protected]>

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault. So don't flush in that case.

Signed-off-by: Juerg Haefliger <[email protected]>
Tested-by: Tycho Andersen <[email protected]>
---
arch/arm64/mm/flush.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 21a8d828cbf4..e17a063b2df2 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/xpfo.h>

#include <asm/cacheflush.h>
#include <asm/cache.h>
@@ -30,7 +31,9 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
unsigned long addr = (unsigned long)kaddr;

if (icache_is_aliasing()) {
- __clean_dcache_area_pou(kaddr, len);
+ /* Don't flush if the page is unmapped by XPFO */
+ if (!xpfo_page_is_unmapped(virt_to_page(kaddr)))
+ __clean_dcache_area_pou(kaddr, len);
__flush_icache_all();
} else {
flush_icache_range(addr, addr + len);
--
2.11.0

2017-08-09 20:09:26

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 10/10] lkdtm: Add test for XPFO

From: Juerg Haefliger <[email protected]>

This test simply reads from userspace memory via the kernel's linear
map.

hugepages is only supported on x86 right now, hence the ifdef.

Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
drivers/misc/Makefile | 1 +
drivers/misc/lkdtm.h | 4 +++
drivers/misc/lkdtm_core.c | 4 +++
drivers/misc/lkdtm_xpfo.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 71 insertions(+)

diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index b0b766416306..8447b42a447d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -62,6 +62,7 @@ lkdtm-$(CONFIG_LKDTM) += lkdtm_heap.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_perms.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_rodata_objcopy.o
lkdtm-$(CONFIG_LKDTM) += lkdtm_usercopy.o
+lkdtm-$(CONFIG_LKDTM) += lkdtm_xpfo.o

KCOV_INSTRUMENT_lkdtm_rodata.o := n

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 3b4976396ec4..fc53546113c1 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -64,4 +64,8 @@ void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
void lkdtm_USERCOPY_STACK_BEYOND(void);
void lkdtm_USERCOPY_KERNEL(void);

+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+
#endif
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 42d2b8e31e6b..9431f80886bc 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -235,6 +235,10 @@ struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
CRASHTYPE(USERCOPY_STACK_BEYOND),
CRASHTYPE(USERCOPY_KERNEL),
+ CRASHTYPE(XPFO_READ_USER),
+#ifdef CONFIG_X86
+ CRASHTYPE(XPFO_READ_USER_HUGE),
+#endif
};


diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
new file mode 100644
index 000000000000..c72509128eb3
--- /dev/null
+++ b/drivers/misc/lkdtm_xpfo.c
@@ -0,0 +1,62 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/xpfo.h>
+
+void read_user_with_flags(unsigned long flags)
+{
+ unsigned long user_addr, user_data = 0xdeadbeef;
+ phys_addr_t phys_addr;
+ void *virt_addr;
+
+ user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+ PROT_READ | PROT_WRITE | PROT_EXEC,
+ flags, 0);
+ if (user_addr >= TASK_SIZE) {
+ pr_warn("Failed to allocate user memory\n");
+ return;
+ }
+
+ if (copy_to_user((void __user *)user_addr, &user_data,
+ sizeof(user_data))) {
+ pr_warn("copy_to_user failed\n");
+ goto free_user;
+ }
+
+ phys_addr = user_virt_to_phys(user_addr);
+ if (!phys_addr) {
+ pr_warn("Failed to get physical address of user memory\n");
+ goto free_user;
+ }
+
+ virt_addr = phys_to_virt(phys_addr);
+ if (phys_addr != virt_to_phys(virt_addr)) {
+ pr_warn("Physical address of user memory seems incorrect\n");
+ goto free_user;
+ }
+
+ pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+ if (*(unsigned long *)virt_addr == user_data)
+ pr_info("Huh? Bad read succeeded?!\n");
+ else
+ pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
+
+ free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
+}
--
2.11.0

2017-08-09 20:09:04

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

From: Juerg Haefliger <[email protected]>

XPFO (eXclusive Page Frame Ownership) doesn't support section mappings
yet, so disable it if XPFO is turned on.

Signed-off-by: Juerg Haefliger <[email protected]>
Tested-by: Tycho Andersen <[email protected]>
---
arch/arm64/mm/mmu.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e0e864..38026b3ccb46 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -176,6 +176,18 @@ static void alloc_init_cont_pte(pmd_t *pmd, unsigned long addr,
} while (addr = next, addr != end);
}

+static inline bool use_section_mapping(unsigned long addr, unsigned long next,
+ unsigned long phys)
+{
+ if (IS_ENABLED(CONFIG_XPFO))
+ return false;
+
+ if (((addr | next | phys) & ~SECTION_MASK) != 0)
+ return false;
+
+ return true;
+}
+
static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(void), int flags)
@@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
next = pmd_addr_end(addr, end);

/* try section mapping first */
- if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
+ if (use_section_mapping(addr, next, phys) &&
(flags & NO_BLOCK_MAPPINGS) == 0) {
pmd_set_huge(pmd, phys, prot);

--
2.11.0

2017-08-09 20:09:49

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 08/10] arm64/mm: Add support for XPFO to swiotlb

From: Juerg Haefliger <[email protected]>

Pages that are unmapped by XPFO need to be mapped before and unmapped
again after (to restore the original state) the __dma_{map,unmap}_area()
operations to prevent fatal page faults.

Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
---
arch/arm64/include/asm/cacheflush.h | 11 +++++++++
arch/arm64/mm/dma-mapping.c | 32 +++++++++++++-------------
arch/arm64/mm/xpfo.c | 45 +++++++++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d74a284abdc2..b6a462e3b2f9 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -93,6 +93,17 @@ extern void __dma_map_area(const void *, size_t, int);
extern void __dma_unmap_area(const void *, size_t, int);
extern void __dma_flush_area(const void *, size_t);

+#ifdef CONFIG_XPFO
+#include <linux/xpfo.h>
+#define _dma_map_area(addr, size, dir) \
+ xpfo_dma_map_unmap_area(true, addr, size, dir)
+#define _dma_unmap_area(addr, size, dir) \
+ xpfo_dma_map_unmap_area(false, addr, size, dir)
+#else
+#define _dma_map_area(addr, size, dir) __dma_map_area(addr, size, dir)
+#define _dma_unmap_area(addr, size, dir) __dma_unmap_area(addr, size, dir)
+#endif
+
/*
* Copy user data from/to a page which is mapped into a different
* processes address space. Really, we want to allow our "user
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index f27d4dd04384..a79f200786ab 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -204,7 +204,7 @@ static dma_addr_t __swiotlb_map_page(struct device *dev, struct page *page,
dev_addr = swiotlb_map_page(dev, page, offset, size, dir, attrs);
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);

return dev_addr;
}
@@ -216,7 +216,7 @@ static void __swiotlb_unmap_page(struct device *dev, dma_addr_t dev_addr,
{
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
swiotlb_unmap_page(dev, dev_addr, size, dir, attrs);
}

@@ -231,8 +231,8 @@ static int __swiotlb_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
for_each_sg(sgl, sg, ret, i)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);

return ret;
}
@@ -248,8 +248,8 @@ static void __swiotlb_unmap_sg_attrs(struct device *dev,
if (!is_device_dma_coherent(dev) &&
(attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
swiotlb_unmap_sg_attrs(dev, sgl, nelems, dir, attrs);
}

@@ -258,7 +258,7 @@ static void __swiotlb_sync_single_for_cpu(struct device *dev,
enum dma_data_direction dir)
{
if (!is_device_dma_coherent(dev))
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
swiotlb_sync_single_for_cpu(dev, dev_addr, size, dir);
}

@@ -268,7 +268,7 @@ static void __swiotlb_sync_single_for_device(struct device *dev,
{
swiotlb_sync_single_for_device(dev, dev_addr, size, dir);
if (!is_device_dma_coherent(dev))
- __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
}

static void __swiotlb_sync_sg_for_cpu(struct device *dev,
@@ -280,8 +280,8 @@ static void __swiotlb_sync_sg_for_cpu(struct device *dev,

if (!is_device_dma_coherent(dev))
for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
swiotlb_sync_sg_for_cpu(dev, sgl, nelems, dir);
}

@@ -295,8 +295,8 @@ static void __swiotlb_sync_sg_for_device(struct device *dev,
swiotlb_sync_sg_for_device(dev, sgl, nelems, dir);
if (!is_device_dma_coherent(dev))
for_each_sg(sgl, sg, nelems, i)
- __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
- sg->length, dir);
+ _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
+ sg->length, dir);
}

static int __swiotlb_mmap_pfn(struct vm_area_struct *vma,
@@ -758,7 +758,7 @@ static void __iommu_sync_single_for_cpu(struct device *dev,
return;

phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
- __dma_unmap_area(phys_to_virt(phys), size, dir);
+ _dma_unmap_area(phys_to_virt(phys), size, dir);
}

static void __iommu_sync_single_for_device(struct device *dev,
@@ -771,7 +771,7 @@ static void __iommu_sync_single_for_device(struct device *dev,
return;

phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
- __dma_map_area(phys_to_virt(phys), size, dir);
+ _dma_map_area(phys_to_virt(phys), size, dir);
}

static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
@@ -811,7 +811,7 @@ static void __iommu_sync_sg_for_cpu(struct device *dev,
return;

for_each_sg(sgl, sg, nelems, i)
- __dma_unmap_area(sg_virt(sg), sg->length, dir);
+ _dma_unmap_area(sg_virt(sg), sg->length, dir);
}

static void __iommu_sync_sg_for_device(struct device *dev,
@@ -825,7 +825,7 @@ static void __iommu_sync_sg_for_device(struct device *dev,
return;

for_each_sg(sgl, sg, nelems, i)
- __dma_map_area(sg_virt(sg), sg->length, dir);
+ _dma_map_area(sg_virt(sg), sg->length, dir);
}

static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
index de03a652d48a..c4deb2b720cf 100644
--- a/arch/arm64/mm/xpfo.c
+++ b/arch/arm64/mm/xpfo.c
@@ -11,8 +11,10 @@
* the Free Software Foundation.
*/

+#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/module.h>
+#include <linux/xpfo.h>

#include <asm/tlbflush.h>

@@ -62,3 +64,46 @@ inline void xpfo_flush_kernel_page(struct page *page, int order)

flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
+
+inline void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+ int dir)
+{
+ unsigned long flags;
+ struct page *page = virt_to_page(addr);
+
+ /*
+ * +2 here because we really want
+ * ceil(size / PAGE_SIZE), not floor(), and one extra in case things are
+ * not page aligned
+ */
+ int i, possible_pages = size / PAGE_SIZE + 2;
+ void *buf[possible_pages];
+
+ memset(buf, 0, sizeof(void *) * possible_pages);
+
+ local_irq_save(flags);
+
+ /* Map the first page */
+ if (xpfo_page_is_unmapped(page))
+ buf[0] = kmap_atomic(page);
+
+ /* Map the remaining pages */
+ for (i = 1; i < possible_pages; i++) {
+ if (page_to_virt(page + i) >= addr + size)
+ break;
+
+ if (xpfo_page_is_unmapped(page + i))
+ buf[i] = kmap_atomic(page + i);
+ }
+
+ if (map)
+ __dma_map_area(addr, size, dir);
+ else
+ __dma_unmap_area(addr, size, dir);
+
+ for (i = 0; i < possible_pages; i++)
+ if (buf[i])
+ kunmap_atomic(buf[i]);
+
+ local_irq_restore(flags);
+}
--
2.11.0

2017-08-09 20:09:01

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 02/10] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

From: Juerg Haefliger <[email protected]>

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
- two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
- a reference counter to balance kmap/kunmap operations.
- a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <[email protected]>
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Marco Benatto <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 23 +++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/pageattr.c | 24 +--
arch/x86/mm/xpfo.c | 96 +++++++++++
include/linux/highmem.h | 15 +-
include/linux/xpfo.h | 39 +++++
mm/Makefile | 1 +
mm/page_alloc.c | 2 +
mm/page_ext.c | 4 +
mm/xpfo.c | 208 ++++++++++++++++++++++++
security/Kconfig | 19 +++
13 files changed, 413 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..444d83183f75 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2736,6 +2736,8 @@

nox2apic [X86-64,APIC] Do not enable x2APIC mode.

+ noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on.
+
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b7cf9e..f37d408ab1f2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -184,6 +184,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select ARCH_SUPPORTS_XPFO if X86_64

config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 77037b6f1caa..0c20379c034c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1238,6 +1238,29 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
return __pte_access_permitted(pud_val(pud), write);
}

+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+ unsigned long *vaddr;
+ pgd_t *pgd;
+ pgprot_t mask_set;
+ pgprot_t mask_clr;
+ unsigned long numpages;
+ int flags;
+ unsigned long pfn;
+ unsigned force_split : 1;
+ int curpage;
+ struct page **pages;
+};
+
+
+int
+try_preserve_large_page(pte_t *kpte, unsigned long address,
+ struct cpa_data *cpa);
+int split_large_page(struct cpa_data *cpa, pte_t *kpte,
+ unsigned long address);
+
#include <asm-generic/pgtable.h>
#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 0fbdcb64f9f8..89ba6d25fb51 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -39,3 +39,4 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o

+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 757b0bcdf712..0a40be4708e9 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -27,28 +27,12 @@
#include <asm/set_memory.h>

/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
- unsigned long *vaddr;
- pgd_t *pgd;
- pgprot_t mask_set;
- pgprot_t mask_clr;
- unsigned long numpages;
- int flags;
- unsigned long pfn;
- unsigned force_split : 1;
- int curpage;
- struct page **pages;
-};
-
-/*
* Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings)
* using cpa_lock. So that we don't allow any other cpu, with stale large tlb
* entries change the page attribute in parallel to some other cpu
* splitting a large page entry along with changing the attribute.
*/
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);

#define CPA_FLUSHTLB 1
#define CPA_ARRAY 2
@@ -512,7 +496,7 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
#endif
}

-static int
+int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
@@ -746,8 +730,8 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
return 0;
}

-static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
- unsigned long address)
+int split_large_page(struct cpa_data *cpa, pte_t *kpte,
+ unsigned long address)
{
struct page *base;

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..3635b37f2fc5
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ unsigned int level;
+ pgprot_t msk_clr;
+ pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+ BUG_ON(!pte);
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+ break;
+ case PG_LEVEL_2M:
+ /* We need to check if it's a 2M page or 1GB page before retrieve
+ * pgprot info, as each one will be extracted from a different
+ * page table levels */
+ msk_clr = pmd_pgprot(*(pmd_t*)pte);
+ case PG_LEVEL_1G: {
+ struct cpa_data cpa;
+ int do_split;
+
+ msk_clr = pud_pgprot(*(pud_t*)pte);
+
+ memset(&cpa, 0, sizeof(cpa));
+ cpa.vaddr = kaddr;
+ cpa.pages = &page;
+ cpa.mask_set = prot;
+ cpa.mask_clr = msk_clr;
+ cpa.numpages = 1;
+ cpa.flags = 0;
+ cpa.curpage = 0;
+ cpa.force_split = 0;
+
+
+ do_split = try_preserve_large_page(pte, (unsigned long)kaddr, &cpa);
+ if (do_split) {
+ spin_lock(&cpa_lock);
+ BUG_ON(split_large_page(&cpa, pte, (unsigned long)kaddr));
+ spin_unlock(&cpa_lock);
+ }
+
+ break;
+ }
+ case PG_LEVEL_512G:
+ /* fallthrough, splitting infrastructure doesn't
+ * support 512G pages. */
+ default:
+ BUG();
+ }
+
+}
+
+inline void xpfo_flush_kernel_page(struct page *page, int order)
+{
+ int level;
+ unsigned long size, kaddr;
+
+ kaddr = (unsigned long)page_address(page);
+ lookup_address(kaddr, &level);
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ size = PAGE_SIZE;
+ break;
+ case PG_LEVEL_2M:
+ size = PMD_SIZE;
+ break;
+ case PG_LEVEL_1G:
+ size = PUD_SIZE;
+ break;
+ default:
+ BUG();
+ }
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
#include <linux/mm.h>
#include <linux/uaccess.h>
#include <linux/hardirq.h>
+#include <linux/xpfo.h>

#include <asm/cacheflush.h>

@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
#ifndef ARCH_HAS_KMAP
static inline void *kmap(struct page *page)
{
+ void *kaddr;
+
might_sleep();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}

static inline void kunmap(struct page *page)
{
+ xpfo_kunmap(page_address(page), page);
}

static inline void *kmap_atomic(struct page *page)
{
+ void *kaddr;
+
preempt_disable();
pagefault_disable();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}
#define kmap_atomic_prot(page, prot) kmap_atomic(page)

static inline void __kunmap_atomic(void *addr)
{
+ xpfo_kunmap(addr, virt_to_page(addr));
pagefault_enable();
preempt_enable();
}
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..1ff2d1976837
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size, int dir);
+void xpfo_flush_kernel_page(struct page *page, int order);
+
+void xpfo_kmap(void *kaddr, struct page *page);
+void xpfo_kunmap(void *kaddr, struct page *page);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/mm/Makefile b/mm/Makefile
index 411bd24d4a7c..0be67cac8f6c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc32aa81f359..f83d8a384fde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1058,6 +1058,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order, 0);
kernel_map_pages(page, 1 << order, 0);
kasan_free_pages(page, order);
+ xpfo_free_pages(page, order);

return true;
}
@@ -1753,6 +1754,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
kernel_map_pages(page, 1 << order, 1);
kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
+ xpfo_alloc_pages(page, order, gfp_flags);
set_page_owner(page, order, gfp_flags);
}

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044b09a..4899df1f5d66 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
#include <linux/kmemleak.h>
#include <linux/page_owner.h>
#include <linux/page_idle.h>
+#include <linux/xpfo.h>

/*
* struct page extension
@@ -65,6 +66,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_XPFO
+ &page_xpfo_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..3cd45f68b5ad
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,208 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+/* XPFO page state flags */
+enum xpfo_flags {
+ XPFO_PAGE_USER, /* Page is allocated to user-space */
+ XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */
+};
+
+/* Per-page XPFO house-keeping data */
+struct xpfo {
+ unsigned long flags; /* Page state */
+ bool inited; /* Map counter and lock initialized */
+ atomic_t mapcount; /* Counter for balancing map/unmap requests */
+ spinlock_t maplock; /* Lock to serialize map/unmap requests */
+};
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool xpfo_disabled __initdata;
+
+static int __init noxpfo_param(char *str)
+{
+ xpfo_disabled = true;
+
+ return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+static bool __init need_xpfo(void)
+{
+ if (xpfo_disabled) {
+ printk(KERN_INFO "XPFO disabled\n");
+ return false;
+ }
+
+ return true;
+}
+
+static void init_xpfo(void)
+{
+ printk(KERN_INFO "XPFO enabled\n");
+ static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+ .size = sizeof(struct xpfo),
+ .need = need_xpfo,
+ .init = init_xpfo,
+};
+
+static inline struct xpfo *lookup_xpfo(struct page *page)
+{
+ return (void *)lookup_page_ext(page) + page_xpfo_ops.offset;
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+{
+ int i, flush_tlb = 0;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+
+ BUG_ON(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags));
+
+ /* Initialize the map lock and map counter */
+ if (unlikely(!xpfo->inited)) {
+ spin_lock_init(&xpfo->maplock);
+ atomic_set(&xpfo->mapcount, 0);
+ xpfo->inited = true;
+ }
+ BUG_ON(atomic_read(&xpfo->mapcount));
+
+ if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+ /*
+ * Tag the page as a user page and flush the TLB if it
+ * was previously allocated to the kernel.
+ */
+ if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ flush_tlb = 1;
+ } else {
+ /* Tag the page as a non-user (kernel) page */
+ clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+ }
+ }
+
+ if (flush_tlb)
+ xpfo_flush_kernel_page(page, order);
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+ int i;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+
+ if (unlikely(!xpfo->inited)) {
+ /*
+ * The page was allocated before page_ext was
+ * initialized, so it is a kernel page.
+ */
+ continue;
+ }
+
+ /*
+ * Map the page back into the kernel if it was previously
+ * allocated to user space.
+ */
+ if (test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags)) {
+ set_kpte(page_address(page + i), page + i,
+ PAGE_KERNEL);
+ }
+ }
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+ unsigned long flags;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (unlikely(!xpfo->inited) || !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock_irqsave(&xpfo->maplock, flags);
+
+ /*
+ * The page was previously allocated to user space, so map it back
+ * into the kernel. No TLB flush required.
+ */
+ if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
+ test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ set_kpte(kaddr, page, PAGE_KERNEL);
+
+ spin_unlock_irqrestore(&xpfo->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+ unsigned long flags;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (unlikely(!xpfo->inited) || !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock_irqsave(&xpfo->maplock, flags);
+
+ /*
+ * The page is to be allocated back to user space, so unmap it from the
+ * kernel, flush the TLB and tag it as a user page.
+ */
+ if (atomic_dec_return(&xpfo->mapcount) == 0) {
+ BUG_ON(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags));
+ set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ set_kpte(kaddr, page, __pgprot(0));
+ __flush_tlb_one((unsigned long)kaddr);
+ }
+
+ spin_unlock_irqrestore(&xpfo->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..be5145eeed7d 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"

source security/keys/Kconfig

+config ARCH_SUPPORTS_XPFO
+ bool
+
+config XPFO
+ bool "Enable eXclusive Page Frame Ownership (XPFO)"
+ default n
+ depends on ARCH_SUPPORTS_XPFO
+ select PAGE_EXTENSION
+ help
+ This option offers protection against 'ret2dir' kernel attacks.
+ When enabled, every time a page frame is allocated to user space, it
+ is unmapped from the direct mapped RAM region in kernel space
+ (physmap). Similarly, when a page frame is freed/reclaimed, it is
+ mapped back to physmap.
+
+ There is a slight performance impact when this option is enabled.
+
+ If in doubt, say "N".
+
config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
--
2.11.0

2017-08-09 20:10:26

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

From: Juerg Haefliger <[email protected]>

Signed-off-by: Juerg Haefliger <[email protected]>
Tested-by: Tycho Andersen <[email protected]>
---
include/linux/xpfo.h | 4 ++++
lib/swiotlb.c | 3 ++-
mm/xpfo.c | 9 +++++++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1ff2d1976837..6b61f7b820f4 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -27,6 +27,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
void xpfo_free_pages(struct page *page, int order);

+bool xpfo_page_is_unmapped(struct page *page);
+
#else /* !CONFIG_XPFO */

static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -34,6 +36,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
static inline void xpfo_free_pages(struct page *page, int order) { }

+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
#endif /* CONFIG_XPFO */

#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index a8d74a733a38..d4fee5ca2d9e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -420,8 +420,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
{
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+ struct page *page = pfn_to_page(pfn);

- if (PageHighMem(pfn_to_page(pfn))) {
+ if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
/* The buffer does not have a mapping. Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 3cd45f68b5ad..3f305f31a072 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -206,3 +206,12 @@ void xpfo_kunmap(void *kaddr, struct page *page)
spin_unlock_irqrestore(&xpfo->maplock, flags);
}
EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+ if (!static_branch_unlikely(&xpfo_inited))
+ return false;
+
+ return test_bit(XPFO_PAGE_UNMAPPED, &lookup_xpfo(page)->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
--
2.11.0

2017-08-09 20:10:24

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 04/10] arm64: Add __flush_tlb_one()

From: Juerg Haefliger <[email protected]>

Add a hook for flushing a single TLB entry on arm64.

Signed-off-by: Juerg Haefliger <[email protected]>
Tested-by: Tycho Andersen <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index af1c76981911..8e0c49105d3e 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -184,6 +184,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
isb();
}

+static inline void __flush_tlb_one(unsigned long addr)
+{
+ dsb(ishst);
+ __tlbi(vaae1is, addr >> 12);
+ dsb(ish);
+ isb();
+}
+
/*
* Used to invalidate the TLB (walk caches) corresponding to intermediate page
* table levels (pgd/pud/pmd).
--
2.11.0

2017-08-09 20:10:23

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 05/10] arm64/mm: Add support for XPFO

From: Juerg Haefliger <[email protected]>

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

At the moment, only 64k page sizes are supported.

Signed-off-by: Juerg Haefliger <[email protected]>
Tested-by: Tycho Andersen <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/Makefile | 2 ++
arch/arm64/mm/xpfo.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 67 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index dfd908630631..2ddae41e0793 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -121,6 +121,7 @@ config ARM64
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_XPFO if ARM64_64K_PAGES
help
ARM 64-bit (AArch64) Linux support.

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 9b0ba191e48e..22e5cab543d8 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n

obj-$(CONFIG_KASAN) += kasan_init.o
KASAN_SANITIZE_kasan_init.o := n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index 000000000000..de03a652d48a
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,64 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/tlbflush.h>
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ BUG_ON(pgd_bad(*pgd));
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ BUG_ON(pud_bad(*pud));
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ BUG_ON(pmd_bad(*pmd));
+
+ return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ pte_t *pte = lookup_address((unsigned long)kaddr);
+
+ set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_page(struct page *page, int order)
+{
+ unsigned long kaddr = (unsigned long)page_address(page);
+ unsigned long size = PAGE_SIZE;
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
--
2.11.0

2017-08-09 20:12:32

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v5 01/10] mm: add MAP_HUGETLB support to vm_mmap

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
---
include/linux/mm.h | 2 ++
mm/mmap.c | 19 +------------------
mm/util.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..e3bce22fe0f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2140,6 +2140,8 @@ struct vm_unmapped_area_info {
extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);

+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
/*
* Search for an unmapped address range.
*
diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..f24fc14808e1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1490,24 +1490,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
- struct hstate *hs;
-
- hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
- if (!hs)
- return -EINVAL;
-
- len = ALIGN(len, huge_page_size(hs));
- /*
- * VM_NORESERVE is used because the reservations will be
- * taken when vm_ops->mmap() is called
- * A dummy user value is used because we are not locking
- * memory so no accounting is necessary
- */
- file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
- VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
- (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ file = map_hugetlb_setup(&len, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 7b07ec852e01..aac2c58d6f52 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -340,6 +340,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
return ret;
}

+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+ struct user_struct *user = NULL;
+ struct hstate *hs;
+
+ hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ if (!hs)
+ return ERR_PTR(-EINVAL);
+
+ *len = ALIGN(*len, huge_page_size(hs));
+
+ /*
+ * VM_NORESERVE is used because the reservations will be
+ * taken when vm_ops->mmap() is called
+ * A dummy user value is used because we are not locking
+ * memory so no accounting is necessary
+ */
+ return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+ VM_NORESERVE,
+ &user, HUGETLB_ANONHUGE_INODE,
+ (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -349,6 +372,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;

+ if (flag & MAP_HUGETLB) {
+ if (file)
+ return -EINVAL;
+
+ file = map_hugetlb_setup(&len, flag);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
}
EXPORT_SYMBOL(vm_mmap);
--
2.11.0

2017-08-10 13:01:12

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

On Wed, Aug 09, 2017 at 02:07:48PM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Tested-by: Tycho Andersen <[email protected]>
> ---
> include/linux/xpfo.h | 4 ++++
> lib/swiotlb.c | 3 ++-
> mm/xpfo.c | 9 +++++++++
> 3 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index 1ff2d1976837..6b61f7b820f4 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -27,6 +27,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
> void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
> void xpfo_free_pages(struct page *page, int order);
>
> +bool xpfo_page_is_unmapped(struct page *page);
> +
> #else /* !CONFIG_XPFO */
>
> static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> @@ -34,6 +36,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
> static inline void xpfo_free_pages(struct page *page, int order) { }
>
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> #endif /* CONFIG_XPFO */
>
> #endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index a8d74a733a38..d4fee5ca2d9e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -420,8 +420,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> {
> unsigned long pfn = PFN_DOWN(orig_addr);
> unsigned char *vaddr = phys_to_virt(tlb_addr);
> + struct page *page = pfn_to_page(pfn);
>
> - if (PageHighMem(pfn_to_page(pfn))) {
> + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> /* The buffer does not have a mapping. Map it in and copy */
> unsigned int offset = orig_addr & ~PAGE_MASK;
> char *buffer;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index 3cd45f68b5ad..3f305f31a072 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -206,3 +206,12 @@ void xpfo_kunmap(void *kaddr, struct page *page)
> spin_unlock_irqrestore(&xpfo->maplock, flags);
> }
> EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> + if (!static_branch_unlikely(&xpfo_inited))
> + return false;
> +
> + return test_bit(XPFO_PAGE_UNMAPPED, &lookup_xpfo(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);

How can it be inline and 'EXPORT_SYMBOL' ? And why make it inline? It
surely does not need to be access that often?

> --
> 2.11.0
>

2017-08-10 13:11:18

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v5 08/10] arm64/mm: Add support for XPFO to swiotlb

On Wed, Aug 09, 2017 at 02:07:53PM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> Pages that are unmapped by XPFO need to be mapped before and unmapped
> again after (to restore the original state) the __dma_{map,unmap}_area()
> operations to prevent fatal page faults.
>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/include/asm/cacheflush.h | 11 +++++++++
> arch/arm64/mm/dma-mapping.c | 32 +++++++++++++-------------
> arch/arm64/mm/xpfo.c | 45 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 72 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
> index d74a284abdc2..b6a462e3b2f9 100644
> --- a/arch/arm64/include/asm/cacheflush.h
> +++ b/arch/arm64/include/asm/cacheflush.h
> @@ -93,6 +93,17 @@ extern void __dma_map_area(const void *, size_t, int);
> extern void __dma_unmap_area(const void *, size_t, int);
> extern void __dma_flush_area(const void *, size_t);
>
> +#ifdef CONFIG_XPFO
> +#include <linux/xpfo.h>
> +#define _dma_map_area(addr, size, dir) \
> + xpfo_dma_map_unmap_area(true, addr, size, dir)
> +#define _dma_unmap_area(addr, size, dir) \
> + xpfo_dma_map_unmap_area(false, addr, size, dir)
> +#else
> +#define _dma_map_area(addr, size, dir) __dma_map_area(addr, size, dir)
> +#define _dma_unmap_area(addr, size, dir) __dma_unmap_area(addr, size, dir)
> +#endif
> +
> /*
> * Copy user data from/to a page which is mapped into a different
> * processes address space. Really, we want to allow our "user
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index f27d4dd04384..a79f200786ab 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -204,7 +204,7 @@ static dma_addr_t __swiotlb_map_page(struct device *dev, struct page *page,
> dev_addr = swiotlb_map_page(dev, page, offset, size, dir, attrs);
> if (!is_device_dma_coherent(dev) &&
> (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
> - __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> + _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
>
> return dev_addr;
> }
> @@ -216,7 +216,7 @@ static void __swiotlb_unmap_page(struct device *dev, dma_addr_t dev_addr,
> {
> if (!is_device_dma_coherent(dev) &&
> (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
> - __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> + _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> swiotlb_unmap_page(dev, dev_addr, size, dir, attrs);
> }
>
> @@ -231,8 +231,8 @@ static int __swiotlb_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
> if (!is_device_dma_coherent(dev) &&
> (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
> for_each_sg(sgl, sg, ret, i)
> - __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> - sg->length, dir);
> + _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> + sg->length, dir);
>
> return ret;
> }
> @@ -248,8 +248,8 @@ static void __swiotlb_unmap_sg_attrs(struct device *dev,
> if (!is_device_dma_coherent(dev) &&
> (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
> for_each_sg(sgl, sg, nelems, i)
> - __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> - sg->length, dir);
> + _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> + sg->length, dir);
> swiotlb_unmap_sg_attrs(dev, sgl, nelems, dir, attrs);
> }
>
> @@ -258,7 +258,7 @@ static void __swiotlb_sync_single_for_cpu(struct device *dev,
> enum dma_data_direction dir)
> {
> if (!is_device_dma_coherent(dev))
> - __dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> + _dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> swiotlb_sync_single_for_cpu(dev, dev_addr, size, dir);
> }
>
> @@ -268,7 +268,7 @@ static void __swiotlb_sync_single_for_device(struct device *dev,
> {
> swiotlb_sync_single_for_device(dev, dev_addr, size, dir);
> if (!is_device_dma_coherent(dev))
> - __dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> + _dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
> }
>
> static void __swiotlb_sync_sg_for_cpu(struct device *dev,
> @@ -280,8 +280,8 @@ static void __swiotlb_sync_sg_for_cpu(struct device *dev,
>
> if (!is_device_dma_coherent(dev))
> for_each_sg(sgl, sg, nelems, i)
> - __dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> - sg->length, dir);
> + _dma_unmap_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> + sg->length, dir);
> swiotlb_sync_sg_for_cpu(dev, sgl, nelems, dir);
> }
>
> @@ -295,8 +295,8 @@ static void __swiotlb_sync_sg_for_device(struct device *dev,
> swiotlb_sync_sg_for_device(dev, sgl, nelems, dir);
> if (!is_device_dma_coherent(dev))
> for_each_sg(sgl, sg, nelems, i)
> - __dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> - sg->length, dir);
> + _dma_map_area(phys_to_virt(dma_to_phys(dev, sg->dma_address)),
> + sg->length, dir);
> }
>
> static int __swiotlb_mmap_pfn(struct vm_area_struct *vma,
> @@ -758,7 +758,7 @@ static void __iommu_sync_single_for_cpu(struct device *dev,
> return;
>
> phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> - __dma_unmap_area(phys_to_virt(phys), size, dir);
> + _dma_unmap_area(phys_to_virt(phys), size, dir);
> }
>
> static void __iommu_sync_single_for_device(struct device *dev,
> @@ -771,7 +771,7 @@ static void __iommu_sync_single_for_device(struct device *dev,
> return;
>
> phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> - __dma_map_area(phys_to_virt(phys), size, dir);
> + _dma_map_area(phys_to_virt(phys), size, dir);
> }
>
> static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
> @@ -811,7 +811,7 @@ static void __iommu_sync_sg_for_cpu(struct device *dev,
> return;
>
> for_each_sg(sgl, sg, nelems, i)
> - __dma_unmap_area(sg_virt(sg), sg->length, dir);
> + _dma_unmap_area(sg_virt(sg), sg->length, dir);
> }
>
> static void __iommu_sync_sg_for_device(struct device *dev,
> @@ -825,7 +825,7 @@ static void __iommu_sync_sg_for_device(struct device *dev,
> return;
>
> for_each_sg(sgl, sg, nelems, i)
> - __dma_map_area(sg_virt(sg), sg->length, dir);
> + _dma_map_area(sg_virt(sg), sg->length, dir);
> }
>
> static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> index de03a652d48a..c4deb2b720cf 100644
> --- a/arch/arm64/mm/xpfo.c
> +++ b/arch/arm64/mm/xpfo.c
> @@ -11,8 +11,10 @@
> * the Free Software Foundation.
> */
>
> +#include <linux/highmem.h>
> #include <linux/mm.h>
> #include <linux/module.h>
> +#include <linux/xpfo.h>
>
> #include <asm/tlbflush.h>
>
> @@ -62,3 +64,46 @@ inline void xpfo_flush_kernel_page(struct page *page, int order)
>
> flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> }
> +
> +inline void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,

And inline? You sure about that? It is quite a lot of code to duplicate
in all of those call-sites.

> + int dir)

Not enum dma_data_direction ?
> +{
> + unsigned long flags;
> + struct page *page = virt_to_page(addr);
> +
> + /*
> + * +2 here because we really want
> + * ceil(size / PAGE_SIZE), not floor(), and one extra in case things are
> + * not page aligned
> + */
> + int i, possible_pages = size / PAGE_SIZE + 2;

Could you use the PAGE_SHIFT macro instead? Or PFN_UP ?

And there is also the PAGE_ALIGN macro...

> + void *buf[possible_pages];

What if you just did 'void *buf[possible_pages] = { };'

Wouldn't that eliminate the need for the memset?

> +
> + memset(buf, 0, sizeof(void *) * possible_pages);
> +
> + local_irq_save(flags);

?? Why?

> +
> + /* Map the first page */
> + if (xpfo_page_is_unmapped(page))
> + buf[0] = kmap_atomic(page);

Especially in context of the lookup and kmap_atomic which can take
a bit of time.
> +
> + /* Map the remaining pages */
> + for (i = 1; i < possible_pages; i++) {
> + if (page_to_virt(page + i) >= addr + size)
> + break;
> +
> + if (xpfo_page_is_unmapped(page + i))
> + buf[i] = kmap_atomic(page + i);
> + }
> +
> + if (map)
> + __dma_map_area(addr, size, dir);
> + else
> + __dma_unmap_area(addr, size, dir);
> +
> + for (i = 0; i < possible_pages; i++)
> + if (buf[i])
> + kunmap_atomic(buf[i]);
> +
> + local_irq_restore(flags);
> +}
> --
> 2.11.0
>

2017-08-10 16:23:01

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

On Thu, Aug 10, 2017 at 09:01:06AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Aug 09, 2017 at 02:07:48PM -0600, Tycho Andersen wrote:
> > +inline bool xpfo_page_is_unmapped(struct page *page)
> > +{
> > + if (!static_branch_unlikely(&xpfo_inited))
> > + return false;
> > +
> > + return test_bit(XPFO_PAGE_UNMAPPED, &lookup_xpfo(page)->flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_page_is_unmapped);
>
> How can it be inline and 'EXPORT_SYMBOL' ? And why make it inline? It
> surely does not need to be access that often?

Good point. I'll drop inline from the next version, thanks!

Tycho

2017-08-10 16:37:48

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 08/10] arm64/mm: Add support for XPFO to swiotlb

Hi Konrad,

Thanks for taking a look!

On Thu, Aug 10, 2017 at 09:11:12AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Aug 09, 2017 at 02:07:53PM -0600, Tycho Andersen wrote:
> > +
> > +inline void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
>
> And inline? You sure about that? It is quite a lot of code to duplicate
> in all of those call-sites.
>
> > + int dir)
>
> Not enum dma_data_direction ?

I'll fix both of these, thanks.

> > +{
> > + unsigned long flags;
> > + struct page *page = virt_to_page(addr);
> > +
> > + /*
> > + * +2 here because we really want
> > + * ceil(size / PAGE_SIZE), not floor(), and one extra in case things are
> > + * not page aligned
> > + */
> > + int i, possible_pages = size / PAGE_SIZE + 2;
>
> Could you use the PAGE_SHIFT macro instead? Or PFN_UP ?
>
> And there is also the PAGE_ALIGN macro...
>
> > + void *buf[possible_pages];
>
> What if you just did 'void *buf[possible_pages] = { };'
>
> Wouldn't that eliminate the need for the memset?

gcc doesn't seem to like that:

arch/arm64//mm/xpfo.c: In function ‘xpfo_dma_map_unmap_area’:
arch/arm64//mm/xpfo.c:80:2: error: variable-sized object may not be initialized
void *buf[possible_pages] = {};
^~~~

I thought about putting this on the heap, but there's no real way to return
errors here if e.g. the kmalloc fails. I'm open to suggestions though, because
this is ugly.

> > +
> > + memset(buf, 0, sizeof(void *) * possible_pages);
> > +
> > + local_irq_save(flags);
>
> ?? Why?

I'm afraid I don't really know. I'll drop it for the next version, thanks!

Tycho

2017-08-11 17:25:19

by Laura Abbott

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> XPFO (eXclusive Page Frame Ownership) doesn't support section mappings
> yet, so disable it if XPFO is turned on.
>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Tested-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e0e864..38026b3ccb46 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -176,6 +176,18 @@ static void alloc_init_cont_pte(pmd_t *pmd, unsigned long addr,
> } while (addr = next, addr != end);
> }
>
> +static inline bool use_section_mapping(unsigned long addr, unsigned long next,
> + unsigned long phys)
> +{
> + if (IS_ENABLED(CONFIG_XPFO))
> + return false;
> +
> + if (((addr | next | phys) & ~SECTION_MASK) != 0)
> + return false;
> +
> + return true;
> +}
> +
> static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(void), int flags)
> @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> next = pmd_addr_end(addr, end);
>
> /* try section mapping first */
> - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
> + if (use_section_mapping(addr, next, phys) &&
> (flags & NO_BLOCK_MAPPINGS) == 0) {
> pmd_set_huge(pmd, phys, prot);
>
>

There is already similar logic to disable section mappings for
debug_pagealloc at the start of map_mem, can you take advantage
of that?

Thanks,
Laura

2017-08-11 18:01:52

by Laura Abbott

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 05/10] arm64/mm: Add support for XPFO

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> provide a hook for updating a single kernel page table entry (which is
> required by the generic XPFO code).
>
> At the moment, only 64k page sizes are supported.
>

Can you add a note somewhere explaining this limitation or what's
on the TODO list?

> Signed-off-by: Juerg Haefliger <[email protected]>
> Tested-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 ++
> arch/arm64/mm/xpfo.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 67 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index dfd908630631..2ddae41e0793 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -121,6 +121,7 @@ config ARM64
> select SPARSE_IRQ
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_XPFO if ARM64_64K_PAGES
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 9b0ba191e48e..22e5cab543d8 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n
>
> obj-$(CONFIG_KASAN) += kasan_init.o
> KASAN_SANITIZE_kasan_init.o := n
> +
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> new file mode 100644
> index 000000000000..de03a652d48a
> --- /dev/null
> +++ b/arch/arm64/mm/xpfo.c
> @@ -0,0 +1,64 @@
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + * Juerg Haefliger <[email protected]>
> + * Vasileios P. Kemerlis <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/tlbflush.h>
> +
> +/*
> + * Lookup the page table entry for a virtual address and return a pointer to
> + * the entry. Based on x86 tree.
> + */
> +static pte_t *lookup_address(unsigned long addr)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> +
> + pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd))
> + return NULL;
> +
> + BUG_ON(pgd_bad(*pgd));
> +
> + pud = pud_offset(pgd, addr);
> + if (pud_none(*pud))
> + return NULL;
> +
> + BUG_ON(pud_bad(*pud));
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd))
> + return NULL;
> +
> + BUG_ON(pmd_bad(*pmd));
> +
> + return pte_offset_kernel(pmd, addr);
> +}

We already have much of this logic implemented for kernel_page_present
in arch/arm64/mm/pageattr.c, we should move this into there and
make this common, similar to x86

> +
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> + pte_t *pte = lookup_address((unsigned long)kaddr);
> +
> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
> +}
> +
> +inline void xpfo_flush_kernel_page(struct page *page, int order)
> +{
> + unsigned long kaddr = (unsigned long)page_address(page);
> + unsigned long size = PAGE_SIZE;
> +
> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +}
>

2017-08-11 20:19:22

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 05/10] arm64/mm: Add support for XPFO

Hi Laura,

On Fri, Aug 11, 2017 at 11:01:46AM -0700, Laura Abbott wrote:
> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> > provide a hook for updating a single kernel page table entry (which is
> > required by the generic XPFO code).
> >
> > At the moment, only 64k page sizes are supported.
> >
>
> Can you add a note somewhere explaining this limitation or what's
> on the TODO list?

I have a little TODO list in the cover letter, and fixing this is on
it.

As for what the limitation is, I'm not really sure. When I enable e.g.
4k pages, it just hangs as soon as the bootloader branches to the
kernel, and doesn't print the kernel's hello world or anything. This
is much before XPFO's initialization code is even run, so it's
probably something simple, but I haven't figured out what yet.

Cheers,

Tycho

> > Signed-off-by: Juerg Haefliger <[email protected]>
> > Tested-by: Tycho Andersen <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/mm/Makefile | 2 ++
> > arch/arm64/mm/xpfo.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 67 insertions(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index dfd908630631..2ddae41e0793 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -121,6 +121,7 @@ config ARM64
> > select SPARSE_IRQ
> > select SYSCTL_EXCEPTION_TRACE
> > select THREAD_INFO_IN_TASK
> > + select ARCH_SUPPORTS_XPFO if ARM64_64K_PAGES
> > help
> > ARM 64-bit (AArch64) Linux support.
> >
> > diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> > index 9b0ba191e48e..22e5cab543d8 100644
> > --- a/arch/arm64/mm/Makefile
> > +++ b/arch/arm64/mm/Makefile
> > @@ -11,3 +11,5 @@ KASAN_SANITIZE_physaddr.o += n
> >
> > obj-$(CONFIG_KASAN) += kasan_init.o
> > KASAN_SANITIZE_kasan_init.o := n
> > +
> > +obj-$(CONFIG_XPFO) += xpfo.o
> > diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> > new file mode 100644
> > index 000000000000..de03a652d48a
> > --- /dev/null
> > +++ b/arch/arm64/mm/xpfo.c
> > @@ -0,0 +1,64 @@
> > +/*
> > + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + * Juerg Haefliger <[email protected]>
> > + * Vasileios P. Kemerlis <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +/*
> > + * Lookup the page table entry for a virtual address and return a pointer to
> > + * the entry. Based on x86 tree.
> > + */
> > +static pte_t *lookup_address(unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > +
> > + pgd = pgd_offset_k(addr);
> > + if (pgd_none(*pgd))
> > + return NULL;
> > +
> > + BUG_ON(pgd_bad(*pgd));
> > +
> > + pud = pud_offset(pgd, addr);
> > + if (pud_none(*pud))
> > + return NULL;
> > +
> > + BUG_ON(pud_bad(*pud));
> > +
> > + pmd = pmd_offset(pud, addr);
> > + if (pmd_none(*pmd))
> > + return NULL;
> > +
> > + BUG_ON(pmd_bad(*pmd));
> > +
> > + return pte_offset_kernel(pmd, addr);
> > +}
>
> We already have much of this logic implemented for kernel_page_present
> in arch/arm64/mm/pageattr.c, we should move this into there and
> make this common, similar to x86
>
> > +
> > +/* Update a single kernel page table entry */
> > +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> > +{
> > + pte_t *pte = lookup_address((unsigned long)kaddr);
> > +
> > + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
> > +}
> > +
> > +inline void xpfo_flush_kernel_page(struct page *page, int order)
> > +{
> > + unsigned long kaddr = (unsigned long)page_address(page);
> > + unsigned long size = PAGE_SIZE;
> > +
> > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> > +}
> >
>

2017-08-11 21:13:06

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

Hi Laura,

On Fri, Aug 11, 2017 at 10:25:14AM -0700, Laura Abbott wrote:
> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > XPFO (eXclusive Page Frame Ownership) doesn't support section mappings
> > yet, so disable it if XPFO is turned on.
> >
> > Signed-off-by: Juerg Haefliger <[email protected]>
> > Tested-by: Tycho Andersen <[email protected]>
> > ---
> > arch/arm64/mm/mmu.c | 14 +++++++++++++-
> > 1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index f1eb15e0e864..38026b3ccb46 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -176,6 +176,18 @@ static void alloc_init_cont_pte(pmd_t *pmd, unsigned long addr,
> > } while (addr = next, addr != end);
> > }
> >
> > +static inline bool use_section_mapping(unsigned long addr, unsigned long next,
> > + unsigned long phys)
> > +{
> > + if (IS_ENABLED(CONFIG_XPFO))
> > + return false;
> > +
> > + if (((addr | next | phys) & ~SECTION_MASK) != 0)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> > phys_addr_t phys, pgprot_t prot,
> > phys_addr_t (*pgtable_alloc)(void), int flags)
> > @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> > next = pmd_addr_end(addr, end);
> >
> > /* try section mapping first */
> > - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
> > + if (use_section_mapping(addr, next, phys) &&
> > (flags & NO_BLOCK_MAPPINGS) == 0) {
> > pmd_set_huge(pmd, phys, prot);
> >
> >
>
> There is already similar logic to disable section mappings for
> debug_pagealloc at the start of map_mem, can you take advantage
> of that?

You're suggesting something like this instead? Seems to work fine.

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 38026b3ccb46..3b2c17bbbf12 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -434,6 +434,8 @@ static void __init map_mem(pgd_t *pgd)

if (debug_pagealloc_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+ if (IS_ENABLED(CONFIG_XPFO))
+ flags |= NO_BLOCK_MAPPINGS;

/*
* Take care not to create a writable alias for the

Cheers,

Tycho

2017-08-11 21:52:09

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

On Fri, Aug 11, 2017 at 03:13:02PM -0600, Tycho Andersen wrote:
> You're suggesting something like this instead? Seems to work fine.

And in fact, using this patch instead means that booting on 4k pages
works too... I guess because NO_BLOCK_MAPPINGS is looked at in a few
other places that matter too? Anyway, I'll use this patch instead,
thanks for the suggestion!

Tycho

2017-08-11 23:35:57

by Laura Abbott

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 00/10] Add support for eXclusive Page Frame Ownership

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> Hi all,
>
> Here's a v5 of the XPFO set. Changes from v4 are:
>
> * huge pages support actually works now on x86
> * arm64 support, which boots on several different arm64 boards
> * tests for hugepages support as well via LKDTM (thanks Kees for suggesting how
> to make this work)
>
> Patch 2 contains some potentially controversial stuff, exposing the cpa_lock
> and lifting some other static functions out; there is probably a better way to
> do this, thoughts welcome.
>
> Still to do are:
>
> * get it to work with non-64k pages on ARM
> * get rid of the BUG()s, in favor or WARN or similar
> * other things people come up with in this review
>
> Please have a look. Thoughts welcome!
>

I gave this a quick test on my arm64 machine and I see faults once
we hit userspace:

[ 4.439714] Unhandled fault: TLB conflict abort (0x96000030) at 0xffff800391440090
[ 4.447357] Internal error: : 96000030 [#1] SMP
[ 4.451875] Modules linked in:
[ 4.454924] CPU: 2 PID: 184 Comm: systemd Tainted: G W 4.13.0-rc4-xpfo+ #63
[ 4.462989] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
[ 4.472698] task: ffff8003e8d9fb00 task.stack: ffff8003f9fbc000
[ 4.478602] PC is at copy_page+0x48/0x110
[ 4.482601] LR is at __cpu_copy_user_page+0x28/0x48

I'll have to give this a closer look to see what's going on with the TLB flushing.

Thanks,
Laura


> Previously: http://www.openwall.com/lists/kernel-hardening/2017/06/07/24
>
> Tycho
>
> Juerg Haefliger (8):
> mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
> swiotlb: Map the buffer if it was unmapped by XPFO
> arm64: Add __flush_tlb_one()
> arm64/mm: Add support for XPFO
> arm64/mm: Disable section mappings if XPFO is enabled
> arm64/mm: Don't flush the data cache if the page is unmapped by XPFO
> arm64/mm: Add support for XPFO to swiotlb
> lkdtm: Add test for XPFO
>
> Tycho Andersen (2):
> mm: add MAP_HUGETLB support to vm_mmap
> mm: add a user_virt_to_phys symbol
>
> Documentation/admin-guide/kernel-parameters.txt | 2 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/cacheflush.h | 11 ++
> arch/arm64/include/asm/tlbflush.h | 8 +
> arch/arm64/mm/Makefile | 2 +
> arch/arm64/mm/dma-mapping.c | 32 ++--
> arch/arm64/mm/flush.c | 5 +-
> arch/arm64/mm/mmu.c | 14 +-
> arch/arm64/mm/xpfo.c | 160 +++++++++++++++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 23 +++
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/pageattr.c | 24 +--
> arch/x86/mm/xpfo.c | 153 +++++++++++++++++
> drivers/misc/Makefile | 1 +
> drivers/misc/lkdtm.h | 4 +
> drivers/misc/lkdtm_core.c | 4 +
> drivers/misc/lkdtm_xpfo.c | 62 +++++++
> include/linux/highmem.h | 15 +-
> include/linux/mm.h | 2 +
> include/linux/xpfo.h | 47 +++++
> lib/swiotlb.c | 3 +-
> mm/Makefile | 1 +
> mm/mmap.c | 19 +--
> mm/page_alloc.c | 2 +
> mm/page_ext.c | 4 +
> mm/util.c | 32 ++++
> mm/xpfo.c | 217 ++++++++++++++++++++++++
> security/Kconfig | 19 +++
> 29 files changed, 810 insertions(+), 59 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm_xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
>

2017-08-12 11:17:43

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

Hi,

On Fri, Aug 11, 2017 at 03:13:02PM -0600, Tycho Andersen wrote:
> On Fri, Aug 11, 2017 at 10:25:14AM -0700, Laura Abbott wrote:
> > On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > > @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> > > next = pmd_addr_end(addr, end);
> > >
> > > /* try section mapping first */
> > > - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
> > > + if (use_section_mapping(addr, next, phys) &&
> > > (flags & NO_BLOCK_MAPPINGS) == 0) {
> > > pmd_set_huge(pmd, phys, prot);
> > >
> > >
> >
> > There is already similar logic to disable section mappings for
> > debug_pagealloc at the start of map_mem, can you take advantage
> > of that?
>
> You're suggesting something like this instead? Seems to work fine.
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 38026b3ccb46..3b2c17bbbf12 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -434,6 +434,8 @@ static void __init map_mem(pgd_t *pgd)
>
> if (debug_pagealloc_enabled())
> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> + if (IS_ENABLED(CONFIG_XPFO))
> + flags |= NO_BLOCK_MAPPINGS;
>

IIUC, XPFO carves out individual pages just like DEBUG_PAGEALLOC, so you'll
also need NO_CONT_MAPPINGS.

Thanks
Mark.

2017-08-12 11:26:09

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> Add a hook for flushing a single TLB entry on arm64.
>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Tested-by: Tycho Andersen <[email protected]>
> ---
> arch/arm64/include/asm/tlbflush.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index af1c76981911..8e0c49105d3e 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -184,6 +184,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
> isb();
> }
>
> +static inline void __flush_tlb_one(unsigned long addr)
> +{
> + dsb(ishst);
> + __tlbi(vaae1is, addr >> 12);
> + dsb(ish);
> + isb();
> +}

Is this going to be called by generic code?

It would be nice if we could drop 'kernel' into the name, to make it clear this
is intended to affect the kernel mappings, which have different maintenance
requirements to user mappings.

We should be able to implement this more simply as:

flush_tlb_kernel_page(unsigned long addr)
{
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}

Thanks,
Mark.

2017-08-12 11:57:44

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 07/10] arm64/mm: Don't flush the data cache if the page is unmapped by XPFO

On Wed, Aug 09, 2017 at 02:07:52PM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <[email protected]>
>
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault. So don't flush in that case.

Do you have an example callchain where that happens? We might need to shuffle
things around to cater for that case.

> @@ -30,7 +31,9 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
> unsigned long addr = (unsigned long)kaddr;
>
> if (icache_is_aliasing()) {
> - __clean_dcache_area_pou(kaddr, len);
> + /* Don't flush if the page is unmapped by XPFO */
> + if (!xpfo_page_is_unmapped(virt_to_page(kaddr)))
> + __clean_dcache_area_pou(kaddr, len);
> __flush_icache_all();
> } else {
> flush_icache_range(addr, addr + len);

I don't think this patch is correct. If data cache maintenance is required in
the absence of XPFO, I don't see why it wouldn't be required in the presence of
XPFO.

I'm not immediately sure why the non-aliasing case misses data cache
maintenance. I couldn't spot where that happens otherwise.

On a more general note, in future it would be good to Cc the arm64 maintainers
and the linux-arm-kernel mailing list for patches affecting arm64.

Thanks,
Mark.

2017-08-12 20:24:48

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] lkdtm: Add test for XPFO

Hi Juerg,

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v4.13-rc4]
[cannot apply to next-20170811]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/Add-support-for-eXclusive-Page-Frame-Ownership/20170813-035705
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: x86_64-randconfig-x016-201733 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All errors (new ones prefixed by >>):

drivers/misc/lkdtm_xpfo.c: In function 'read_user_with_flags':
>> drivers/misc/lkdtm_xpfo.c:31:14: error: implicit declaration of function 'user_virt_to_phys' [-Werror=implicit-function-declaration]
phys_addr = user_virt_to_phys(user_addr);
^~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/user_virt_to_phys +31 drivers/misc/lkdtm_xpfo.c

10
11 void read_user_with_flags(unsigned long flags)
12 {
13 unsigned long user_addr, user_data = 0xdeadbeef;
14 phys_addr_t phys_addr;
15 void *virt_addr;
16
17 user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
18 PROT_READ | PROT_WRITE | PROT_EXEC,
19 flags, 0);
20 if (user_addr >= TASK_SIZE) {
21 pr_warn("Failed to allocate user memory\n");
22 return;
23 }
24
25 if (copy_to_user((void __user *)user_addr, &user_data,
26 sizeof(user_data))) {
27 pr_warn("copy_to_user failed\n");
28 goto free_user;
29 }
30
> 31 phys_addr = user_virt_to_phys(user_addr);
32 if (!phys_addr) {
33 pr_warn("Failed to get physical address of user memory\n");
34 goto free_user;
35 }
36
37 virt_addr = phys_to_virt(phys_addr);
38 if (phys_addr != virt_to_phys(virt_addr)) {
39 pr_warn("Physical address of user memory seems incorrect\n");
40 goto free_user;
41 }
42
43 pr_info("Attempting bad read from kernel address %p\n", virt_addr);
44 if (*(unsigned long *)virt_addr == user_data)
45 pr_info("Huh? Bad read succeeded?!\n");
46 else
47 pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
48
49 free_user:
50 vm_munmap(user_addr, PAGE_SIZE);
51 }
52

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.54 kB)
.config.gz (29.82 kB)
Download all attachments

2017-08-12 21:05:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] lkdtm: Add test for XPFO

Hi Juerg,

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v4.13-rc4]
[cannot apply to next-20170811]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/Add-support-for-eXclusive-Page-Frame-Ownership/20170813-035705
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All error/warnings (new ones prefixed by >>):

drivers//misc/lkdtm_xpfo.c: In function 'read_user_with_flags':
drivers//misc/lkdtm_xpfo.c:31:2: error: implicit declaration of function 'user_virt_to_phys' [-Werror=implicit-function-declaration]
phys_addr = user_virt_to_phys(user_addr);
^
>> drivers//misc/lkdtm_xpfo.c:37:2: error: implicit declaration of function 'phys_to_virt' [-Werror=implicit-function-declaration]
virt_addr = phys_to_virt(phys_addr);
^
>> drivers//misc/lkdtm_xpfo.c:37:12: warning: assignment makes pointer from integer without a cast
virt_addr = phys_to_virt(phys_addr);
^
>> drivers//misc/lkdtm_xpfo.c:38:2: error: implicit declaration of function 'virt_to_phys' [-Werror=implicit-function-declaration]
if (phys_addr != virt_to_phys(virt_addr)) {
^
cc1: some warnings being treated as errors

vim +/phys_to_virt +37 drivers//misc/lkdtm_xpfo.c

10
11 void read_user_with_flags(unsigned long flags)
12 {
13 unsigned long user_addr, user_data = 0xdeadbeef;
14 phys_addr_t phys_addr;
15 void *virt_addr;
16
17 user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
18 PROT_READ | PROT_WRITE | PROT_EXEC,
19 flags, 0);
20 if (user_addr >= TASK_SIZE) {
21 pr_warn("Failed to allocate user memory\n");
22 return;
23 }
24
25 if (copy_to_user((void __user *)user_addr, &user_data,
26 sizeof(user_data))) {
27 pr_warn("copy_to_user failed\n");
28 goto free_user;
29 }
30
> 31 phys_addr = user_virt_to_phys(user_addr);
32 if (!phys_addr) {
33 pr_warn("Failed to get physical address of user memory\n");
34 goto free_user;
35 }
36
> 37 virt_addr = phys_to_virt(phys_addr);
> 38 if (phys_addr != virt_to_phys(virt_addr)) {
39 pr_warn("Physical address of user memory seems incorrect\n");
40 goto free_user;
41 }
42
43 pr_info("Attempting bad read from kernel address %p\n", virt_addr);
44 if (*(unsigned long *)virt_addr == user_data)
45 pr_info("Huh? Bad read succeeded?!\n");
46 else
47 pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
48
49 free_user:
50 vm_munmap(user_addr, PAGE_SIZE);
51 }
52

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.15 kB)
.config.gz (49.73 kB)
Download all attachments

2017-08-14 16:21:56

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] lkdtm: Add test for XPFO

On Sun, Aug 13, 2017 at 04:24:23AM +0800, kbuild test robot wrote:
> Hi Juerg,
>
> [auto build test ERROR on arm64/for-next/core]
> [also build test ERROR on v4.13-rc4]
> [cannot apply to next-20170811]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/Add-support-for-eXclusive-Page-Frame-Ownership/20170813-035705
> base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
> config: x86_64-randconfig-x016-201733 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
>
> All errors (new ones prefixed by >>):
>
> drivers/misc/lkdtm_xpfo.c: In function 'read_user_with_flags':
> >> drivers/misc/lkdtm_xpfo.c:31:14: error: implicit declaration of function 'user_virt_to_phys' [-Werror=implicit-function-declaration]
> phys_addr = user_virt_to_phys(user_addr);
> ^~~~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors

These are both the same error, looks like I forgot a dummy prototype
in the non CONFIG_XPFO case, I'll fix it in the next version.

Tycho

2017-08-14 16:23:09

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

On Sat, Aug 12, 2017 at 12:17:34PM +0100, Mark Rutland wrote:
> Hi,
>
> On Fri, Aug 11, 2017 at 03:13:02PM -0600, Tycho Andersen wrote:
> > On Fri, Aug 11, 2017 at 10:25:14AM -0700, Laura Abbott wrote:
> > > On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > > > @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> > > > next = pmd_addr_end(addr, end);
> > > >
> > > > /* try section mapping first */
> > > > - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
> > > > + if (use_section_mapping(addr, next, phys) &&
> > > > (flags & NO_BLOCK_MAPPINGS) == 0) {
> > > > pmd_set_huge(pmd, phys, prot);
> > > >
> > > >
> > >
> > > There is already similar logic to disable section mappings for
> > > debug_pagealloc at the start of map_mem, can you take advantage
> > > of that?
> >
> > You're suggesting something like this instead? Seems to work fine.
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 38026b3ccb46..3b2c17bbbf12 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -434,6 +434,8 @@ static void __init map_mem(pgd_t *pgd)
> >
> > if (debug_pagealloc_enabled())
> > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > + if (IS_ENABLED(CONFIG_XPFO))
> > + flags |= NO_BLOCK_MAPPINGS;
> >
>
> IIUC, XPFO carves out individual pages just like DEBUG_PAGEALLOC, so you'll
> also need NO_CONT_MAPPINGS.

Yes, thanks!

Tycho

2017-08-14 16:35:40

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

Hi Mark,

On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > Add a hook for flushing a single TLB entry on arm64.
> >
> > Signed-off-by: Juerg Haefliger <[email protected]>
> > Tested-by: Tycho Andersen <[email protected]>
> > ---
> > arch/arm64/include/asm/tlbflush.h | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index af1c76981911..8e0c49105d3e 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -184,6 +184,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
> > isb();
> > }
> >
> > +static inline void __flush_tlb_one(unsigned long addr)
> > +{
> > + dsb(ishst);
> > + __tlbi(vaae1is, addr >> 12);
> > + dsb(ish);
> > + isb();
> > +}
>
> Is this going to be called by generic code?

Yes, it's called in mm/xpfo.c:xpfo_kunmap.

> It would be nice if we could drop 'kernel' into the name, to make it clear this
> is intended to affect the kernel mappings, which have different maintenance
> requirements to user mappings.
>
> We should be able to implement this more simply as:
>
> flush_tlb_kernel_page(unsigned long addr)
> {
> flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> }

It's named __flush_tlb_one after the x86 (and a few other arches)
function of the same name. I can change it to flush_tlb_kernel_page,
but then we'll need some x86-specific code to map the name as well.

Maybe since it's called from generic code that's warranted though?
I'll change the implementation for now, let me know what you want to
do about the name.

Cheers,

Tycho

2017-08-14 16:51:59

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Mon, Aug 14, 2017 at 10:35:36AM -0600, Tycho Andersen wrote:
> Hi Mark,
>
> On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> > On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > > +static inline void __flush_tlb_one(unsigned long addr)
> > > +{
> > > + dsb(ishst);
> > > + __tlbi(vaae1is, addr >> 12);
> > > + dsb(ish);
> > > + isb();
> > > +}
> >
> > Is this going to be called by generic code?
>
> Yes, it's called in mm/xpfo.c:xpfo_kunmap.
>
> > It would be nice if we could drop 'kernel' into the name, to make it clear this
> > is intended to affect the kernel mappings, which have different maintenance
> > requirements to user mappings.

> It's named __flush_tlb_one after the x86 (and a few other arches)
> function of the same name. I can change it to flush_tlb_kernel_page,
> but then we'll need some x86-specific code to map the name as well.
>
> Maybe since it's called from generic code that's warranted though?
> I'll change the implementation for now, let me know what you want to
> do about the name.

I think it would be preferable to do so, to align with
flush_tlb_kernel_range(), which is an existing generic interface.

That said, is there any reason not to use flush_tlb_kernel_range()
directly?

Thanks,
Mark.

2017-08-14 16:55:35

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 07/10] arm64/mm: Don't flush the data cache if the page is unmapped by XPFO

On Sat, Aug 12, 2017 at 12:57:37PM +0100, Mark Rutland wrote:
> On Wed, Aug 09, 2017 at 02:07:52PM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > If the page is unmapped by XPFO, a data cache flush results in a fatal
> > page fault. So don't flush in that case.
>
> Do you have an example callchain where that happens? We might need to shuffle
> things around to cater for that case.
>
> > @@ -30,7 +31,9 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
> > unsigned long addr = (unsigned long)kaddr;
> >
> > if (icache_is_aliasing()) {
> > - __clean_dcache_area_pou(kaddr, len);
> > + /* Don't flush if the page is unmapped by XPFO */
> > + if (!xpfo_page_is_unmapped(virt_to_page(kaddr)))
> > + __clean_dcache_area_pou(kaddr, len);
> > __flush_icache_all();
> > } else {
> > flush_icache_range(addr, addr + len);
>
> I don't think this patch is correct. If data cache maintenance is required in
> the absence of XPFO, I don't see why it wouldn't be required in the presence of
> XPFO.
>
> I'm not immediately sure why the non-aliasing case misses data cache
> maintenance. I couldn't spot where that happens otherwise.

>From another look, it seems that flush_icache_range() performs the
D-cache maintenance internally, so it's just a slightly misleading name.

However, __flush_icache_all() does not, and does need separate D-cache
maintenance.

Thanks,
Mark.

2017-08-14 17:01:28

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> On Mon, Aug 14, 2017 at 10:35:36AM -0600, Tycho Andersen wrote:
> > Hi Mark,
> >
> > On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> > > On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > > > +static inline void __flush_tlb_one(unsigned long addr)
> > > > +{
> > > > + dsb(ishst);
> > > > + __tlbi(vaae1is, addr >> 12);
> > > > + dsb(ish);
> > > > + isb();
> > > > +}
> > >
> > > Is this going to be called by generic code?
> >
> > Yes, it's called in mm/xpfo.c:xpfo_kunmap.
> >
> > > It would be nice if we could drop 'kernel' into the name, to make it clear this
> > > is intended to affect the kernel mappings, which have different maintenance
> > > requirements to user mappings.
>
> > It's named __flush_tlb_one after the x86 (and a few other arches)
> > function of the same name. I can change it to flush_tlb_kernel_page,
> > but then we'll need some x86-specific code to map the name as well.
> >
> > Maybe since it's called from generic code that's warranted though?
> > I'll change the implementation for now, let me know what you want to
> > do about the name.
>
> I think it would be preferable to do so, to align with
> flush_tlb_kernel_range(), which is an existing generic interface.
>
> That said, is there any reason not to use flush_tlb_kernel_range()
> directly?

I don't think so, I'll change the generic code to that and drop this
patch.

Thanks!

Tycho

2017-08-14 18:42:52

by Laura Abbott

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

On 08/14/2017 09:22 AM, Tycho Andersen wrote:
> On Sat, Aug 12, 2017 at 12:17:34PM +0100, Mark Rutland wrote:
>> Hi,
>>
>> On Fri, Aug 11, 2017 at 03:13:02PM -0600, Tycho Andersen wrote:
>>> On Fri, Aug 11, 2017 at 10:25:14AM -0700, Laura Abbott wrote:
>>>> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
>>>>> @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
>>>>> next = pmd_addr_end(addr, end);
>>>>>
>>>>> /* try section mapping first */
>>>>> - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
>>>>> + if (use_section_mapping(addr, next, phys) &&
>>>>> (flags & NO_BLOCK_MAPPINGS) == 0) {
>>>>> pmd_set_huge(pmd, phys, prot);
>>>>>
>>>>>
>>>>
>>>> There is already similar logic to disable section mappings for
>>>> debug_pagealloc at the start of map_mem, can you take advantage
>>>> of that?
>>>
>>> You're suggesting something like this instead? Seems to work fine.
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 38026b3ccb46..3b2c17bbbf12 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -434,6 +434,8 @@ static void __init map_mem(pgd_t *pgd)
>>>
>>> if (debug_pagealloc_enabled())
>>> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>> + if (IS_ENABLED(CONFIG_XPFO))
>>> + flags |= NO_BLOCK_MAPPINGS;
>>>
>>
>> IIUC, XPFO carves out individual pages just like DEBUG_PAGEALLOC, so you'll
>> also need NO_CONT_MAPPINGS.
>
> Yes, thanks!
>
> Tycho
>

Setting NO_CONT_MAPPINGS fixes the TLB conflict aborts I was seeing
on my machine.

Thanks,
Laura

2017-08-14 18:51:07

by Laura Abbott

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..3cd45f68b5ad
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,208 @@
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + * Juerg Haefliger <[email protected]>
> + * Vasileios P. Kemerlis <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +/* XPFO page state flags */
> +enum xpfo_flags {
> + XPFO_PAGE_USER, /* Page is allocated to user-space */
> + XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */
> +};
> +
> +/* Per-page XPFO house-keeping data */
> +struct xpfo {
> + unsigned long flags; /* Page state */
> + bool inited; /* Map counter and lock initialized */
> + atomic_t mapcount; /* Counter for balancing map/unmap requests */
> + spinlock_t maplock; /* Lock to serialize map/unmap requests */
> +};
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool xpfo_disabled __initdata;
> +
> +static int __init noxpfo_param(char *str)
> +{
> + xpfo_disabled = true;
> +
> + return 0;
> +}
> +
> +early_param("noxpfo", noxpfo_param);
> +
> +static bool __init need_xpfo(void)
> +{
> + if (xpfo_disabled) {
> + printk(KERN_INFO "XPFO disabled\n");
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> + printk(KERN_INFO "XPFO enabled\n");
> + static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> + .size = sizeof(struct xpfo),
> + .need = need_xpfo,
> + .init = init_xpfo,
> +};
> +
> +static inline struct xpfo *lookup_xpfo(struct page *page)
> +{
> + return (void *)lookup_page_ext(page) + page_xpfo_ops.offset;
> +}

lookup_page_ext can return NULL so this function and its callers
need to account for that.

Thanks,
Laura

2017-08-14 19:10:49

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] lkdtm: Add test for XPFO

On Wed, Aug 9, 2017 at 1:07 PM, Tycho Andersen <[email protected]> wrote:
> From: Juerg Haefliger <[email protected]>
>
> This test simply reads from userspace memory via the kernel's linear
> map.
>
> hugepages is only supported on x86 right now, hence the ifdef.

I'd prefer that the #ifdef is handled in the .c file. The result is
that all architectures will have the XPFO_READ_USER_HUGE test, but it
can just fail when not available. This means no changes are needed for
lkdtm in the future and the test provides an actual test of hugepages
coverage.

-Kees

--
Kees Cook
Pixel Security

2017-08-14 20:27:31

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 07/10] arm64/mm: Don't flush the data cache if the page is unmapped by XPFO

Hi Mark,

First, thanks for taking a look!

On Sat, Aug 12, 2017 at 12:57:37PM +0100, Mark Rutland wrote:
> On Wed, Aug 09, 2017 at 02:07:52PM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > If the page is unmapped by XPFO, a data cache flush results in a fatal
> > page fault. So don't flush in that case.
>
> Do you have an example callchain where that happens? We might need to shuffle
> things around to cater for that case.

Here's one from the other branch (i.e. xpfo_page_is_unmapped() is true):

[ 15.487293] CPU: 2 PID: 1633 Comm: plymouth Not tainted 4.13.0-rc4-c2+ #242
[ 15.487295] Hardware name: Hardkernel ODROID-C2 (DT)
[ 15.487297] Call trace:
[ 15.487313] [<ffff0000080884f0>] dump_backtrace+0x0/0x248
[ 15.487317] [<ffff00000808878c>] show_stack+0x14/0x20
[ 15.487324] [<ffff000008b3e1b8>] dump_stack+0x98/0xb8
[ 15.487329] [<ffff000008098bb4>] sync_icache_aliases+0x84/0x98
[ 15.487332] [<ffff000008098c74>] __sync_icache_dcache+0x64/0x88
[ 15.487337] [<ffff0000081d4814>] alloc_set_pte+0x4ec/0x6b8
[ 15.487342] [<ffff00000819d920>] filemap_map_pages+0x350/0x360
[ 15.487344] [<ffff0000081d4ccc>] do_fault+0x28c/0x568
[ 15.487347] [<ffff0000081d67a0>] __handle_mm_fault+0x410/0xd08
[ 15.487350] [<ffff0000081d7164>] handle_mm_fault+0xcc/0x1a8
[ 15.487352] [<ffff000008098580>] do_page_fault+0x270/0x380
[ 15.487355] [<ffff00000808128c>] do_mem_abort+0x3c/0x98
[ 15.487358] Exception stack(0xffff800061dabe20 to 0xffff800061dabf50)
[ 15.487362] be20: 0000000000000000 0000800062e19000 ffffffffffffffff 0000ffff8f64ddc8
[ 15.487365] be40: ffff800061dabe80 ffff000008238810 ffff800061d80330 0000000000000018
[ 15.487368] be60: ffffffffffffffff 0000ffff8f5ba958 ffff800061d803d0 ffff800067132e18
[ 15.487370] be80: 0000000000000000 ffff800061d80d08 0000000000000000 0000000000000019
[ 15.487373] bea0: 000000002bd3d0f0 0000000000000000 0000000000000019 ffff800067132e00
[ 15.487376] bec0: 0000000000000000 0000ffff8f657220 0000000000000000 0000000000000000
[ 15.487379] bee0: 8080808000000000 0000000000000000 0000000080808080 fefefeff6f6b6467
[ 15.487381] bf00: 7f7f7f7f7f7f7f7f 000000002bd3fb40 0101010101010101 0000000000000020
[ 15.487384] bf20: 00000000004072b0 00000000004072e0 0000000000000000 0000ffff8f6b2000
[ 15.487386] bf40: 0000ffff8f66b190 0000ffff8f576380
[ 15.487389] [<ffff000008082b74>] el0_da+0x20/0x24

> > @@ -30,7 +31,9 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
> > unsigned long addr = (unsigned long)kaddr;
> >
> > if (icache_is_aliasing()) {
> > - __clean_dcache_area_pou(kaddr, len);
> > + /* Don't flush if the page is unmapped by XPFO */
> > + if (!xpfo_page_is_unmapped(virt_to_page(kaddr)))
> > + __clean_dcache_area_pou(kaddr, len);
> > __flush_icache_all();
> > } else {
> > flush_icache_range(addr, addr + len);
>
> I don't think this patch is correct. If data cache maintenance is required in
> the absence of XPFO, I don't see why it wouldn't be required in the presence of
> XPFO.

Ok. I suppose we could do re-map like we do for dma; or is there some
re-arrangement of things you can see that would help?

> I'm not immediately sure why the non-aliasing case misses data cache
> maintenance. I couldn't spot where that happens otherwise.
>
> On a more general note, in future it would be good to Cc the arm64 maintainers
> and the linux-arm-kernel mailing list for patches affecting arm64.

Yes, I thought about doing that for the series, but since it has x86 patches
too, I didn't want to spam everyone :). I'll just add x86/arm lists to CC in
the patches in the future. If there's some better way, let me know.

Tycho

2017-08-14 20:28:48

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 06/10] arm64/mm: Disable section mappings if XPFO is enabled

On Mon, Aug 14, 2017 at 11:42:45AM -0700, Laura Abbott wrote:
> On 08/14/2017 09:22 AM, Tycho Andersen wrote:
> > On Sat, Aug 12, 2017 at 12:17:34PM +0100, Mark Rutland wrote:
> >> Hi,
> >>
> >> On Fri, Aug 11, 2017 at 03:13:02PM -0600, Tycho Andersen wrote:
> >>> On Fri, Aug 11, 2017 at 10:25:14AM -0700, Laura Abbott wrote:
> >>>> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> >>>>> @@ -190,7 +202,7 @@ static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
> >>>>> next = pmd_addr_end(addr, end);
> >>>>>
> >>>>> /* try section mapping first */
> >>>>> - if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
> >>>>> + if (use_section_mapping(addr, next, phys) &&
> >>>>> (flags & NO_BLOCK_MAPPINGS) == 0) {
> >>>>> pmd_set_huge(pmd, phys, prot);
> >>>>>
> >>>>>
> >>>>
> >>>> There is already similar logic to disable section mappings for
> >>>> debug_pagealloc at the start of map_mem, can you take advantage
> >>>> of that?
> >>>
> >>> You're suggesting something like this instead? Seems to work fine.
> >>>
> >>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >>> index 38026b3ccb46..3b2c17bbbf12 100644
> >>> --- a/arch/arm64/mm/mmu.c
> >>> +++ b/arch/arm64/mm/mmu.c
> >>> @@ -434,6 +434,8 @@ static void __init map_mem(pgd_t *pgd)
> >>>
> >>> if (debug_pagealloc_enabled())
> >>> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >>> + if (IS_ENABLED(CONFIG_XPFO))
> >>> + flags |= NO_BLOCK_MAPPINGS;
> >>>
> >>
> >> IIUC, XPFO carves out individual pages just like DEBUG_PAGEALLOC, so you'll
> >> also need NO_CONT_MAPPINGS.
> >
> > Yes, thanks!
> >
> > Tycho
> >
>
> Setting NO_CONT_MAPPINGS fixes the TLB conflict aborts I was seeing
> on my machine.

Great, thanks for testing! I've also fixed the lookup_page_ext bug you
noted in the other thread.

Tycho

2017-08-14 20:29:52

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] lkdtm: Add test for XPFO

On Mon, Aug 14, 2017 at 12:10:47PM -0700, Kees Cook wrote:
> On Wed, Aug 9, 2017 at 1:07 PM, Tycho Andersen <[email protected]> wrote:
> > From: Juerg Haefliger <[email protected]>
> >
> > This test simply reads from userspace memory via the kernel's linear
> > map.
> >
> > hugepages is only supported on x86 right now, hence the ifdef.
>
> I'd prefer that the #ifdef is handled in the .c file. The result is
> that all architectures will have the XPFO_READ_USER_HUGE test, but it
> can just fail when not available. This means no changes are needed for
> lkdtm in the future and the test provides an actual test of hugepages
> coverage.

If failing tests is okay, I think we can just drop that hunk entirely.
Everything compiles fine, it just doesn't work :). I'll do that for
the next version.

Tycho

2017-08-14 22:30:06

by Laura Abbott

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> + unsigned int level;
> + pgprot_t msk_clr;
> + pte_t *pte = lookup_address((unsigned long)kaddr, &level);
> +
> + BUG_ON(!pte);
> +
> + switch (level) {
> + case PG_LEVEL_4K:
> + set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> + break;
> + case PG_LEVEL_2M:
> + /* We need to check if it's a 2M page or 1GB page before retrieve
> + * pgprot info, as each one will be extracted from a different
> + * page table levels */
> + msk_clr = pmd_pgprot(*(pmd_t*)pte);
> + case PG_LEVEL_1G: {
> + struct cpa_data cpa;
> + int do_split;
> +
> + msk_clr = pud_pgprot(*(pud_t*)pte);
> +
> + memset(&cpa, 0, sizeof(cpa));
> + cpa.vaddr = kaddr;
> + cpa.pages = &page;
> + cpa.mask_set = prot;
> + cpa.mask_clr = msk_clr;
> + cpa.numpages = 1;
> + cpa.flags = 0;
> + cpa.curpage = 0;
> + cpa.force_split = 0;
> +
> +
> + do_split = try_preserve_large_page(pte, (unsigned long)kaddr, &cpa);
> + if (do_split) {
> + spin_lock(&cpa_lock);
> + BUG_ON(split_large_page(&cpa, pte, (unsigned long)kaddr));
> + spin_unlock(&cpa_lock);
> + }

This doesn't work in atomic contexts:

[ 28.263571] BUG: sleeping function called from invalid context at
mm/page_alloc.c:4048
[ 28.263575] in_atomic(): 1, irqs_disabled(): 1, pid: 2433, name:
gnome-terminal
[ 28.263576] INFO: lockdep is turned off.
[ 28.263578] irq event stamp: 0
[ 28.263580] hardirqs last enabled at (0): [< (null)>]
(null)
[ 28.263584] hardirqs last disabled at (0): [<ffffffff840af28a>]
copy_process.part.25+0x62a/0x1e90
[ 28.263587] softirqs last enabled at (0): [<ffffffff840af28a>]
copy_process.part.25+0x62a/0x1e90
[ 28.263588] softirqs last disabled at (0): [< (null)>]
(null)
[ 28.263591] CPU: 0 PID: 2433 Comm: gnome-terminal Tainted: G W
4.13.0-rc5-xpfo+ #86
[ 28.263592] Hardware name: LENOVO 20BTS1N700/20BTS1N700, BIOS
N14ET28W (1.06 ) 03/12/2015
[ 28.263593] Call Trace:
[ 28.263598] dump_stack+0x8e/0xd6
[ 28.263601] ___might_sleep+0x164/0x250
[ 28.263604] __might_sleep+0x4a/0x80
[ 28.263607] __alloc_pages_nodemask+0x2b3/0x3e0
[ 28.263611] alloc_pages_current+0x6a/0xe0
[ 28.263614] split_large_page+0x4e/0x360
[ 28.263618] set_kpte+0x12c/0x150
[ 28.263623] xpfo_kunmap+0x7e/0xa0
[ 28.263627] wp_page_copy+0x16e/0x800
[ 28.263631] do_wp_page+0x9a/0x580
[ 28.263633] __handle_mm_fault+0xb1c/0x1130
[ 28.263638] handle_mm_fault+0x178/0x350
[ 28.263641] __do_page_fault+0x26e/0x510
[ 28.263644] do_page_fault+0x30/0x80
[ 28.263647] page_fault+0x28/0x30


split_large_page calls alloc_page with GFP_KERNEL. switching to
use GFP_ATOMIC in this path works locally for me.

Thanks,
Laura

> +
> + break;
> + }
> + case PG_LEVEL_512G:
> + /* fallthrough, splitting infrastructure doesn't
> + * support 512G pages. */
> + default:
> + BUG();
> + }
> +
> +}

2017-08-15 03:47:21

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

Hi Laura,

On Mon, Aug 14, 2017 at 03:30:00PM -0700, Laura Abbott wrote:
> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > +/* Update a single kernel page table entry */
> > +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> > +{
> > + unsigned int level;
> > + pgprot_t msk_clr;
> > + pte_t *pte = lookup_address((unsigned long)kaddr, &level);
> > +
> > + BUG_ON(!pte);
> > +
> > + switch (level) {
> > + case PG_LEVEL_4K:
> > + set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> > + break;
> > + case PG_LEVEL_2M:
> > + /* We need to check if it's a 2M page or 1GB page before retrieve
> > + * pgprot info, as each one will be extracted from a different
> > + * page table levels */
> > + msk_clr = pmd_pgprot(*(pmd_t*)pte);
> > + case PG_LEVEL_1G: {
> > + struct cpa_data cpa;
> > + int do_split;
> > +
> > + msk_clr = pud_pgprot(*(pud_t*)pte);
> > +
> > + memset(&cpa, 0, sizeof(cpa));
> > + cpa.vaddr = kaddr;
> > + cpa.pages = &page;
> > + cpa.mask_set = prot;
> > + cpa.mask_clr = msk_clr;
> > + cpa.numpages = 1;
> > + cpa.flags = 0;
> > + cpa.curpage = 0;
> > + cpa.force_split = 0;
> > +
> > +
> > + do_split = try_preserve_large_page(pte, (unsigned long)kaddr, &cpa);
> > + if (do_split) {
> > + spin_lock(&cpa_lock);
> > + BUG_ON(split_large_page(&cpa, pte, (unsigned long)kaddr));
> > + spin_unlock(&cpa_lock);
> > + }
>
> This doesn't work in atomic contexts:
>
> [ 28.263571] BUG: sleeping function called from invalid context at
> mm/page_alloc.c:4048
> [ 28.263575] in_atomic(): 1, irqs_disabled(): 1, pid: 2433, name:
> gnome-terminal
> [ 28.263576] INFO: lockdep is turned off.
> [ 28.263578] irq event stamp: 0
> [ 28.263580] hardirqs last enabled at (0): [< (null)>]
> (null)
> [ 28.263584] hardirqs last disabled at (0): [<ffffffff840af28a>]
> copy_process.part.25+0x62a/0x1e90
> [ 28.263587] softirqs last enabled at (0): [<ffffffff840af28a>]
> copy_process.part.25+0x62a/0x1e90
> [ 28.263588] softirqs last disabled at (0): [< (null)>]
> (null)
> [ 28.263591] CPU: 0 PID: 2433 Comm: gnome-terminal Tainted: G W
> 4.13.0-rc5-xpfo+ #86
> [ 28.263592] Hardware name: LENOVO 20BTS1N700/20BTS1N700, BIOS N14ET28W
> (1.06 ) 03/12/2015
> [ 28.263593] Call Trace:
> [ 28.263598] dump_stack+0x8e/0xd6
> [ 28.263601] ___might_sleep+0x164/0x250
> [ 28.263604] __might_sleep+0x4a/0x80
> [ 28.263607] __alloc_pages_nodemask+0x2b3/0x3e0
> [ 28.263611] alloc_pages_current+0x6a/0xe0
> [ 28.263614] split_large_page+0x4e/0x360
> [ 28.263618] set_kpte+0x12c/0x150
> [ 28.263623] xpfo_kunmap+0x7e/0xa0
> [ 28.263627] wp_page_copy+0x16e/0x800
> [ 28.263631] do_wp_page+0x9a/0x580
> [ 28.263633] __handle_mm_fault+0xb1c/0x1130
> [ 28.263638] handle_mm_fault+0x178/0x350
> [ 28.263641] __do_page_fault+0x26e/0x510
> [ 28.263644] do_page_fault+0x30/0x80
> [ 28.263647] page_fault+0x28/0x30
>
>
> split_large_page calls alloc_page with GFP_KERNEL. switching to
> use GFP_ATOMIC in this path works locally for me.

Oof, thanks. I'll do that for the next version, and also CC x86 in
case they may have better suggestions.

Cheers,

Tycho

2017-08-15 03:51:42

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

On Mon, Aug 14, 2017 at 09:47:18PM -0600, Tycho Andersen wrote:
> I'll do that for the next version

Actually looking closer, I think we just need to mirror the
debug_pagealloc_enabled() checks in set_kpte() from
split_large_page(),

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index a1344f27406c..c962bd7f34cc 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -54,9 +54,11 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)

do_split = try_preserve_large_page(pte, (unsigned long)kaddr, &cpa);
if (do_split) {
- spin_lock(&cpa_lock);
+ if (!debug_pagealloc_enabled())
+ spin_lock(&cpa_lock);
BUG_ON(split_large_page(&cpa, pte, (unsigned long)kaddr));
- spin_unlock(&cpa_lock);
+ if (!debug_pagealloc_enabled())
+ spin_unlock(&cpa_lock);
}

break;


Tycho

2017-08-15 09:40:20

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 07/10] arm64/mm: Don't flush the data cache if the page is unmapped by XPFO

On Mon, Aug 14, 2017 at 02:27:27PM -0600, Tycho Andersen wrote:
> Hi Mark,
>
> First, thanks for taking a look!
>
> On Sat, Aug 12, 2017 at 12:57:37PM +0100, Mark Rutland wrote:
> > On Wed, Aug 09, 2017 at 02:07:52PM -0600, Tycho Andersen wrote:
> > > From: Juerg Haefliger <[email protected]>
> > >
> > > If the page is unmapped by XPFO, a data cache flush results in a fatal
> > > page fault. So don't flush in that case.
> >
> > Do you have an example callchain where that happens? We might need to shuffle
> > things around to cater for that case.
>
> Here's one from the other branch (i.e. xpfo_page_is_unmapped() is true):
>
> [ 15.487293] CPU: 2 PID: 1633 Comm: plymouth Not tainted 4.13.0-rc4-c2+ #242
> [ 15.487295] Hardware name: Hardkernel ODROID-C2 (DT)
> [ 15.487297] Call trace:
> [ 15.487313] [<ffff0000080884f0>] dump_backtrace+0x0/0x248
> [ 15.487317] [<ffff00000808878c>] show_stack+0x14/0x20
> [ 15.487324] [<ffff000008b3e1b8>] dump_stack+0x98/0xb8
> [ 15.487329] [<ffff000008098bb4>] sync_icache_aliases+0x84/0x98
> [ 15.487332] [<ffff000008098c74>] __sync_icache_dcache+0x64/0x88
> [ 15.487337] [<ffff0000081d4814>] alloc_set_pte+0x4ec/0x6b8
> [ 15.487342] [<ffff00000819d920>] filemap_map_pages+0x350/0x360
> [ 15.487344] [<ffff0000081d4ccc>] do_fault+0x28c/0x568
> [ 15.487347] [<ffff0000081d67a0>] __handle_mm_fault+0x410/0xd08
> [ 15.487350] [<ffff0000081d7164>] handle_mm_fault+0xcc/0x1a8
> [ 15.487352] [<ffff000008098580>] do_page_fault+0x270/0x380
> [ 15.487355] [<ffff00000808128c>] do_mem_abort+0x3c/0x98
> [ 15.487358] Exception stack(0xffff800061dabe20 to 0xffff800061dabf50)
> [ 15.487362] be20: 0000000000000000 0000800062e19000 ffffffffffffffff 0000ffff8f64ddc8
> [ 15.487365] be40: ffff800061dabe80 ffff000008238810 ffff800061d80330 0000000000000018
> [ 15.487368] be60: ffffffffffffffff 0000ffff8f5ba958 ffff800061d803d0 ffff800067132e18
> [ 15.487370] be80: 0000000000000000 ffff800061d80d08 0000000000000000 0000000000000019
> [ 15.487373] bea0: 000000002bd3d0f0 0000000000000000 0000000000000019 ffff800067132e00
> [ 15.487376] bec0: 0000000000000000 0000ffff8f657220 0000000000000000 0000000000000000
> [ 15.487379] bee0: 8080808000000000 0000000000000000 0000000080808080 fefefeff6f6b6467
> [ 15.487381] bf00: 7f7f7f7f7f7f7f7f 000000002bd3fb40 0101010101010101 0000000000000020
> [ 15.487384] bf20: 00000000004072b0 00000000004072e0 0000000000000000 0000ffff8f6b2000
> [ 15.487386] bf40: 0000ffff8f66b190 0000ffff8f576380
> [ 15.487389] [<ffff000008082b74>] el0_da+0x20/0x24
>
> > > @@ -30,7 +31,9 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
> > > unsigned long addr = (unsigned long)kaddr;
> > >
> > > if (icache_is_aliasing()) {
> > > - __clean_dcache_area_pou(kaddr, len);
> > > + /* Don't flush if the page is unmapped by XPFO */
> > > + if (!xpfo_page_is_unmapped(virt_to_page(kaddr)))
> > > + __clean_dcache_area_pou(kaddr, len);
> > > __flush_icache_all();
> > > } else {
> > > flush_icache_range(addr, addr + len);
> >
> > I don't think this patch is correct. If data cache maintenance is required in
> > the absence of XPFO, I don't see why it wouldn't be required in the presence of
> > XPFO.
>
> Ok. I suppose we could do re-map like we do for dma; or is there some
> re-arrangement of things you can see that would help?

Creating a temporary mapping (which I guess you do for DMA) should work.

We might be able to perform the maintenance before we unmap the linear
map alias, but I guess this might not be possible if the that logic is
too far removed from the PTE manipulation.

> > On a more general note, in future it would be good to Cc the arm64 maintainers
> > and the linux-arm-kernel mailing list for patches affecting arm64.
>
> Yes, I thought about doing that for the series, but since it has x86 patches
> too, I didn't want to spam everyone :). I'll just add x86/arm lists to CC in
> the patches in the future. If there's some better way, let me know.

That sounds fine; thanks.

Mark.

2017-08-23 16:58:46

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

Hi Mark,

On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> That said, is there any reason not to use flush_tlb_kernel_range()
> directly?

So it turns out that there is a difference between __flush_tlb_one() and
flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
I think is enough here).

As you might expect, this is quite a performance hit (at least under kvm), I
ran a little kernbench:

# __flush_tlb_one
Wed Aug 23 15:47:33 UTC 2017
4.13.0-rc5+
Average Half load -j 2 Run (std deviation):
Elapsed Time 50.3233 (1.82716)
User Time 87.1233 (1.26871)
System Time 15.36 (0.500899)
Percent CPU 203.667 (4.04145)
Context Switches 7350.33 (1339.65)
Sleeps 16008.3 (980.362)

Average Optimal load -j 4 Run (std deviation):
Elapsed Time 27.4267 (0.215019)
User Time 88.6983 (1.91501)
System Time 13.1933 (2.39488)
Percent CPU 286.333 (90.6083)
Context Switches 11393 (4509.14)
Sleeps 15764.7 (698.048)

# flush_tlb_kernel_range()
Wed Aug 23 16:00:03 UTC 2017
4.13.0-rc5+
Average Half load -j 2 Run (std deviation):
Elapsed Time 86.57 (1.06099)
User Time 103.25 (1.85475)
System Time 75.4433 (0.415852)
Percent CPU 205.667 (3.21455)
Context Switches 9363.33 (1361.57)
Sleeps 14703.3 (1439.12)

Average Optimal load -j 4 Run (std deviation):
Elapsed Time 51.27 (0.615873)
User Time 110.328 (7.93884)
System Time 74.06 (1.55788)
Percent CPU 288 (90.2197)
Context Switches 16557.5 (7930.01)
Sleeps 14774.7 (921.746)

So, I think we need to keep something like __flush_tlb_one around.
I'll call it flush_one_local_tlb() for now, and will cc x86@ on the
next version to see if they have any insight.

Cheers,

Tycho

2017-08-23 17:05:58

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> Hi Mark,
>
> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > That said, is there any reason not to use flush_tlb_kernel_range()
> > directly?
>
> So it turns out that there is a difference between __flush_tlb_one() and
> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> I think is enough here).

That sounds suspicious; I don't think that __flush_tlb_one() is
sufficient.

If you only do local TLB maintenance, then the page is left accessible
to other CPUs via the (stale) kernel mappings. i.e. the page isn't
exclusively mapped by userspace.

Thanks,
Mark.

2017-08-23 17:13:06

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Wed, Aug 23, 2017 at 06:04:43PM +0100, Mark Rutland wrote:
> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> > Hi Mark,
> >
> > On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > > That said, is there any reason not to use flush_tlb_kernel_range()
> > > directly?
> >
> > So it turns out that there is a difference between __flush_tlb_one() and
> > flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> > via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> > I think is enough here).
>
> That sounds suspicious; I don't think that __flush_tlb_one() is
> sufficient.
>
> If you only do local TLB maintenance, then the page is left accessible
> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> exclusively mapped by userspace.

I thought so too, so I tried to test it with something like the patch
below. But it correctly failed for me when using __flush_tlb_one(). I
suppose I'm doing something wrong in the test, but I'm not sure what.

Tycho


>From 1d1b0a18d56cf1634072096231bfbaa96cb2aa16 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <[email protected]>
Date: Tue, 22 Aug 2017 18:07:12 -0600
Subject: [PATCH] add XPFO_SMP test

Signed-off-by: Tycho Andersen <[email protected]>
---
drivers/misc/lkdtm.h | 1 +
drivers/misc/lkdtm_core.c | 1 +
drivers/misc/lkdtm_xpfo.c | 139 ++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index fc53546113c1..34a6ee37f216 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -67,5 +67,6 @@ void lkdtm_USERCOPY_KERNEL(void);
/* lkdtm_xpfo.c */
void lkdtm_XPFO_READ_USER(void);
void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);

#endif
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 164bc404f416..9544e329de4b 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -237,6 +237,7 @@ struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(XPFO_READ_USER),
CRASHTYPE(XPFO_READ_USER_HUGE),
+ CRASHTYPE(XPFO_SMP),
};


diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
index c72509128eb3..7600fdcae22f 100644
--- a/drivers/misc/lkdtm_xpfo.c
+++ b/drivers/misc/lkdtm_xpfo.c
@@ -4,22 +4,27 @@

#include "lkdtm.h"

+#include <linux/cpumask.h>
#include <linux/mman.h>
#include <linux/uaccess.h>
#include <linux/xpfo.h>
+#include <linux/kthread.h>

-void read_user_with_flags(unsigned long flags)
+#include <linux/delay.h>
+#include <linux/sched/task.h>
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
{
- unsigned long user_addr, user_data = 0xdeadbeef;
- phys_addr_t phys_addr;
- void *virt_addr;
+ unsigned long user_addr, user_data = XPFO_DATA;

user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
PROT_READ | PROT_WRITE | PROT_EXEC,
flags, 0);
if (user_addr >= TASK_SIZE) {
pr_warn("Failed to allocate user memory\n");
- return;
+ return 0;
}

if (copy_to_user((void __user *)user_addr, &user_data,
@@ -28,25 +33,61 @@ void read_user_with_flags(unsigned long flags)
goto free_user;
}

+ return user_addr;
+
+free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+ return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+ phys_addr_t phys_addr;
+ void *virt_addr;
+
phys_addr = user_virt_to_phys(user_addr);
if (!phys_addr) {
pr_warn("Failed to get physical address of user memory\n");
- goto free_user;
+ return 0;
}

virt_addr = phys_to_virt(phys_addr);
if (phys_addr != virt_to_phys(virt_addr)) {
pr_warn("Physical address of user memory seems incorrect\n");
- goto free_user;
+ return 0;
}

+ return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
pr_info("Attempting bad read from kernel address %p\n", virt_addr);
- if (*(unsigned long *)virt_addr == user_data)
- pr_info("Huh? Bad read succeeded?!\n");
+ if (*(unsigned long *)virt_addr == XPFO_DATA)
+ pr_err("FAIL: Bad read succeeded?!\n");
else
- pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
+ pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+ unsigned long user_addr, *kernel;
+
+ user_addr = do_map(flags);
+ if (!user_addr) {
+ pr_err("FAIL: map failed\n");
+ return;
+ }
+
+ kernel = user_to_kernel(user_addr);
+ if (!kernel) {
+ pr_err("FAIL: user to kernel conversion failed\n");
+ goto free_user;
+ }
+
+ read_map(kernel);

- free_user:
+free_user:
vm_munmap(user_addr, PAGE_SIZE);
}

@@ -60,3 +101,79 @@ void lkdtm_XPFO_READ_USER_HUGE(void)
{
read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
}
+
+struct smp_arg {
+ struct completion map_done;
+ unsigned long *virt_addr;
+ unsigned int cpu;
+};
+
+static int smp_reader(void *parg)
+{
+ struct smp_arg *arg = parg;
+
+ if (arg->cpu != smp_processor_id()) {
+ pr_err("FAIL: scheduled on wrong CPU?\n");
+ return 0;
+ }
+
+ wait_for_completion(&arg->map_done);
+
+ if (arg->virt_addr)
+ read_map(arg->virt_addr);
+
+ return 0;
+}
+
+/* The idea here is to read from the kernel's map on a different thread than
+ * did the mapping (and thus the TLB flushing), to make sure that the page
+ * faults on other cores too.
+ */
+void lkdtm_XPFO_SMP(void)
+{
+ unsigned long user_addr;
+ struct task_struct *thread;
+ int ret;
+ struct smp_arg arg;
+
+ init_completion(&arg.map_done);
+
+ if (num_online_cpus() < 2) {
+ pr_err("not enough to do a multi cpu test\n");
+ return;
+ }
+
+ arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
+ thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
+ if (IS_ERR(thread)) {
+ pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
+ return;
+ }
+
+ kthread_bind(thread, arg.cpu);
+ get_task_struct(thread);
+ wake_up_process(thread);
+
+ user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
+ if (user_addr) {
+ arg.virt_addr = user_to_kernel(user_addr);
+ /* child thread checks for failure */
+ }
+
+ complete(&arg.map_done);
+
+ /* there must be a better way to do this. */
+ while (1) {
+ if (thread->exit_state)
+ break;
+ msleep_interruptible(100);
+ }
+
+ ret = kthread_stop(thread);
+ if (ret != SIGKILL)
+ pr_err("FAIL: thread wasn't killed: %d\n", ret);
+ put_task_struct(thread);
+
+ if (user_addr)
+ vm_munmap(user_addr, PAGE_SIZE);
+}
--
2.11.0

2017-08-24 15:46:35

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Wed, Aug 23, 2017 at 11:13:02AM -0600, Tycho Andersen wrote:
> On Wed, Aug 23, 2017 at 06:04:43PM +0100, Mark Rutland wrote:
> > On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> > > Hi Mark,
> > >
> > > On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > > > That said, is there any reason not to use flush_tlb_kernel_range()
> > > > directly?
> > >
> > > So it turns out that there is a difference between __flush_tlb_one() and
> > > flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> > > via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> > > I think is enough here).
> >
> > That sounds suspicious; I don't think that __flush_tlb_one() is
> > sufficient.
> >
> > If you only do local TLB maintenance, then the page is left accessible
> > to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> > exclusively mapped by userspace.
>
> I thought so too, so I tried to test it with something like the patch
> below. But it correctly failed for me when using __flush_tlb_one(). I
> suppose I'm doing something wrong in the test, but I'm not sure what.

I suspect the issue is that you use a completion to synchronise the
mapping.

The reader thread will block (i.e. it we go into schedule() and
something else will run), and I guess that on x86, that the
context-switch this entails upon completion happens to invalidate the
TLBs.

Instead, you could serialise the update with the reader doing:

/* spin until address is published to us */
addr = smp_cond_load_acquire(arg->virt_addr, VAL != NULL);

read_map(addr);

... and the writer doing:

user_addr = do_map(...)

...

smp_store_release(arg->virt_addr, user_addr);

There would still be a chance of a context-switch, but it wouldn't be
mandatory.

As an aside, it looks like DEBUG_PAGEALLOC on x86 has the problem w.r.t.
under-invalidating, juding by the comments in x86's
__kernel_map_pages(). It only invalidates the local TLBs, even though
it should do it on all CPUs.

Thanks,
Mark.

>
> Tycho
>
>
> From 1d1b0a18d56cf1634072096231bfbaa96cb2aa16 Mon Sep 17 00:00:00 2001
> From: Tycho Andersen <[email protected]>
> Date: Tue, 22 Aug 2017 18:07:12 -0600
> Subject: [PATCH] add XPFO_SMP test
>
> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> drivers/misc/lkdtm.h | 1 +
> drivers/misc/lkdtm_core.c | 1 +
> drivers/misc/lkdtm_xpfo.c | 139 ++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
> index fc53546113c1..34a6ee37f216 100644
> --- a/drivers/misc/lkdtm.h
> +++ b/drivers/misc/lkdtm.h
> @@ -67,5 +67,6 @@ void lkdtm_USERCOPY_KERNEL(void);
> /* lkdtm_xpfo.c */
> void lkdtm_XPFO_READ_USER(void);
> void lkdtm_XPFO_READ_USER_HUGE(void);
> +void lkdtm_XPFO_SMP(void);
>
> #endif
> diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
> index 164bc404f416..9544e329de4b 100644
> --- a/drivers/misc/lkdtm_core.c
> +++ b/drivers/misc/lkdtm_core.c
> @@ -237,6 +237,7 @@ struct crashtype crashtypes[] = {
> CRASHTYPE(USERCOPY_KERNEL),
> CRASHTYPE(XPFO_READ_USER),
> CRASHTYPE(XPFO_READ_USER_HUGE),
> + CRASHTYPE(XPFO_SMP),
> };
>
>
> diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
> index c72509128eb3..7600fdcae22f 100644
> --- a/drivers/misc/lkdtm_xpfo.c
> +++ b/drivers/misc/lkdtm_xpfo.c
> @@ -4,22 +4,27 @@
>
> #include "lkdtm.h"
>
> +#include <linux/cpumask.h>
> #include <linux/mman.h>
> #include <linux/uaccess.h>
> #include <linux/xpfo.h>
> +#include <linux/kthread.h>
>
> -void read_user_with_flags(unsigned long flags)
> +#include <linux/delay.h>
> +#include <linux/sched/task.h>
> +
> +#define XPFO_DATA 0xdeadbeef
> +
> +static unsigned long do_map(unsigned long flags)
> {
> - unsigned long user_addr, user_data = 0xdeadbeef;
> - phys_addr_t phys_addr;
> - void *virt_addr;
> + unsigned long user_addr, user_data = XPFO_DATA;
>
> user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
> PROT_READ | PROT_WRITE | PROT_EXEC,
> flags, 0);
> if (user_addr >= TASK_SIZE) {
> pr_warn("Failed to allocate user memory\n");
> - return;
> + return 0;
> }
>
> if (copy_to_user((void __user *)user_addr, &user_data,
> @@ -28,25 +33,61 @@ void read_user_with_flags(unsigned long flags)
> goto free_user;
> }
>
> + return user_addr;
> +
> +free_user:
> + vm_munmap(user_addr, PAGE_SIZE);
> + return 0;
> +}
> +
> +static unsigned long *user_to_kernel(unsigned long user_addr)
> +{
> + phys_addr_t phys_addr;
> + void *virt_addr;
> +
> phys_addr = user_virt_to_phys(user_addr);
> if (!phys_addr) {
> pr_warn("Failed to get physical address of user memory\n");
> - goto free_user;
> + return 0;
> }
>
> virt_addr = phys_to_virt(phys_addr);
> if (phys_addr != virt_to_phys(virt_addr)) {
> pr_warn("Physical address of user memory seems incorrect\n");
> - goto free_user;
> + return 0;
> }
>
> + return virt_addr;
> +}
> +
> +static void read_map(unsigned long *virt_addr)
> +{
> pr_info("Attempting bad read from kernel address %p\n", virt_addr);
> - if (*(unsigned long *)virt_addr == user_data)
> - pr_info("Huh? Bad read succeeded?!\n");
> + if (*(unsigned long *)virt_addr == XPFO_DATA)
> + pr_err("FAIL: Bad read succeeded?!\n");
> else
> - pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
> + pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
> +}
> +
> +static void read_user_with_flags(unsigned long flags)
> +{
> + unsigned long user_addr, *kernel;
> +
> + user_addr = do_map(flags);
> + if (!user_addr) {
> + pr_err("FAIL: map failed\n");
> + return;
> + }
> +
> + kernel = user_to_kernel(user_addr);
> + if (!kernel) {
> + pr_err("FAIL: user to kernel conversion failed\n");
> + goto free_user;
> + }
> +
> + read_map(kernel);
>
> - free_user:
> +free_user:
> vm_munmap(user_addr, PAGE_SIZE);
> }
>
> @@ -60,3 +101,79 @@ void lkdtm_XPFO_READ_USER_HUGE(void)
> {
> read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
> }
> +
> +struct smp_arg {
> + struct completion map_done;
> + unsigned long *virt_addr;
> + unsigned int cpu;
> +};
> +
> +static int smp_reader(void *parg)
> +{
> + struct smp_arg *arg = parg;
> +
> + if (arg->cpu != smp_processor_id()) {
> + pr_err("FAIL: scheduled on wrong CPU?\n");
> + return 0;
> + }
> +
> + wait_for_completion(&arg->map_done);
> +
> + if (arg->virt_addr)
> + read_map(arg->virt_addr);
> +
> + return 0;
> +}
> +
> +/* The idea here is to read from the kernel's map on a different thread than
> + * did the mapping (and thus the TLB flushing), to make sure that the page
> + * faults on other cores too.
> + */
> +void lkdtm_XPFO_SMP(void)
> +{
> + unsigned long user_addr;
> + struct task_struct *thread;
> + int ret;
> + struct smp_arg arg;
> +
> + init_completion(&arg.map_done);
> +
> + if (num_online_cpus() < 2) {
> + pr_err("not enough to do a multi cpu test\n");
> + return;
> + }
> +
> + arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
> + thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
> + if (IS_ERR(thread)) {
> + pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
> + return;
> + }
> +
> + kthread_bind(thread, arg.cpu);
> + get_task_struct(thread);
> + wake_up_process(thread);
> +
> + user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
> + if (user_addr) {
> + arg.virt_addr = user_to_kernel(user_addr);
> + /* child thread checks for failure */
> + }
> +
> + complete(&arg.map_done);
> +
> + /* there must be a better way to do this. */
> + while (1) {
> + if (thread->exit_state)
> + break;
> + msleep_interruptible(100);
> + }
> +
> + ret = kthread_stop(thread);
> + if (ret != SIGKILL)
> + pr_err("FAIL: thread wasn't killed: %d\n", ret);
> + put_task_struct(thread);
> +
> + if (user_addr)
> + vm_munmap(user_addr, PAGE_SIZE);
> +}
> --
> 2.11.0
>

2017-08-29 17:24:36

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

Hi Mark,

On Thu, Aug 24, 2017 at 04:45:19PM +0100, Mark Rutland wrote:
> On Wed, Aug 23, 2017 at 11:13:02AM -0600, Tycho Andersen wrote:
> > On Wed, Aug 23, 2017 at 06:04:43PM +0100, Mark Rutland wrote:
> > > On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> > > > Hi Mark,
> > > >
> > > > On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > > > > That said, is there any reason not to use flush_tlb_kernel_range()
> > > > > directly?
> > > >
> > > > So it turns out that there is a difference between __flush_tlb_one() and
> > > > flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> > > > via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> > > > I think is enough here).
> > >
> > > That sounds suspicious; I don't think that __flush_tlb_one() is
> > > sufficient.
> > >
> > > If you only do local TLB maintenance, then the page is left accessible
> > > to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> > > exclusively mapped by userspace.
> >
> > I thought so too, so I tried to test it with something like the patch
> > below. But it correctly failed for me when using __flush_tlb_one(). I
> > suppose I'm doing something wrong in the test, but I'm not sure what.
>
> I suspect the issue is that you use a completion to synchronise the
> mapping.
>
> The reader thread will block (i.e. it we go into schedule() and
> something else will run), and I guess that on x86, that the
> context-switch this entails upon completion happens to invalidate the
> TLBs.
>
> Instead, you could serialise the update with the reader doing:
>
> /* spin until address is published to us */
> addr = smp_cond_load_acquire(arg->virt_addr, VAL != NULL);
>
> read_map(addr);
>
> ... and the writer doing:
>
> user_addr = do_map(...)
>
> ...
>
> smp_store_release(arg->virt_addr, user_addr);
>
> There would still be a chance of a context-switch, but it wouldn't be
> mandatory.
>

Thanks for this. However, I've tried several different variations
(including just raw spinning), and it still seems to fail correctly
for me.

I'll see about just cleaning up the series with this round of feedback
and post the whole thing again with CCs to the arm/x86 lists where
appropriate, and hopefully someone can tell me what I'm doing wrong.

Thanks,

Tycho

> As an aside, it looks like DEBUG_PAGEALLOC on x86 has the problem w.r.t.
> under-invalidating, juding by the comments in x86's
> __kernel_map_pages(). It only invalidates the local TLBs, even though
> it should do it on all CPUs.
>
> Thanks,
> Mark.
>
> >
> > Tycho
> >
> >
> > From 1d1b0a18d56cf1634072096231bfbaa96cb2aa16 Mon Sep 17 00:00:00 2001
> > From: Tycho Andersen <[email protected]>
> > Date: Tue, 22 Aug 2017 18:07:12 -0600
> > Subject: [PATCH] add XPFO_SMP test
> >
> > Signed-off-by: Tycho Andersen <[email protected]>
> > ---
> > drivers/misc/lkdtm.h | 1 +
> > drivers/misc/lkdtm_core.c | 1 +
> > drivers/misc/lkdtm_xpfo.c | 139 ++++++++++++++++++++++++++++++++++++++++++----
> > 3 files changed, 130 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
> > index fc53546113c1..34a6ee37f216 100644
> > --- a/drivers/misc/lkdtm.h
> > +++ b/drivers/misc/lkdtm.h
> > @@ -67,5 +67,6 @@ void lkdtm_USERCOPY_KERNEL(void);
> > /* lkdtm_xpfo.c */
> > void lkdtm_XPFO_READ_USER(void);
> > void lkdtm_XPFO_READ_USER_HUGE(void);
> > +void lkdtm_XPFO_SMP(void);
> >
> > #endif
> > diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
> > index 164bc404f416..9544e329de4b 100644
> > --- a/drivers/misc/lkdtm_core.c
> > +++ b/drivers/misc/lkdtm_core.c
> > @@ -237,6 +237,7 @@ struct crashtype crashtypes[] = {
> > CRASHTYPE(USERCOPY_KERNEL),
> > CRASHTYPE(XPFO_READ_USER),
> > CRASHTYPE(XPFO_READ_USER_HUGE),
> > + CRASHTYPE(XPFO_SMP),
> > };
> >
> >
> > diff --git a/drivers/misc/lkdtm_xpfo.c b/drivers/misc/lkdtm_xpfo.c
> > index c72509128eb3..7600fdcae22f 100644
> > --- a/drivers/misc/lkdtm_xpfo.c
> > +++ b/drivers/misc/lkdtm_xpfo.c
> > @@ -4,22 +4,27 @@
> >
> > #include "lkdtm.h"
> >
> > +#include <linux/cpumask.h>
> > #include <linux/mman.h>
> > #include <linux/uaccess.h>
> > #include <linux/xpfo.h>
> > +#include <linux/kthread.h>
> >
> > -void read_user_with_flags(unsigned long flags)
> > +#include <linux/delay.h>
> > +#include <linux/sched/task.h>
> > +
> > +#define XPFO_DATA 0xdeadbeef
> > +
> > +static unsigned long do_map(unsigned long flags)
> > {
> > - unsigned long user_addr, user_data = 0xdeadbeef;
> > - phys_addr_t phys_addr;
> > - void *virt_addr;
> > + unsigned long user_addr, user_data = XPFO_DATA;
> >
> > user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
> > PROT_READ | PROT_WRITE | PROT_EXEC,
> > flags, 0);
> > if (user_addr >= TASK_SIZE) {
> > pr_warn("Failed to allocate user memory\n");
> > - return;
> > + return 0;
> > }
> >
> > if (copy_to_user((void __user *)user_addr, &user_data,
> > @@ -28,25 +33,61 @@ void read_user_with_flags(unsigned long flags)
> > goto free_user;
> > }
> >
> > + return user_addr;
> > +
> > +free_user:
> > + vm_munmap(user_addr, PAGE_SIZE);
> > + return 0;
> > +}
> > +
> > +static unsigned long *user_to_kernel(unsigned long user_addr)
> > +{
> > + phys_addr_t phys_addr;
> > + void *virt_addr;
> > +
> > phys_addr = user_virt_to_phys(user_addr);
> > if (!phys_addr) {
> > pr_warn("Failed to get physical address of user memory\n");
> > - goto free_user;
> > + return 0;
> > }
> >
> > virt_addr = phys_to_virt(phys_addr);
> > if (phys_addr != virt_to_phys(virt_addr)) {
> > pr_warn("Physical address of user memory seems incorrect\n");
> > - goto free_user;
> > + return 0;
> > }
> >
> > + return virt_addr;
> > +}
> > +
> > +static void read_map(unsigned long *virt_addr)
> > +{
> > pr_info("Attempting bad read from kernel address %p\n", virt_addr);
> > - if (*(unsigned long *)virt_addr == user_data)
> > - pr_info("Huh? Bad read succeeded?!\n");
> > + if (*(unsigned long *)virt_addr == XPFO_DATA)
> > + pr_err("FAIL: Bad read succeeded?!\n");
> > else
> > - pr_info("Huh? Bad read didn't fail but data is incorrect?!\n");
> > + pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
> > +}
> > +
> > +static void read_user_with_flags(unsigned long flags)
> > +{
> > + unsigned long user_addr, *kernel;
> > +
> > + user_addr = do_map(flags);
> > + if (!user_addr) {
> > + pr_err("FAIL: map failed\n");
> > + return;
> > + }
> > +
> > + kernel = user_to_kernel(user_addr);
> > + if (!kernel) {
> > + pr_err("FAIL: user to kernel conversion failed\n");
> > + goto free_user;
> > + }
> > +
> > + read_map(kernel);
> >
> > - free_user:
> > +free_user:
> > vm_munmap(user_addr, PAGE_SIZE);
> > }
> >
> > @@ -60,3 +101,79 @@ void lkdtm_XPFO_READ_USER_HUGE(void)
> > {
> > read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
> > }
> > +
> > +struct smp_arg {
> > + struct completion map_done;
> > + unsigned long *virt_addr;
> > + unsigned int cpu;
> > +};
> > +
> > +static int smp_reader(void *parg)
> > +{
> > + struct smp_arg *arg = parg;
> > +
> > + if (arg->cpu != smp_processor_id()) {
> > + pr_err("FAIL: scheduled on wrong CPU?\n");
> > + return 0;
> > + }
> > +
> > + wait_for_completion(&arg->map_done);
> > +
> > + if (arg->virt_addr)
> > + read_map(arg->virt_addr);
> > +
> > + return 0;
> > +}
> > +
> > +/* The idea here is to read from the kernel's map on a different thread than
> > + * did the mapping (and thus the TLB flushing), to make sure that the page
> > + * faults on other cores too.
> > + */
> > +void lkdtm_XPFO_SMP(void)
> > +{
> > + unsigned long user_addr;
> > + struct task_struct *thread;
> > + int ret;
> > + struct smp_arg arg;
> > +
> > + init_completion(&arg.map_done);
> > +
> > + if (num_online_cpus() < 2) {
> > + pr_err("not enough to do a multi cpu test\n");
> > + return;
> > + }
> > +
> > + arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
> > + thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
> > + if (IS_ERR(thread)) {
> > + pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
> > + return;
> > + }
> > +
> > + kthread_bind(thread, arg.cpu);
> > + get_task_struct(thread);
> > + wake_up_process(thread);
> > +
> > + user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
> > + if (user_addr) {
> > + arg.virt_addr = user_to_kernel(user_addr);
> > + /* child thread checks for failure */
> > + }
> > +
> > + complete(&arg.map_done);
> > +
> > + /* there must be a better way to do this. */
> > + while (1) {
> > + if (thread->exit_state)
> > + break;
> > + msleep_interruptible(100);
> > + }
> > +
> > + ret = kthread_stop(thread);
> > + if (ret != SIGKILL)
> > + pr_err("FAIL: thread wasn't killed: %d\n", ret);
> > + put_task_struct(thread);
> > +
> > + if (user_addr)
> > + vm_munmap(user_addr, PAGE_SIZE);
> > +}
> > --
> > 2.11.0
> >

2017-08-30 05:31:30

by Juerg Haefliger

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()



On 08/23/2017 07:04 PM, Mark Rutland wrote:
> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
>> Hi Mark,
>>
>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
>>> That said, is there any reason not to use flush_tlb_kernel_range()
>>> directly?
>>
>> So it turns out that there is a difference between __flush_tlb_one() and
>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
>> I think is enough here).
>
> That sounds suspicious; I don't think that __flush_tlb_one() is
> sufficient.
>
> If you only do local TLB maintenance, then the page is left accessible
> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> exclusively mapped by userspace.

We flush all CPUs to get rid of stale entries when a new page is
allocated to userspace that was previously allocated to the kernel.
Is that the scenario you were thinking of?

...Juerg


> Thanks,
> Mark.
>


Attachments:
signature.asc (845.00 B)
OpenPGP digital signature

2017-08-30 16:47:28

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
>
>
> On 08/23/2017 07:04 PM, Mark Rutland wrote:
> > On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> >> Hi Mark,
> >>
> >> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> >>> That said, is there any reason not to use flush_tlb_kernel_range()
> >>> directly?
> >>
> >> So it turns out that there is a difference between __flush_tlb_one() and
> >> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> >> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> >> I think is enough here).
> >
> > That sounds suspicious; I don't think that __flush_tlb_one() is
> > sufficient.
> >
> > If you only do local TLB maintenance, then the page is left accessible
> > to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> > exclusively mapped by userspace.
>
> We flush all CPUs to get rid of stale entries when a new page is
> allocated to userspace that was previously allocated to the kernel.
> Is that the scenario you were thinking of?

I think there are two cases, the one you describe above, where the
pages are first allocated, and a second one, where e.g. the pages are
mapped into the kernel because of DMA or whatever. In the case you
describe above, I think we're doing the right thing (which is why my
test worked correctly, because it tested this case).

In the second case, when the pages are unmapped (i.e. the kernel is
done doing DMA), do we need to flush the other CPUs TLBs? I think the
current code is not quite correct, because if multiple tasks (CPUs)
map the pages, only the TLB of the last one is flushed when the
mapping is cleared, because the tlb is only flushed when ->mapcount
drops to zero, leaving stale entries in the other TLBs. It's not clear
to me what to do about this case.

Thoughts?

Tycho

> ...Juerg
>
>
> > Thanks,
> > Mark.
> >
>



2017-08-31 09:43:59

by Juerg Haefliger

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On 08/30/2017 06:47 PM, Tycho Andersen wrote:
> On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
>>
>>
>> On 08/23/2017 07:04 PM, Mark Rutland wrote:
>>> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
>>>> Hi Mark,
>>>>
>>>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
>>>>> That said, is there any reason not to use flush_tlb_kernel_range()
>>>>> directly?
>>>>
>>>> So it turns out that there is a difference between __flush_tlb_one() and
>>>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
>>>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
>>>> I think is enough here).
>>>
>>> That sounds suspicious; I don't think that __flush_tlb_one() is
>>> sufficient.
>>>
>>> If you only do local TLB maintenance, then the page is left accessible
>>> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
>>> exclusively mapped by userspace.
>>
>> We flush all CPUs to get rid of stale entries when a new page is
>> allocated to userspace that was previously allocated to the kernel.
>> Is that the scenario you were thinking of?
>
> I think there are two cases, the one you describe above, where the
> pages are first allocated, and a second one, where e.g. the pages are
> mapped into the kernel because of DMA or whatever. In the case you
> describe above, I think we're doing the right thing (which is why my
> test worked correctly, because it tested this case).
>
> In the second case, when the pages are unmapped (i.e. the kernel is
> done doing DMA), do we need to flush the other CPUs TLBs? I think the
> current code is not quite correct, because if multiple tasks (CPUs)
> map the pages, only the TLB of the last one is flushed when the
> mapping is cleared, because the tlb is only flushed when ->mapcount
> drops to zero, leaving stale entries in the other TLBs. It's not clear
> to me what to do about this case.

For this to happen, multiple CPUs need to have the same userspace page
mapped at the same time. Is this a valid scenario?

...Juerg


> Thoughts?
>
> Tycho
>
>> ...Juerg
>>
>>
>>> Thanks,
>>> Mark.
>>>
>>
>
>
>


Attachments:
signature.asc (845.00 B)
OpenPGP digital signature

2017-08-31 09:48:45

by Mark Rutland

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

On Thu, Aug 31, 2017 at 11:43:53AM +0200, Juerg Haefliger wrote:
> On 08/30/2017 06:47 PM, Tycho Andersen wrote:
> > On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
> >>
> >>
> >> On 08/23/2017 07:04 PM, Mark Rutland wrote:
> >>> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> >>>> Hi Mark,
> >>>>
> >>>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> >>>>> That said, is there any reason not to use flush_tlb_kernel_range()
> >>>>> directly?
> >>>>
> >>>> So it turns out that there is a difference between __flush_tlb_one() and
> >>>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> >>>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> >>>> I think is enough here).
> >>>
> >>> That sounds suspicious; I don't think that __flush_tlb_one() is
> >>> sufficient.
> >>>
> >>> If you only do local TLB maintenance, then the page is left accessible
> >>> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> >>> exclusively mapped by userspace.
> >>
> >> We flush all CPUs to get rid of stale entries when a new page is
> >> allocated to userspace that was previously allocated to the kernel.
> >> Is that the scenario you were thinking of?
> >
> > I think there are two cases, the one you describe above, where the
> > pages are first allocated, and a second one, where e.g. the pages are
> > mapped into the kernel because of DMA or whatever. In the case you
> > describe above, I think we're doing the right thing (which is why my
> > test worked correctly, because it tested this case).
> >
> > In the second case, when the pages are unmapped (i.e. the kernel is
> > done doing DMA), do we need to flush the other CPUs TLBs? I think the
> > current code is not quite correct, because if multiple tasks (CPUs)
> > map the pages, only the TLB of the last one is flushed when the
> > mapping is cleared, because the tlb is only flushed when ->mapcount
> > drops to zero, leaving stale entries in the other TLBs. It's not clear
> > to me what to do about this case.
>
> For this to happen, multiple CPUs need to have the same userspace page
> mapped at the same time. Is this a valid scenario?

I believe so. I think you could trigger that with a multi-threaded
application running across several CPUs. All those threads would share
the same page tables.

Thanks,
Mark.

2017-08-31 21:21:47

by Tycho Andersen

[permalink] [raw]
Subject: Re: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

Hi all,

On Thu, Aug 31, 2017 at 10:47:27AM +0100, Mark Rutland wrote:
> On Thu, Aug 31, 2017 at 11:43:53AM +0200, Juerg Haefliger wrote:
> > On 08/30/2017 06:47 PM, Tycho Andersen wrote:
> > > On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
> > >>
> > >>
> > >> On 08/23/2017 07:04 PM, Mark Rutland wrote:
> > >>> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> > >>>> Hi Mark,
> > >>>>
> > >>>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > >>>>> That said, is there any reason not to use flush_tlb_kernel_range()
> > >>>>> directly?
> > >>>>
> > >>>> So it turns out that there is a difference between __flush_tlb_one() and
> > >>>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> > >>>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> > >>>> I think is enough here).
> > >>>
> > >>> That sounds suspicious; I don't think that __flush_tlb_one() is
> > >>> sufficient.
> > >>>
> > >>> If you only do local TLB maintenance, then the page is left accessible
> > >>> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> > >>> exclusively mapped by userspace.
> > >>
> > >> We flush all CPUs to get rid of stale entries when a new page is
> > >> allocated to userspace that was previously allocated to the kernel.
> > >> Is that the scenario you were thinking of?
> > >
> > > I think there are two cases, the one you describe above, where the
> > > pages are first allocated, and a second one, where e.g. the pages are
> > > mapped into the kernel because of DMA or whatever. In the case you
> > > describe above, I think we're doing the right thing (which is why my
> > > test worked correctly, because it tested this case).
> > >
> > > In the second case, when the pages are unmapped (i.e. the kernel is
> > > done doing DMA), do we need to flush the other CPUs TLBs? I think the
> > > current code is not quite correct, because if multiple tasks (CPUs)
> > > map the pages, only the TLB of the last one is flushed when the
> > > mapping is cleared, because the tlb is only flushed when ->mapcount
> > > drops to zero, leaving stale entries in the other TLBs. It's not clear
> > > to me what to do about this case.
> >
> > For this to happen, multiple CPUs need to have the same userspace page
> > mapped at the same time. Is this a valid scenario?
>
> I believe so. I think you could trigger that with a multi-threaded
> application running across several CPUs. All those threads would share
> the same page tables.

I played around with trying to track this per-cpu, and I'm not sure
there's a nice way to do it (see the patch below, and the comment
about correctness [never mind that this patch calls alloc_percpu from
a possibly atomic context]).

I think it may be best to just flush all the TLBs of the DMA range
when the last task unmaps it. This would leave a small exploitable
race where a task had mapped/unmapped the page, but some other page
still had it mapped.

If anyone has any better ideas please let me know, otherwise I'll just
flush all the TLBs when the use count drops to zero, and post the next
version Soon (TM).

Cheers,

Tycho



>From a3a8f9da00bed910e086805f3e71b9e5e1b898b4 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <[email protected]>
Date: Thu, 31 Aug 2017 15:03:06 -0600
Subject: [PATCH] draft of per-cpu flush flag

Signed-off-by: Tycho Andersen <[email protected]>
---
mm/xpfo.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 70 insertions(+), 8 deletions(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index 0b178ad5a37e..5f9aeaaa40d2 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -31,6 +31,8 @@ struct xpfo {
bool inited; /* Map counter and lock initialized */
atomic_t mapcount; /* Counter for balancing map/unmap requests */
spinlock_t maplock; /* Lock to serialize map/unmap requests */
+ void *mapped; /* per-cpu variable to indicate whether this
+ CPU has mapped this page or not */
};

DEFINE_STATIC_KEY_FALSE(xpfo_inited);
@@ -78,6 +80,43 @@ static inline struct xpfo *lookup_xpfo(struct page *page)
return (void *)page_ext + page_xpfo_ops.offset;
}

+/*
+ * Return the map status of this page. Note that the cpu needs to be pinned,
+ * either via get_cpu() or a spin lock.
+ */
+static bool xpfo_test_unmapped(struct xpfo *xpfo)
+{
+ if (!xpfo->mapped) {
+ return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ } else {
+ return *(bool *)per_cpu_ptr(xpfo->mapped,
+ smp_processor_id());
+ }
+}
+
+/*
+ * Set the unmapped status of this page. Returns the previous state. Note that
+ * the cpu needs to be pinnned, either via get_cpu() or a spin lock.
+ */
+static bool xpfo_test_set_unmapped(struct xpfo *xpfo, bool unmapped)
+{
+ if (!xpfo->mapped) {
+ if (unmapped)
+ return test_and_set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ else
+ return test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ } else {
+ bool *p, prev;
+
+ p = per_cpu_ptr(xpfo->mapped, smp_processor_id());
+ prev = *p;
+ *p = unmapped;
+
+ return prev;
+ }
+}
+
+
void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
{
int i, flush_tlb = 0;
@@ -91,7 +130,7 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
if (!xpfo)
continue;

- WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+ WARN(xpfo_test_unmapped(xpfo),
"xpfo: unmapped page being allocated\n");

/* Initialize the map lock and map counter */
@@ -99,7 +138,9 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
spin_lock_init(&xpfo->maplock);
atomic_set(&xpfo->mapcount, 0);
xpfo->inited = true;
+ xpfo->mapped = NULL;
}
+
WARN(atomic_read(&xpfo->mapcount),
"xpfo: already mapped page being allocated\n");

@@ -168,13 +209,19 @@ void xpfo_kmap(void *kaddr, struct page *page)
return;

spin_lock(&xpfo->maplock);
+ if (!xpfo->mapped) {
+ xpfo->mapped = alloc_percpu(bool);
+ if (!xpfo->mapped)
+ WARN_ON("xpfo: percpu flag allocation failed\n");
+ }

/*
* The page was previously allocated to user space, so map it back
* into the kernel. No TLB flush required.
*/
- if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
- test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ xpfo_test_set_unmapped(xpfo, false);
+
+ if (atomic_inc_return(&xpfo->mapcount) == 1)
set_kpte(kaddr, page, PAGE_KERNEL);

spin_unlock(&xpfo->maplock);
@@ -205,10 +252,25 @@ void xpfo_kunmap(void *kaddr, struct page *page)
* The page is to be allocated back to user space, so unmap it from the
* kernel, flush the TLB and tag it as a user page.
*/
- if (atomic_dec_return(&xpfo->mapcount) == 0) {
- WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
- "xpfo: unmapping already unmapped page\n");
- set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ if (xpfo->mapped) {
+ /*
+ * We have a per-cpu map, and we know it is mapped on this
+ * process, so let's flush our local TLB.
+ */
+ xpfo_test_set_unmapped(xpfo, true);
+
+ /*
+ * I think this is incorrect -- the page should still be mapped
+ * by the other cpus, it's just the TLB entry here is a bit stale.
+ */
+ set_kpte(kaddr, page, __pgprot(0));
+ __flush_tlb_one((unsigned long) kaddr);
+ } else if (atomic_dec_return(&xpfo->mapcount) == 0) {
+ /*
+ * No per-cpu map, so let's just do a best effort and
+ * unmap/flush all the TLBs when the count reaches 0.
+ */
+ xpfo_test_set_unmapped(xpfo, true);
set_kpte(kaddr, page, __pgprot(0));
flush_tlb_kernel_range((unsigned long) kaddr,
(unsigned long) kaddr + PAGE_SIZE);
@@ -229,7 +291,7 @@ bool xpfo_page_is_unmapped(struct page *page)
if (unlikely(!xpfo) && !xpfo->inited)
return false;

- return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ return xpfo_test_unmapped(xpfo);
}
EXPORT_SYMBOL(xpfo_page_is_unmapped);

--
2.11.0

2017-09-20 16:20:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -420,8 +420,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> {
> unsigned long pfn = PFN_DOWN(orig_addr);
> unsigned char *vaddr = phys_to_virt(tlb_addr);
> + struct page *page = pfn_to_page(pfn);
>
> - if (PageHighMem(pfn_to_page(pfn))) {
> + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> /* The buffer does not have a mapping. Map it in and copy */
> unsigned int offset = orig_addr & ~PAGE_MASK;
> char *buffer;

This is a little scary. I wonder how many more of these are in the
kernel, like:

> static inline void *skcipher_map(struct scatter_walk *walk)
> {
> struct page *page = scatterwalk_page(walk);
>
> return (PageHighMem(page) ? kmap_atomic(page) : page_address(page)) +
> offset_in_page(walk->offset);
> }

Is there any better way to catch these? Like, can we add some debugging
to check for XPFO pages in __va()?

2017-09-20 22:47:45

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

On Wed, Sep 20, 2017 at 09:19:56AM -0700, Dave Hansen wrote:
> On 08/09/2017 01:07 PM, Tycho Andersen wrote:
> > --- a/lib/swiotlb.c
> > +++ b/lib/swiotlb.c
> > @@ -420,8 +420,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> > {
> > unsigned long pfn = PFN_DOWN(orig_addr);
> > unsigned char *vaddr = phys_to_virt(tlb_addr);
> > + struct page *page = pfn_to_page(pfn);
> >
> > - if (PageHighMem(pfn_to_page(pfn))) {
> > + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> > /* The buffer does not have a mapping. Map it in and copy */
> > unsigned int offset = orig_addr & ~PAGE_MASK;
> > char *buffer;
>
> This is a little scary. I wonder how many more of these are in the
> kernel, like:

I don't know, but I assume several :)

> > static inline void *skcipher_map(struct scatter_walk *walk)
> > {
> > struct page *page = scatterwalk_page(walk);
> >
> > return (PageHighMem(page) ? kmap_atomic(page) : page_address(page)) +
> > offset_in_page(walk->offset);
> > }
>
> Is there any better way to catch these? Like, can we add some debugging
> to check for XPFO pages in __va()?

Yes, and perhaps also a debugging check in PageHighMem? Would __va
have caught either of the two cases you've pointed out?

Tycho

2017-09-20 23:25:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] swiotlb: Map the buffer if it was unmapped by XPFO

On 09/20/2017 03:47 PM, Tycho Andersen wrote:
>
>>> static inline void *skcipher_map(struct scatter_walk *walk)
>>> {
>>> struct page *page = scatterwalk_page(walk);
>>>
>>> return (PageHighMem(page) ? kmap_atomic(page) : page_address(page)) +
>>> offset_in_page(walk->offset);
>>> }
>> Is there any better way to catch these? Like, can we add some debugging
>> to check for XPFO pages in __va()?
> Yes, and perhaps also a debugging check in PageHighMem?

I'm not sure what PageHighMem() would check. It's OK to use as long as
you don't depend on the contents of the page.

> Would __va have caught either of the two cases you've pointed out?
Yes. __va() is what is eventually called by lowmem_page_address(),
which is only OK to call on things that are actually mapped into the kernel.