I am continuing to build on the work Juerg, Tycho and Julian have done
on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were flushed
actively after an XPFO TLB update. Benchmark for measuring performance
is kernel build using parallel make. To get full protection from
ret2dir attackes, we must flush stale TLB entries. Performance
penalty from flushing stale TLB entries goes up as the number of
cores goes up. On a desktop class machine with only 4 cores,
enabling TLB flush for stale entries causes system time for "make
-j4" to go up by a factor of 2.614x but on a larger machine with 96
cores, system time with "make -j60" goes up by a factor of 26.366x!
I have been working on reducing this performance penalty.
I implemented a solution to reduce performance penalty and
that has had large impact. When XPFO code flushes stale TLB entries,
it does so for all CPUs on the system which may include CPUs that
may not have any matching TLB entries or may never be scheduled to
run the userspace task causing TLB flush. Problem is made worse by
the fact that if number of entries being flushed exceeds
tlb_single_page_flush_ceiling, it results in a full TLB flush on
every CPU. A rogue process can launch a ret2dir attack only from a
CPU that has dual mapping for its pages in physmap in its TLB. We
can hence defer TLB flush on a CPU until a process that would have
caused a TLB flush is scheduled on that CPU. I have added a cpumask
to task_struct which is then used to post pending TLB flush on CPUs
other than the one a process is running on. This cpumask is checked
when a process migrates to a new CPU and TLB is flushed at that
time. I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before this optimization and then
again after applying this optimization. Here are the results:
Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all
4.20 915.183s
4.20+XPFO 24129.354s 26.366x
4.20+XPFO+Deferred flush 1216.987s 1.330xx
Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all
4.20 607.671s
4.20+XPFO 1588.646s 2.614x
4.20+XPFO+Deferred flush 794.473s 1.307xx
30+% overhead is still very high and there is room for improvement.
Dave Hansen had suggested batch updating TLB entries and Tycho had
created an initial implementation but I have not been able to get
that to work correctly. I am still working on it and I suspect we
will see a noticeable improvement in performance with that. In the
code I added, I post a pending full TLB flush to all other CPUs even
when number of TLB entries being flushed on current CPU does not
exceed tlb_single_page_flush_ceiling. There has to be a better way
to do this. I just haven't found an efficient way to implemented
delayed limited TLB flush on other CPUs.
I am not entirely sure if switch_mm_irqs_off() is indeed the right
place to perform the pending TLB flush for a CPU. Any feedback on
that will be very helpful. Delaying full TLB flushes on other CPUs
seems to help tremendously, so if there is a better way to implement
the same thing than what I have done in patch 16, I am open to
ideas.
Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.
Since not flushing stale TLB entries creates a false sense of
security, I would recommend making TLB flush mandatory and eliminate
the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
flushing by default for XPFO page table modifications").
What remains to be done beyond this patch series:
1. Performance improvements
2. Remove xpfotlbflush parameter
3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
from Juerg. I dropped it for now since swiotlb code for ARM has
changed a lot in 4.20.
4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
CPUs" to other architectures besides x86.
---------------------------------------------------------
Juerg Haefliger (5):
mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
swiotlb: Map the buffer if it was unmapped by XPFO
arm64/mm: Add support for XPFO
arm64/mm, xpfo: temporarily map dcache regions
lkdtm: Add test for XPFO
Julian Stecklina (4):
mm, x86: omit TLB flushing by default for XPFO page table
modifications
xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
xpfo, mm: optimize spinlock usage in xpfo_kunmap
EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
Khalid Aziz (2):
xpfo, mm: Fix hang when booting with "xpfotlbflush"
xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
Tycho Andersen (5):
mm: add MAP_HUGETLB support to vm_mmap
x86: always set IF before oopsing from page fault
xpfo: add primitives for mapping underlying memory
arm64/mm: disable section/contiguous mappings if XPFO is enabled
mm: add a user_virt_to_phys symbol
.../admin-guide/kernel-parameters.txt | 2 +
arch/arm64/Kconfig | 1 +
arch/arm64/mm/Makefile | 2 +
arch/arm64/mm/flush.c | 7 +
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/xpfo.c | 58 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 26 ++
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/fault.c | 10 +
arch/x86/mm/pageattr.c | 23 +-
arch/x86/mm/tlb.c | 27 ++
arch/x86/mm/xpfo.c | 171 ++++++++++++
drivers/misc/lkdtm/Makefile | 1 +
drivers/misc/lkdtm/core.c | 3 +
drivers/misc/lkdtm/lkdtm.h | 5 +
drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++
include/linux/highmem.h | 15 +-
include/linux/mm.h | 2 +
include/linux/mm_types.h | 8 +
include/linux/page-flags.h | 13 +
include/linux/sched.h | 9 +
include/linux/xpfo.h | 90 +++++++
include/trace/events/mmflags.h | 10 +-
kernel/dma/swiotlb.c | 3 +-
mm/Makefile | 1 +
mm/mmap.c | 19 +-
mm/page_alloc.c | 3 +
mm/util.c | 32 +++
mm/xpfo.c | 247 ++++++++++++++++++
security/Kconfig | 29 ++
32 files changed, 974 insertions(+), 43 deletions(-)
create mode 100644 arch/arm64/mm/xpfo.c
create mode 100644 arch/x86/mm/xpfo.c
create mode 100644 drivers/misc/lkdtm/xpfo.c
create mode 100644 include/linux/xpfo.h
create mode 100644 mm/xpfo.c
--
2.17.1
From: Tycho Andersen <[email protected]>
In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
include/linux/xpfo.h | 22 ++++++++++++++++++++++
mm/xpfo.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index e38b823f44e3..2682a00ebbcb 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -37,6 +37,15 @@ void xpfo_free_pages(struct page *page, int order);
bool xpfo_page_is_unmapped(struct page *page);
+#define XPFO_NUM_PAGES(addr, size) \
+ (PFN_UP((unsigned long) (addr) + (size)) - \
+ PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+ size_t mapping_len);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -46,6 +55,19 @@ static inline void xpfo_free_pages(struct page *page, int order) { }
static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+ void **mapping, size_t mapping_len)
+{
+}
+
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index cdbcbac582d5..f79075bf7d65 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -13,6 +13,7 @@
* the Free Software Foundation.
*/
+#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/page_ext.h>
@@ -235,3 +236,32 @@ bool xpfo_page_is_unmapped(struct page *page)
return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
}
EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+ struct page *page = virt_to_page(addr);
+ int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+ memset(mapping, 0, mapping_len);
+
+ for (i = 0; i < num_pages; i++) {
+ if (page_to_virt(page + i) >= addr + size)
+ break;
+
+ if (xpfo_page_is_unmapped(page + i))
+ mapping[i] = kmap_atomic(page + i);
+ }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+ size_t mapping_len)
+{
+ int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+ for (i = 0; i < num_pages; i++)
+ if (mapping[i])
+ kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
--
2.17.1
From: Tycho Andersen <[email protected]>
We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.
v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
CC: [email protected]
CC: [email protected]
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/x86/mm/xpfo.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/xpfo.h | 8 +++++++
2 files changed, 65 insertions(+)
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index d1f04ea533cd..bcdb2f2089d2 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -112,3 +112,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(current->mm, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+ phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+ offset = addr & ~P4D_MASK;
+ goto out;
+ }
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+
+ if (pud_large(*pud) || !pud_present(*pud)) {
+ phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+ offset = addr & ~PUD_MASK;
+ goto out;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+
+ if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+ phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+ offset = addr & ~PMD_MASK;
+ goto out;
+ }
+
+ pte = pte_offset_kernel(pmd, addr);
+ phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ offset = addr & ~PAGE_MASK;
+
+out:
+ return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 0c26836a24e1..d4b38ab8a633 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -23,6 +23,10 @@ struct page;
#ifdef CONFIG_XPFO
+#include <linux/dma-mapping.h>
+
+#include <linux/types.h>
+
extern struct page_ext_operations page_xpfo_ops;
void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
@@ -48,6 +52,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
bool xpfo_enabled(void);
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -72,6 +78,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
static inline bool xpfo_enabled(void) { return false; }
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
--
2.17.1
From: Juerg Haefliger <[email protected]>
v6: * guard against lookup_xpfo() returning NULL
CC: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
include/linux/xpfo.h | 4 ++++
kernel/dma/swiotlb.c | 3 ++-
mm/xpfo.c | 15 +++++++++++++++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index a39259ce0174..e38b823f44e3 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -35,6 +35,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
void xpfo_free_pages(struct page *page, int order);
+bool xpfo_page_is_unmapped(struct page *page);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -42,6 +44,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
static inline void xpfo_free_pages(struct page *page, int order) { }
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..820a54b57491 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
{
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+ struct page *page = pfn_to_page(pfn);
- if (PageHighMem(pfn_to_page(pfn))) {
+ if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
/* The buffer does not have a mapping. Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index bff24afcaa2e..cdbcbac582d5 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -220,3 +220,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
spin_unlock(&xpfo->maplock);
}
EXPORT_SYMBOL(xpfo_kunmap);
+
+bool xpfo_page_is_unmapped(struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return false;
+
+ xpfo = lookup_xpfo(page);
+ if (unlikely(!xpfo) && !xpfo->inited)
+ return false;
+
+ return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
--
2.17.1
From: Julian Stecklina <[email protected]>
We can reduce spin lock usage in xpfo_kmap to the 0->1 transition of
the mapcount. This means that xpfo_kmap() can now race and that we
get spurious page faults.
The page fault handler helps the system make forward progress by
fixing the page table instead of allowing repeated page faults until
the right xpfo_kmap went through.
Model-checked with up to 4 concurrent callers with Spin.
Signed-off-by: Julian Stecklina <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Vasileios P. Kemerlis <[email protected]>
Cc: Juerg Haefliger <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Marco Benatto <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/x86/mm/fault.c | 4 ++++
include/linux/xpfo.h | 4 ++++
mm/xpfo.c | 50 +++++++++++++++++++++++++++++++++++++-------
3 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ba51652fbd33..207081dcd572 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_recover_from_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/xpfo.h>
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1218,6 +1219,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
if (kprobes_fault(regs))
return;
+ if (xpfo_spurious_fault(address))
+ return;
+
/*
* Note, despite being a "bad area", there are quite a few
* acceptable reasons to get here, such as erratum fixups
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index ea5188882f49..58dd243637d2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -54,6 +54,8 @@ bool xpfo_enabled(void);
phys_addr_t user_virt_to_phys(unsigned long addr);
+bool xpfo_spurious_fault(unsigned long addr);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_init_single_page(struct page *page) { }
@@ -81,6 +83,8 @@ static inline bool xpfo_enabled(void) { return false; }
static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+static inline bool xpfo_spurious_fault(unsigned long addr) { return false; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index dbf20efb0499..85079377c91d 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -119,6 +119,16 @@ void xpfo_free_pages(struct page *page, int order)
}
}
+static void xpfo_do_map(void *kaddr, struct page *page)
+{
+ spin_lock(&page->xpfo_lock);
+ if (PageXpfoUnmapped(page)) {
+ set_kpte(kaddr, page, PAGE_KERNEL);
+ ClearPageXpfoUnmapped(page);
+ }
+ spin_unlock(&page->xpfo_lock);
+}
+
void xpfo_kmap(void *kaddr, struct page *page)
{
if (!static_branch_unlikely(&xpfo_inited))
@@ -127,17 +137,12 @@ void xpfo_kmap(void *kaddr, struct page *page)
if (!PageXpfoUser(page))
return;
- spin_lock(&page->xpfo_lock);
-
/*
* The page was previously allocated to user space, so map it back
* into the kernel. No TLB flush required.
*/
- if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
- TestClearPageXpfoUnmapped(page))
- set_kpte(kaddr, page, PAGE_KERNEL);
-
- spin_unlock(&page->xpfo_lock);
+ if (atomic_inc_return(&page->xpfo_mapcount) == 1)
+ xpfo_do_map(kaddr, page);
}
EXPORT_SYMBOL(xpfo_kmap);
@@ -204,3 +209,34 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
kunmap_atomic(mapping[i]);
}
EXPORT_SYMBOL(xpfo_temp_unmap);
+
+bool xpfo_spurious_fault(unsigned long addr)
+{
+ struct page *page;
+ bool spurious;
+ int mapcount;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return false;
+
+ /* XXX Is this sufficient to guard against calling virt_to_page() on a
+ * virtual address that has no corresponding struct page? */
+ if (!virt_addr_valid(addr))
+ return false;
+
+ page = virt_to_page(addr);
+ mapcount = atomic_read(&page->xpfo_mapcount);
+ spurious = PageXpfoUser(page) && mapcount;
+
+ /* Guarantee forward progress in case xpfo_kmap() raced. */
+ if (spurious && PageXpfoUnmapped(page)) {
+ xpfo_do_map((void *)(addr & PAGE_MASK), page);
+ }
+
+ if (unlikely(!spurious))
+ printk("XPFO non-spurious fault %lx user=%d unmapped=%d mapcount=%d\n",
+ addr, PageXpfoUser(page), PageXpfoUnmapped(page),
+ mapcount);
+
+ return spurious;
+}
--
2.17.1
XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs.
If the number of TLB entries to flush exceeds
tlb_single_page_flush_ceiling, this results in entire TLB neing
flushed on all CPUs. A malicious userspace app can exploit the
dual mapping of a physical page caused by physmap only on the CPU
it is running on. There is no good reason to incur the very high
cost of TLB flush on CPUs that may never run the malicious app or
do not have any TLB entries for the malicious app. The cost of
full TLB flush goes up dramatically on machines with high core
count.
This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. This pending
TLB flush is posted for each task separately and TLB is flushed on
a CPU when a task is scheduled on it that has a pending TLB flush
posted for that CPU. This patch does two things - (1) it
potentially aggregates multiple TLB flushes into one, and (2) it
avoids TLB flush on CPUs that never run the task that caused a TLB
flush. This has very significant impact especially on machines
with large core counts. To illustrate this, kernel was compiled
with -j on two classes of machines - a server with high core count
and large amount of memory, and a desktop class machine with more
modest specs. System time from "make -j" from vanilla 4.20 kernel,
4.20 with XPFO patches before applying this patch and after
applying this patch are below:
Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all
4.20 915.183s
4.19+XPFO 24129.354s 26.366x
4.19+XPFO+Deferred flush 1216.987s 1.330xx
Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all
4.20 607.671s
4.19+XPFO 1588.646s 2.614x
4.19+XPFO+Deferred flush 794.473s 1.307xx
This patch could use more optimization. For instance, it posts a
pending full TLB flush for other CPUs even when number of TLB
entries being flushed does not exceed tlb_single_page_flush_ceiling.
Batching more TLB entry flushes, as was suggested for earlier
version of these patches, can help reduce these cases. This same
code should be implemented for other architectures as well once
finalized.
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 27 +++++++++++++++++++++++++++
arch/x86/mm/xpfo.c | 2 +-
include/linux/sched.h | 9 +++++++++
4 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03b6b4c2238d..b04a501c850b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -319,6 +319,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
__flush_tlb_all();
}
#endif
+
+ /* If there is a pending TLB flush for this CPU due to XPFO
+ * flush, do it now.
+ */
+ if (tsk && cpumask_test_and_clear_cpu(cpu, &tsk->pending_xpfo_flush)) {
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ __flush_tlb_all();
+ }
+
this_cpu_write(cpu_tlbstate.is_lazy, false);
/*
@@ -801,6 +810,24 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
}
}
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+
+ /* Balance as user space task's flush, a bit conservative */
+ if (end == TLB_FLUSH_ALL ||
+ (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+ do_flush_tlb_all(NULL);
+ } else {
+ struct flush_tlb_info info;
+
+ info.start = start;
+ info.end = end;
+ do_kernel_range_flush(&info);
+ }
+ cpumask_setall(¤t->pending_xpfo_flush);
+ cpumask_clear_cpu(smp_processor_id(), ¤t->pending_xpfo_flush);
+}
+
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info info = {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index bcdb2f2089d2..5aa17cb2c813 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -110,7 +110,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
return;
}
- flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+ xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
}
/* Convert a user space virtual address to a physical address.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 291a9bd5b97f..ba298be3b5a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1206,6 +1206,15 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif
+ /*
+ * When a full TLB flush is needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space, this TLB flush will be delayed until the
+ * task is scheduled on that CPU. Keep track of CPUs with
+ * pending full TLB flush forced by xpfo.
+ */
+ cpumask_t pending_xpfo_flush;
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
--
2.17.1
From: Juerg Haefliger <[email protected]>
This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.
Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
- two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
- a reference counter to balance kmap/kunmap operations.
- a lock to serialize access to the XPFO fields.
This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
__split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
pass it
CC: [email protected]
Suggested-by: Vasileios P. Kemerlis <[email protected]>
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Marco Benatto <[email protected]>
[[email protected]: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 26 ++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/pageattr.c | 23 +-
arch/x86/mm/xpfo.c | 114 +++++++++
include/linux/highmem.h | 15 +-
include/linux/xpfo.h | 47 ++++
mm/Makefile | 1 +
mm/page_alloc.c | 2 +
mm/page_ext.c | 4 +
mm/xpfo.c | 222 ++++++++++++++++++
security/Kconfig | 19 ++
13 files changed, 456 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/mm/xpfo.c
create mode 100644 include/linux/xpfo.h
create mode 100644 mm/xpfo.c
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..c4c62599f216 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,6 +2982,8 @@
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
+ noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on.
+
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d69d8cc6e57e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -207,6 +207,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select ARCH_SUPPORTS_XPFO if X86_64
config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..ad2d1792939d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
}
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+ unsigned long *vaddr;
+ pgd_t *pgd;
+ pgprot_t mask_set;
+ pgprot_t mask_clr;
+ unsigned long numpages;
+ int flags;
+ unsigned long pfn;
+ unsigned force_split : 1,
+ force_static_prot : 1;
+ int curpage;
+ struct page **pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+ struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+ struct page *base);
+
#include <asm-generic/pgtable.h>
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a1bcde35db4c..84002442ab61 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -26,23 +26,6 @@
#include <asm/pat.h>
#include <asm/set_memory.h>
-/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
- unsigned long *vaddr;
- pgd_t *pgd;
- pgprot_t mask_set;
- pgprot_t mask_clr;
- unsigned long numpages;
- int flags;
- unsigned long pfn;
- unsigned force_split : 1,
- force_static_prot : 1;
- int curpage;
- struct page **pages;
-};
-
enum cpa_warn {
CPA_CONFLICT,
CPA_PROTECT,
@@ -57,7 +40,7 @@ static const int cpa_warn_level = CPA_PROTECT;
* entries change the page attribute in parallel to some other cpu
* splitting a large page entry along with changing the attribute.
*/
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);
#define CPA_FLUSHTLB 1
#define CPA_ARRAY 2
@@ -869,7 +852,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
return 0;
}
-static int should_split_large_page(pte_t *kpte, unsigned long address,
+int should_split_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
int do_split;
@@ -919,7 +902,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
set_pte(pte, pfn_pte(pfn, ref_prot));
}
-static int
+int
__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
struct page *base)
{
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..d1f04ea533cd
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,114 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ unsigned int level;
+ pgprot_t msk_clr;
+ pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+ if (unlikely(!pte)) {
+ WARN(1, "xpfo: invalid address %p\n", kaddr);
+ return;
+ }
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+ break;
+ case PG_LEVEL_2M:
+ case PG_LEVEL_1G: {
+ struct cpa_data cpa = { };
+ int do_split;
+
+ if (level == PG_LEVEL_2M)
+ msk_clr = pmd_pgprot(*(pmd_t*)pte);
+ else
+ msk_clr = pud_pgprot(*(pud_t*)pte);
+
+ cpa.vaddr = kaddr;
+ cpa.pages = &page;
+ cpa.mask_set = prot;
+ cpa.mask_clr = msk_clr;
+ cpa.numpages = 1;
+ cpa.flags = 0;
+ cpa.curpage = 0;
+ cpa.force_split = 0;
+
+
+ do_split = should_split_large_page(pte, (unsigned long)kaddr,
+ &cpa);
+ if (do_split) {
+ struct page *base;
+
+ base = alloc_pages(GFP_ATOMIC, 0);
+ if (!base) {
+ WARN(1, "xpfo: failed to split large page\n");
+ break;
+ }
+
+ if (!debug_pagealloc_enabled())
+ spin_lock(&cpa_lock);
+ if (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0)
+ WARN(1, "xpfo: failed to split large page\n");
+ if (!debug_pagealloc_enabled())
+ spin_unlock(&cpa_lock);
+ }
+
+ break;
+ }
+ case PG_LEVEL_512G:
+ /* fallthrough, splitting infrastructure doesn't
+ * support 512G pages. */
+ default:
+ WARN(1, "xpfo: unsupported page level %x\n", level);
+ }
+
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+ int level;
+ unsigned long size, kaddr;
+
+ kaddr = (unsigned long)page_address(page);
+
+ if (unlikely(!lookup_address(kaddr, &level))) {
+ WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
+ return;
+ }
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ size = PAGE_SIZE;
+ break;
+ case PG_LEVEL_2M:
+ size = PMD_SIZE;
+ break;
+ case PG_LEVEL_1G:
+ size = PUD_SIZE;
+ break;
+ default:
+ WARN(1, "xpfo: unsupported page level %x\n", level);
+ return;
+ }
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679832d4..1fdae929e38b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
#include <linux/mm.h>
#include <linux/uaccess.h>
#include <linux/hardirq.h>
+#include <linux/xpfo.h>
#include <asm/cacheflush.h>
@@ -56,24 +57,34 @@ static inline struct page *kmap_to_page(void *addr)
#ifndef ARCH_HAS_KMAP
static inline void *kmap(struct page *page)
{
+ void *kaddr;
+
might_sleep();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}
static inline void kunmap(struct page *page)
{
+ xpfo_kunmap(page_address(page), page);
}
static inline void *kmap_atomic(struct page *page)
{
+ void *kaddr;
+
preempt_disable();
pagefault_disable();
- return page_address(page);
+ kaddr = page_address(page);
+ xpfo_kmap(kaddr, page);
+ return kaddr;
}
#define kmap_atomic_prot(page, prot) kmap_atomic(page)
static inline void __kunmap_atomic(void *addr)
{
+ xpfo_kunmap(addr, virt_to_page(addr));
pagefault_enable();
preempt_enable();
}
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..a39259ce0174
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ * Tycho Andersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#include <linux/types.h>
+#include <linux/dma-direction.h>
+
+struct page;
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+ enum dma_data_direction dir);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_kmap(void *kaddr, struct page *page);
+void xpfo_kunmap(void *kaddr, struct page *page);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e99e1e6ae5ae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_HMM) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e95b5b7c9c3d..08e277790b5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1038,6 +1038,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order, 0);
kernel_map_pages(page, 1 << order, 0);
kasan_free_pages(page, order);
+ xpfo_free_pages(page, order);
return true;
}
@@ -1915,6 +1916,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
kernel_map_pages(page, 1 << order, 1);
kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
+ xpfo_alloc_pages(page, order, gfp_flags);
set_page_owner(page, order, gfp_flags);
}
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..38e5013dcb9a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -8,6 +8,7 @@
#include <linux/kmemleak.h>
#include <linux/page_owner.h>
#include <linux/page_idle.h>
+#include <linux/xpfo.h>
/*
* struct page extension
@@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_XPFO
+ &page_xpfo_ops,
+#endif
};
static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..bff24afcaa2e
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,222 @@
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ * Tycho Andersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+/* XPFO page state flags */
+enum xpfo_flags {
+ XPFO_PAGE_USER, /* Page is allocated to user-space */
+ XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */
+};
+
+/* Per-page XPFO house-keeping data */
+struct xpfo {
+ unsigned long flags; /* Page state */
+ bool inited; /* Map counter and lock initialized */
+ atomic_t mapcount; /* Counter for balancing map/unmap requests */
+ spinlock_t maplock; /* Lock to serialize map/unmap requests */
+};
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool xpfo_disabled __initdata;
+
+static int __init noxpfo_param(char *str)
+{
+ xpfo_disabled = true;
+
+ return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+static bool __init need_xpfo(void)
+{
+ if (xpfo_disabled) {
+ printk(KERN_INFO "XPFO disabled\n");
+ return false;
+ }
+
+ return true;
+}
+
+static void init_xpfo(void)
+{
+ printk(KERN_INFO "XPFO enabled\n");
+ static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+ .size = sizeof(struct xpfo),
+ .need = need_xpfo,
+ .init = init_xpfo,
+};
+
+static inline struct xpfo *lookup_xpfo(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext)) {
+ WARN(1, "xpfo: failed to get page ext");
+ return NULL;
+ }
+
+ return (void *)page_ext + page_xpfo_ops.offset;
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+{
+ int i, flush_tlb = 0;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+ if (!xpfo)
+ continue;
+
+ WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+ "xpfo: unmapped page being allocated\n");
+
+ /* Initialize the map lock and map counter */
+ if (unlikely(!xpfo->inited)) {
+ spin_lock_init(&xpfo->maplock);
+ atomic_set(&xpfo->mapcount, 0);
+ xpfo->inited = true;
+ }
+ WARN(atomic_read(&xpfo->mapcount),
+ "xpfo: already mapped page being allocated\n");
+
+ if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+ /*
+ * Tag the page as a user page and flush the TLB if it
+ * was previously allocated to the kernel.
+ */
+ if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ flush_tlb = 1;
+ } else {
+ /* Tag the page as a non-user (kernel) page */
+ clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+ }
+ }
+
+ if (flush_tlb)
+ xpfo_flush_kernel_tlb(page, order);
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+ int i;
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ xpfo = lookup_xpfo(page + i);
+ if (!xpfo || unlikely(!xpfo->inited)) {
+ /*
+ * The page was allocated before page_ext was
+ * initialized, so it is a kernel page.
+ */
+ continue;
+ }
+
+ /*
+ * Map the page back into the kernel if it was previously
+ * allocated to user space.
+ */
+ if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
+ clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ set_kpte(page_address(page + i), page + i,
+ PAGE_KERNEL);
+ }
+ }
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (!xpfo || unlikely(!xpfo->inited) ||
+ !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock(&xpfo->maplock);
+
+ /*
+ * The page was previously allocated to user space, so map it back
+ * into the kernel. No TLB flush required.
+ */
+ if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
+ test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ set_kpte(kaddr, page, PAGE_KERNEL);
+
+ spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+ struct xpfo *xpfo;
+
+ if (!static_branch_unlikely(&xpfo_inited))
+ return;
+
+ xpfo = lookup_xpfo(page);
+
+ /*
+ * The page was allocated before page_ext was initialized (which means
+ * it's a kernel page) or it's allocated to the kernel, so nothing to
+ * do.
+ */
+ if (!xpfo || unlikely(!xpfo->inited) ||
+ !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ return;
+
+ spin_lock(&xpfo->maplock);
+
+ /*
+ * The page is to be allocated back to user space, so unmap it from the
+ * kernel, flush the TLB and tag it as a user page.
+ */
+ if (atomic_dec_return(&xpfo->mapcount) == 0) {
+ WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+ "xpfo: unmapping already unmapped page\n");
+ set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ set_kpte(kaddr, page, __pgprot(0));
+ xpfo_flush_kernel_tlb(page, 0);
+ }
+
+ spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/security/Kconfig b/security/Kconfig
index d9aa521b5206..8d0e4e303551 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
source security/keys/Kconfig
+config ARCH_SUPPORTS_XPFO
+ bool
+
+config XPFO
+ bool "Enable eXclusive Page Frame Ownership (XPFO)"
+ default n
+ depends on ARCH_SUPPORTS_XPFO
+ select PAGE_EXTENSION
+ help
+ This option offers protection against 'ret2dir' kernel attacks.
+ When enabled, every time a page frame is allocated to user space, it
+ is unmapped from the direct mapped RAM region in kernel space
+ (physmap). Similarly, when a page frame is freed/reclaimed, it is
+ mapped back to physmap.
+
+ There is a slight performance impact when this option is enabled.
+
+ If in doubt, say "N".
+
config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
--
2.17.1
From: Juerg Haefliger <[email protected]>
If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.
v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly
CC: [email protected]
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/arm64/mm/flush.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..f12f26b60319 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/xpfo.h>
#include <asm/cacheflush.h>
#include <asm/cache.h>
@@ -28,9 +29,15 @@
void sync_icache_aliases(void *kaddr, unsigned long len)
{
unsigned long addr = (unsigned long)kaddr;
+ unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+ void *mapping[num_pages];
if (icache_is_aliasing()) {
+ xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+ xpfo_temp_unmap(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
flush_icache_range(addr, addr + len);
--
2.17.1
From: Julian Stecklina <[email protected]>
Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.
Model-checked with up to 4 concurrent callers with Spin.
Signed-off-by: Julian Stecklina <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Vasileios P. Kemerlis <[email protected]>
Cc: Juerg Haefliger <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Marco Benatto <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
mm/xpfo.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/xpfo.c b/mm/xpfo.c
index cbfeafc2f10f..dbf20efb0499 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -149,22 +149,24 @@ void xpfo_kunmap(void *kaddr, struct page *page)
if (!PageXpfoUser(page))
return;
- spin_lock(&page->xpfo_lock);
-
/*
* The page is to be allocated back to user space, so unmap it from the
* kernel, flush the TLB and tag it as a user page.
*/
if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
- BUG_ON(PageXpfoUnmapped(page));
-#endif
- SetPageXpfoUnmapped(page);
- set_kpte(kaddr, page, __pgprot(0));
- xpfo_cond_flush_kernel_tlb(page, 0);
- }
+ spin_lock(&page->xpfo_lock);
- spin_unlock(&page->xpfo_lock);
+ /*
+ * In the case, where we raced with kmap after the
+ * atomic_dec_return, we must not nuke the mapping.
+ */
+ if (atomic_read(&page->xpfo_mapcount) == 0) {
+ SetPageXpfoUnmapped(page);
+ set_kpte(kaddr, page, __pgprot(0));
+ xpfo_cond_flush_kernel_tlb(page, 0);
+ }
+ spin_unlock(&page->xpfo_lock);
+ }
}
EXPORT_SYMBOL(xpfo_kunmap);
--
2.17.1
From: Julian Stecklina <[email protected]>
XPFO carries a large performance overhead. In my tests, I saw >40%
overhead for compiling a Linux kernel with XPFO enabled. The
frequent TLB flushes that XPFO performs are the root cause of much
of this overhead.
TLB flushing is required for full paranoia mode where we don't want
TLB entries of physmap pages to stick around potentially
indefinitely. In reality, though, these TLB entries are going to be
evicted pretty rapidly even without explicit flushing. That means
omitting TLB flushes only marginally lowers the security benefits of
XPFO. For kernel compile, omitting TLB flushes pushes the overhead
below 3%.
Change the default in XPFO to not flush TLBs unless the user
explicitly requests to do so using a kernel parameter.
Signed-off-by: Julian Stecklina <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Vasileios P. Kemerlis <[email protected]>
Cc: Juerg Haefliger <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Marco Benatto <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
mm/xpfo.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 25fba05d01bd..e80374b0c78e 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -36,6 +36,7 @@ struct xpfo {
};
DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+DEFINE_STATIC_KEY_FALSE(xpfo_do_tlb_flush);
static bool xpfo_disabled __initdata;
@@ -46,7 +47,15 @@ static int __init noxpfo_param(char *str)
return 0;
}
+static int __init xpfotlbflush_param(char *str)
+{
+ static_branch_enable(&xpfo_do_tlb_flush);
+
+ return 0;
+}
+
early_param("noxpfo", noxpfo_param);
+early_param("xpfotlbflush", xpfotlbflush_param);
static bool __init need_xpfo(void)
{
@@ -76,6 +85,13 @@ bool __init xpfo_enabled(void)
}
EXPORT_SYMBOL(xpfo_enabled);
+
+static void xpfo_cond_flush_kernel_tlb(struct page *page, int order)
+{
+ if (static_branch_unlikely(&xpfo_do_tlb_flush))
+ xpfo_flush_kernel_tlb(page, order);
+}
+
static inline struct xpfo *lookup_xpfo(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
@@ -114,12 +130,17 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
"xpfo: already mapped page being allocated\n");
if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
- /*
- * Tag the page as a user page and flush the TLB if it
- * was previously allocated to the kernel.
- */
- if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
- flush_tlb = 1;
+ if (static_branch_unlikely(&xpfo_do_tlb_flush)) {
+ /*
+ * Tag the page as a user page and flush the TLB if it
+ * was previously allocated to the kernel.
+ */
+ if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ flush_tlb = 1;
+ } else {
+ set_bit(XPFO_PAGE_USER, &xpfo->flags);
+ }
+
} else {
/* Tag the page as a non-user (kernel) page */
clear_bit(XPFO_PAGE_USER, &xpfo->flags);
@@ -127,7 +148,7 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
}
if (flush_tlb)
- xpfo_flush_kernel_tlb(page, order);
+ xpfo_cond_flush_kernel_tlb(page, order);
}
void xpfo_free_pages(struct page *page, int order)
@@ -221,7 +242,7 @@ void xpfo_kunmap(void *kaddr, struct page *page)
"xpfo: unmapping already unmapped page\n");
set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
set_kpte(kaddr, page, __pgprot(0));
- xpfo_flush_kernel_tlb(page, 0);
+ xpfo_cond_flush_kernel_tlb(page, 0);
}
spin_unlock(&xpfo->maplock);
--
2.17.1
From: Juerg Haefliger <[email protected]>
This test simply reads from userspace memory via the kernel's linear
map.
v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
unmap of an address, that it can't be used accidentally by other
CPUs.
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
[[email protected]: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
drivers/misc/lkdtm/Makefile | 1 +
drivers/misc/lkdtm/core.c | 3 +
drivers/misc/lkdtm/lkdtm.h | 5 +
drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++++++++++++++++++++++++
4 files changed, 203 insertions(+)
create mode 100644 drivers/misc/lkdtm/xpfo.c
diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM) += refcount.o
lkdtm-$(CONFIG_LKDTM) += rodata_objcopy.o
lkdtm-$(CONFIG_LKDTM) += usercopy.o
lkdtm-$(CONFIG_LKDTM) += stackleak.o
+lkdtm-$(CONFIG_LKDTM) += xpfo.o
KASAN_SANITIZE_stackleak.o := n
KCOV_INSTRUMENT_rodata.o := n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
+ CRASHTYPE(XPFO_READ_USER),
+ CRASHTYPE(XPFO_READ_USER_HUGE),
+ CRASHTYPE(XPFO_SMP),
};
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
/* lkdtm_stackleak.c */
void lkdtm_STACKLEAK_ERASING(void);
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
#endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index 000000000000..d903063bdd0b
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,194 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include <linux/cpumask.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/xpfo.h>
+#include <linux/kthread.h>
+
+#include <linux/delay.h>
+#include <linux/sched/task.h>
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+ unsigned long user_addr, user_data = XPFO_DATA;
+
+ user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+ PROT_READ | PROT_WRITE | PROT_EXEC,
+ flags, 0);
+ if (user_addr >= TASK_SIZE) {
+ pr_warn("Failed to allocate user memory\n");
+ return 0;
+ }
+
+ if (copy_to_user((void __user *)user_addr, &user_data,
+ sizeof(user_data))) {
+ pr_warn("copy_to_user failed\n");
+ goto free_user;
+ }
+
+ return user_addr;
+
+free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+ return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+ phys_addr_t phys_addr;
+ void *virt_addr;
+
+ phys_addr = user_virt_to_phys(user_addr);
+ if (!phys_addr) {
+ pr_warn("Failed to get physical address of user memory\n");
+ return NULL;
+ }
+
+ virt_addr = phys_to_virt(phys_addr);
+ if (phys_addr != virt_to_phys(virt_addr)) {
+ pr_warn("Physical address of user memory seems incorrect\n");
+ return NULL;
+ }
+
+ return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+ pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+ if (*(unsigned long *)virt_addr == XPFO_DATA)
+ pr_err("FAIL: Bad read succeeded?!\n");
+ else
+ pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+ unsigned long user_addr, *kernel;
+
+ user_addr = do_map(flags);
+ if (!user_addr) {
+ pr_err("FAIL: map failed\n");
+ return;
+ }
+
+ kernel = user_to_kernel(user_addr);
+ if (!kernel) {
+ pr_err("FAIL: user to kernel conversion failed\n");
+ goto free_user;
+ }
+
+ read_map(kernel);
+
+free_user:
+ vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+ read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB);
+}
+
+struct smp_arg {
+ unsigned long *virt_addr;
+ unsigned int cpu;
+};
+
+static int smp_reader(void *parg)
+{
+ struct smp_arg *arg = parg;
+ unsigned long *virt_addr;
+
+ if (arg->cpu != smp_processor_id()) {
+ pr_err("FAIL: scheduled on wrong CPU?\n");
+ return 0;
+ }
+
+ virt_addr = smp_cond_load_acquire(&arg->virt_addr, VAL != NULL);
+ read_map(virt_addr);
+
+ return 0;
+}
+
+#ifdef CONFIG_X86
+#define XPFO_SMP_KILLED SIGKILL
+#elif CONFIG_ARM64
+#define XPFO_SMP_KILLED SIGSEGV
+#else
+#error unsupported arch
+#endif
+
+/* The idea here is to read from the kernel's map on a different thread than
+ * did the mapping (and thus the TLB flushing), to make sure that the page
+ * faults on other cores too.
+ */
+void lkdtm_XPFO_SMP(void)
+{
+ unsigned long user_addr, *virt_addr;
+ struct task_struct *thread;
+ int ret;
+ struct smp_arg arg;
+
+ if (num_online_cpus() < 2) {
+ pr_err("not enough to do a multi cpu test\n");
+ return;
+ }
+
+ arg.virt_addr = NULL;
+ arg.cpu = (smp_processor_id() + 1) % num_online_cpus();
+ thread = kthread_create(smp_reader, &arg, "lkdtm_xpfo_test");
+ if (IS_ERR(thread)) {
+ pr_err("couldn't create kthread? %ld\n", PTR_ERR(thread));
+ return;
+ }
+
+ kthread_bind(thread, arg.cpu);
+ get_task_struct(thread);
+ wake_up_process(thread);
+
+ user_addr = do_map(MAP_PRIVATE | MAP_ANONYMOUS);
+ if (!user_addr)
+ goto kill_thread;
+
+ virt_addr = user_to_kernel(user_addr);
+ if (!virt_addr) {
+ /*
+ * let's store something that will fail, so we can unblock the
+ * thread
+ */
+ smp_store_release(&arg.virt_addr, &arg);
+ goto free_user;
+ }
+
+ smp_store_release(&arg.virt_addr, virt_addr);
+
+ /* there must be a better way to do this. */
+ while (1) {
+ if (thread->exit_state)
+ break;
+ msleep_interruptible(100);
+ }
+
+free_user:
+ if (user_addr)
+ vm_munmap(user_addr, PAGE_SIZE);
+
+kill_thread:
+ ret = kthread_stop(thread);
+ if (ret != XPFO_SMP_KILLED)
+ pr_err("FAIL: thread wasn't killed: %d\n", ret);
+ put_task_struct(thread);
+}
--
2.17.1
From: Tycho Andersen <[email protected]>
vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.
Signed-off-by: Tycho Andersen <[email protected]>
Tested-by: Marco Benatto <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
include/linux/mm.h | 2 ++
mm/mmap.c | 19 +------------------
mm/util.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 18 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..30bddc7b3c75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2361,6 +2361,8 @@ struct vm_unmapped_area_info {
extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
/*
* Search for an unmapped address range.
*
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..c668d7d27c2b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
- struct hstate *hs;
-
- hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
- if (!hs)
- return -EINVAL;
-
- len = ALIGN(len, huge_page_size(hs));
- /*
- * VM_NORESERVE is used because the reservations will be
- * taken when vm_ops->mmap() is called
- * A dummy user value is used because we are not locking
- * memory so no accounting is necessary
- */
- file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
- VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
- (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ file = map_hugetlb_setup(&len, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 8bf08b5b5760..536c14cf88ba 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
return ret;
}
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+ struct user_struct *user = NULL;
+ struct hstate *hs;
+
+ hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+ if (!hs)
+ return ERR_PTR(-EINVAL);
+
+ *len = ALIGN(*len, huge_page_size(hs));
+
+ /*
+ * VM_NORESERVE is used because the reservations will be
+ * taken when vm_ops->mmap() is called
+ * A dummy user value is used because we are not locking
+ * memory so no accounting is necessary
+ */
+ return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+ VM_NORESERVE,
+ &user, HUGETLB_ANONHUGE_INODE,
+ (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
+ if (flag & MAP_HUGETLB) {
+ if (file)
+ return -EINVAL;
+
+ file = map_hugetlb_setup(&len, flag);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
}
EXPORT_SYMBOL(vm_mmap);
--
2.17.1
Kernel hangs when booted up with "xpfotlbflush" option. This is caused
by xpfo_kunmap() fliushing TLB while holding xpfo lock starving other
tasks waiting for the lock. This patch moves tlb flush outside of the
code holding xpfo lock.
Signed-off-by: Khalid Aziz <[email protected]>
---
mm/xpfo.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 85079377c91d..79ffdba6af69 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -148,6 +148,8 @@ EXPORT_SYMBOL(xpfo_kmap);
void xpfo_kunmap(void *kaddr, struct page *page)
{
+ bool flush_tlb = false;
+
if (!static_branch_unlikely(&xpfo_inited))
return;
@@ -168,10 +170,13 @@ void xpfo_kunmap(void *kaddr, struct page *page)
if (atomic_read(&page->xpfo_mapcount) == 0) {
SetPageXpfoUnmapped(page);
set_kpte(kaddr, page, __pgprot(0));
- xpfo_cond_flush_kernel_tlb(page, 0);
+ flush_tlb = true;
}
spin_unlock(&page->xpfo_lock);
}
+
+ if (flush_tlb)
+ xpfo_cond_flush_kernel_tlb(page, 0);
}
EXPORT_SYMBOL(xpfo_kunmap);
--
2.17.1
From: Tycho Andersen <[email protected]>
XPFO doesn't support section/contiguous mappings yet, so let's disable it
if XPFO is turned on.
Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
pointing out we need NO_CONT_MAPPINGS too.
CC: [email protected]
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/arm64/mm/mmu.c | 2 +-
include/linux/xpfo.h | 4 ++++
mm/xpfo.c | 6 ++++++
3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d1d6601b385d..f4dd27073006 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
struct memblock_region *reg;
int flags = 0;
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
/*
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 2682a00ebbcb..0c26836a24e1 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -46,6 +46,8 @@ void xpfo_temp_map(const void *addr, size_t size, void **mapping,
void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
size_t mapping_len);
+bool xpfo_enabled(void);
+
#else /* !CONFIG_XPFO */
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -68,6 +70,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t size,
}
+static inline bool xpfo_enabled(void) { return false; }
+
#endif /* CONFIG_XPFO */
#endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index f79075bf7d65..25fba05d01bd 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -70,6 +70,12 @@ struct page_ext_operations page_xpfo_ops = {
.init = init_xpfo,
};
+bool __init xpfo_enabled(void)
+{
+ return !xpfo_disabled;
+}
+EXPORT_SYMBOL(xpfo_enabled);
+
static inline struct xpfo *lookup_xpfo(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
--
2.17.1
From: Juerg Haefliger <[email protected]>
Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).
v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
CC: [email protected]
Signed-off-by: Juerg Haefliger <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/Makefile | 2 ++
arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+)
create mode 100644 arch/arm64/mm/xpfo.c
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ea2ab0330e3a..f0a9c0007d23 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -171,6 +171,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_XPFO
help
ARM 64-bit (AArch64) Linux support.
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
obj-$(CONFIG_KASAN) += kasan_init.o
KASAN_SANITIZE_kasan_init.o := n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index 000000000000..678e2be848eb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,58 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ * Juerg Haefliger <[email protected]>
+ * Vasileios P. Kemerlis <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/tlbflush.h>
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+ pte_t *pte = lookup_address((unsigned long)kaddr);
+
+ set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+ unsigned long kaddr = (unsigned long)page_address(page);
+ unsigned long size = PAGE_SIZE;
+
+ flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
--
2.17.1
From: Tycho Andersen <[email protected]>
Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:
Aug 23 19:30:27 xpfo kernel: [ 38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [ 38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [ 38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [ 38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [ 38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [ 38.307243] dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [ 38.307665] ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [ 38.308139] __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [ 38.308593] exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [ 38.309046] ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [ 38.309677] do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [ 38.310078] ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [ 38.310604] ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [ 38.311045] ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [ 38.311680] rewind_stack_do_exit+0x17/0x20
To be safe, let's just always enable irqs.
The particular case I'm hitting is:
Aug 23 19:30:27 xpfo kernel: [ 38.278615] __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [ 38.278617] bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [ 38.278618] __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [ 38.278620] ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [ 38.278623] ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [ 38.278624] trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [ 38.278625] do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [ 38.278627] async_page_fault+0x28/0x30
When a fault is in kernel space which has been triggered by XPFO.
Signed-off-by: Tycho Andersen <[email protected]>
CC: [email protected]
Signed-off-by: Khalid Aziz <[email protected]>
---
arch/x86/mm/fault.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..ba51652fbd33 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -748,6 +748,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
+ /*
+ * We're about to oops, which might kill the task. Make sure we're
+ * allowed to sleep.
+ */
+ flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
}
--
2.17.1
From: Julian Stecklina <[email protected]>
Instead of using the page extension debug feature, encode all
information, we need for XPFO in struct page. This allows to get rid of
some checks in the hot paths and there are also no pages anymore that
are allocated before XPFO is enabled.
Also make debugging aids configurable for maximum performance.
Signed-off-by: Julian Stecklina <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Vasileios P. Kemerlis <[email protected]>
Cc: Juerg Haefliger <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Marco Benatto <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Khalid Aziz <[email protected]>
---
include/linux/mm_types.h | 8 ++
include/linux/page-flags.h | 13 +++
include/linux/xpfo.h | 3 +-
include/trace/events/mmflags.h | 10 +-
mm/page_alloc.c | 3 +-
mm/page_ext.c | 4 -
mm/xpfo.c | 162 ++++++++-------------------------
security/Kconfig | 12 ++-
8 files changed, 81 insertions(+), 134 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@ struct page {
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
+
+#ifdef CONFIG_XPFO
+ /* Counts the number of times this page has been kmapped. */
+ atomic_t xpfo_mapcount;
+
+ /* Serialize kmap/kunmap of this page */
+ spinlock_t xpfo_lock;
+#endif
} _struct_page_alignment;
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..a532063f27b5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@ enum pageflags {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+ PG_xpfo_user, /* Page is allocated to user-space */
+ PG_xpfo_unmapped, /* Page is unmapped from the linear map */
#endif
__NR_PAGEFLAGS,
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index d4b38ab8a633..ea5188882f49 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -27,7 +27,7 @@ struct page;
#include <linux/types.h>
-extern struct page_ext_operations page_xpfo_ops;
+void xpfo_init_single_page(struct page *page);
void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
@@ -56,6 +56,7 @@ phys_addr_t user_virt_to_phys(unsigned long addr);
#else /* !CONFIG_XPFO */
+static inline void xpfo_init_single_page(struct page *page) { }
static inline void xpfo_kmap(void *kaddr, struct page *page) { }
static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
#define IF_HAVE_PG_IDLE(flag,string)
#endif
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
{1UL << PG_waiters, "waiters" }, \
@@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \
IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \
IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
IF_HAVE_PG_IDLE(PG_young, "young" ) \
-IF_HAVE_PG_IDLE(PG_idle, "idle" )
+IF_HAVE_PG_IDLE(PG_idle, "idle" ) \
+IF_HAVE_PG_XPFO(PG_xpfo_user, "xpfo_user" ) \
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped, "xpfo_unmapped" ) \
#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08e277790b5f..d00382b20001 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1024,6 +1024,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
if (bad)
return false;
+ xpfo_free_pages(page, order);
page_cpupid_reset_last(page);
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
@@ -1038,7 +1039,6 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order, 0);
kernel_map_pages(page, 1 << order, 0);
kasan_free_pages(page, order);
- xpfo_free_pages(page, order);
return true;
}
@@ -1191,6 +1191,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
+ xpfo_init_single_page(page);
}
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 38e5013dcb9a..ae44f7adbe07 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -8,7 +8,6 @@
#include <linux/kmemleak.h>
#include <linux/page_owner.h>
#include <linux/page_idle.h>
-#include <linux/xpfo.h>
/*
* struct page extension
@@ -69,9 +68,6 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
-#ifdef CONFIG_XPFO
- &page_xpfo_ops,
-#endif
};
static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index e80374b0c78e..cbfeafc2f10f 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -16,33 +16,16 @@
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/module.h>
-#include <linux/page_ext.h>
#include <linux/xpfo.h>
#include <asm/tlbflush.h>
-/* XPFO page state flags */
-enum xpfo_flags {
- XPFO_PAGE_USER, /* Page is allocated to user-space */
- XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */
-};
-
-/* Per-page XPFO house-keeping data */
-struct xpfo {
- unsigned long flags; /* Page state */
- bool inited; /* Map counter and lock initialized */
- atomic_t mapcount; /* Counter for balancing map/unmap requests */
- spinlock_t maplock; /* Lock to serialize map/unmap requests */
-};
-
-DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+DEFINE_STATIC_KEY_TRUE(xpfo_inited);
DEFINE_STATIC_KEY_FALSE(xpfo_do_tlb_flush);
-static bool xpfo_disabled __initdata;
-
static int __init noxpfo_param(char *str)
{
- xpfo_disabled = true;
+ static_branch_disable(&xpfo_inited);
return 0;
}
@@ -57,34 +40,13 @@ static int __init xpfotlbflush_param(char *str)
early_param("noxpfo", noxpfo_param);
early_param("xpfotlbflush", xpfotlbflush_param);
-static bool __init need_xpfo(void)
-{
- if (xpfo_disabled) {
- printk(KERN_INFO "XPFO disabled\n");
- return false;
- }
-
- return true;
-}
-
-static void init_xpfo(void)
-{
- printk(KERN_INFO "XPFO enabled\n");
- static_branch_enable(&xpfo_inited);
-}
-
-struct page_ext_operations page_xpfo_ops = {
- .size = sizeof(struct xpfo),
- .need = need_xpfo,
- .init = init_xpfo,
-};
-
bool __init xpfo_enabled(void)
{
- return !xpfo_disabled;
+ if (!static_branch_unlikely(&xpfo_inited))
+ return false;
+ else
+ return true;
}
-EXPORT_SYMBOL(xpfo_enabled);
-
static void xpfo_cond_flush_kernel_tlb(struct page *page, int order)
{
@@ -92,58 +54,40 @@ static void xpfo_cond_flush_kernel_tlb(struct page *page, int order)
xpfo_flush_kernel_tlb(page, order);
}
-static inline struct xpfo *lookup_xpfo(struct page *page)
+void __meminit xpfo_init_single_page(struct page *page)
{
- struct page_ext *page_ext = lookup_page_ext(page);
-
- if (unlikely(!page_ext)) {
- WARN(1, "xpfo: failed to get page ext");
- return NULL;
- }
-
- return (void *)page_ext + page_xpfo_ops.offset;
+ spin_lock_init(&page->xpfo_lock);
}
void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
{
int i, flush_tlb = 0;
- struct xpfo *xpfo;
if (!static_branch_unlikely(&xpfo_inited))
return;
for (i = 0; i < (1 << order); i++) {
- xpfo = lookup_xpfo(page + i);
- if (!xpfo)
- continue;
-
- WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
- "xpfo: unmapped page being allocated\n");
-
- /* Initialize the map lock and map counter */
- if (unlikely(!xpfo->inited)) {
- spin_lock_init(&xpfo->maplock);
- atomic_set(&xpfo->mapcount, 0);
- xpfo->inited = true;
- }
- WARN(atomic_read(&xpfo->mapcount),
- "xpfo: already mapped page being allocated\n");
-
+#ifdef CONFIG_XPFO_DEBUG
+ BUG_ON(PageXpfoUser(page + i));
+ BUG_ON(PageXpfoUnmapped(page + i));
+ BUG_ON(spin_is_locked(&(page + i)->xpfo_lock));
+ BUG_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
if (static_branch_unlikely(&xpfo_do_tlb_flush)) {
/*
* Tag the page as a user page and flush the TLB if it
* was previously allocated to the kernel.
*/
- if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+ if (!TestSetPageXpfoUser(page + i))
flush_tlb = 1;
} else {
- set_bit(XPFO_PAGE_USER, &xpfo->flags);
+ SetPageXpfoUser(page + i);
}
} else {
/* Tag the page as a non-user (kernel) page */
- clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+ ClearPageXpfoUser(page + i);
}
}
@@ -154,27 +98,21 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
void xpfo_free_pages(struct page *page, int order)
{
int i;
- struct xpfo *xpfo;
if (!static_branch_unlikely(&xpfo_inited))
return;
for (i = 0; i < (1 << order); i++) {
- xpfo = lookup_xpfo(page + i);
- if (!xpfo || unlikely(!xpfo->inited)) {
- /*
- * The page was allocated before page_ext was
- * initialized, so it is a kernel page.
- */
- continue;
- }
+#ifdef CONFIG_XPFO_DEBUG
+ BUG_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
/*
* Map the page back into the kernel if it was previously
* allocated to user space.
*/
- if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
- clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ if (TestClearPageXpfoUser(page + i)) {
+ ClearPageXpfoUnmapped(page + i);
set_kpte(page_address(page + i), page + i,
PAGE_KERNEL);
}
@@ -183,84 +121,56 @@ void xpfo_free_pages(struct page *page, int order)
void xpfo_kmap(void *kaddr, struct page *page)
{
- struct xpfo *xpfo;
-
if (!static_branch_unlikely(&xpfo_inited))
return;
- xpfo = lookup_xpfo(page);
-
- /*
- * The page was allocated before page_ext was initialized (which means
- * it's a kernel page) or it's allocated to the kernel, so nothing to
- * do.
- */
- if (!xpfo || unlikely(!xpfo->inited) ||
- !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ if (!PageXpfoUser(page))
return;
- spin_lock(&xpfo->maplock);
+ spin_lock(&page->xpfo_lock);
/*
* The page was previously allocated to user space, so map it back
* into the kernel. No TLB flush required.
*/
- if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
- test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+ if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
+ TestClearPageXpfoUnmapped(page))
set_kpte(kaddr, page, PAGE_KERNEL);
- spin_unlock(&xpfo->maplock);
+ spin_unlock(&page->xpfo_lock);
}
EXPORT_SYMBOL(xpfo_kmap);
void xpfo_kunmap(void *kaddr, struct page *page)
{
- struct xpfo *xpfo;
-
if (!static_branch_unlikely(&xpfo_inited))
return;
- xpfo = lookup_xpfo(page);
-
- /*
- * The page was allocated before page_ext was initialized (which means
- * it's a kernel page) or it's allocated to the kernel, so nothing to
- * do.
- */
- if (!xpfo || unlikely(!xpfo->inited) ||
- !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+ if (!PageXpfoUser(page))
return;
- spin_lock(&xpfo->maplock);
+ spin_lock(&page->xpfo_lock);
/*
* The page is to be allocated back to user space, so unmap it from the
* kernel, flush the TLB and tag it as a user page.
*/
- if (atomic_dec_return(&xpfo->mapcount) == 0) {
- WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
- "xpfo: unmapping already unmapped page\n");
- set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
+#ifdef CONFIG_XPFO_DEBUG
+ BUG_ON(PageXpfoUnmapped(page));
+#endif
+ SetPageXpfoUnmapped(page);
set_kpte(kaddr, page, __pgprot(0));
xpfo_cond_flush_kernel_tlb(page, 0);
}
- spin_unlock(&xpfo->maplock);
+ spin_unlock(&page->xpfo_lock);
}
EXPORT_SYMBOL(xpfo_kunmap);
bool xpfo_page_is_unmapped(struct page *page)
{
- struct xpfo *xpfo;
-
- if (!static_branch_unlikely(&xpfo_inited))
- return false;
-
- xpfo = lookup_xpfo(page);
- if (unlikely(!xpfo) && !xpfo->inited)
- return false;
-
- return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+ return PageXpfoUnmapped(page);
}
EXPORT_SYMBOL(xpfo_page_is_unmapped);
diff --git a/security/Kconfig b/security/Kconfig
index 8d0e4e303551..c7c581bac963 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -13,7 +13,6 @@ config XPFO
bool "Enable eXclusive Page Frame Ownership (XPFO)"
default n
depends on ARCH_SUPPORTS_XPFO
- select PAGE_EXTENSION
help
This option offers protection against 'ret2dir' kernel attacks.
When enabled, every time a page frame is allocated to user space, it
@@ -25,6 +24,17 @@ config XPFO
If in doubt, say "N".
+config XPFO_DEBUG
+ bool "Enable debugging of XPFO"
+ default n
+ depends on XPFO
+ help
+ Enables additional checking of XPFO data structures that help find
+ bugs in the XPFO implementation. This option comes with a slight
+ performance cost.
+
+ If in doubt, say "N".
+
config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
--
2.17.1
On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <[email protected]> wrote:
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
>
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
>
> 4.20 915.183s
> 4.20+XPFO 24129.354s 26.366x
> 4.20+XPFO+Deferred flush 1216.987s 1.330xx
>
>
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
>
> 4.20 607.671s
> 4.20+XPFO 1588.646s 2.614x
> 4.20+XPFO+Deferred flush 794.473s 1.307xx
Well that's an impressive improvement! Nice work. :)
(Are the cpumask improvements possible to be extended to other TLB
flushing needs? i.e. could there be other performance gains with that
code even for a non-XPFO system?)
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
>
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
Dave, Andy, Ingo, Thomas, does anyone have time to look this over?
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
>
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
At this point, yes, that does seem to make sense.
> What remains to be done beyond this patch series:
>
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
> from Juerg. I dropped it for now since swiotlb code for ARM has
> changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
> CPUs" to other architectures besides x86.
This seems like a good plan.
I've put this series in one of my tree so that 0day will find it and
grind tests...
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/xpfo/v7
Thanks!
--
Kees Cook
First of all, thanks for picking this back up. It looks to be going in
a very positive direction!
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush.
...
> A rogue process can launch a ret2dir attack only from a CPU that has
> dual mapping for its pages in physmap in its TLB. We can hence defer
> TLB flush on a CPU until a process that would have caused a TLB
> flush is scheduled on that CPU.
This logic is a bit suspect to me. Imagine a situation where we have
two attacker processes: one which is causing page to go from
kernel->user (and be unmapped from the kernel) and a second process that
*was* accessing that page.
The second process could easily have the page's old TLB entry. It could
abuse that entry as long as that CPU doesn't context switch
(switch_mm_irqs_off()) or otherwise flush the TLB entry.
As for where to flush the TLB... As you know, using synchronous IPIs is
obviously the most bulletproof from a mitigation perspective. If you
can batch the IPIs, you can get the overhead down, but you need to do
the flushes for a bunch of pages at once, which I think is what you were
exploring but haven't gotten working yet.
Anything else you do will have *some* reduced mitigation value, which
isn't a deal-breaker (to me at least). Some ideas:
Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S. Every time that
gets called, we've (potentially) just done a user->kernel transition and
might benefit from flushing the TLB. We're always doing a CR3 write (on
Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
X86_CR3_PCID_NOFLUSH_BIT is set. So, when you need a TLB flush, you
would set a bit that ADJUST_KERNEL_CR3 would see on the next
user->kernel transition on *each* CPU. Potentially, multiple TLB
flushes could be coalesced this way. The downside of this is that
you're exposed to the old TLB entries if a flush is needed while you are
already *in* the kernel.
You could also potentially do this from C code, like in the syscall
entry code, or in sensitive places, like when you're returning from a
guest after a VMEXIT in the kvm code.
Thanks for looking this over.
On 1/10/19 4:07 PM, Kees Cook wrote:
> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <[email protected]> wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush. Problem is made worse by
>> the fact that if number of entries being flushed exceeds
>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>> every CPU. A rogue process can launch a ret2dir attack only from a
>> CPU that has dual mapping for its pages in physmap in its TLB. We
>> can hence defer TLB flush on a CPU until a process that would have
>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>> to task_struct which is then used to post pending TLB flush on CPUs
>> other than the one a process is running on. This cpumask is checked
>> when a process migrates to a new CPU and TLB is flushed at that
>> time. I measured system time for parallel make with unmodified 4.20
>> kernel, 4.20 with XPFO patches before this optimization and then
>> again after applying this optimization. Here are the results:
>>
>> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
>> make -j60 all
>>
>> 4.20 915.183s
>> 4.20+XPFO 24129.354s 26.366x
>> 4.20+XPFO+Deferred flush 1216.987s 1.330xx
>>
>>
>> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
>> make -j4 all
>>
>> 4.20 607.671s
>> 4.20+XPFO 1588.646s 2.614x
>> 4.20+XPFO+Deferred flush 794.473s 1.307xx
>
> Well that's an impressive improvement! Nice work. :)
>
> (Are the cpumask improvements possible to be extended to other TLB
> flushing needs? i.e. could there be other performance gains with that
> code even for a non-XPFO system?)
It may be usable for other situations as well but I have not given it
any thought yet. I will take a look.
>
>> 30+% overhead is still very high and there is room for improvement.
>> Dave Hansen had suggested batch updating TLB entries and Tycho had
>> created an initial implementation but I have not been able to get
>> that to work correctly. I am still working on it and I suspect we
>> will see a noticeable improvement in performance with that. In the
>> code I added, I post a pending full TLB flush to all other CPUs even
>> when number of TLB entries being flushed on current CPU does not
>> exceed tlb_single_page_flush_ceiling. There has to be a better way
>> to do this. I just haven't found an efficient way to implemented
>> delayed limited TLB flush on other CPUs.
>>
>> I am not entirely sure if switch_mm_irqs_off() is indeed the right
>> place to perform the pending TLB flush for a CPU. Any feedback on
>> that will be very helpful. Delaying full TLB flushes on other CPUs
>> seems to help tremendously, so if there is a better way to implement
>> the same thing than what I have done in patch 16, I am open to
>> ideas.
>
> Dave, Andy, Ingo, Thomas, does anyone have time to look this over?
>
>> Performance with this patch set is good enough to use these as
>> starting point for further refinement before we merge it into main
>> kernel, hence RFC.
>>
>> Since not flushing stale TLB entries creates a false sense of
>> security, I would recommend making TLB flush mandatory and eliminate
>> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
>> flushing by default for XPFO page table modifications").
>
> At this point, yes, that does seem to make sense.
>
>> What remains to be done beyond this patch series:
>>
>> 1. Performance improvements
>> 2. Remove xpfotlbflush parameter
>> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>> from Juerg. I dropped it for now since swiotlb code for ARM has
>> changed a lot in 4.20.
>> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>> CPUs" to other architectures besides x86.
>
> This seems like a good plan.
>
> I've put this series in one of my tree so that 0day will find it and
> grind tests...
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/xpfo/v7
Thanks for doing that!
--
Khalid
On Thu, Jan 10, 2019 at 3:07 PM Kees Cook <[email protected]> wrote:
>
> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <[email protected]> wrote:
> > I implemented a solution to reduce performance penalty and
> > that has had large impact. When XPFO code flushes stale TLB entries,
> > it does so for all CPUs on the system which may include CPUs that
> > may not have any matching TLB entries or may never be scheduled to
> > run the userspace task causing TLB flush. Problem is made worse by
> > the fact that if number of entries being flushed exceeds
> > tlb_single_page_flush_ceiling, it results in a full TLB flush on
> > every CPU. A rogue process can launch a ret2dir attack only from a
> > CPU that has dual mapping for its pages in physmap in its TLB. We
> > can hence defer TLB flush on a CPU until a process that would have
> > caused a TLB flush is scheduled on that CPU. I have added a cpumask
> > to task_struct which is then used to post pending TLB flush on CPUs
> > other than the one a process is running on. This cpumask is checked
> > when a process migrates to a new CPU and TLB is flushed at that
> > time. I measured system time for parallel make with unmodified 4.20
> > kernel, 4.20 with XPFO patches before this optimization and then
> > again after applying this optimization. Here are the results:
I wasn't cc'd on the patch, so I don't know the exact details.
I'm assuming that "ret2dir" means that you corrupt the kernel into
using a direct-map page as its stack. If so, then I don't see why the
task in whose context the attack is launched needs to be the same
process as the one that has the page mapped for user access.
My advice would be to attempt an entirely different optimization: try
to avoid putting pages *back* into the direct map when they're freed
until there is an actual need to use them for kernel purposes.
How are you handing page cache? Presumably MAP_SHARED PROT_WRITE
pages are still in the direct map so that IO works.
On Thu, Jan 10, 2019 at 03:40:04PM -0800, Dave Hansen wrote:
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least). Some ideas:
>
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S. Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB. We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set. So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU. Potentially, multiple TLB
> flushes could be coalesced this way. The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.
I would really prefer not to depend on the PTI crud for new stuff. We
really want to get rid of that code on unaffected CPUs.
On Thu, Jan 10, 2019 at 02:09:39PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <[email protected]>
>
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault, so let's temporarily map the region, flush the cache, and then
> unmap it.
>
> v6: actually flush in the face of xpfo, and temporarily map the underlying
> memory so it can be flushed correctly
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> arch/arm64/mm/flush.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 30695a868107..f12f26b60319 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -20,6 +20,7 @@
> #include <linux/export.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> +#include <linux/xpfo.h>
>
> #include <asm/cacheflush.h>
> #include <asm/cache.h>
> @@ -28,9 +29,15 @@
> void sync_icache_aliases(void *kaddr, unsigned long len)
> {
> unsigned long addr = (unsigned long)kaddr;
> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> + void *mapping[num_pages];
Does this still compile with -Wvla? It was a bad hack on my part, and
we should probably just drop it and come up with something else :)
Tycho
> if (icache_is_aliasing()) {
> + xpfo_temp_map(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
> __clean_dcache_area_pou(kaddr, len);
> + xpfo_temp_unmap(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
> __flush_icache_all();
> } else {
> flush_icache_range(addr, addr + len);
> --
> 2.17.1
>
On 1/11/19 7:54 AM, Tycho Andersen wrote:
> On Thu, Jan 10, 2019 at 02:09:39PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger <[email protected]>
>>
>> If the page is unmapped by XPFO, a data cache flush results in a fatal
>> page fault, so let's temporarily map the region, flush the cache, and then
>> unmap it.
>>
>> v6: actually flush in the face of xpfo, and temporarily map the underlying
>> memory so it can be flushed correctly
>>
>> CC: [email protected]
>> Signed-off-by: Juerg Haefliger <[email protected]>
>> Signed-off-by: Tycho Andersen <[email protected]>
>> Signed-off-by: Khalid Aziz <[email protected]>
>> ---
>> arch/arm64/mm/flush.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>> index 30695a868107..f12f26b60319 100644
>> --- a/arch/arm64/mm/flush.c
>> +++ b/arch/arm64/mm/flush.c
>> @@ -20,6 +20,7 @@
>> #include <linux/export.h>
>> #include <linux/mm.h>
>> #include <linux/pagemap.h>
>> +#include <linux/xpfo.h>
>>
>> #include <asm/cacheflush.h>
>> #include <asm/cache.h>
>> @@ -28,9 +29,15 @@
>> void sync_icache_aliases(void *kaddr, unsigned long len)
>> {
>> unsigned long addr = (unsigned long)kaddr;
>> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
>> + void *mapping[num_pages];
>
> Does this still compile with -Wvla? It was a bad hack on my part, and
> we should probably just drop it and come up with something else :)
I will make a note of it. I hope someone with better knowledge of arm64
than me can come up with a better solution ;)
--
Khalid
>
> Tycho
>
>> if (icache_is_aliasing()) {
>> + xpfo_temp_map(kaddr, len, mapping,
>> + sizeof(mapping[0]) * num_pages);
>> __clean_dcache_area_pou(kaddr, len);
>> + xpfo_temp_unmap(kaddr, len, mapping,
>> + sizeof(mapping[0]) * num_pages);
>> __flush_icache_all();
>> } else {
>> flush_icache_range(addr, addr + len);
>> --
>> 2.17.1
>>
Hi Dave,
Thanks for looking at this and providing feedback.
On 1/10/19 4:40 PM, Dave Hansen wrote:
> First of all, thanks for picking this back up. It looks to be going in
> a very positive direction!
>
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush.
> ...
>> A rogue process can launch a ret2dir attack only from a CPU that has
>> dual mapping for its pages in physmap in its TLB. We can hence defer
>> TLB flush on a CPU until a process that would have caused a TLB
>> flush is scheduled on that CPU.
>
> This logic is a bit suspect to me. Imagine a situation where we have
> two attacker processes: one which is causing page to go from
> kernel->user (and be unmapped from the kernel) and a second process that
> *was* accessing that page.
>
> The second process could easily have the page's old TLB entry. It could
> abuse that entry as long as that CPU doesn't context switch
> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
That is an interesting scenario. Working through this scenario, physmap
TLB entry for a page is flushed on the local processor when the page is
allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
page back into kernel, that page is mapped into kernel space using a va
from kmap pool in xpfo_kmap() which can be different for each new
mapping of the same page. The physical page is unmapped from kernel on
the way back from kernel to userspace by xpfo_kunmap(). So two processes
on different CPUs sharing same physical page might not be seeing the
same virtual address for that page while they are in the kernel, as long
as it is an address from kmap pool. ret2dir attack relies upon being
able to craft a predictable virtual address in the kernel physmap for a
physical page and redirect execution to that address. Does that sound right?
Now what happens if only one of these cooperating processes allocates
the page, places malicious payload on that page and passes the address
of this page to the other process which can deduce physmap for the page
through /proc and exploit the physmap entry for the page on its CPU.
That must be the scenario you are referring to.
>
> As for where to flush the TLB... As you know, using synchronous IPIs is
> obviously the most bulletproof from a mitigation perspective. If you
> can batch the IPIs, you can get the overhead down, but you need to do
> the flushes for a bunch of pages at once, which I think is what you were
> exploring but haven't gotten working yet.
>
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least). Some ideas:
Even without batched IPIs working reliably, I was able to measure the
performance impact of this partially working solution. With just batched
IPIs and no delayed TLB flushes, performance improved by a factor of 2.
The 26x system time went down to 12x-13x but it was still too high and a
non-starter. Combining batched IPI with delayed TLB flushes improved
performance to about 1.1x as opposed to 1.33x with delayed TLB flush
alone. Those numbers are very rough since the batching implementation is
incomplete.
>
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S. Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB. We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set. So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU. Potentially, multiple TLB
> flushes could be coalesced this way. The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.
>
> You could also potentially do this from C code, like in the syscall
> entry code, or in sensitive places, like when you're returning from a
> guest after a VMEXIT in the kvm code.
>
Good suggestions. Thanks.
I think benefit will be highest from batching TLB flushes. I see a lot
of time consumed by full TLB flushes on other processors when local
processor did only a limited TLB flush. I will continue to debug the
batch TLB updates.
--
Khalid
On Fri, Jan 11, 2019 at 11:28:19AM -0700, Khalid Aziz wrote:
> On 1/11/19 7:54 AM, Tycho Andersen wrote:
> > On Thu, Jan 10, 2019 at 02:09:39PM -0700, Khalid Aziz wrote:
> >> From: Juerg Haefliger <[email protected]>
> >>
> >> If the page is unmapped by XPFO, a data cache flush results in a fatal
> >> page fault, so let's temporarily map the region, flush the cache, and then
> >> unmap it.
> >>
> >> v6: actually flush in the face of xpfo, and temporarily map the underlying
> >> memory so it can be flushed correctly
> >>
> >> CC: [email protected]
> >> Signed-off-by: Juerg Haefliger <[email protected]>
> >> Signed-off-by: Tycho Andersen <[email protected]>
> >> Signed-off-by: Khalid Aziz <[email protected]>
> >> ---
> >> arch/arm64/mm/flush.c | 7 +++++++
> >> 1 file changed, 7 insertions(+)
> >>
> >> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> >> index 30695a868107..f12f26b60319 100644
> >> --- a/arch/arm64/mm/flush.c
> >> +++ b/arch/arm64/mm/flush.c
> >> @@ -20,6 +20,7 @@
> >> #include <linux/export.h>
> >> #include <linux/mm.h>
> >> #include <linux/pagemap.h>
> >> +#include <linux/xpfo.h>
> >>
> >> #include <asm/cacheflush.h>
> >> #include <asm/cache.h>
> >> @@ -28,9 +29,15 @@
> >> void sync_icache_aliases(void *kaddr, unsigned long len)
> >> {
> >> unsigned long addr = (unsigned long)kaddr;
> >> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> >> + void *mapping[num_pages];
> >
> > Does this still compile with -Wvla? It was a bad hack on my part, and
> > we should probably just drop it and come up with something else :)
>
> I will make a note of it. I hope someone with better knowledge of arm64
> than me can come up with a better solution ;)
It's not just arm64, IIRC everywhere I used xpfo_temp_map() has a VLA.
I think this is in part because some of these paths don't allow
allocation failures, so we can't do a dynamic allocation. Perhaps we
need to reserve some memory for each call site?
Tycho
>> The second process could easily have the page's old TLB entry. It could
>> abuse that entry as long as that CPU doesn't context switch
>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>
> That is an interesting scenario. Working through this scenario, physmap
> TLB entry for a page is flushed on the local processor when the page is
> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
> page back into kernel, that page is mapped into kernel space using a va
> from kmap pool in xpfo_kmap() which can be different for each new
> mapping of the same page. The physical page is unmapped from kernel on
> the way back from kernel to userspace by xpfo_kunmap(). So two processes
> on different CPUs sharing same physical page might not be seeing the
> same virtual address for that page while they are in the kernel, as long
> as it is an address from kmap pool. ret2dir attack relies upon being
> able to craft a predictable virtual address in the kernel physmap for a
> physical page and redirect execution to that address. Does that sound right?
All processes share one set of kernel page tables. Or, did your patches
change that somehow that I missed?
Since they share the page tables, they implicitly share kmap*()
mappings. kmap_atomic() is not *used* by more than one CPU, but the
mapping is accessible and at least exists for all processors.
I'm basically assuming that any entry mapped in a shared page table is
exploitable on any CPU regardless of where we logically *want* it to be
used.
On Fri, Jan 11, 2019 at 12:42 PM Dave Hansen <[email protected]> wrote:
>
> >> The second process could easily have the page's old TLB entry. It could
> >> abuse that entry as long as that CPU doesn't context switch
> >> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
> >
> > That is an interesting scenario. Working through this scenario, physmap
> > TLB entry for a page is flushed on the local processor when the page is
> > allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
> > page back into kernel, that page is mapped into kernel space using a va
> > from kmap pool in xpfo_kmap() which can be different for each new
> > mapping of the same page. The physical page is unmapped from kernel on
> > the way back from kernel to userspace by xpfo_kunmap(). So two processes
> > on different CPUs sharing same physical page might not be seeing the
> > same virtual address for that page while they are in the kernel, as long
> > as it is an address from kmap pool. ret2dir attack relies upon being
> > able to craft a predictable virtual address in the kernel physmap for a
> > physical page and redirect execution to that address. Does that sound right?
>
> All processes share one set of kernel page tables. Or, did your patches
> change that somehow that I missed?
>
> Since they share the page tables, they implicitly share kmap*()
> mappings. kmap_atomic() is not *used* by more than one CPU, but the
> mapping is accessible and at least exists for all processors.
>
> I'm basically assuming that any entry mapped in a shared page table is
> exploitable on any CPU regardless of where we logically *want* it to be
> used.
>
>
We can, very easily, have kernel mappings that are private to a given
mm. Maybe this is useful here.
On 1/10/19 5:44 PM, Andy Lutomirski wrote:
> On Thu, Jan 10, 2019 at 3:07 PM Kees Cook <[email protected]> wrote:
>>
>> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <[email protected]> wrote:
>>> I implemented a solution to reduce performance penalty and
>>> that has had large impact. When XPFO code flushes stale TLB entries,
>>> it does so for all CPUs on the system which may include CPUs that
>>> may not have any matching TLB entries or may never be scheduled to
>>> run the userspace task causing TLB flush. Problem is made worse by
>>> the fact that if number of entries being flushed exceeds
>>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>>> every CPU. A rogue process can launch a ret2dir attack only from a
>>> CPU that has dual mapping for its pages in physmap in its TLB. We
>>> can hence defer TLB flush on a CPU until a process that would have
>>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>>> to task_struct which is then used to post pending TLB flush on CPUs
>>> other than the one a process is running on. This cpumask is checked
>>> when a process migrates to a new CPU and TLB is flushed at that
>>> time. I measured system time for parallel make with unmodified 4.20
>>> kernel, 4.20 with XPFO patches before this optimization and then
>>> again after applying this optimization. Here are the results:
>
> I wasn't cc'd on the patch, so I don't know the exact details.
>
> I'm assuming that "ret2dir" means that you corrupt the kernel into
> using a direct-map page as its stack. If so, then I don't see why the
> task in whose context the attack is launched needs to be the same
> process as the one that has the page mapped for user access.
You are right. More work is needed to refine delayed TLB flush to close
this gap.
>
> My advice would be to attempt an entirely different optimization: try
> to avoid putting pages *back* into the direct map when they're freed
> until there is an actual need to use them for kernel purposes.
I had thought about that but it turns out the performance impact happens
on the initial allocation of the page and resulting TLB flushes, not
from putting the pages back into direct map. The way we could benefit
from not adding page back to direct map is if we change page allocation
to prefer pages not in direct map. That way we incur the cost of TLB
flushes initially but then satisfy multiple allocation requests after
that from those "xpfo cost" free pages. More changes will be needed to
pick which of these pages can be added back to direct map without
degenerating into worst case scenario of a page bouncing constantly
between this list of preferred pages and direct mapped pages. It started
to get complex enough that I decided to put this in my back pocket and
attempt simpler approaches first :)
>
> How are you handing page cache? Presumably MAP_SHARED PROT_WRITE
> pages are still in the direct map so that IO works.
>
Since Juerg wrote the actual implementation of XPFO, he probably
understands it better. XPFO tackles only the page allocation requests
from userspace and does not touch page cache pages.
--
Khalid
On 1/11/19 1:42 PM, Dave Hansen wrote:
>>> The second process could easily have the page's old TLB entry. It could
>>> abuse that entry as long as that CPU doesn't context switch
>>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>
>> That is an interesting scenario. Working through this scenario, physmap
>> TLB entry for a page is flushed on the local processor when the page is
>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>> page back into kernel, that page is mapped into kernel space using a va
>> from kmap pool in xpfo_kmap() which can be different for each new
>> mapping of the same page. The physical page is unmapped from kernel on
>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>> on different CPUs sharing same physical page might not be seeing the
>> same virtual address for that page while they are in the kernel, as long
>> as it is an address from kmap pool. ret2dir attack relies upon being
>> able to craft a predictable virtual address in the kernel physmap for a
>> physical page and redirect execution to that address. Does that sound right?
>
> All processes share one set of kernel page tables. Or, did your patches
> change that somehow that I missed?
>
> Since they share the page tables, they implicitly share kmap*()
> mappings. kmap_atomic() is not *used* by more than one CPU, but the
> mapping is accessible and at least exists for all processors.
>
> I'm basically assuming that any entry mapped in a shared page table is
> exploitable on any CPU regardless of where we logically *want* it to be
> used.
>
>
Ah, I see what you are saying. Virtual address on one processor is
visible on the other processor as well and one process could communicate
that va to the other process in some way so it could be exploited by the
other process. This va is exploitable only between the kmap and matching
kunmap but the window exists. I am trying to understand your scenario,
so I can address it right.
--
Khalid
On 1/11/19 2:06 PM, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:42 PM Dave Hansen <[email protected]> wrote:
>>
>>>> The second process could easily have the page's old TLB entry. It could
>>>> abuse that entry as long as that CPU doesn't context switch
>>>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>>
>>> That is an interesting scenario. Working through this scenario, physmap
>>> TLB entry for a page is flushed on the local processor when the page is
>>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>>> page back into kernel, that page is mapped into kernel space using a va
>>> from kmap pool in xpfo_kmap() which can be different for each new
>>> mapping of the same page. The physical page is unmapped from kernel on
>>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>>> on different CPUs sharing same physical page might not be seeing the
>>> same virtual address for that page while they are in the kernel, as long
>>> as it is an address from kmap pool. ret2dir attack relies upon being
>>> able to craft a predictable virtual address in the kernel physmap for a
>>> physical page and redirect execution to that address. Does that sound right?
>>
>> All processes share one set of kernel page tables. Or, did your patches
>> change that somehow that I missed?
>>
>> Since they share the page tables, they implicitly share kmap*()
>> mappings. kmap_atomic() is not *used* by more than one CPU, but the
>> mapping is accessible and at least exists for all processors.
>>
>> I'm basically assuming that any entry mapped in a shared page table is
>> exploitable on any CPU regardless of where we logically *want* it to be
>> used.
>>
>>
>
> We can, very easily, have kernel mappings that are private to a given
> mm. Maybe this is useful here.
>
That sounds like an interesting idea. kmap mappings would be a good
candidate for that. Those are temporary mappings and should only be
valid for one process.
--
Khalid
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
>
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
>
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
>
> 4.20 915.183s
> 4.20+XPFO 24129.354s 26.366x
> 4.20+XPFO+Deferred flush 1216.987s 1.330xx
>
>
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
>
> 4.20 607.671s
> 4.20+XPFO 1588.646s 2.614x
> 4.20+XPFO+Deferred flush 794.473s 1.307xx
>
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
>
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
>
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
>
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
>
> What remains to be done beyond this patch series:
>
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
> from Juerg. I dropped it for now since swiotlb code for ARM has
> changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
> CPUs" to other architectures besides x86.
>
>
> ---------------------------------------------------------
>
> Juerg Haefliger (5):
> mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
> swiotlb: Map the buffer if it was unmapped by XPFO
> arm64/mm: Add support for XPFO
> arm64/mm, xpfo: temporarily map dcache regions
> lkdtm: Add test for XPFO
>
> Julian Stecklina (4):
> mm, x86: omit TLB flushing by default for XPFO page table
> modifications
> xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
> xpfo, mm: optimize spinlock usage in xpfo_kunmap
> EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
>
> Khalid Aziz (2):
> xpfo, mm: Fix hang when booting with "xpfotlbflush"
> xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
>
> Tycho Andersen (5):
> mm: add MAP_HUGETLB support to vm_mmap
> x86: always set IF before oopsing from page fault
> xpfo: add primitives for mapping underlying memory
> arm64/mm: disable section/contiguous mappings if XPFO is enabled
> mm: add a user_virt_to_phys symbol
>
> .../admin-guide/kernel-parameters.txt | 2 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 +
> arch/arm64/mm/flush.c | 7 +
> arch/arm64/mm/mmu.c | 2 +-
> arch/arm64/mm/xpfo.c | 58 ++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 26 ++
> arch/x86/include/asm/tlbflush.h | 1 +
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/fault.c | 10 +
> arch/x86/mm/pageattr.c | 23 +-
> arch/x86/mm/tlb.c | 27 ++
> arch/x86/mm/xpfo.c | 171 ++++++++++++
> drivers/misc/lkdtm/Makefile | 1 +
> drivers/misc/lkdtm/core.c | 3 +
> drivers/misc/lkdtm/lkdtm.h | 5 +
> drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++
> include/linux/highmem.h | 15 +-
> include/linux/mm.h | 2 +
> include/linux/mm_types.h | 8 +
> include/linux/page-flags.h | 13 +
> include/linux/sched.h | 9 +
> include/linux/xpfo.h | 90 +++++++
> include/trace/events/mmflags.h | 10 +-
> kernel/dma/swiotlb.c | 3 +-
> mm/Makefile | 1 +
> mm/mmap.c | 19 +-
> mm/page_alloc.c | 3 +
> mm/util.c | 32 +++
> mm/xpfo.c | 247 ++++++++++++++++++
> security/Kconfig | 29 ++
> 32 files changed, 974 insertions(+), 43 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm/xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
>
So this seems to blow up immediately on my arm64 box with a config
based on Fedora:
[ 11.008243] Unable to handle kernel paging request at virtual address ffff8003f8602f9b
[ 11.016133] Mem abort info:
[ 11.018926] ESR = 0x96000007
[ 11.021967] Exception class = DABT (current EL), IL = 32 bits
[ 11.027858] SET = 0, FnV = 0
[ 11.030904] EA = 0, S1PTW = 0
[ 11.034030] Data abort info:
[ 11.036896] ISV = 0, ISS = 0x00000007
[ 11.040718] CM = 0, WnR = 0
[ 11.043672] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
[ 11.050523] [ffff8003f8602f9b] pgd=00000043ffff7803, pud=00000043fe113803, pmd=00000043fc376803, pte=00e80043f8602f13
[ 11.061094] Internal error: Oops: 96000007 [#3] SMP
[ 11.065948] Modules linked in: xfs libcrc32c sdhci_of_arasan sdhci_pltfm sdhci i2c_xgene_slimpro cqhci gpio_dwapb xhci_plat_hcd gpio_xgene_sb gpio_keys
[ 11.079454] CPU: 3 PID: 577 Comm: systemd-getty-g Tainted: G D 4.20.0-xpfo+ #9
[ 11.087936] Hardware name: http://www.apm.com American Megatrends/American Megatrends, BIOS 3.07.06 20/03/2015
[ 11.097285] pstate: 00400005 (nzcv daif +PAN -UAO)
[ 11.102057] pc : __memcpy+0x20/0x180
[ 11.105616] lr : __access_remote_vm+0x7c/0x1f0
[ 11.110036] sp : ffff000011cb3c20
[ 11.113333] x29: ffff000011cb3c20 x28: ffff8003f8602000
[ 11.118619] x27: 0000000000000f9b x26: 0000000000001000
[ 11.123904] x25: 000083ffffffffff x24: cccccccccccccccd
[ 11.129189] x23: ffff8003d7c53000 x22: 0000000000000044
[ 11.134474] x21: 0000fffff0591f9b x20: 0000000000000044
[ 11.139759] x19: 0000000000000044 x18: 0000000000000000
[ 11.145044] x17: 0000000000000002 x16: 0000000000000000
[ 11.150329] x15: 0000000000000000 x14: 0000000000000000
[ 11.155614] x13: 0000000000000000 x12: 0000000000000000
[ 11.160899] x11: 0000000000000000 x10: 0000000000000000
[ 11.166184] x9 : 0000000000000000 x8 : 0000000000000000
[ 11.171469] x7 : 0000000000000000 x6 : ffff8003d7c53000
[ 11.176754] x5 : 00e00043f8602fd3 x4 : 0000000000000005
[ 11.182038] x3 : 00000003f8602000 x2 : 000000000000003f
[ 11.187323] x1 : ffff8003f8602f9b x0 : ffff8003d7c53000
[ 11.192609] Process systemd-getty-g (pid: 577, stack limit = 0x(____ptrval____))
[ 11.199967] Call trace:
[ 11.202400] __memcpy+0x20/0x180
[ 11.205611] access_remote_vm+0x4c/0x60
[ 11.209428] environ_read+0x12c/0x260
[ 11.213071] __vfs_read+0x48/0x158
[ 11.216454] vfs_read+0x94/0x150
[ 11.219665] ksys_read+0x54/0xb0
[ 11.222875] __arm64_sys_read+0x24/0x30
[ 11.226691] el0_svc_handler+0x94/0x110
[ 11.230508] el0_svc+0x8/0xc
[ 11.233375] Code: f2400c84 540001c0 cb040042 36000064 (38401423)
[ 11.239439] ---[ end trace 4132d3416fb70591 ]---
I'll see if I get some time tomorrow to dig into this unless
someone spots a problem sooner.
Thanks,
Laura
On 1/16/19 7:56 AM, Julian Stecklina wrote:
> Khalid Aziz <[email protected]> writes:
>
>> I am continuing to build on the work Juerg, Tycho and Julian have done
>> on XPFO.
>
> Awesome!
>
>> A rogue process can launch a ret2dir attack only from a CPU that has
>> dual mapping for its pages in physmap in its TLB. We can hence defer
>> TLB flush on a CPU until a process that would have caused a TLB flush
>> is scheduled on that CPU.
>
> Assuming the attacker already has the ability to execute arbitrary code
> in userspace, they can just create a second process and thus avoid the
> TLB flush. Am I getting this wrong?
No, you got it right. The patch I wrote closes the security hole when
attack is launched from the same process but still leaves a window open
when attack is launched from another process. I am working on figuring
out how to close that hole while keeping the performance the same as it
is now. Synchronous TLB flush across all cores is the most secure but
performance impact is horrendous.
--
Khalid
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> From: Julian Stecklina <[email protected]>
>
> We can reduce spin lock usage in xpfo_kmap to the 0->1 transition of
> the mapcount. This means that xpfo_kmap() can now race and that we
> get spurious page faults.
>
> The page fault handler helps the system make forward progress by
> fixing the page table instead of allowing repeated page faults until
> the right xpfo_kmap went through.
>
> Model-checked with up to 4 concurrent callers with Spin.
>
This needs the spurious check for arm64 as well. This at
least gets me booting but could probably use more review:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7d9571f4ae3d..8f425848cbb9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -32,6 +32,7 @@
#include <linux/perf_event.h>
#include <linux/preempt.h>
#include <linux/hugetlb.h>
+#include <linux/xpfo.h>
#include <asm/bug.h>
#include <asm/cmpxchg.h>
@@ -289,6 +290,9 @@ static void __do_kernel_fault(unsigned long addr, unsigned int esr,
if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
return;
+ if (xpfo_spurious_fault(addr))
+ return;
+
if (is_el1_permission_fault(addr, esr, regs)) {
if (esr & ESR_ELx_WNR)
msg = "write to read-only memory";
> Signed-off-by: Julian Stecklina <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Vasileios P. Kemerlis <[email protected]>
> Cc: Juerg Haefliger <[email protected]>
> Cc: Tycho Andersen <[email protected]>
> Cc: Marco Benatto <[email protected]>
> Cc: David Woodhouse <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> arch/x86/mm/fault.c | 4 ++++
> include/linux/xpfo.h | 4 ++++
> mm/xpfo.c | 50 +++++++++++++++++++++++++++++++++++++-------
> 3 files changed, 51 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index ba51652fbd33..207081dcd572 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -18,6 +18,7 @@
> #include <linux/uaccess.h> /* faulthandler_disabled() */
> #include <linux/efi.h> /* efi_recover_from_page_fault()*/
> #include <linux/mm_types.h>
> +#include <linux/xpfo.h>
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/traps.h> /* dotraplinkage, ... */
> @@ -1218,6 +1219,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> if (kprobes_fault(regs))
> return;
>
> + if (xpfo_spurious_fault(address))
> + return;
> +
> /*
> * Note, despite being a "bad area", there are quite a few
> * acceptable reasons to get here, such as erratum fixups
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index ea5188882f49..58dd243637d2 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -54,6 +54,8 @@ bool xpfo_enabled(void);
>
> phys_addr_t user_virt_to_phys(unsigned long addr);
>
> +bool xpfo_spurious_fault(unsigned long addr);
> +
> #else /* !CONFIG_XPFO */
>
> static inline void xpfo_init_single_page(struct page *page) { }
> @@ -81,6 +83,8 @@ static inline bool xpfo_enabled(void) { return false; }
>
> static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
>
> +static inline bool xpfo_spurious_fault(unsigned long addr) { return false; }
> +
> #endif /* CONFIG_XPFO */
>
> #endif /* _LINUX_XPFO_H */
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index dbf20efb0499..85079377c91d 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -119,6 +119,16 @@ void xpfo_free_pages(struct page *page, int order)
> }
> }
>
> +static void xpfo_do_map(void *kaddr, struct page *page)
> +{
> + spin_lock(&page->xpfo_lock);
> + if (PageXpfoUnmapped(page)) {
> + set_kpte(kaddr, page, PAGE_KERNEL);
> + ClearPageXpfoUnmapped(page);
> + }
> + spin_unlock(&page->xpfo_lock);
> +}
> +
> void xpfo_kmap(void *kaddr, struct page *page)
> {
> if (!static_branch_unlikely(&xpfo_inited))
> @@ -127,17 +137,12 @@ void xpfo_kmap(void *kaddr, struct page *page)
> if (!PageXpfoUser(page))
> return;
>
> - spin_lock(&page->xpfo_lock);
> -
> /*
> * The page was previously allocated to user space, so map it back
> * into the kernel. No TLB flush required.
> */
> - if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> - TestClearPageXpfoUnmapped(page))
> - set_kpte(kaddr, page, PAGE_KERNEL);
> -
> - spin_unlock(&page->xpfo_lock);
> + if (atomic_inc_return(&page->xpfo_mapcount) == 1)
> + xpfo_do_map(kaddr, page);
> }
> EXPORT_SYMBOL(xpfo_kmap);
>
> @@ -204,3 +209,34 @@ void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
> kunmap_atomic(mapping[i]);
> }
> EXPORT_SYMBOL(xpfo_temp_unmap);
> +
> +bool xpfo_spurious_fault(unsigned long addr)
> +{
> + struct page *page;
> + bool spurious;
> + int mapcount;
> +
> + if (!static_branch_unlikely(&xpfo_inited))
> + return false;
> +
> + /* XXX Is this sufficient to guard against calling virt_to_page() on a
> + * virtual address that has no corresponding struct page? */
> + if (!virt_addr_valid(addr))
> + return false;
> +
> + page = virt_to_page(addr);
> + mapcount = atomic_read(&page->xpfo_mapcount);
> + spurious = PageXpfoUser(page) && mapcount;
> +
> + /* Guarantee forward progress in case xpfo_kmap() raced. */
> + if (spurious && PageXpfoUnmapped(page)) {
> + xpfo_do_map((void *)(addr & PAGE_MASK), page);
> + }
> +
> + if (unlikely(!spurious))
> + printk("XPFO non-spurious fault %lx user=%d unmapped=%d mapcount=%d\n",
> + addr, PageXpfoUser(page), PageXpfoUnmapped(page),
> + mapcount);
> +
> + return spurious;
> +}
>
Khalid Aziz <[email protected]> writes:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO.
Awesome!
> A rogue process can launch a ret2dir attack only from a CPU that has
> dual mapping for its pages in physmap in its TLB. We can hence defer
> TLB flush on a CPU until a process that would have caused a TLB flush
> is scheduled on that CPU.
Assuming the attacker already has the ability to execute arbitrary code
in userspace, they can just create a second process and thus avoid the
TLB flush. Am I getting this wrong?
Julian
Khalid Aziz <[email protected]> writes:
> From: Julian Stecklina <[email protected]>
>
> Instead of using the page extension debug feature, encode all
> information, we need for XPFO in struct page. This allows to get rid of
> some checks in the hot paths and there are also no pages anymore that
> are allocated before XPFO is enabled.
I have another patch lying around that turns the XPFO per-page locks
into bit spinlocks and thus get the size of struct page to <= 64 byte
again. In case someone's interested, ping me.
Julian
On 1/16/19 5:18 PM, Laura Abbott wrote:
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> From: Julian Stecklina <[email protected]>
>>
>> We can reduce spin lock usage in xpfo_kmap to the 0->1 transition of
>> the mapcount. This means that xpfo_kmap() can now race and that we
>> get spurious page faults.
>>
>> The page fault handler helps the system make forward progress by
>> fixing the page table instead of allowing repeated page faults until
>> the right xpfo_kmap went through.
>>
>> Model-checked with up to 4 concurrent callers with Spin.
>>
>
> This needs the spurious check for arm64 as well. This at
> least gets me booting but could probably use more review:
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 7d9571f4ae3d..8f425848cbb9 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -32,6 +32,7 @@
> #include <linux/perf_event.h>
> #include <linux/preempt.h>
> #include <linux/hugetlb.h>
> +#include <linux/xpfo.h>
>
> #include <asm/bug.h>
> #include <asm/cmpxchg.h>
> @@ -289,6 +290,9 @@ static void __do_kernel_fault(unsigned long addr,
> unsigned int esr,
> if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
> return;
>
> + if (xpfo_spurious_fault(addr))
> + return;
> +
> if (is_el1_permission_fault(addr, esr, regs)) {
> if (esr & ESR_ELx_WNR)
> msg = "write to read-only memory";
>
>
That makes sense. Thanks for debugging this. I will add this to patch 14
("EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap").
Thanks,
Khalid
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
>
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
>
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
>
> 4.20 915.183s
> 4.20+XPFO 24129.354s 26.366x
> 4.20+XPFO+Deferred flush 1216.987s 1.330xx
>
>
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
>
> 4.20 607.671s
> 4.20+XPFO 1588.646s 2.614x
> 4.20+XPFO+Deferred flush 794.473s 1.307xx
>
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
>
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
>
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
>
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
>
> What remains to be done beyond this patch series:
>
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
> from Juerg. I dropped it for now since swiotlb code for ARM has
> changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
> CPUs" to other architectures besides x86.
>
>
> ---------------------------------------------------------
>
> Juerg Haefliger (5):
> mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
> swiotlb: Map the buffer if it was unmapped by XPFO
> arm64/mm: Add support for XPFO
> arm64/mm, xpfo: temporarily map dcache regions
> lkdtm: Add test for XPFO
>
> Julian Stecklina (4):
> mm, x86: omit TLB flushing by default for XPFO page table
> modifications
> xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
> xpfo, mm: optimize spinlock usage in xpfo_kunmap
> EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
>
> Khalid Aziz (2):
> xpfo, mm: Fix hang when booting with "xpfotlbflush"
> xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
>
> Tycho Andersen (5):
> mm: add MAP_HUGETLB support to vm_mmap
> x86: always set IF before oopsing from page fault
> xpfo: add primitives for mapping underlying memory
> arm64/mm: disable section/contiguous mappings if XPFO is enabled
> mm: add a user_virt_to_phys symbol
>
> .../admin-guide/kernel-parameters.txt | 2 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 +
> arch/arm64/mm/flush.c | 7 +
> arch/arm64/mm/mmu.c | 2 +-
> arch/arm64/mm/xpfo.c | 58 ++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 26 ++
> arch/x86/include/asm/tlbflush.h | 1 +
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/fault.c | 10 +
> arch/x86/mm/pageattr.c | 23 +-
> arch/x86/mm/tlb.c | 27 ++
> arch/x86/mm/xpfo.c | 171 ++++++++++++
> drivers/misc/lkdtm/Makefile | 1 +
> drivers/misc/lkdtm/core.c | 3 +
> drivers/misc/lkdtm/lkdtm.h | 5 +
> drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++
> include/linux/highmem.h | 15 +-
> include/linux/mm.h | 2 +
> include/linux/mm_types.h | 8 +
> include/linux/page-flags.h | 13 +
> include/linux/sched.h | 9 +
> include/linux/xpfo.h | 90 +++++++
> include/trace/events/mmflags.h | 10 +-
> kernel/dma/swiotlb.c | 3 +-
> mm/Makefile | 1 +
> mm/mmap.c | 19 +-
> mm/page_alloc.c | 3 +
> mm/util.c | 32 +++
> mm/xpfo.c | 247 ++++++++++++++++++
> security/Kconfig | 29 ++
> 32 files changed, 974 insertions(+), 43 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm/xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
>
Also gave this a boot on my X1 Carbon and I got some lockdep splat:
[ 16.863110] ================================
[ 16.863119] WARNING: inconsistent lock state
[ 16.863128] 4.20.0-xpfo+ #6 Not tainted
[ 16.863136] --------------------------------
[ 16.863145] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ 16.863157] swapper/5/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
[ 16.863168] 00000000301e129a (&(&page->xpfo_lock)->rlock){?.+.}, at: xpfo_do_map+0x1b/0x90
[ 16.863188] {HARDIRQ-ON-W} state was registered at:
[ 16.863200] _raw_spin_lock+0x30/0x70
[ 16.863208] xpfo_do_map+0x1b/0x90
[ 16.863217] simple_write_begin+0xc7/0x240
[ 16.863227] generic_perform_write+0xf7/0x1c0
[ 16.863237] __generic_file_write_iter+0xfa/0x1c0
[ 16.863247] generic_file_write_iter+0xab/0x150
[ 16.863257] __vfs_write+0x139/0x1a0
[ 16.863264] vfs_write+0xba/0x1c0
[ 16.863272] ksys_write+0x52/0xc0
[ 16.863281] xwrite+0x29/0x5a
[ 16.863288] do_copy+0x2b/0xc8
[ 16.863296] write_buffer+0x2a/0x3a
[ 16.863304] unpack_to_rootfs+0x107/0x2c8
[ 16.863312] populate_rootfs+0x5d/0x10a
[ 16.863322] do_one_initcall+0x5d/0x2be
[ 16.863541] kernel_init_freeable+0x21b/0x2c9
[ 16.863764] kernel_init+0xa/0x109
[ 16.863988] ret_from_fork+0x3a/0x50
[ 16.864220] irq event stamp: 337503
[ 16.864456] hardirqs last enabled at (337502): [<ffffffff8ce000a7>] __do_softirq+0xa7/0x47c
[ 16.864715] hardirqs last disabled at (337503): [<ffffffff8c0037e8>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 16.864985] softirqs last enabled at (337500): [<ffffffff8c0c6d88>] irq_enter+0x68/0x70
[ 16.865263] softirqs last disabled at (337501): [<ffffffff8c0c6ea9>] irq_exit+0x119/0x120
[ 16.865546]
other info that might help us debug this:
[ 16.866128] Possible unsafe locking scenario:
[ 16.866733] CPU0
[ 16.867039] ----
[ 16.867370] lock(&(&page->xpfo_lock)->rlock);
[ 16.867693] <Interrupt>
[ 16.868019] lock(&(&page->xpfo_lock)->rlock);
[ 16.868354]
*** DEADLOCK ***
[ 16.869373] 1 lock held by swapper/5/0:
[ 16.869727] #0: 00000000800b2c51 (&(&ctx->completion_lock)->rlock){-.-.}, at: aio_complete+0x3c/0x460
[ 16.870106]
stack backtrace:
[ 16.870868] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-xpfo+ #6
[ 16.871270] Hardware name: LENOVO 20KGS23S00/20KGS23S00, BIOS N23ET40W (1.15 ) 04/13/2018
[ 16.871686] Call Trace:
[ 16.872106] <IRQ>
[ 16.872531] dump_stack+0x85/0xc0
[ 16.872962] print_usage_bug.cold.60+0x1a8/0x1e2
[ 16.873407] ? print_shortest_lock_dependencies+0x40/0x40
[ 16.873856] mark_lock+0x502/0x600
[ 16.874308] ? check_usage_backwards+0x120/0x120
[ 16.874769] __lock_acquire+0x6e2/0x1650
[ 16.875236] ? find_held_lock+0x34/0xa0
[ 16.875710] ? sched_clock_cpu+0xc/0xb0
[ 16.876185] lock_acquire+0x9e/0x180
[ 16.876668] ? xpfo_do_map+0x1b/0x90
[ 16.877154] _raw_spin_lock+0x30/0x70
[ 16.877649] ? xpfo_do_map+0x1b/0x90
[ 16.878144] xpfo_do_map+0x1b/0x90
[ 16.878647] aio_complete+0xb2/0x460
[ 16.879154] blkdev_bio_end_io+0x71/0x150
[ 16.879665] blk_update_request+0xd7/0x2e0
[ 16.880170] blk_mq_end_request+0x1a/0x100
[ 16.880669] blk_mq_complete_request+0x98/0x120
[ 16.881175] nvme_irq+0x192/0x210 [nvme]
[ 16.881675] __handle_irq_event_percpu+0x46/0x2a0
[ 16.882174] handle_irq_event_percpu+0x30/0x80
[ 16.882670] handle_irq_event+0x34/0x51
[ 16.883252] handle_edge_irq+0x7b/0x190
[ 16.883772] handle_irq+0xbf/0x100
[ 16.883774] do_IRQ+0x5f/0x120
[ 16.883776] common_interrupt+0xf/0xf
[ 16.885469] RIP: 0010:__do_softirq+0xae/0x47c
[ 16.885470] Code: 0c 00 00 01 c7 44 24 24 0a 00 00 00 44 89 7c 24 04 48 c7 c0 c0 1e 1e 00 65 66 c7 00 00 00 e8 69 3d 3e ff fb 66 0f 1f 44 00 00 <48> c7 44 24 08 80 51 60 8d b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
[ 16.885471] RSP: 0018:ffff8bde5e003f68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdd
[ 16.887291] RAX: ffff8bde5b303740 RBX: ffff8bde5b303740 RCX: 0000000000000000
[ 16.887291] RDX: ffff8bde5b303740 RSI: 0000000000000000 RDI: ffff8bde5b303740
[ 16.887292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 16.887293] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 16.887294] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000202
[ 16.887296] ? common_interrupt+0xa/0xf
[ 16.890885] ? __do_softirq+0xa7/0x47c
[ 16.890887] ? hrtimer_interrupt+0x12e/0x220
[ 16.890889] irq_exit+0x119/0x120
[ 16.890920] smp_apic_timer_interrupt+0xa2/0x230
[ 16.890921] apic_timer_interrupt+0xf/0x20
[ 16.890922] </IRQ>
[ 16.890955] RIP: 0010:cpuidle_enter_state+0xbe/0x350
[ 16.890956] Code: 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 6d 02 00 00 31 ff e8 8e 61 91 ff e8 19 77 98 ff fb 66 0f 1f 44 00 00 <85> ed 0f 88 36 02 00 00 48 b8 ff ff ff ff f3 01 00 00 48 2b 1c 24
[ 16.890957] RSP: 0018:ffffa91a41997ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[ 16.891025] RAX: ffff8bde5b303740 RBX: 00000003ed1dca4d RCX: 0000000000000000
[ 16.891026] RDX: ffff8bde5b303740 RSI: 0000000000000001 RDI: ffff8bde5b303740
[ 16.891027] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[ 16.891028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8d7f8898
[ 16.891028] R13: ffffc91a3f800a00 R14: 0000000000000004 R15: 0000000000000000
[ 16.891032] do_idle+0x23e/0x280
[ 16.891119] cpu_startup_entry+0x19/0x20
[ 16.891122] start_secondary+0x1b3/0x200
[ 16.891124] secondary_startup_64+0xa4/0xb0
This was 4.20 + this series. config was based on what Fedora has.
Thanks,
Laura
On Thu, Jan 10, 2019 at 02:09:36PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <[email protected]>
>
> v6: * guard against lookup_xpfo() returning NULL
>
> CC: Konrad Rzeszutek Wilk <[email protected]>
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> include/linux/xpfo.h | 4 ++++
> kernel/dma/swiotlb.c | 3 ++-
> mm/xpfo.c | 15 +++++++++++++++
> 3 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index a39259ce0174..e38b823f44e3 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -35,6 +35,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
> void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
> void xpfo_free_pages(struct page *page, int order);
>
> +bool xpfo_page_is_unmapped(struct page *page);
> +
> #else /* !CONFIG_XPFO */
>
> static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> @@ -42,6 +44,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
> static inline void xpfo_free_pages(struct page *page, int order) { }
>
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> #endif /* CONFIG_XPFO */
>
> #endif /* _LINUX_XPFO_H */
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 045930e32c0e..820a54b57491 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> {
> unsigned long pfn = PFN_DOWN(orig_addr);
> unsigned char *vaddr = phys_to_virt(tlb_addr);
> + struct page *page = pfn_to_page(pfn);
>
> - if (PageHighMem(pfn_to_page(pfn))) {
> + if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> /* The buffer does not have a mapping. Map it in and copy */
> unsigned int offset = orig_addr & ~PAGE_MASK;
> char *buffer;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> index bff24afcaa2e..cdbcbac582d5 100644
> --- a/mm/xpfo.c
> +++ b/mm/xpfo.c
> @@ -220,3 +220,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
> spin_unlock(&xpfo->maplock);
> }
> EXPORT_SYMBOL(xpfo_kunmap);
> +
> +bool xpfo_page_is_unmapped(struct page *page)
> +{
> + struct xpfo *xpfo;
> +
> + if (!static_branch_unlikely(&xpfo_inited))
> + return false;
> +
> + xpfo = lookup_xpfo(page);
> + if (unlikely(!xpfo) && !xpfo->inited)
> + return false;
> +
> + return test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> --
> 2.17.1
>
On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <[email protected]>
>
> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> provide a hook for updating a single kernel page table entry (which is
> required by the generic XPFO code).
>
> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 ++
> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 61 insertions(+)
> create mode 100644 arch/arm64/mm/xpfo.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index ea2ab0330e3a..f0a9c0007d23 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -171,6 +171,7 @@ config ARM64
> select SWIOTLB
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_XPFO
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 849c1df3d214..cca3808d9776 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>
> obj-$(CONFIG_KASAN) += kasan_init.o
> KASAN_SANITIZE_kasan_init.o := n
> +
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> new file mode 100644
> index 000000000000..678e2be848eb
> --- /dev/null
> +++ b/arch/arm64/mm/xpfo.c
> @@ -0,0 +1,58 @@
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + * Juerg Haefliger <[email protected]>
> + * Vasileios P. Kemerlis <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/tlbflush.h>
> +
> +/*
> + * Lookup the page table entry for a virtual address and return a pointer to
> + * the entry. Based on x86 tree.
> + */
> +static pte_t *lookup_address(unsigned long addr)
The x86 also has level. Would it make sense to include that in here?
On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <[email protected]>
>
> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
> provide a hook for updating a single kernel page table entry (which is
> required by the generic XPFO code).
>
> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 ++
> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 61 insertions(+)
> create mode 100644 arch/arm64/mm/xpfo.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index ea2ab0330e3a..f0a9c0007d23 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -171,6 +171,7 @@ config ARM64
> select SWIOTLB
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_XPFO
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 849c1df3d214..cca3808d9776 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>
> obj-$(CONFIG_KASAN) += kasan_init.o
> KASAN_SANITIZE_kasan_init.o := n
> +
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
> new file mode 100644
> index 000000000000..678e2be848eb
> --- /dev/null
> +++ b/arch/arm64/mm/xpfo.c
> @@ -0,0 +1,58 @@
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + * Juerg Haefliger <[email protected]>
> + * Vasileios P. Kemerlis <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/tlbflush.h>
> +
> +/*
> + * Lookup the page table entry for a virtual address and return a pointer to
> + * the entry. Based on x86 tree.
> + */
> +static pte_t *lookup_address(unsigned long addr)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> +
> + pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd))
> + return NULL;
> +
> + pud = pud_offset(pgd, addr);
> + if (pud_none(*pud))
> + return NULL;
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd))
> + return NULL;
> +
> + return pte_offset_kernel(pmd, addr);
> +}
> +
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> + pte_t *pte = lookup_address((unsigned long)kaddr);
> +
> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
Thought on the other hand.. what if the page is PMD? Do you really want
to do this?
What if 'pte' is NULL?
> +}
> +
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> + unsigned long kaddr = (unsigned long)page_address(page);
> + unsigned long size = PAGE_SIZE;
> +
> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
Ditto here. You are assuming it is PTE, but it may be PMD or such.
Or worts - the lookup_address could be NULL.
> +}
> --
> 2.17.1
>
On Thu, Jan 10, 2019 at 02:09:39PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger <[email protected]>
>
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault, so let's temporarily map the region, flush the cache, and then
> unmap it.
>
> v6: actually flush in the face of xpfo, and temporarily map the underlying
> memory so it can be flushed correctly
>
> CC: [email protected]
> Signed-off-by: Juerg Haefliger <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> arch/arm64/mm/flush.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 30695a868107..f12f26b60319 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -20,6 +20,7 @@
> #include <linux/export.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> +#include <linux/xpfo.h>
>
> #include <asm/cacheflush.h>
> #include <asm/cache.h>
> @@ -28,9 +29,15 @@
> void sync_icache_aliases(void *kaddr, unsigned long len)
> {
> unsigned long addr = (unsigned long)kaddr;
> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
Is it possible that the 'len' is more than 32 pages? Or say 1000's
of pages? Which would blow away your stack.
> + void *mapping[num_pages];
>
> if (icache_is_aliasing()) {
> + xpfo_temp_map(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
> __clean_dcache_area_pou(kaddr, len);
> + xpfo_temp_unmap(kaddr, len, mapping,
> + sizeof(mapping[0]) * num_pages);
> __flush_icache_all();
> } else {
> flush_icache_range(addr, addr + len);
> --
> 2.17.1
>
> +EXPORT_SYMBOL(user_virt_to_phys);
Could it be _GPL? OTherwise looks OK to me.
On 1/23/19 7:20 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger <[email protected]>
>>
>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>> provide a hook for updating a single kernel page table entry (which is
>> required by the generic XPFO code).
>>
>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>
>> CC: [email protected]
>> Signed-off-by: Juerg Haefliger <[email protected]>
>> Signed-off-by: Tycho Andersen <[email protected]>
>> Signed-off-by: Khalid Aziz <[email protected]>
>> ---
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/mm/Makefile | 2 ++
>> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 61 insertions(+)
>> create mode 100644 arch/arm64/mm/xpfo.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index ea2ab0330e3a..f0a9c0007d23 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -171,6 +171,7 @@ config ARM64
>> select SWIOTLB
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> + select ARCH_SUPPORTS_XPFO
>> help
>> ARM 64-bit (AArch64) Linux support.
>>
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index 849c1df3d214..cca3808d9776 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>>
>> obj-$(CONFIG_KASAN) += kasan_init.o
>> KASAN_SANITIZE_kasan_init.o := n
>> +
>> +obj-$(CONFIG_XPFO) += xpfo.o
>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>> new file mode 100644
>> index 000000000000..678e2be848eb
>> --- /dev/null
>> +++ b/arch/arm64/mm/xpfo.c
>> @@ -0,0 +1,58 @@
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + * Juerg Haefliger <[email protected]>
>> + * Vasileios P. Kemerlis <[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/tlbflush.h>
>> +
>> +/*
>> + * Lookup the page table entry for a virtual address and return a pointer to
>> + * the entry. Based on x86 tree.
>> + */
>> +static pte_t *lookup_address(unsigned long addr)
>
> The x86 also has level. Would it make sense to include that in here?
>
Possibly. ARM64 does not define page levels (as in the enum for page
levels) at this time but that can be added easily. Adding level to
lookup_address() for arm will make it uniform with x86 but is there any
other rationale besides that? Do you see a future use for this
information? The only other architecture I could see that defines
lookup_address() is sh but it uses it for trapped io only.
Thanks,
Khalid
On 1/23/19 7:24 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger <[email protected]>
>>
>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>> provide a hook for updating a single kernel page table entry (which is
>> required by the generic XPFO code).
>>
>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>
>> CC: [email protected]
>> Signed-off-by: Juerg Haefliger <[email protected]>
>> Signed-off-by: Tycho Andersen <[email protected]>
>> Signed-off-by: Khalid Aziz <[email protected]>
>> ---
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/mm/Makefile | 2 ++
>> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 61 insertions(+)
>> create mode 100644 arch/arm64/mm/xpfo.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index ea2ab0330e3a..f0a9c0007d23 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -171,6 +171,7 @@ config ARM64
>> select SWIOTLB
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> + select ARCH_SUPPORTS_XPFO
>> help
>> ARM 64-bit (AArch64) Linux support.
>>
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index 849c1df3d214..cca3808d9776 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>>
>> obj-$(CONFIG_KASAN) += kasan_init.o
>> KASAN_SANITIZE_kasan_init.o := n
>> +
>> +obj-$(CONFIG_XPFO) += xpfo.o
>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>> new file mode 100644
>> index 000000000000..678e2be848eb
>> --- /dev/null
>> +++ b/arch/arm64/mm/xpfo.c
>> @@ -0,0 +1,58 @@
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + * Juerg Haefliger <[email protected]>
>> + * Vasileios P. Kemerlis <[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/tlbflush.h>
>> +
>> +/*
>> + * Lookup the page table entry for a virtual address and return a pointer to
>> + * the entry. Based on x86 tree.
>> + */
>> +static pte_t *lookup_address(unsigned long addr)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> +
>> + pgd = pgd_offset_k(addr);
>> + if (pgd_none(*pgd))
>> + return NULL;
>> +
>> + pud = pud_offset(pgd, addr);
>> + if (pud_none(*pud))
>> + return NULL;
>> +
>> + pmd = pmd_offset(pud, addr);
>> + if (pmd_none(*pmd))
>> + return NULL;
>> +
>> + return pte_offset_kernel(pmd, addr);
>> +}
>> +
>> +/* Update a single kernel page table entry */
>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>> +{
>> + pte_t *pte = lookup_address((unsigned long)kaddr);
>> +
>> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
>
> Thought on the other hand.. what if the page is PMD? Do you really want
> to do this?
>
> What if 'pte' is NULL?
>> +}
>> +
>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>> +{
>> + unsigned long kaddr = (unsigned long)page_address(page);
>> + unsigned long size = PAGE_SIZE;
>> +
>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>
> Ditto here. You are assuming it is PTE, but it may be PMD or such.
> Or worts - the lookup_address could be NULL.
>
>> +}
>> --
>> 2.17.1
>>
Hi Konrad,
This makes sense. x86 version of set_kpte() checks pte for NULL and also
checks if the page is PMD. Now what you said about adding level to
lookup_address() for arm makes more sense.
Can someone with knowledge of arm64 mmu make recommendations here?
Thanks,
Khalid
On 2/12/19 7:52 AM, Khalid Aziz wrote:
> On 1/23/19 7:24 AM, Konrad Rzeszutek Wilk wrote:
>> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>>> From: Juerg Haefliger <[email protected]>
>>>
>>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>>> provide a hook for updating a single kernel page table entry (which is
>>> required by the generic XPFO code).
>>>
>>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>>
>>> CC: [email protected]
>>> Signed-off-by: Juerg Haefliger <[email protected]>
>>> Signed-off-by: Tycho Andersen <[email protected]>
>>> Signed-off-by: Khalid Aziz <[email protected]>
>>> ---
>>> arch/arm64/Kconfig | 1 +
>>> arch/arm64/mm/Makefile | 2 ++
>>> arch/arm64/mm/xpfo.c | 58 ++++++++++++++++++++++++++++++++++++++++++
>>> 3 files changed, 61 insertions(+)
>>> create mode 100644 arch/arm64/mm/xpfo.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index ea2ab0330e3a..f0a9c0007d23 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -171,6 +171,7 @@ config ARM64
>>> select SWIOTLB
>>> select SYSCTL_EXCEPTION_TRACE
>>> select THREAD_INFO_IN_TASK
>>> + select ARCH_SUPPORTS_XPFO
>>> help
>>> ARM 64-bit (AArch64) Linux support.
>>>
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index 849c1df3d214..cca3808d9776 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>>>
>>> obj-$(CONFIG_KASAN) += kasan_init.o
>>> KASAN_SANITIZE_kasan_init.o := n
>>> +
>>> +obj-$(CONFIG_XPFO) += xpfo.o
>>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>>> new file mode 100644
>>> index 000000000000..678e2be848eb
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/xpfo.c
>>> @@ -0,0 +1,58 @@
>>> +/*
>>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>>> + * Copyright (C) 2016 Brown University. All rights reserved.
>>> + *
>>> + * Authors:
>>> + * Juerg Haefliger <[email protected]>
>>> + * Vasileios P. Kemerlis <[email protected]>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify it
>>> + * under the terms of the GNU General Public License version 2 as published by
>>> + * the Free Software Foundation.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/module.h>
>>> +
>>> +#include <asm/tlbflush.h>
>>> +
>>> +/*
>>> + * Lookup the page table entry for a virtual address and return a pointer to
>>> + * the entry. Based on x86 tree.
>>> + */
>>> +static pte_t *lookup_address(unsigned long addr)
>>> +{
>>> + pgd_t *pgd;
>>> + pud_t *pud;
>>> + pmd_t *pmd;
>>> +
>>> + pgd = pgd_offset_k(addr);
>>> + if (pgd_none(*pgd))
>>> + return NULL;
>>> +
>>> + pud = pud_offset(pgd, addr);
>>> + if (pud_none(*pud))
>>> + return NULL;
>>> +
>>> + pmd = pmd_offset(pud, addr);
>>> + if (pmd_none(*pmd))
>>> + return NULL;
>>> +
>>> + return pte_offset_kernel(pmd, addr);
>>> +}
>>> +
>>> +/* Update a single kernel page table entry */
>>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>>> +{
>>> + pte_t *pte = lookup_address((unsigned long)kaddr);
>>> +
>>> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
>>
>> Thought on the other hand.. what if the page is PMD? Do you really want
>> to do this?
>>
>> What if 'pte' is NULL?
>>> +}
>>> +
>>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>>> +{
>>> + unsigned long kaddr = (unsigned long)page_address(page);
>>> + unsigned long size = PAGE_SIZE;
>>> +
>>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>>
>> Ditto here. You are assuming it is PTE, but it may be PMD or such.
>> Or worts - the lookup_address could be NULL.
>>
>>> +}
>>> --
>>> 2.17.1
>>>
>
> Hi Konrad,
>
> This makes sense. x86 version of set_kpte() checks pte for NULL and also
> checks if the page is PMD. Now what you said about adding level to
> lookup_address() for arm makes more sense.
>
> Can someone with knowledge of arm64 mmu make recommendations here?
>
> Thanks,
> Khalid
>
arm64 can't split larger pages and requires everything must be
mapped as pages (see [RFC PATCH v7 08/16] arm64/mm: disable
section/contiguous mappings if XPFO is enabled) . Any
situation where we would get something other than a pte
would be a bug.
Thanks,
Laura
On 2/12/19 1:01 PM, Laura Abbott wrote:
> On 2/12/19 7:52 AM, Khalid Aziz wrote:
>> On 1/23/19 7:24 AM, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>>>> From: Juerg Haefliger <[email protected]>
>>>>
>>>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>>>> provide a hook for updating a single kernel page table entry (which is
>>>> required by the generic XPFO code).
>>>>
>>>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>>>
>>>> CC: [email protected]
>>>> Signed-off-by: Juerg Haefliger <[email protected]>
>>>> Signed-off-by: Tycho Andersen <[email protected]>
>>>> Signed-off-by: Khalid Aziz <[email protected]>
>>>> ---
>>>> arch/arm64/Kconfig | 1 +
>>>> arch/arm64/mm/Makefile | 2 ++
>>>> arch/arm64/mm/xpfo.c | 58
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>> 3 files changed, 61 insertions(+)
>>>> create mode 100644 arch/arm64/mm/xpfo.c
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index ea2ab0330e3a..f0a9c0007d23 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -171,6 +171,7 @@ config ARM64
>>>> select SWIOTLB
>>>> select SYSCTL_EXCEPTION_TRACE
>>>> select THREAD_INFO_IN_TASK
>>>> + select ARCH_SUPPORTS_XPFO
>>>> help
>>>> ARM 64-bit (AArch64) Linux support.
>>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>>> index 849c1df3d214..cca3808d9776 100644
>>>> --- a/arch/arm64/mm/Makefile
>>>> +++ b/arch/arm64/mm/Makefile
>>>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
>>>> obj-$(CONFIG_KASAN) += kasan_init.o
>>>> KASAN_SANITIZE_kasan_init.o := n
>>>> +
>>>> +obj-$(CONFIG_XPFO) += xpfo.o
>>>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>>>> new file mode 100644
>>>> index 000000000000..678e2be848eb
>>>> --- /dev/null
>>>> +++ b/arch/arm64/mm/xpfo.c
>>>> @@ -0,0 +1,58 @@
>>>> +/*
>>>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>>>> + * Copyright (C) 2016 Brown University. All rights reserved.
>>>> + *
>>>> + * Authors:
>>>> + * Juerg Haefliger <[email protected]>
>>>> + * Vasileios P. Kemerlis <[email protected]>
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or
>>>> modify it
>>>> + * under the terms of the GNU General Public License version 2 as
>>>> published by
>>>> + * the Free Software Foundation.
>>>> + */
>>>> +
>>>> +#include <linux/mm.h>
>>>> +#include <linux/module.h>
>>>> +
>>>> +#include <asm/tlbflush.h>
>>>> +
>>>> +/*
>>>> + * Lookup the page table entry for a virtual address and return a
>>>> pointer to
>>>> + * the entry. Based on x86 tree.
>>>> + */
>>>> +static pte_t *lookup_address(unsigned long addr)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + pud_t *pud;
>>>> + pmd_t *pmd;
>>>> +
>>>> + pgd = pgd_offset_k(addr);
>>>> + if (pgd_none(*pgd))
>>>> + return NULL;
>>>> +
>>>> + pud = pud_offset(pgd, addr);
>>>> + if (pud_none(*pud))
>>>> + return NULL;
>>>> +
>>>> + pmd = pmd_offset(pud, addr);
>>>> + if (pmd_none(*pmd))
>>>> + return NULL;
>>>> +
>>>> + return pte_offset_kernel(pmd, addr);
>>>> +}
>>>> +
>>>> +/* Update a single kernel page table entry */
>>>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>>>> +{
>>>> + pte_t *pte = lookup_address((unsigned long)kaddr);
>>>> +
>>>> + set_pte(pte, pfn_pte(page_to_pfn(page), prot));
>>>
>>> Thought on the other hand.. what if the page is PMD? Do you really want
>>> to do this?
>>>
>>> What if 'pte' is NULL?
>>>> +}
>>>> +
>>>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>>>> +{
>>>> + unsigned long kaddr = (unsigned long)page_address(page);
>>>> + unsigned long size = PAGE_SIZE;
>>>> +
>>>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>>>
>>> Ditto here. You are assuming it is PTE, but it may be PMD or such.
>>> Or worts - the lookup_address could be NULL.
>>>
>>>> +}
>>>> --
>>>> 2.17.1
>>>>
>>
>> Hi Konrad,
>>
>> This makes sense. x86 version of set_kpte() checks pte for NULL and also
>> checks if the page is PMD. Now what you said about adding level to
>> lookup_address() for arm makes more sense.
>>
>> Can someone with knowledge of arm64 mmu make recommendations here?
>>
>> Thanks,
>> Khalid
>>
>
> arm64 can't split larger pages and requires everything must be
> mapped as pages (see [RFC PATCH v7 08/16] arm64/mm: disable
> section/contiguous mappings if XPFO is enabled) . Any
> situation where we would get something other than a pte
> would be a bug.
Thanks, Laura! That helps a lot. I would think checking for NULL pte in
set_kpte() would still make sense since lookup_address() can return
NULL. Something like:
--- arch/arm64/mm/xpfo.c 2019-01-30 13:36:39.857185612 -0700
+++ arch/arm64/mm/xpfo.c.new 2019-02-12 13:26:47.471633031 -0700
@@ -46,6 +46,11 @@
{
pte_t *pte = lookup_address((unsigned long)kaddr);
+ if (unlikely(!pte)) {
+ WARN(1, "xpfo: invalid address %p\n", kaddr);
+ return;
+ }
+
set_pte(pte, pfn_pte(page_to_pfn(page), prot));
}
--
Khalid