This applies to 4.17 and 4.18.
Thanks to Hugh Dickins for initially finding the r/w kernel text
issue and coming up with an initial fix. I found the "unused
hole" part and came up with different approach for fixing the
mess.
--
Background:
Process Context IDentifiers (PCIDs) are a hardware feature that
allows TLB entries to survive page table switches (CR3 writes).
As an optimization, the PTI code currently allows the kernel image
to be Global when running on hardware without PCIDs. This results
in fewer TLB misses, especially upon entry.
The downside is that these Global areas are theoretically
susceptible to Meltdown. The logic is that there are no secrets
in the kernel image, so why pay the cost of TLB misses.
Problem:
The current PTI code leaves the entire area of the kernel binary
between '_text' and '_end' as Global (on non-PCID hardware).
However, that range contains both read-write kernel data, and two
"unused" holes in addition to text. The areas which are not text
or read-only might contain secrets once they are freed back into
the allocator.
This issue affects systems which are susceptible to Meltdown, do not
have PCIDs and which are using the default PTI_AUTO mode (no
pti=on/off on the cmdline).
PCIDs became generally available for servers in ~2010 (Westmere)
and desktop (client) parts in roughly 2011 (Sandybridge). This
is not expected to affect anything newer than that.
Solution:
The solution for the read-write area is to clear the global bit
for the area (patch #1).
The "unused" holes need a bit more work since we free them in a
bit of an ad-hoc way, but we fix this up in patches 2-5.
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
From: Dave Hansen <[email protected]>
free_reserved_area() takes pointers as arguments to show which
addresses should be freed. However, it does this in a
somewhat ambiguous way. If it gets a kernel direct map address,
it always works. However, if it gets an address that is
part of the kernel image alias mapping, it can fail.
It fails if all of the following happen:
* The specified address is part of the kernel image alias
* Poisoning is requested (forcing a memset())
* The address is in a read-only portion of the kernel image
The memset() fails on the read-only mapping, of course.
free_reserved_area() *is* called both on the direct map and
on kernel image alias addresses. We've just lucked out thus
far that the kernel image alias areas it gets used on are
read-write. I'm fairly sure this has been just a happy
accident.
It is quite easy to make free_reserved_area() work for all
cases: just convert the address to a direct map address before
doing the memset(), and do this unconditionally. There is
little chance of a regression here because we previously
did a virt_to_page() on the address for the memset, so we
know these are no highmem pages for which virt_to_page()
would fail.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
---
b/mm/page_alloc.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff -puN mm/page_alloc.c~x86-mm-init-handle-non-linear-map-ranges-free_init_pages mm/page_alloc.c
--- a/mm/page_alloc.c~x86-mm-init-handle-non-linear-map-ranges-free_init_pages 2018-07-30 09:53:13.259915693 -0700
+++ b/mm/page_alloc.c 2018-07-30 09:53:13.264915693 -0700
@@ -6934,9 +6934,21 @@ unsigned long free_reserved_area(void *s
start = (void *)PAGE_ALIGN((unsigned long)start);
end = (void *)((unsigned long)end & PAGE_MASK);
for (pos = start; pos < end; pos += PAGE_SIZE, pages++) {
+ struct page *page = virt_to_page(pos);
+ void *direct_map_addr;
+
+ /*
+ * 'direct_map_addr' might be different from 'pos'
+ * because some architectures' virt_to_page()
+ * work with aliases. Getting the direct map
+ * address ensures that we get a _writeable_
+ * alias for the memset().
+ */
+ direct_map_addr = page_address(page);
if ((unsigned int)poison <= 0xFF)
- memset(pos, poison, PAGE_SIZE);
- free_reserved_page(virt_to_page(pos));
+ memset(direct_map_addr, poison, PAGE_SIZE);
+
+ free_reserved_page(page);
}
if (pages && s)
_
From: Dave Hansen <[email protected]>
The kernel image is mapped into two places in the virtual address
space (addresses without KASLR, of course):
1. The kernel direct map (0xffff880000000000)
2. The "high kernel map" (0xffffffff81000000)
We actually execute out of #2. If we get the address of a kernel
symbol, it points to #2, but almost all physical-to-virtual
translations point to #1.
Parts of the "high kernel map" alias are mapped in the userspace
page tables with the Global bit for performance reasons. The
parts that we map to userspace do not (er, should not) have
secrets.
This is fine, except that some areas in the kernel image that
are adjacent to the non-secret-containing areas are unused holes.
We free these holes back into the normal page allocator and
reuse them as normal kernel memory. The memory will, of course,
get *used* via the normal map, but the alias mapping is kept.
This otherwise unused alias mapping of the holes will, by default
keep the Global bit, be mapped out to userspace, and be
vulnerable to Meltdown.
Remove the alias mapping of these pages entirely. This is likely
to fracture the 2M page mapping the kernel image near these areas,
but this should affect a minority of the area.
This unmapping behavior is currently dependent on PTI being in
place. Going forward, we should at least consider doing this for
all configurations. Having an extra read-write alias for memory
is not exactly ideal for debugging things like random memory
corruption and this does undercut features like DEBUG_PAGEALLOC
or future work like eXclusive Page Frame Ownership (XPFO).
Before this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
After this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte
current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_user-0xffffffff82488000-0xffffffff82600000 1504K pte
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
Signed-off-by: Dave Hansen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
---
b/arch/x86/mm/init.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff -puN arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image 2018-07-30 09:53:14.862915689 -0700
+++ b/arch/x86/mm/init.c 2018-07-30 09:53:14.866915689 -0700
@@ -778,8 +778,26 @@ void free_init_pages(char *what, unsigne
*/
void free_kernel_image_pages(void *begin, void *end)
{
- free_init_pages("unused kernel image",
- (unsigned long)begin, (unsigned long)end);
+ unsigned long begin_ul = (unsigned long)begin;
+ unsigned long end_ul = (unsigned long)end;
+ unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT;
+
+
+ free_init_pages("unused kernel image", begin_ul, end_ul);
+
+ /*
+ * PTI maps some of the kernel into userspace. For
+ * performance, this includes some kernel areas that
+ * do not contain secrets. Those areas might be
+ * adjacent to the parts of the kernel image being
+ * freed, which may contain secrets. Remove the
+ * "high kernel image mapping" for these freed areas,
+ * ensuring they are not even potentially vulnerable
+ * to Meltdown regardless of the specific optimizations
+ * PTI is currently using.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_PTI))
+ set_memory_np(begin_ul, len_pages);
}
void __ref free_initmem(void)
_
From: Dave Hansen <[email protected]>
When chunks of the kernel image are freed, free_init_pages()
is used directly. Consolidate the three sites that do this.
Also update the string to give an incrementally better
description of that memory versus what was there before.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
---
b/arch/x86/include/asm/processor.h | 1 +
b/arch/x86/mm/init.c | 15 ++++++++++++---
b/arch/x86/mm/init_64.c | 4 ++--
3 files changed, 15 insertions(+), 5 deletions(-)
diff -puN arch/x86/include/asm/processor.h~x86-mm-init-common-kernel-image-free-helper arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86-mm-init-common-kernel-image-free-helper 2018-07-30 09:53:14.295915691 -0700
+++ b/arch/x86/include/asm/processor.h 2018-07-30 09:53:14.303915690 -0700
@@ -975,6 +975,7 @@ static inline uint32_t hypervisor_cpuid_
extern unsigned long arch_align_stack(unsigned long sp);
extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
+extern void free_kernel_image_pages(void *begin, void *end);
void default_idle(void);
#ifdef CONFIG_XEN
diff -puN arch/x86/mm/init_64.c~x86-mm-init-common-kernel-image-free-helper arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c~x86-mm-init-common-kernel-image-free-helper 2018-07-30 09:53:14.297915691 -0700
+++ b/arch/x86/mm/init_64.c 2018-07-30 09:53:14.304915690 -0700
@@ -1283,8 +1283,8 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
#endif
- free_init_pages("unused kernel", text_end, rodata_start);
- free_init_pages("unused kernel", rodata_end, _sdata);
+ free_kernel_image_pages((void *)text_end, (void *)rodata_start);
+ free_kernel_image_pages((void *)rodata_end, (void *)_sdata);
debug_checkwx();
diff -puN arch/x86/mm/init.c~x86-mm-init-common-kernel-image-free-helper arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~x86-mm-init-common-kernel-image-free-helper 2018-07-30 09:53:14.298915691 -0700
+++ b/arch/x86/mm/init.c 2018-07-30 09:53:14.304915690 -0700
@@ -771,13 +771,22 @@ void free_init_pages(char *what, unsigne
}
}
+/*
+ * begin/end can be in the direct map or the "high kernel mapping"
+ * used for the kernel image only. free_init_pages() will do the
+ * right thing for either kind of address.
+ */
+void free_kernel_image_pages(void *begin, void *end)
+{
+ free_init_pages("unused kernel image",
+ (unsigned long)begin, (unsigned long)end);
+}
+
void __ref free_initmem(void)
{
e820__reallocate_tables();
- free_init_pages("unused kernel",
- (unsigned long)(&__init_begin),
- (unsigned long)(&__init_end));
+ free_kernel_image_pages(&__init_begin, &__init_end);
}
#ifdef CONFIG_BLK_DEV_INITRD
_
From: Dave Hansen <[email protected]>
The x86 code has several places where it frees parts of kernel image:
1. Unused SMP alternative
2. __init code
3. The hole between text and rodata
4. The hole between rodata and data
We call free_init_pages() to do this. Strangely, we convert the
symbol addresses to kernel direct map addresses in some cases
(#3, #4) but not others (#1, #2).
The virt_to_page() and the other code in free_reserved_area() now
works fine for for symbol addresses on x86, so don't bother
converting the addresses to direct map addresses before freeing
them.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
---
b/arch/x86/mm/init_64.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff -puN arch/x86/mm/init_64.c~x86-init-do-not-convert-symbol-addresses arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c~x86-init-do-not-convert-symbol-addresses 2018-07-30 09:53:13.778915692 -0700
+++ b/arch/x86/mm/init_64.c 2018-07-30 09:53:13.782915692 -0700
@@ -1283,12 +1283,8 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
#endif
- free_init_pages("unused kernel",
- (unsigned long) __va(__pa_symbol(text_end)),
- (unsigned long) __va(__pa_symbol(rodata_start)));
- free_init_pages("unused kernel",
- (unsigned long) __va(__pa_symbol(rodata_end)),
- (unsigned long) __va(__pa_symbol(_sdata)));
+ free_init_pages("unused kernel", text_end, rodata_start);
+ free_init_pages("unused kernel", rodata_end, _sdata);
debug_checkwx();
_
From: Dave Hansen <[email protected]>
The kernel image starts out with the Global bit set across the entire
kernel image. The bit is cleared with set_memory_nonglobal() in the
configurations with PCIDs where we do not need the performance
benefits of the Global bit.
However, this is fragile. It means that we are stuck opting *out* of
the less-secure (Global bit set) configuration, which seems backwards.
Let's start more secure (Global bit clear) and then let things opt
back in if they want performance, or are truly mapping common data
between kernel and userspace.
This fixes a bug. Before this patch, there are areas that are
unmapped from the user page tables (like like everything above
0xffffffff82600000 in the example below). These have the hallmark of
being a wrong Global area: they are no identical in the
'current_kernel' and 'current_user' page table dumps. They are also
read-write, which means they're much more likely to contain secrets.
Before this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE GLB NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW GLB NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE GLB NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte
current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
After this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
Signed-off-by: Dave Hansen <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
---
b/arch/x86/mm/pageattr.c | 6 ++++++
b/arch/x86/mm/pti.c | 34 ++++++++++++++++++++++++----------
2 files changed, 30 insertions(+), 10 deletions(-)
diff -puN arch/x86/mm/pageattr.c~pti-non-pcid-clear-more-global-in-kernel-mapping arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~pti-non-pcid-clear-more-global-in-kernel-mapping 2018-07-30 09:53:12.720915694 -0700
+++ b/arch/x86/mm/pageattr.c 2018-07-30 09:53:12.726915694 -0700
@@ -1784,6 +1784,12 @@ int set_memory_nonglobal(unsigned long a
__pgprot(_PAGE_GLOBAL), 0);
}
+int set_memory_global(unsigned long addr, int numpages)
+{
+ return change_page_attr_set(&addr, numpages,
+ __pgprot(_PAGE_GLOBAL), 0);
+}
+
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
struct cpa_data cpa;
diff -puN arch/x86/mm/pti.c~pti-non-pcid-clear-more-global-in-kernel-mapping arch/x86/mm/pti.c
--- a/arch/x86/mm/pti.c~pti-non-pcid-clear-more-global-in-kernel-mapping 2018-07-30 09:53:12.722915694 -0700
+++ b/arch/x86/mm/pti.c 2018-07-30 09:53:12.726915694 -0700
@@ -435,6 +435,13 @@ static inline bool pti_kernel_image_glob
}
/*
+ * This is the only user for these and it is not arch-generic
+ * like the other set_memory.h functions. Just extern them.
+ */
+extern int set_memory_nonglobal(unsigned long addr, int numpages);
+extern int set_memory_global(unsigned long addr, int numpages);
+
+/*
* For some configurations, map all of kernel text into the user page
* tables. This reduces TLB misses, especially on non-PCID systems.
*/
@@ -446,7 +453,8 @@ void pti_clone_kernel_text(void)
* clone the areas past rodata, they might contain secrets.
*/
unsigned long start = PFN_ALIGN(_text);
- unsigned long end = (unsigned long)__end_rodata_hpage_align;
+ unsigned long end_clone = (unsigned long)__end_rodata_hpage_align;
+ unsigned long end_global = PFN_ALIGN((unsigned long)__stop___ex_table);
if (!pti_kernel_image_global_ok())
return;
@@ -458,14 +466,18 @@ void pti_clone_kernel_text(void)
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
- pti_clone_pmds(start, end, _PAGE_RW);
+ pti_clone_pmds(start, end_clone, _PAGE_RW);
+
+ /*
+ * pti_clone_pmds() will set the global bit in any PMDs
+ * that it clones, but we also need to get any PTEs in
+ * the last level for areas that are not huge-page-aligned.
+ */
+
+ /* Set the global bit for normal non-__init kernel text: */
+ set_memory_global(start, (end_global - start) >> PAGE_SHIFT);
}
-/*
- * This is the only user for it and it is not arch-generic like
- * the other set_memory.h functions. Just extern it.
- */
-extern int set_memory_nonglobal(unsigned long addr, int numpages);
void pti_set_kernel_image_nonglobal(void)
{
/*
@@ -477,9 +489,11 @@ void pti_set_kernel_image_nonglobal(void
unsigned long start = PFN_ALIGN(_text);
unsigned long end = ALIGN((unsigned long)_end, PMD_PAGE_SIZE);
- if (pti_kernel_image_global_ok())
- return;
-
+ /*
+ * This clears _PAGE_GLOBAL from the entire kernel image.
+ * pti_clone_kernel_text() map put _PAGE_GLOBAL back for
+ * areas that are mapped to userspace.
+ */
set_memory_nonglobal(start, (end - start) >> PAGE_SHIFT);
}
_
On Wed, 1 Aug 2018, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> The kernel image is mapped into two places in the virtual address
> space (addresses without KASLR, of course):
>
> 1. The kernel direct map (0xffff880000000000)
> 2. The "high kernel map" (0xffffffff81000000)
>
> We actually execute out of #2. If we get the address of a kernel
> symbol, it points to #2, but almost all physical-to-virtual
> translations point to #1.
>
> Parts of the "high kernel map" alias are mapped in the userspace
> page tables with the Global bit for performance reasons. The
> parts that we map to userspace do not (er, should not) have
> secrets.
>
> This is fine, except that some areas in the kernel image that
> are adjacent to the non-secret-containing areas are unused holes.
> We free these holes back into the normal page allocator and
> reuse them as normal kernel memory. The memory will, of course,
> get *used* via the normal map, but the alias mapping is kept.
>
> This otherwise unused alias mapping of the holes will, by default
> keep the Global bit, be mapped out to userspace, and be
> vulnerable to Meltdown.
>
> Remove the alias mapping of these pages entirely. This is likely
> to fracture the 2M page mapping the kernel image near these areas,
> but this should affect a minority of the area.
>
> This unmapping behavior is currently dependent on PTI being in
> place. Going forward, we should at least consider doing this for
> all configurations. Having an extra read-write alias for memory
> is not exactly ideal for debugging things like random memory
> corruption and this does undercut features like DEBUG_PAGEALLOC
> or future work like eXclusive Page Frame Ownership (XPFO).
>
> Before this patch:
>
> current_kernel:---[ High Kernel Mapping ]---
> current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
> current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
> current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
> current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
> current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
> current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
>
> current_user:---[ High Kernel Mapping ]---
> current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
> current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
> current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
>
>
> After this patch:
>
> current_kernel:---[ High Kernel Mapping ]---
> current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte
> current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
> current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
> current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte
> current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
> current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte
> current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte
>
> current_user:---[ High Kernel Mapping ]---
> current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte
> current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
> current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
> current_user-0xffffffff82488000-0xffffffff82600000 1504K pte
> current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Juergen Gross <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Andi Kleen <[email protected]>
> ---
>
> b/arch/x86/mm/init.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff -puN arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image arch/x86/mm/init.c
> --- a/arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image 2018-07-30 09:53:14.862915689 -0700
> +++ b/arch/x86/mm/init.c 2018-07-30 09:53:14.866915689 -0700
> @@ -778,8 +778,26 @@ void free_init_pages(char *what, unsigne
> */
> void free_kernel_image_pages(void *begin, void *end)
> {
> - free_init_pages("unused kernel image",
> - (unsigned long)begin, (unsigned long)end);
> + unsigned long begin_ul = (unsigned long)begin;
> + unsigned long end_ul = (unsigned long)end;
> + unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT;
> +
> +
> + free_init_pages("unused kernel image", begin_ul, end_ul);
> +
> + /*
> + * PTI maps some of the kernel into userspace. For
> + * performance, this includes some kernel areas that
> + * do not contain secrets. Those areas might be
> + * adjacent to the parts of the kernel image being
> + * freed, which may contain secrets. Remove the
> + * "high kernel image mapping" for these freed areas,
> + * ensuring they are not even potentially vulnerable
> + * to Meltdown regardless of the specific optimizations
> + * PTI is currently using.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_PTI))
> + set_memory_np(begin_ul, len_pages);
> }
>
> void __ref free_initmem(void)
> _
Ironically, that set_memory_np() is giving me a problem.
I don't see it when booting the 8GB laptop normally, but when booting
with "mem=1G", I get a not-present fault when ext4_iget() is trying to
do its business in starting init. But boots fine with "mem=1G nopti".
I get the feeling that set_memory_np() is marking those freed pages
as not-present in the direct map, so they're no longer usable at all.
I can jot down some console messages if you need, but hope I've said
enough for you to see it immediately, and just say whoops, forget 5/5?
Hugh