Folks,
This patch series is a response to the report that the EFI region
mappings trigger warnings when booting with CONFIG_DEBUG_WX enabled.
They allocate a new page table structure and ensure that all the
mappings we require during EFI runtime calls are only setup there.
It turns out that it still makes sense to share some page table
entries with 'swapper_pg_dir', just not the entries where we need to
allow security lax permissions. Sharing entries is useful for memory
hotplug, for example.
When writing this series I discovered a number of bugs in the existing
code that only became apparent when we stopped using 'trampoline_pgd'
which already mapped a bunch of things for us. I've put those bug
fixes at the start of the series.
Further testing would be very much appreciated as this is a
notoriously funky area of the EFI code.
The following changes since commit 2c66e24d75d424919c42288b418d2e593fa818b1:
x86/efi: Fix kernel panic when CONFIG_DEBUG_VIRTUAL is enabled (2015-10-25 10:22:25 +0000)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git tags/efi-next
for you to fetch changes up to 278a4c3c13f6b24f4e18aeabe4135fff7fc703d2:
Documentation/x86: Update EFI memory region description (2015-11-12 15:03:08 +0000)
----------------------------------------------------------------
* Use completely separate page tables for EFI runtime service calls
so that the security-lax mapping permissions (RWX) do not leak into
the standard kernel page tables and trigger warnings when
CONFIG_DEBUG_WX is enabled.
----------------------------------------------------------------
Matt Fleming (6):
x86/mm/pageattr: Ensure cpa->pfn only contains page frame numbers
x86/mm/pageattr: Do not strip pte flags from cpa->pfn
x86/efi: Map RAM into the identity page table for mixed mode
x86/efi: Hoist page table switching code into efi_call_virt()
x86/efi: Build our own page table structures
Documentation/x86: Update EFI memory region description
Documentation/x86/x86_64/mm.txt | 12 +--
arch/x86/include/asm/efi.h | 26 ++++++
arch/x86/mm/pageattr.c | 17 ++--
arch/x86/platform/efi/efi.c | 41 ++++-----
arch/x86/platform/efi/efi_32.c | 5 ++
arch/x86/platform/efi/efi_64.c | 174 ++++++++++++++++++++++++++++--------
arch/x86/platform/efi/efi_stub_64.S | 43 ---------
7 files changed, 193 insertions(+), 125 deletions(-)
The x86 pageattr code is confused about the data that is stored
cpa->pfn, sometimes it's treated as a page fram number and sometimes
it's treated as an unshifted physical address.
The result of this is that the mapping functions do not map the
intended physical address.
This isn't a problem in practice because most of the addresses we're
mapping in the EFI code paths are already mapped in 'trampoline_pgd'
and so the pageattr mappings functions don't actually do anything in
this case. But when we move to using a separate page table for the EFI
runtime this will be an issue.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/mm/pageattr.c | 12 ++++++------
arch/x86/platform/efi/efi_64.c | 33 ++++++++++++++++++++++-----------
2 files changed, 28 insertions(+), 17 deletions(-)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 9abe0c9b1098..893921b12272 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -890,10 +890,10 @@ static void populate_pte(struct cpa_data *cpa,
if (!(pgprot_val(pgprot) & _PAGE_NX))
cpa->pfn &= ~_PAGE_NX;
- set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot));
+ set_pte(pte, pfn_pte(cpa->pfn, pgprot));
start += PAGE_SIZE;
- cpa->pfn += PAGE_SIZE;
+ cpa->pfn++;
pte++;
}
}
@@ -949,11 +949,11 @@ static int populate_pmd(struct cpa_data *cpa,
pmd = pmd_offset(pud, start);
- set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE |
+ set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
massage_pgprot(pmd_pgprot)));
start += PMD_SIZE;
- cpa->pfn += PMD_SIZE;
+ cpa->pfn += PMD_SIZE >> PAGE_SHIFT;
cur_pages += PMD_SIZE >> PAGE_SHIFT;
}
@@ -1022,11 +1022,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
* Map everything starting from the Gb boundary, possibly with 1G pages
*/
while (end - start >= PUD_SIZE) {
- set_pud(pud, __pud(cpa->pfn | _PAGE_PSE |
+ set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
massage_pgprot(pud_pgprot)));
start += PUD_SIZE;
- cpa->pfn += PUD_SIZE;
+ cpa->pfn += PUD_SIZE >> PAGE_SHIFT;
cur_pages += PUD_SIZE >> PAGE_SHIFT;
pud++;
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index a0ac0f9c307f..c8b58ac47b77 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void)
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
{
- unsigned long text;
+ unsigned long pfn, text;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
* and ident-map those pages containing the map before calling
* phys_efi_set_virtual_address_map().
*/
- if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, _PAGE_NX)) {
+ pfn = pa_memmap >> PAGE_SHIFT;
+ if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) {
pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap);
return 1;
}
@@ -176,21 +177,29 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
if (!IS_ENABLED(CONFIG_EFI_MIXED))
return 0;
+ npages = (_end - _text) >> PAGE_SHIFT;
+ text = __pa(_text);
+ pfn = text >> PAGE_SHIFT;
+
+ if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, 0)) {
+ pr_err("Failed to map kernel text 1:1\n");
+ return 1;
+ }
+
page = alloc_page(GFP_KERNEL|__GFP_DMA32);
if (!page)
panic("Unable to allocate EFI runtime stack < 4GB\n");
efi_scratch.phys_stack = virt_to_phys(page_address(page));
- efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
-
- npages = (_end - _text) >> PAGE_SHIFT;
- text = __pa(_text);
+ pfn = page_to_pfn(page);
- if (kernel_map_pages_in_pgd(pgd, text >> PAGE_SHIFT, text, npages, 0)) {
- pr_err("Failed to map kernel text 1:1\n");
+ if (kernel_map_pages_in_pgd(pgd, pfn, efi_scratch.phys_stack, 1, 0)) {
+ pr_err("Failed to map mixed mode stack\n");
return 1;
}
+ efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
+
return 0;
}
@@ -204,12 +213,14 @@ void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
static void __init __map_region(efi_memory_desc_t *md, u64 va)
{
pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
- unsigned long pf = 0;
+ unsigned long flags = 0;
+ unsigned long pfn;
if (!(md->attribute & EFI_MEMORY_WB))
- pf |= _PAGE_PCD;
+ flags |= _PAGE_PCD;
- if (kernel_map_pages_in_pgd(pgd, md->phys_addr, va, md->num_pages, pf))
+ pfn = md->phys_addr >> PAGE_SHIFT;
+ if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
md->phys_addr, va);
}
--
2.6.2
Removing the PAGE_NX bit from cpa->pfn will corrupt the page frame
number address rather than removing PAGE_NX as the code intends. This
is unlikley to be a problem in practice because _PAGE_BIT_NX is bit 63
and most machines do not have page frame numbers that reach that high.
Still, pte flags are never stored in cpa->pfn so we can safely delete
the code.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/mm/pageattr.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 893921b12272..d5240be55915 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -885,11 +885,6 @@ static void populate_pte(struct cpa_data *cpa,
pte = pte_offset_kernel(pmd, start);
while (num_pages-- && start < end) {
-
- /* deal with the NX bit */
- if (!(pgprot_val(pgprot) & _PAGE_NX))
- cpa->pfn &= ~_PAGE_NX;
-
set_pte(pte, pfn_pte(cpa->pfn, pgprot));
start += PAGE_SIZE;
--
2.6.2
We are relying on the pre-existing mappings in 'trampoline_pgd' when
accessing function arguments in the EFI mixed mode thunking code.
Instead let's map memory explicitly so that things will continue to
work when we move to a separate page table in the future.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/platform/efi/efi_64.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index c8b58ac47b77..634536034e32 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
{
unsigned long pfn, text;
+ efi_memory_desc_t *md;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -200,6 +201,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
+ /*
+ * Map all of RAM so that we can access arguments in the 1:1
+ * mapping when making EFI runtime calls.
+ */
+ for_each_efi_memory_desc(&memmap, md) {
+ if (md->type != EFI_CONVENTIONAL_MEMORY &&
+ md->type != EFI_LOADER_DATA &&
+ md->type != EFI_LOADER_CODE)
+ continue;
+
+ pfn = md->phys_addr >> PAGE_SHIFT;
+ npages = md->num_pages;
+
+ if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 0)) {
+ pr_err("Failed to map 1:1 memory\n");
+ return 1;
+ }
+ }
+
return 0;
}
--
2.6.2
This change is a prerequisite for pending patches that switch to a
dedicated EFI page table, instead of using 'trampoline_pgd' which
shares PGD entries with 'swapper_pg_dir'. The pending patches make it
impossible to dereference the runtime service function pointer without
first switching %cr3.
It's true that we now have duplicated switching code in
efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
sacrificing code duplication for a little more clarity and the ease of
writing the page table switching code in C instead of asm.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Denys Vlasenko <[email protected]>,
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/include/asm/efi.h | 25 +++++++++++++++++++++
arch/x86/platform/efi/efi_64.c | 24 ++++++++++-----------
arch/x86/platform/efi/efi_stub_64.S | 43 -------------------------------------
3 files changed, 36 insertions(+), 56 deletions(-)
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index cfee9d4b02af..f9d99d4e7b1a 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -3,6 +3,7 @@
#include <asm/fpu/api.h>
#include <asm/pgtable.h>
+#include <asm/tlb.h>
/*
* We map the EFI regions needed for runtime services non-contiguously,
@@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
#define efi_call_phys(f, args...) efi_call((f), args)
+/*
+ * Scratch space used for switching the pagetable in the EFI stub
+ */
+struct efi_scratch {
+ u64 r15;
+ u64 prev_cr3;
+ pgd_t *efi_pgt;
+ bool use_pgd;
+ u64 phys_stack;
+} __packed;
+
#define efi_call_virt(f, ...) \
({ \
efi_status_t __s; \
@@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
efi_sync_low_kernel_mappings(); \
preempt_disable(); \
__kernel_fpu_begin(); \
+ \
+ if (efi_scratch.use_pgd) { \
+ efi_scratch.prev_cr3 = read_cr3(); \
+ write_cr3((unsigned long)efi_scratch.efi_pgt); \
+ __flush_tlb_all(); \
+ } \
+ \
__s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__); \
+ \
+ if (efi_scratch.use_pgd) { \
+ write_cr3(efi_scratch.prev_cr3); \
+ __flush_tlb_all(); \
+ } \
+ \
__kernel_fpu_end(); \
preempt_enable(); \
__s; \
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 634536034e32..ab5f14a886cc 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,16 +47,7 @@
*/
static u64 efi_va = EFI_VA_START;
-/*
- * Scratch space used for switching the pagetable in the EFI stub
- */
-struct efi_scratch {
- u64 r15;
- u64 prev_cr3;
- pgd_t *efi_pgt;
- bool use_pgd;
- u64 phys_stack;
-} __packed;
+struct efi_scratch efi_scratch;
static void __init early_code_mapping_set_exec(int executable)
{
@@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
int pgd;
int n_pgds;
- if (!efi_enabled(EFI_OLD_MEMMAP))
- return NULL;
+ if (!efi_enabled(EFI_OLD_MEMMAP)) {
+ save_pgd = (pgd_t *)read_cr3();
+ write_cr3((unsigned long)efi_scratch.efi_pgt);
+ goto out;
+ }
early_code_mapping_set_exec(1);
@@ -96,6 +90,7 @@ pgd_t * __init efi_call_phys_prolog(void)
vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
}
+out:
__flush_tlb_all();
return save_pgd;
@@ -109,8 +104,11 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
int pgd_idx;
int nr_pgds;
- if (!save_pgd)
+ if (!efi_enabled(EFI_OLD_MEMMAP)) {
+ write_cr3((unsigned long)save_pgd);
+ __flush_tlb_all();
return;
+ }
nr_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT) , PGDIR_SIZE);
diff --git a/arch/x86/platform/efi/efi_stub_64.S b/arch/x86/platform/efi/efi_stub_64.S
index 86d0f9e08dd9..32020cb8bb08 100644
--- a/arch/x86/platform/efi/efi_stub_64.S
+++ b/arch/x86/platform/efi/efi_stub_64.S
@@ -38,41 +38,6 @@
mov %rsi, %cr0; \
mov (%rsp), %rsp
- /* stolen from gcc */
- .macro FLUSH_TLB_ALL
- movq %r15, efi_scratch(%rip)
- movq %r14, efi_scratch+8(%rip)
- movq %cr4, %r15
- movq %r15, %r14
- andb $0x7f, %r14b
- movq %r14, %cr4
- movq %r15, %cr4
- movq efi_scratch+8(%rip), %r14
- movq efi_scratch(%rip), %r15
- .endm
-
- .macro SWITCH_PGT
- cmpb $0, efi_scratch+24(%rip)
- je 1f
- movq %r15, efi_scratch(%rip) # r15
- # save previous CR3
- movq %cr3, %r15
- movq %r15, efi_scratch+8(%rip) # prev_cr3
- movq efi_scratch+16(%rip), %r15 # EFI pgt
- movq %r15, %cr3
- 1:
- .endm
-
- .macro RESTORE_PGT
- cmpb $0, efi_scratch+24(%rip)
- je 2f
- movq efi_scratch+8(%rip), %r15
- movq %r15, %cr3
- movq efi_scratch(%rip), %r15
- FLUSH_TLB_ALL
- 2:
- .endm
-
ENTRY(efi_call)
SAVE_XMM
mov (%rsp), %rax
@@ -83,16 +48,8 @@ ENTRY(efi_call)
mov %r8, %r9
mov %rcx, %r8
mov %rsi, %rcx
- SWITCH_PGT
call *%rdi
- RESTORE_PGT
addq $48, %rsp
RESTORE_XMM
ret
ENDPROC(efi_call)
-
- .data
-ENTRY(efi_scratch)
- .fill 3,8,0
- .byte 0
- .quad 0
--
2.6.2
With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all users
booting on 64-bit UEFI machines see the following warning,
------------[ cut here ]------------
WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x5dc/0x780()
x86/mm: Found insecure W+X mapping at address ffff88000005f000/0xffff88000005f000
...
x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found.
...
This is caused by mapping EFI regions with RWX permissions. There
isn't much we can do to restrict the permissions for these regions due
to the way the firmware toolchains mix code and data, but we can at
least isolate these mappings so that they do not appear in the regular
kernel page tables.
In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping")
we started using 'trampoline_pgd' to map the EFI regions because there
was an existing identity mapping there which we use during the
SetVirtualAddressMap() call and for broken firmware that accesses
those addresses.
But 'trampoline_pgd' shares some PGD entries with 'swapper_pg_dir' and
does not provide the isolation we require. Notaby the virtual address
for __START_KERNEL_map and MODULES_START are mapped by the same PGD
entry so we need to be more careful when copying changes over in
efi_sync_low_kernel_mappings().
This patch doesn't go the full mile, we still want to share some PGD
entries with 'swapper_pg_dir'. Having completely separate page tables
brings its own issues such as sychronising new mappings after memory
hotplug and module loading. Sharing also keeps memory usage down.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Denys Vlasenko <[email protected]>,
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/include/asm/efi.h | 1 +
arch/x86/platform/efi/efi.c | 41 +++++++-----------
arch/x86/platform/efi/efi_32.c | 5 +++
arch/x86/platform/efi/efi_64.c | 97 +++++++++++++++++++++++++++++++++++-------
4 files changed, 103 insertions(+), 41 deletions(-)
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index f9d99d4e7b1a..896264dfdf80 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -124,6 +124,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long size);
extern void __init efi_map_region(efi_memory_desc_t *md);
extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
extern void efi_sync_low_kernel_mappings(void);
+extern int __init efi_alloc_page_tables(void);
extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
extern void __init old_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index c69e58fb7f19..133d09b21dc4 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -804,7 +804,7 @@ static void __init kexec_enter_virtual_mode(void)
* This function will switch the EFI runtime services to virtual mode.
* Essentially, we look through the EFI memmap and map every region that
* has the runtime attribute bit set in its memory descriptor into the
- * ->trampoline_pgd page table using a top-down VA allocation scheme.
+ * efi_pgd page table.
*
* The old method which used to update that memory descriptor with the
* virtual address obtained from ioremap() is still supported when the
@@ -814,8 +814,8 @@ static void __init kexec_enter_virtual_mode(void)
*
* The new method does a pagetable switch in a preemption-safe manner
* so that we're in a different address space when calling a runtime
- * function. For function arguments passing we do copy the PGDs of the
- * kernel page table into ->trampoline_pgd prior to each call.
+ * function. For function arguments passing we do copy the PUDs of the
+ * kernel page table into efi_pgd prior to each call.
*
* Specially for kexec boot, efi runtime maps in previous kernel should
* be passed in via setup_data. In that case runtime ranges will be mapped
@@ -831,6 +831,12 @@ static void __init __efi_enter_virtual_mode(void)
efi.systab = NULL;
efi_merge_regions();
+ if (efi_alloc_page_tables()) {
+ pr_err("Failed to allocate EFI page tables\n");
+ clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
+ return;
+ }
+
new_memmap = efi_map_regions(&count, &pg_shift);
if (!new_memmap) {
pr_err("Error reallocating memory, EFI runtime non-functional!\n");
@@ -888,29 +894,12 @@ static void __init __efi_enter_virtual_mode(void)
efi_runtime_mkexec();
- /*
- * We mapped the descriptor array into the EFI pagetable above but we're
- * not unmapping it here. Here's why:
- *
- * We're copying select PGDs from the kernel page table to the EFI page
- * table and when we do so and make changes to those PGDs like unmapping
- * stuff from them, those changes appear in the kernel page table and we
- * go boom.
- *
- * From setup_real_mode():
- *
- * ...
- * trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
- *
- * In this particular case, our allocation is in PGD 0 of the EFI page
- * table but we've copied that PGD from PGD[272] of the EFI page table:
- *
- * pgd_index(__PAGE_OFFSET = 0xffff880000000000) = 272
- *
- * where the direct memory mapping in kernel space is.
- *
- * new_memmap's VA comes from that direct mapping and thus clearing it,
- * it would get cleared in the kernel page table too.
+ /*
+ * We mapped the descriptor array into the EFI pagetable above
+ * but we're not unmapping it here because if we're running in
+ * EFI mixed mode we need all of memory to be accessible when
+ * we pass parameters to the EFI runtime services in the
+ * thunking code.
*
* efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
*/
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index ed5b67338294..58d669bc8250 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -38,6 +38,11 @@
* say 0 - 3G.
*/
+int __init efi_alloc_page_tables(void)
+{
+ return 0;
+}
+
void efi_sync_low_kernel_mappings(void) {}
void __init efi_dump_pagetable(void) {}
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index ab5f14a886cc..b5e578a2d2d8 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -40,6 +40,7 @@
#include <asm/fixmap.h>
#include <asm/realmode.h>
#include <asm/time.h>
+#include <asm/pgalloc.h>
/*
* We allocate runtime services regions bottom-up, starting from -4G, i.e.
@@ -121,22 +122,92 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
early_code_mapping_set_exec(0);
}
+static pgd_t *efi_pgd;
+
+/*
+ * We need our own copy of the higher levels of the page tables
+ * because we want to avoid inserting EFI region mappings (EFI_VA_END
+ * to EFI_VA_START) into the standard kernel page tables. Everything
+ * else can be shared, see efi_sync_low_kernel_mappings().
+ */
+int __init efi_alloc_page_tables(void)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ gfp_t gfp_mask;
+
+ if (efi_enabled(EFI_OLD_MEMMAP))
+ return 0;
+
+ gfp_mask = GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO;
+ efi_pgd = (pgd_t *)__get_free_page(gfp_mask);
+ if (!efi_pgd)
+ return -ENOMEM;
+
+ pgd = efi_pgd + pgd_index(EFI_VA_END);
+
+ pud = pud_alloc_one(NULL, 0);
+ if (!pud) {
+ free_page((unsigned long)efi_pgd);
+ return -ENOMEM;
+ }
+
+ pgd_populate(NULL, pgd, pud);
+
+ return 0;
+}
+
/*
* Add low kernel mappings for passing arguments to EFI functions.
*/
void efi_sync_low_kernel_mappings(void)
{
- unsigned num_pgds;
- pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
+ unsigned num_entries;
+ pgd_t *pgd_k, *pgd_efi;
+ pud_t *pud_k, *pud_efi;
if (efi_enabled(EFI_OLD_MEMMAP))
return;
- num_pgds = pgd_index(MODULES_END - 1) - pgd_index(PAGE_OFFSET);
+ /*
+ * We can share all PGD entries apart from the one entry that
+ * covers the EFI runtime mapping space.
+ *
+ * Make sure the EFI runtime region mappings are guaranteed to
+ * only span a single PGD entry and that the entry also maps
+ * other important kernel regions.
+ */
+ BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
+ BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
+ (EFI_VA_END & PGDIR_MASK));
+
+ pgd_efi = efi_pgd + pgd_index(PAGE_OFFSET);
+ pgd_k = pgd_offset_k(PAGE_OFFSET);
+
+ num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET);
+ memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries);
- memcpy(pgd + pgd_index(PAGE_OFFSET),
- init_mm.pgd + pgd_index(PAGE_OFFSET),
- sizeof(pgd_t) * num_pgds);
+ /*
+ * We share all the PUD entries apart from those that map the
+ * EFI regions. Copy around them.
+ */
+ BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0);
+ BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);
+
+ pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
+ pud_efi = pud_offset(pgd_efi, 0);
+
+ pgd_k = pgd_offset_k(EFI_VA_END);
+ pud_k = pud_offset(pgd_k, 0);
+
+ num_entries = pud_index(EFI_VA_END);
+ memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
+
+ pud_efi = pud_offset(pgd_efi, EFI_VA_START);
+ pud_k = pud_offset(pgd_k, EFI_VA_START);
+
+ num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
+ memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
}
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
@@ -150,8 +221,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
if (efi_enabled(EFI_OLD_MEMMAP))
return 0;
- efi_scratch.efi_pgt = (pgd_t *)(unsigned long)real_mode_header->trampoline_pgd;
- pgd = __va(efi_scratch.efi_pgt);
+ efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd);
+ pgd = efi_pgd;
/*
* It can happen that the physical address of new_memmap lands in memory
@@ -223,16 +294,14 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
{
- pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
-
- kernel_unmap_pages_in_pgd(pgd, pa_memmap, num_pages);
+ kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
}
static void __init __map_region(efi_memory_desc_t *md, u64 va)
{
- pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
unsigned long flags = 0;
unsigned long pfn;
+ pgd_t *pgd = efi_pgd;
if (!(md->attribute & EFI_MEMORY_WB))
flags |= _PAGE_PCD;
@@ -341,9 +410,7 @@ void __init efi_runtime_mkexec(void)
void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
- pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
-
- ptdump_walk_pgd_level(NULL, pgd);
+ ptdump_walk_pgd_level(NULL, efi_pgd);
#endif
}
--
2.6.2
Make it clear that the EFI page tables are only available during EFI
runtime calls since that subject has come up a fair numbers of times
in the past.
Additionally, add the EFI region start and end addresses to the table
so that it's possible to see at a glance where they fall in relation
to other regions.
Cc: Borislav Petkov <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Denys Vlasenko <[email protected]>,
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
Documentation/x86/x86_64/mm.txt | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 05712ac83e38..a9885bb1ac22 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -16,6 +16,8 @@ ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB)
... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...
+ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space
+... unused hole ...
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
@@ -32,11 +34,9 @@ reference.
Current X86-64 implementations only support 40 bits of address space,
but we support up to 46 bits. This expands into MBZ space in the page tables.
-->trampoline_pgd:
-
-We map EFI runtime services in the aforementioned PGD in the virtual
-range of 64Gb (arbitrarily set, can be raised if needed)
-
-0xffffffef00000000 - 0xffffffff00000000
+We map EFI runtime services in the efi_pgd PGD in the virtual range of
+64Gb (arbitrarily set, can be raised if needed). The mappings are not
+part of any other kernel PGD and are only available during EFI runtime
+calls.
-Andi Kleen, Jul 2004
--
2.6.2
On Thu, Nov 12, 2015 at 03:40:20PM +0000, Matt Fleming wrote:
> We are relying on the pre-existing mappings in 'trampoline_pgd' when
> accessing function arguments in the EFI mixed mode thunking code.
>
> Instead let's map memory explicitly so that things will continue to
> work when we move to a separate page table in the future.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/platform/efi/efi_64.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index c8b58ac47b77..634536034e32 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
> int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> {
> unsigned long pfn, text;
> + efi_memory_desc_t *md;
> struct page *page;
> unsigned npages;
> pgd_t *pgd;
> @@ -200,6 +201,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>
> efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
>
> + /*
> + * Map all of RAM so that we can access arguments in the 1:1
Why "all of RAM"?
> + * mapping when making EFI runtime calls.
> + */
> + for_each_efi_memory_desc(&memmap, md) {
> + if (md->type != EFI_CONVENTIONAL_MEMORY &&
> + md->type != EFI_LOADER_DATA &&
> + md->type != EFI_LOADER_CODE)
That's mapping all those EFI_* types...
> + continue;
> +
> + pfn = md->phys_addr >> PAGE_SHIFT;
> + npages = md->num_pages;
> +
> + if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 0)) {
> + pr_err("Failed to map 1:1 memory\n");
> + return 1;
> + }
> + }
> +
> return 0;
> }
>
> --
> 2.6.2
>
>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:23PM +0000, Matt Fleming wrote:
> Make it clear that the EFI page tables are only available during EFI
> runtime calls since that subject has come up a fair numbers of times
> in the past.
>
> Additionally, add the EFI region start and end addresses to the table
> so that it's possible to see at a glance where they fall in relation
> to other regions.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Dave Jones <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Denys Vlasenko <[email protected]>,
> Cc: Stephen Smalley <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> Documentation/x86/x86_64/mm.txt | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
Reviewed-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:22PM +0000, Matt Fleming wrote:
> With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all users
> booting on 64-bit UEFI machines see the following warning,
>
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x5dc/0x780()
> x86/mm: Found insecure W+X mapping at address ffff88000005f000/0xffff88000005f000
> ...
> x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found.
> ...
>
> This is caused by mapping EFI regions with RWX permissions. There
> isn't much we can do to restrict the permissions for these regions due
> to the way the firmware toolchains mix code and data, but we can at
> least isolate these mappings so that they do not appear in the regular
> kernel page tables.
>
> In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping")
> we started using 'trampoline_pgd' to map the EFI regions because there
> was an existing identity mapping there which we use during the
> SetVirtualAddressMap() call and for broken firmware that accesses
> those addresses.
>
> But 'trampoline_pgd' shares some PGD entries with 'swapper_pg_dir' and
> does not provide the isolation we require. Notaby the virtual address
Notably
> for __START_KERNEL_map and MODULES_START are mapped by the same PGD
> entry so we need to be more careful when copying changes over in
> efi_sync_low_kernel_mappings().
>
> This patch doesn't go the full mile, we still want to share some PGD
> entries with 'swapper_pg_dir'. Having completely separate page tables
> brings its own issues such as sychronising new mappings after memory
> hotplug and module loading. Sharing also keeps memory usage down.
Cool idea!
...
> @@ -831,6 +831,12 @@ static void __init __efi_enter_virtual_mode(void)
> efi.systab = NULL;
>
> efi_merge_regions();
> + if (efi_alloc_page_tables()) {
> + pr_err("Failed to allocate EFI page tables\n");
> + clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
> + return;
> + }
This should happen before efi_merge_regions() - no need to merge if we
can't alloc PGT.
> +
> new_memmap = efi_map_regions(&count, &pg_shift);
> if (!new_memmap) {
> pr_err("Error reallocating memory, EFI runtime non-functional!\n");
> @@ -888,29 +894,12 @@ static void __init __efi_enter_virtual_mode(void)
>
> efi_runtime_mkexec();
>
> - /*
> - * We mapped the descriptor array into the EFI pagetable above but we're
> - * not unmapping it here. Here's why:
> - *
> - * We're copying select PGDs from the kernel page table to the EFI page
> - * table and when we do so and make changes to those PGDs like unmapping
> - * stuff from them, those changes appear in the kernel page table and we
> - * go boom.
> - *
> - * From setup_real_mode():
> - *
> - * ...
> - * trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
> - *
> - * In this particular case, our allocation is in PGD 0 of the EFI page
> - * table but we've copied that PGD from PGD[272] of the EFI page table:
> - *
> - * pgd_index(__PAGE_OFFSET = 0xffff880000000000) = 272
> - *
> - * where the direct memory mapping in kernel space is.
> - *
> - * new_memmap's VA comes from that direct mapping and thus clearing it,
> - * it would get cleared in the kernel page table too.
> + /*
ERROR: code indent should use tabs where possible
#149: FILE: arch/x86/platform/efi/efi.c:897:
+ /*$
...
> void efi_sync_low_kernel_mappings(void)
> {
> - unsigned num_pgds;
> - pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
> + unsigned num_entries;
> + pgd_t *pgd_k, *pgd_efi;
> + pud_t *pud_k, *pud_efi;
>
> if (efi_enabled(EFI_OLD_MEMMAP))
> return;
>
> - num_pgds = pgd_index(MODULES_END - 1) - pgd_index(PAGE_OFFSET);
> + /*
> + * We can share all PGD entries apart from the one entry that
> + * covers the EFI runtime mapping space.
> + *
> + * Make sure the EFI runtime region mappings are guaranteed to
> + * only span a single PGD entry and that the entry also maps
> + * other important kernel regions.
> + */
> + BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
> + BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
> + (EFI_VA_END & PGDIR_MASK));
You can align them in a more readable way:
BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
(EFI_VA_END & PGDIR_MASK));
> +
> + pgd_efi = efi_pgd + pgd_index(PAGE_OFFSET);
> + pgd_k = pgd_offset_k(PAGE_OFFSET);
> +
> + num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET);
> + memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries);
>
> - memcpy(pgd + pgd_index(PAGE_OFFSET),
> - init_mm.pgd + pgd_index(PAGE_OFFSET),
> - sizeof(pgd_t) * num_pgds);
> + /*
> + * We share all the PUD entries apart from those that map the
> + * EFI regions. Copy around them.
> + */
> + BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0);
> + BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);
> +
> + pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
> + pud_efi = pud_offset(pgd_efi, 0);
> +
> + pgd_k = pgd_offset_k(EFI_VA_END);
> + pud_k = pud_offset(pgd_k, 0);
> +
> + num_entries = pud_index(EFI_VA_END);
> + memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
> +
> + pud_efi = pud_offset(pgd_efi, EFI_VA_START);
> + pud_k = pud_offset(pgd_k, EFI_VA_START);
> +
> + num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
> + memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
> }
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:21PM +0000, Matt Fleming wrote:
> This change is a prerequisite for pending patches that switch to a
> dedicated EFI page table, instead of using 'trampoline_pgd' which
> shares PGD entries with 'swapper_pg_dir'. The pending patches make it
> impossible to dereference the runtime service function pointer without
> first switching %cr3.
>
> It's true that we now have duplicated switching code in
> efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
> sacrificing code duplication for a little more clarity and the ease of
> writing the page table switching code in C instead of asm.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Dave Jones <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Denys Vlasenko <[email protected]>,
> Cc: Stephen Smalley <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/include/asm/efi.h | 25 +++++++++++++++++++++
> arch/x86/platform/efi/efi_64.c | 24 ++++++++++-----------
> arch/x86/platform/efi/efi_stub_64.S | 43 -------------------------------------
> 3 files changed, 36 insertions(+), 56 deletions(-)
>
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index cfee9d4b02af..f9d99d4e7b1a 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -3,6 +3,7 @@
>
> #include <asm/fpu/api.h>
> #include <asm/pgtable.h>
> +#include <asm/tlb.h>
>
> /*
> * We map the EFI regions needed for runtime services non-contiguously,
> @@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
>
> #define efi_call_phys(f, args...) efi_call((f), args)
>
> +/*
> + * Scratch space used for switching the pagetable in the EFI stub
> + */
> +struct efi_scratch {
> + u64 r15;
> + u64 prev_cr3;
> + pgd_t *efi_pgt;
> + bool use_pgd;
> + u64 phys_stack;
> +} __packed;
> +
> #define efi_call_virt(f, ...) \
> ({ \
> efi_status_t __s; \
> @@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
> efi_sync_low_kernel_mappings(); \
> preempt_disable(); \
> __kernel_fpu_begin(); \
> + \
> + if (efi_scratch.use_pgd) { \
> + efi_scratch.prev_cr3 = read_cr3(); \
> + write_cr3((unsigned long)efi_scratch.efi_pgt); \
> + __flush_tlb_all(); \
> + } \
> + \
> __s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__); \
> + \
> + if (efi_scratch.use_pgd) { \
> + write_cr3(efi_scratch.prev_cr3); \
> + __flush_tlb_all(); \
> + } \
> + \
> __kernel_fpu_end(); \
> preempt_enable(); \
> __s; \
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 634536034e32..ab5f14a886cc 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -47,16 +47,7 @@
> */
> static u64 efi_va = EFI_VA_START;
>
> -/*
> - * Scratch space used for switching the pagetable in the EFI stub
> - */
> -struct efi_scratch {
> - u64 r15;
> - u64 prev_cr3;
> - pgd_t *efi_pgt;
> - bool use_pgd;
> - u64 phys_stack;
> -} __packed;
> +struct efi_scratch efi_scratch;
>
> static void __init early_code_mapping_set_exec(int executable)
> {
> @@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
> int pgd;
> int n_pgds;
>
> - if (!efi_enabled(EFI_OLD_MEMMAP))
> - return NULL;
> + if (!efi_enabled(EFI_OLD_MEMMAP)) {
> + save_pgd = (pgd_t *)read_cr3();
> + write_cr3((unsigned long)efi_scratch.efi_pgt);
> + goto out;
> + }
>
> early_code_mapping_set_exec(1);
>
So this one is called in phys_efi_set_virtual_address_map() like this:
----
save_pgd = efi_call_phys_prolog();
/* Disable interrupts around EFI calls: */
local_irq_save(flags);
<--- MARKER
status = efi_call_phys(efi_phys.set_virtual_address_map,
memory_map_size, descriptor_size,
descriptor_version, virtual_map);
local_irq_restore(flags);
efi_call_phys_epilog(save_pgd);
---
Now, if you look at MARKER, the asm looks like this here:
.loc 1 91 0
call efi_call_phys_prolog #
movq %rax, %r15 #, save_pgd
.file 6 "./arch/x86/include/asm/irqflags.h"
.loc 6 20 0
#APP
# 20 "./arch/x86/include/asm/irqflags.h" 1
# __raw_save_flags
pushf ; pop %r14 # flags
That PUSHF implicitly pushes on the stack pointed by %rsp. But(!) we
have switched the pagetable (i.e., %cr3 has efi_scratch.efi_pgt) and
we're pushing to the VA where the stack *was* but is not anymore.
Or maybe it is because you're copying all the PUDs. It is still not 100%
clean, IMHO.
Can you do the prolog/epilog calls inside the IRQs-off section?
Btw, it was crap like that why I wanted to do SWITCH_PGT in asm...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:21PM +0000, Matt Fleming wrote:
> This change is a prerequisite for pending patches that switch to a
> dedicated EFI page table, instead of using 'trampoline_pgd' which
> shares PGD entries with 'swapper_pg_dir'. The pending patches make it
> impossible to dereference the runtime service function pointer without
> first switching %cr3.
>
> It's true that we now have duplicated switching code in
> efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
> sacrificing code duplication for a little more clarity and the ease of
> writing the page table switching code in C instead of asm.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Dave Jones <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Denys Vlasenko <[email protected]>,
> Cc: Stephen Smalley <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/include/asm/efi.h | 25 +++++++++++++++++++++
> arch/x86/platform/efi/efi_64.c | 24 ++++++++++-----------
> arch/x86/platform/efi/efi_stub_64.S | 43 -------------------------------------
> 3 files changed, 36 insertions(+), 56 deletions(-)
>
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index cfee9d4b02af..f9d99d4e7b1a 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -3,6 +3,7 @@
>
> #include <asm/fpu/api.h>
> #include <asm/pgtable.h>
> +#include <asm/tlb.h>
>
> /*
> * We map the EFI regions needed for runtime services non-contiguously,
> @@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
>
> #define efi_call_phys(f, args...) efi_call((f), args)
>
> +/*
> + * Scratch space used for switching the pagetable in the EFI stub
> + */
> +struct efi_scratch {
> + u64 r15;
> + u64 prev_cr3;
> + pgd_t *efi_pgt;
> + bool use_pgd;
> + u64 phys_stack;
> +} __packed;
> +
> #define efi_call_virt(f, ...) \
> ({ \
> efi_status_t __s; \
> @@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
> efi_sync_low_kernel_mappings(); \
> preempt_disable(); \
> __kernel_fpu_begin(); \
> + \
> + if (efi_scratch.use_pgd) { \
> + efi_scratch.prev_cr3 = read_cr3(); \
> + write_cr3((unsigned long)efi_scratch.efi_pgt); \
> + __flush_tlb_all(); \
> + } \
> + \
> __s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__); \
> + \
> + if (efi_scratch.use_pgd) { \
> + write_cr3(efi_scratch.prev_cr3); \
> + __flush_tlb_all(); \
> + } \
> + \
> __kernel_fpu_end(); \
> preempt_enable(); \
> __s; \
checkpatch is bitching here - not that I agree with it:
WARNING: please, no space before tabs
#87: FILE: arch/x86/include/asm/efi.h:88:
+^I^Iefi_scratch.prev_cr3 = read_cr3(); ^I^I^I\$
WARNING: please, no space before tabs
#89: FILE: arch/x86/include/asm/efi.h:90:
+^I^I__flush_tlb_all(); ^I^I^I^I^I\$
WARNING: please, no space before tabs
#94: FILE: arch/x86/include/asm/efi.h:95:
+^Iif (efi_scratch.use_pgd) { ^I^I^I^I^I\$
WARNING: please, no space before tabs
#96: FILE: arch/x86/include/asm/efi.h:97:
+^I^I__flush_tlb_all(); ^I^I^I^I^I\$
WARNING: please, no space before tabs
#97: FILE: arch/x86/include/asm/efi.h:98:
+^I} ^I^I^I^I^I^I^I^I\$
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 634536034e32..ab5f14a886cc 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -47,16 +47,7 @@
> */
> static u64 efi_va = EFI_VA_START;
>
> -/*
> - * Scratch space used for switching the pagetable in the EFI stub
> - */
> -struct efi_scratch {
> - u64 r15;
> - u64 prev_cr3;
> - pgd_t *efi_pgt;
> - bool use_pgd;
> - u64 phys_stack;
> -} __packed;
> +struct efi_scratch efi_scratch;
>
> static void __init early_code_mapping_set_exec(int executable)
> {
> @@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
> int pgd;
> int n_pgds;
>
> - if (!efi_enabled(EFI_OLD_MEMMAP))
> - return NULL;
> + if (!efi_enabled(EFI_OLD_MEMMAP)) {
> + save_pgd = (pgd_t *)read_cr3();
> + write_cr3((unsigned long)efi_scratch.efi_pgt);
> + goto out;
> + }
>
> early_code_mapping_set_exec(1);
>
> @@ -96,6 +90,7 @@ pgd_t * __init efi_call_phys_prolog(void)
> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> }
> +out:
> __flush_tlb_all();
>
> return save_pgd;
> @@ -109,8 +104,11 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
There's a comment here:
/*
* After the lock is released, the original page table is restored.
*/
Which lock are we talking about?
> int pgd_idx;
> int nr_pgds;
>
> - if (!save_pgd)
> + if (!efi_enabled(EFI_OLD_MEMMAP)) {
> + write_cr3((unsigned long)save_pgd);
> + __flush_tlb_all();
> return;
> + }
>
> nr_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT) , PGDIR_SIZE);
>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:19PM +0000, Matt Fleming wrote:
> Removing the PAGE_NX bit from cpa->pfn will corrupt the page frame
> number address rather than removing PAGE_NX as the code intends. This
> is unlikley to be a problem in practice because _PAGE_BIT_NX is bit 63
> and most machines do not have page frame numbers that reach that high.
>
> Still, pte flags are never stored in cpa->pfn so we can safely delete
> the code.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/mm/pageattr.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 893921b12272..d5240be55915 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -885,11 +885,6 @@ static void populate_pte(struct cpa_data *cpa,
> pte = pte_offset_kernel(pmd, start);
>
> while (num_pages-- && start < end) {
> -
> - /* deal with the NX bit */
> - if (!(pgprot_val(pgprot) & _PAGE_NX))
> - cpa->pfn &= ~_PAGE_NX;
> -
> set_pte(pte, pfn_pte(cpa->pfn, pgprot));
>
> start += PAGE_SIZE;
> --
I think this should be part of the 1st patch because there you're
correcting ->pfn to actually be a pfn.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Nov 12, 2015 at 03:40:18PM +0000, Matt Fleming wrote:
> The x86 pageattr code is confused about the data that is stored
> cpa->pfn, sometimes it's treated as a page fram number and sometimes
> it's treated as an unshifted physical address.
>
> The result of this is that the mapping functions do not map the
> intended physical address.
>
> This isn't a problem in practice because most of the addresses we're
> mapping in the EFI code paths are already mapped in 'trampoline_pgd'
> and so the pageattr mappings functions don't actually do anything in
> this case. But when we move to using a separate page table for the EFI
> runtime this will be an issue.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Sai Praneeth Prakhya <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/mm/pageattr.c | 12 ++++++------
> arch/x86/platform/efi/efi_64.c | 33 ++++++++++++++++++++++-----------
> 2 files changed, 28 insertions(+), 17 deletions(-)
Reviewed-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, 12 Nov, at 07:47:30PM, Borislav Petkov wrote:
> On Thu, Nov 12, 2015 at 03:40:19PM +0000, Matt Fleming wrote:
> > Removing the PAGE_NX bit from cpa->pfn will corrupt the page frame
> > number address rather than removing PAGE_NX as the code intends. This
> > is unlikley to be a problem in practice because _PAGE_BIT_NX is bit 63
> > and most machines do not have page frame numbers that reach that high.
> >
> > Still, pte flags are never stored in cpa->pfn so we can safely delete
> > the code.
> >
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Sai Praneeth Prakhya <[email protected]>
> > Signed-off-by: Matt Fleming <[email protected]>
> > ---
> > arch/x86/mm/pageattr.c | 5 -----
> > 1 file changed, 5 deletions(-)
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index 893921b12272..d5240be55915 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -885,11 +885,6 @@ static void populate_pte(struct cpa_data *cpa,
> > pte = pte_offset_kernel(pmd, start);
> >
> > while (num_pages-- && start < end) {
> > -
> > - /* deal with the NX bit */
> > - if (!(pgprot_val(pgprot) & _PAGE_NX))
> > - cpa->pfn &= ~_PAGE_NX;
> > -
> > set_pte(pte, pfn_pte(cpa->pfn, pgprot));
> >
> > start += PAGE_SIZE;
> > --
>
> I think this should be part of the 1st patch because there you're
> correcting ->pfn to actually be a pfn.
OK, that's fine by me. I split this out into a separate patch so that
this single logical change could be debated independently of the other
->pfn change, and because I was less certain on this one.
I fold it into the first patch.
On Thu, 12 Nov, at 07:01:26PM, Borislav Petkov wrote:
> On Thu, Nov 12, 2015 at 03:40:20PM +0000, Matt Fleming wrote:
> > We are relying on the pre-existing mappings in 'trampoline_pgd' when
> > accessing function arguments in the EFI mixed mode thunking code.
> >
> > Instead let's map memory explicitly so that things will continue to
> > work when we move to a separate page table in the future.
> >
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Sai Praneeth Prakhya <[email protected]>
> > Signed-off-by: Matt Fleming <[email protected]>
> > ---
> > arch/x86/platform/efi/efi_64.c | 20 ++++++++++++++++++++
> > 1 file changed, 20 insertions(+)
> >
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index c8b58ac47b77..634536034e32 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
> > int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> > {
> > unsigned long pfn, text;
> > + efi_memory_desc_t *md;
> > struct page *page;
> > unsigned npages;
> > pgd_t *pgd;
> > @@ -200,6 +201,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> >
> > efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
> >
> > + /*
> > + * Map all of RAM so that we can access arguments in the 1:1
>
> Why "all of RAM"?
Because the pointers to arguments we pass to the thunk code can point
anywhere in RAM.
> > + * mapping when making EFI runtime calls.
> > + */
> > + for_each_efi_memory_desc(&memmap, md) {
> > + if (md->type != EFI_CONVENTIONAL_MEMORY &&
> > + md->type != EFI_LOADER_DATA &&
> > + md->type != EFI_LOADER_CODE)
>
> That's mapping all those EFI_* types...
Hmm? This is mapping only those regions that are usable by the kernel
for general allocations. See the table in setup_e820() in the EFI boot
stub code - these types map to E820_RAM.
On Thu, 12 Nov, at 07:44:32PM, Borislav Petkov wrote:
>
> So this one is called in phys_efi_set_virtual_address_map() like this:
>
> ----
> save_pgd = efi_call_phys_prolog();
>
> /* Disable interrupts around EFI calls: */
> local_irq_save(flags);
>
> <--- MARKER
>
> status = efi_call_phys(efi_phys.set_virtual_address_map,
> memory_map_size, descriptor_size,
> descriptor_version, virtual_map);
> local_irq_restore(flags);
>
> efi_call_phys_epilog(save_pgd);
> ---
>
>
> Now, if you look at MARKER, the asm looks like this here:
>
> .loc 1 91 0
> call efi_call_phys_prolog #
> movq %rax, %r15 #, save_pgd
>
> .file 6 "./arch/x86/include/asm/irqflags.h"
> .loc 6 20 0
> #APP
> # 20 "./arch/x86/include/asm/irqflags.h" 1
> # __raw_save_flags
> pushf ; pop %r14 # flags
>
>
> That PUSHF implicitly pushes on the stack pointed by %rsp. But(!) we
> have switched the pagetable (i.e., %cr3 has efi_scratch.efi_pgt) and
> we're pushing to the VA where the stack *was* but is not anymore.
All the kernel mappings will still exist in the page table we switch
to, so pushing to the stack should be fine.
The mappings have to exist so that the firmware can dereference
pointer arguments, e.g. when writing variable data to a kernel buffer.
Or have I misunderstood your point?
> Or maybe it is because you're copying all the PUDs. It is still not 100%
> clean, IMHO.
>
> Can you do the prolog/epilog calls inside the IRQs-off section?
Not really because in the efi_enabled(EFI_OLD_MEMMAP) case we perform
kmalloc(), see commit 23a0d4e8fa6d ("efi: Disable interrupts around
EFI calls, not in the epilog/prolog calls").
On Thu, 12 Nov, at 07:47:14PM, Borislav Petkov wrote:
>
> checkpatch is bitching here - not that I agree with it:
>
> WARNING: please, no space before tabs
> #87: FILE: arch/x86/include/asm/efi.h:88:
> +^I^Iefi_scratch.prev_cr3 = read_cr3(); ^I^I^I\$
>
> WARNING: please, no space before tabs
> #89: FILE: arch/x86/include/asm/efi.h:90:
> +^I^I__flush_tlb_all(); ^I^I^I^I^I\$
>
> WARNING: please, no space before tabs
> #94: FILE: arch/x86/include/asm/efi.h:95:
> +^Iif (efi_scratch.use_pgd) { ^I^I^I^I^I\$
>
> WARNING: please, no space before tabs
> #96: FILE: arch/x86/include/asm/efi.h:97:
> +^I^I__flush_tlb_all(); ^I^I^I^I^I\$
>
> WARNING: please, no space before tabs
> #97: FILE: arch/x86/include/asm/efi.h:98:
> +^I} ^I^I^I^I^I^I^I^I\$
Crap. These look legit, I'll fix this up in v2.
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index 634536034e32..ab5f14a886cc 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -47,16 +47,7 @@
> > */
> > static u64 efi_va = EFI_VA_START;
> >
> > -/*
> > - * Scratch space used for switching the pagetable in the EFI stub
> > - */
> > -struct efi_scratch {
> > - u64 r15;
> > - u64 prev_cr3;
> > - pgd_t *efi_pgt;
> > - bool use_pgd;
> > - u64 phys_stack;
> > -} __packed;
> > +struct efi_scratch efi_scratch;
> >
> > static void __init early_code_mapping_set_exec(int executable)
> > {
> > @@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
> > int pgd;
> > int n_pgds;
> >
> > - if (!efi_enabled(EFI_OLD_MEMMAP))
> > - return NULL;
> > + if (!efi_enabled(EFI_OLD_MEMMAP)) {
> > + save_pgd = (pgd_t *)read_cr3();
> > + write_cr3((unsigned long)efi_scratch.efi_pgt);
> > + goto out;
> > + }
> >
> > early_code_mapping_set_exec(1);
> >
> > @@ -96,6 +90,7 @@ pgd_t * __init efi_call_phys_prolog(void)
> > vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> > set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> > }
> > +out:
> > __flush_tlb_all();
> >
> > return save_pgd;
> > @@ -109,8 +104,11 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
>
> There's a comment here:
>
> /*
> * After the lock is released, the original page table is restored.
> */
>
> Which lock are we talking about?
No idea, we don't take any locks. Looks like a stale comment.
On Thu, 12 Nov, at 07:38:13PM, Borislav Petkov wrote:
> > @@ -831,6 +831,12 @@ static void __init __efi_enter_virtual_mode(void)
> > efi.systab = NULL;
> >
> > efi_merge_regions();
> > + if (efi_alloc_page_tables()) {
> > + pr_err("Failed to allocate EFI page tables\n");
> > + clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
> > + return;
> > + }
>
> This should happen before efi_merge_regions() - no need to merge if we
> can't alloc PGT.
Fair point.
> > +
> > new_memmap = efi_map_regions(&count, &pg_shift);
> > if (!new_memmap) {
> > pr_err("Error reallocating memory, EFI runtime non-functional!\n");
> > @@ -888,29 +894,12 @@ static void __init __efi_enter_virtual_mode(void)
> >
> > efi_runtime_mkexec();
> >
> > - /*
> > - * We mapped the descriptor array into the EFI pagetable above but we're
> > - * not unmapping it here. Here's why:
> > - *
> > - * We're copying select PGDs from the kernel page table to the EFI page
> > - * table and when we do so and make changes to those PGDs like unmapping
> > - * stuff from them, those changes appear in the kernel page table and we
> > - * go boom.
> > - *
> > - * From setup_real_mode():
> > - *
> > - * ...
> > - * trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
> > - *
> > - * In this particular case, our allocation is in PGD 0 of the EFI page
> > - * table but we've copied that PGD from PGD[272] of the EFI page table:
> > - *
> > - * pgd_index(__PAGE_OFFSET = 0xffff880000000000) = 272
> > - *
> > - * where the direct memory mapping in kernel space is.
> > - *
> > - * new_memmap's VA comes from that direct mapping and thus clearing it,
> > - * it would get cleared in the kernel page table too.
> > + /*
>
> ERROR: code indent should use tabs where possible
> #149: FILE: arch/x86/platform/efi/efi.c:897:
> + /*$
>
> ...
Dammit vim. I'll fix this.
> > void efi_sync_low_kernel_mappings(void)
> > {
> > - unsigned num_pgds;
> > - pgd_t *pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
> > + unsigned num_entries;
> > + pgd_t *pgd_k, *pgd_efi;
> > + pud_t *pud_k, *pud_efi;
> >
> > if (efi_enabled(EFI_OLD_MEMMAP))
> > return;
> >
> > - num_pgds = pgd_index(MODULES_END - 1) - pgd_index(PAGE_OFFSET);
> > + /*
> > + * We can share all PGD entries apart from the one entry that
> > + * covers the EFI runtime mapping space.
> > + *
> > + * Make sure the EFI runtime region mappings are guaranteed to
> > + * only span a single PGD entry and that the entry also maps
> > + * other important kernel regions.
> > + */
> > + BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
> > + BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
> > + (EFI_VA_END & PGDIR_MASK));
>
> You can align them in a more readable way:
>
> BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
> (EFI_VA_END & PGDIR_MASK));
Sure.
On Thu, Nov 12, 2015 at 08:01:08PM +0000, Matt Fleming wrote:
> > That PUSHF implicitly pushes on the stack pointed by %rsp. But(!) we
> > have switched the pagetable (i.e., %cr3 has efi_scratch.efi_pgt) and
> > we're pushing to the VA where the stack *was* but is not anymore.
>
> All the kernel mappings will still exist in the page table we switch
> to, so pushing to the stack should be fine.
>
> The mappings have to exist so that the firmware can dereference
> pointer arguments, e.g. when writing variable data to a kernel buffer.
>
> Or have I misunderstood your point?
>
> > Or maybe it is because you're copying all the PUDs. It is still not 100%
> > clean, IMHO.
^^^^^^^^^^^^^^^^^
I think we're on the same page - you're copying the PUDs in
efi_sync_low_kernel_mappings() so the stack should be there.
> > Can you do the prolog/epilog calls inside the IRQs-off section?
>
> Not really because in the efi_enabled(EFI_OLD_MEMMAP) case we perform
> kmalloc(), see commit 23a0d4e8fa6d ("efi: Disable interrupts around
> EFI calls, not in the epilog/prolog calls").
n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
save_pgd = kmalloc(n_pgds * sizeof(pgd_t), GFP_KERNEL);
This?
That n_pgds thing is of static size so you can pre-alloc it maybe even
once during boot and reuse it all the time when EFI_OLD_MEMMAP is
enabled.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
* Matt Fleming <[email protected]> wrote:
> +We map EFI runtime services in the efi_pgd PGD in the virtual range of
> +64Gb (arbitrarily set, can be raised if needed). The mappings are not
> +part of any other kernel PGD and are only available during EFI runtime
> +calls.
Is that virtual address range 0-64Gb, i.e.:
0x00000000.00000000 - 0x00000010.00000000
or is it somewhere else?
Thanks,
Ingo
On Fri, 13 Nov, at 10:22:10AM, Ingo Molnar wrote:
>
> * Matt Fleming <[email protected]> wrote:
>
> > +We map EFI runtime services in the efi_pgd PGD in the virtual range of
> > +64Gb (arbitrarily set, can be raised if needed). The mappings are not
> > +part of any other kernel PGD and are only available during EFI runtime
> > +calls.
>
> Is that virtual address range 0-64Gb, i.e.:
>
> 0x00000000.00000000 - 0x00000010.00000000
>
> or is it somewhere else?
You've snipped the patch hunk that gives the address range used,
diff --git a/Documentation/x86/x86_64/mm.txt
b/Documentation/x86/x86_64/mm.txt
index 05712ac83e38..a9885bb1ac22 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -16,6 +16,8 @@ ffffec0000000000 - fffffc0000000000 (=44 bits) kasan
shadow memory (16TB)
... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...
+ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space
+... unused hole ...
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping,
from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
On Fri, 13 Nov, at 08:59:43AM, Borislav Petkov wrote:
> On Thu, Nov 12, 2015 at 08:01:08PM +0000, Matt Fleming wrote:
> > > That PUSHF implicitly pushes on the stack pointed by %rsp. But(!) we
> > > have switched the pagetable (i.e., %cr3 has efi_scratch.efi_pgt) and
> > > we're pushing to the VA where the stack *was* but is not anymore.
> >
> > All the kernel mappings will still exist in the page table we switch
> > to, so pushing to the stack should be fine.
> >
> > The mappings have to exist so that the firmware can dereference
> > pointer arguments, e.g. when writing variable data to a kernel buffer.
> >
> > Or have I misunderstood your point?
> >
> > > Or maybe it is because you're copying all the PUDs. It is still not 100%
> > > clean, IMHO.
> ^^^^^^^^^^^^^^^^^
>
> I think we're on the same page - you're copying the PUDs in
> efi_sync_low_kernel_mappings() so the stack should be there.
Correct.
> > > Can you do the prolog/epilog calls inside the IRQs-off section?
> >
> > Not really because in the efi_enabled(EFI_OLD_MEMMAP) case we perform
> > kmalloc(), see commit 23a0d4e8fa6d ("efi: Disable interrupts around
> > EFI calls, not in the epilog/prolog calls").
>
> n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
> save_pgd = kmalloc(n_pgds * sizeof(pgd_t), GFP_KERNEL);
>
> This?
>
> That n_pgds thing is of static size so you can pre-alloc it maybe even
> once during boot and reuse it all the time when EFI_OLD_MEMMAP is
> enabled.
Well, this code is only executed once during boot anyway, for
phys_efi_set_virtual_address_map().
FYI, I'm still planning on ripping out all the EFI_OLD_MEMMAP code, as
it's getting particularly crufty.
On Fri, Nov 13, 2015 at 1:29 AM, Matt Fleming <[email protected]> wrote:
> On Fri, 13 Nov, at 10:22:10AM, Ingo Molnar wrote:
>
> You've snipped the patch hunk that gives the address range used,
I'm actually wondering if we should strive to make the UEFI stuff more
like a user process, and just map the UEFI mappings in low memory in
that magic UEFI address space.
We won't be able to run those things *as* user space, since I assume
the code will want to do a lot of kernely things, but shouldn't we aim
to make it look as much like that as possible? Maybe some day we could
even strive to run it in some controlled environment (ie user space
with fixups, virtual machine, whatever), but even if we never get
there it sounds like a potentially good idea to try to set up the
mappings to move in that direction..
No big hurry, and maybe there are good reasons not to go that way. The
first step is indeed just to get rid of the WX mappings in the normal
kernel page tables.
Linus
On Fri, 13 Nov, at 08:42:54AM, Linus Torvalds wrote:
> On Fri, Nov 13, 2015 at 1:29 AM, Matt Fleming <[email protected]> wrote:
> > On Fri, 13 Nov, at 10:22:10AM, Ingo Molnar wrote:
> >
> > You've snipped the patch hunk that gives the address range used,
>
> I'm actually wondering if we should strive to make the UEFI stuff more
> like a user process, and just map the UEFI mappings in low memory in
> that magic UEFI address space.
We map things in the user address space now but only for the purposes
of having an identity mapping, for the reasons that I mentioned
previously: bust firmware accesses and for the SetVirtaulAddressMap()
call [1]. Importantly, the kernel does not access the identity mapping
directly.
So if we were to repurpose the user address space it would make sense
to just have the identity mapping be the one and only mapping.
However, going through the identity addresses to invoke EFI runtime
services is known to break some Apple Macs. It's probably worth
revisiting this issue, because I don't have any further details.
Having a separate mapping in the user address space that isn't the
identity mapping is also possible of course.
> We won't be able to run those things *as* user space, since I assume
> the code will want to do a lot of kernely things, but shouldn't we aim
> to make it look as much like that as possible? Maybe some day we could
> even strive to run it in some controlled environment (ie user space
> with fixups, virtual machine, whatever), but even if we never get
> there it sounds like a potentially good idea to try to set up the
> mappings to move in that direction..
It would be interesting to see how far we could push this, say, using
SMAP/SMEP to further isolate what kernel pieces the firmware can
touch. It's not about security guarantees since most of the firmware
functionality is implemented in SMM today for x86, but it does go some
way towards providing protection from unintended accesses.
> No big hurry, and maybe there are good reasons not to go that way. The
> first step is indeed just to get rid of the WX mappings in the normal
> kernel page tables.
I think it's worth exploring.
[1] Oh, and also for the EFI mixed mode code (running 64-bit kernels
on 32-bit EFI), but less people tend to care about that ;-)
* Matt Fleming <[email protected]> wrote:
> On Fri, 13 Nov, at 10:22:10AM, Ingo Molnar wrote:
> >
> > * Matt Fleming <[email protected]> wrote:
> >
> > > +We map EFI runtime services in the efi_pgd PGD in the virtual range of
> > > +64Gb (arbitrarily set, can be raised if needed). The mappings are not
> > > +part of any other kernel PGD and are only available during EFI runtime
> > > +calls.
> >
> > Is that virtual address range 0-64Gb, i.e.:
> >
> > 0x00000000.00000000 - 0x00000010.00000000
> >
> > or is it somewhere else?
>
> You've snipped the patch hunk that gives the address range used,
>
> diff --git a/Documentation/x86/x86_64/mm.txt
> b/Documentation/x86/x86_64/mm.txt
> index 05712ac83e38..a9885bb1ac22 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -16,6 +16,8 @@ ffffec0000000000 - fffffc0000000000 (=44 bits) kasan
> shadow memory (16TB)
> ... unused hole ...
> ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
> ... unused hole ...
> +ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space
> +... unused hole ...
> ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping,
> from phys 0
> ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
> ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
Ah yes - but then the text is misleading, what does 'in the virtual range of 64Gb'
mean? A virtual memory range is a specific range of addresses - like in the table
you extended.
A better phrasing would be something like:
We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual memory
window (this size is arbitrary, it can be raised later if needed).
Agreed?
Thanks,
Ingo
On Wed, 18 Nov, at 09:18:59AM, Ingo Molnar wrote:
>
> Ah yes - but then the text is misleading, what does 'in the virtual range of 64Gb'
> mean? A virtual memory range is a specific range of addresses - like in the table
> you extended.
>
> A better phrasing would be something like:
>
> We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual memory
> window (this size is arbitrary, it can be raised later if needed).
>
> Agreed?
Sounds OK to me.