2018-01-05 04:16:35

by Yisheng Xie

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

Hi Dava,

On 2017/11/23 8:34, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> These patches are based on work from a team at Graz University of
> Technology: https://github.com/IAIK/KAISER . This work would not have
> been possible without their work as a starting point.
>
> KAISER is a countermeasure against side channel attacks against kernel
> virtual memory. It leaves the existing page tables largely alone and
> refers to them as the "kernel page tables. It adds a "shadow" pgd for
> every process which is intended for use when running userspace. The
> shadow pgd maps all the same user memory as the "kernel" copy, but
> only maps a minimal set of kernel memory.
>
> Whenever entering the kernel (syscalls, interrupts, exceptions), the
> pgd is switched to the "kernel" copy. When switching back to user
> mode, the shadow pgd is used.
>
> The minimalistic kernel page tables try to map only what is needed to
> enter/exit the kernel such as the entry/exit functions themselves and
> the interrupt descriptors (IDT).
>
> === Page Table Poisoning ===
>
> KAISER has two copies of the page tables: one for the kernel and
> one for when running in userspace.

So, we have 2 page table, thinking about this case:
If _ONE_ process includes _TWO_ threads, one run in user space, the other
run in kernel, they can run in one core with Hyper-Threading, right? So both
userspace and kernel space is valid, right? And for one core with
Hyper-Threading, they may share TLB, so the timing problem described in
the paper may still exist?

Can this case still be protected by KAISER?

Thanks
Yisheng

> There is also a kernel
> portion of each of the page tables: the part that *maps* the
> kernel.
>
> The kernel portion is relatively static and uses pre-populated
> PGDs. Nobody ever calls set_pgd() on the kernel portion during
> normal operation.
>
> The userspace portion of the page tables is updated frequently as
> userspace pages are mapped and page table pages are allocated.
> These updates of the userspace *portion* of the tables need to be
> reflected into both the kernel and user/shadow copies.
>
> The original KAISER patches did this by effectively looking at the
> address that is being updated. If it is <PAGE_OFFSET, it is
> considered to be doing an update for the userspace portion of the page
> tables and must make an entry in the shadow.
>
> However, this has a wrinkle: there are a few places where low
> addresses are used in supervisor (kernel) mode. When EFI calls
> are made, they use what are traditionally user addresses in
> supervisor mode and trip over these checks. The trampoline code
> that used for booting secondary CPUs has a similar issue.
>
> Remember, there are two things that KAISER needs performed on a
> userspace PGD:
>
> 1. Populate the shadow itself
> 2. Poison the kernel PGD so it can not be used by userspace.
>
> Only perform these actions when dealing with a user address *and* the
> PGD has _PAGE_USER set. That way, in-kernel users of low addresses
> typically used by userspace are not accidentally poisoned.
>



2018-01-05 05:18:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>> === Page Table Poisoning ===
>>
>> KAISER has two copies of the page tables: one for the kernel and
>> one for when running in userspace.
>
> So, we have 2 page table, thinking about this case:
> If _ONE_ process includes _TWO_ threads, one run in user space, the other
> run in kernel, they can run in one core with Hyper-Threading, right?

Yes.

> So both userspace and kernel space is valid, right? And for one core
> with Hyper-Threading, they may share TLB, so the timing problem
> described in the paper may still exist?

No. The TLB is managed per logical CPU (hyperthread), as is the CR3
register that points to the page tables. Two threads running the same
process might use the same CR3 _value_, but that does not mean they
share TLB entries.

One thread *can* be in the kernel with the kernel page tables while the
other is in userspace with the user page tables active. They will even
use a different PCID/ASID for the same page tables normally.

> Can this case still be protected by KAISER?

Yes.

2018-01-05 06:17:00

by Yisheng Xie

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

Hi Dave,

On 2018/1/5 13:18, Dave Hansen wrote:
> On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>>> === Page Table Poisoning ===
>>>
>>> KAISER has two copies of the page tables: one for the kernel and
>>> one for when running in userspace.
>>
>> So, we have 2 page table, thinking about this case:
>> If _ONE_ process includes _TWO_ threads, one run in user space, the other
>> run in kernel, they can run in one core with Hyper-Threading, right?
>
> Yes.
>
>> So both userspace and kernel space is valid, right? And for one core
>> with Hyper-Threading, they may share TLB, so the timing problem
>> described in the paper may still exist?
>
> No. The TLB is managed per logical CPU (hyperthread), as is the CR3
> register that points to the page tables. Two threads running the same
> process might use the same CR3 _value_, but that does not mean they
> share TLB entries.

Get it, and thanks for your explain.

BTW, we have just reported a bug caused by kaiser[1], which looks like
caused by SMEP. Could you please help to have a look?

[1] https://lkml.org/lkml/2018/1/5/3

Thanks
Yisheng

>
> One thread *can* be in the kernel with the kernel page tables while the
> other is in userspace with the user page tables active. They will even
> use a different PCID/ASID for the same page tables normally.
>
>> Can this case still be protected by KAISER?
>
> Yes.
>
> .
>

2018-01-05 06:29:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/04/2018 10:16 PM, Yisheng Xie wrote:
> BTW, we have just reported a bug caused by kaiser[1], which looks like
> caused by SMEP. Could you please help to have a look?
>
> [1] https://lkml.org/lkml/2018/1/5/3

Please report that to your kernel vendor. Your EFI page tables have the
NX bit set on the low addresses. There have been a bunch of iterations
of this, but you need to make sure that the EFI kernel mappings don't
get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in
mainline.

2018-01-05 11:49:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

Hi Yisheng and Dave,

On Thu, Jan 04, 2018 at 10:29:53PM -0800, Dave Hansen wrote:
> On 01/04/2018 10:16 PM, Yisheng Xie wrote:
> > BTW, we have just reported a bug caused by kaiser[1], which looks like
> > caused by SMEP. Could you please help to have a look?
> >
> > [1] https://lkml.org/lkml/2018/1/5/3
>
> Please report that to your kernel vendor. Your EFI page tables have the
> NX bit set on the low addresses. There have been a bunch of iterations
> of this, but you need to make sure that the EFI kernel mappings don't
> get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in
> mainline.

Yisheng could you file a report on the vendor bz?

>From my part of course I'm fine to discuss it here, but it's not fair
to use lkml bandwidth for this, sorry for the noise.

The vast majority of the hardware boots fine and isn't running into
this. This is the first time I hear about this, sorry about that.

I fixed it with the upstream solution, greatly appreciated the pointer
Dave. I don't have hardware to verify it though so we've to follow up
on bz.

Thanks,
Andrea

>From 74e2d799b7c22f00a8d3158958e3d6d9fa45c1d2 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Fri, 5 Jan 2018 11:39:40 +0100
Subject: [RHEL7.5 PATCH 1/1] x86/pti/mm: don't set NX on EFI mapping without
_PAGE_USER

The kernel must be able to execute EFI code in userland (positive
virtual address space) without _PAGE_USER set, so don't set NX on
it. This only selectively disables the NX poisoning in kernel pgd so
there's no effect whatsoever on the page table isolation from userland
point of view.

Solves this crash at boot:

[ 0.039130] BUG: unable to handle kernel paging request at 000000005b835f90
[ 0.046101] IP: [<000000005b835f90>] 0x5b835f8f
[ 0.050637] PGD 8000000001f61067 PUD 190ffefff067 PMD 190ffeffd067 PTE 5b835063
[ 0.057989] Oops: 0011 [#1] SMP
[ 0.061241] Modules linked in:
[ 0.064304] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-327.59.59.46.h42.x86_64 #1
[ 0.072280] Hardware name: Huawei FusionServer9032/IT91SMUB, BIOS BLXSV316 11/14/2017
[ 0.080082] task: ffffffff8196e440 ti: ffffffff81958000 task.ti: ffffffff81958000
[ 0.087539] RIP: 0010:[<000000005b835f90>] [<000000005b835f90>] 0x5b835f8f
[ 0.094494] RSP: 0000:ffffffff8195be28 EFLAGS: 00010046
[ 0.099788] RAX: 0000000080050033 RBX: ffff910fbc802000 RCX: 00000000000002d0
[ 0.106897] RDX: 0000000000000030 RSI: 00000000000002d0 RDI: 000000005b835f90
[ 0.114006] RBP: ffffffff8195bf38 R08: 0000000000000001 R09: 0000090fbc802000
[ 0.121116] R10: ffff88ffbcc07340 R11: 0000000000000001 R12: 0000000000000001
[ 0.128225] R13: 0000090fbc802000 R14: 00000000000002d0 R15: 0000000000000001
[ 0.135336] FS: 0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
[ 0.143398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.149124] CR2: 000000005b835f90 CR3: 0000000001966000 CR4: 00000000000606b0
[ 0.156234] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.163344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.170454] Call Trace:
[ 0.172899] [<ffffffff8107512c>] ? efi_call4+0x6c/0xf0
[ 0.178108] [<ffffffff8105b3fe>] ? native_flush_tlb_global+0x8e/0xc0
[ 0.184527] [<ffffffff810652b3>] ? set_memory_x+0x43/0x50
[ 0.189997] [<ffffffff81acf91f>] ? efi_enter_virtual_mode+0x3bc/0x538
[ 0.196505] [<ffffffff81ab104b>] start_kernel+0x39f/0x44f
[ 0.201972] [<ffffffff81ab0ab5>] ? repair_env_string+0x5c/0x5c
[ 0.207872] [<ffffffff81ab0120>] ? early_idt_handlers+0x120/0x120
[ 0.214030] [<ffffffff81ab066c>] x86_64_start_reservations+0x2a/0x2c
[ 0.220449] [<ffffffff81ab07c0>] x86_64_start_kernel+0x152/0x175
[ 0.226521] Code: Bad RIP value.
[ 0.229860] RIP [<000000005b835f90>] 0x5b835f8f
[ 0.234478] RSP <ffffffff8195be28>
[ 0.237955] CR2: 000000005b835f90
[ 0.241266] ---[ end trace 8178226af3e802ca ]---
[ 0.245869] Kernel panic - not syncing: Fatal exception

Reported-by: Yisheng Xie <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 7c8bc5c23664..132176fe45e2 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -189,28 +189,34 @@ static inline bool pgd_userspace_access(pgd_t pgd)
return pgd.pgd & _PAGE_USER;
}

+#define _PAGE_PTI_CAN_NX (_PAGE_PRESENT|_PAGE_USER)
+
static inline void kaiser_poison_pgd(pgd_t *pgd)
{
- if (pgd->pgd & _PAGE_PRESENT && __supported_pte_mask & _PAGE_NX)
+ if ((pgd->pgd & _PAGE_PTI_CAN_NX) == _PAGE_PTI_CAN_NX &&
+ __supported_pte_mask & _PAGE_NX)
pgd->pgd |= _PAGE_NX;
}

static inline void kaiser_unpoison_pgd(pgd_t *pgd)
{
- if (pgd->pgd & _PAGE_PRESENT && __supported_pte_mask & _PAGE_NX)
+ if ((pgd->pgd & _PAGE_PTI_CAN_NX) == _PAGE_PTI_CAN_NX &&
+ __supported_pte_mask & _PAGE_NX)
pgd->pgd &= ~_PAGE_NX;
}

static inline void kaiser_poison_pgd_atomic(pgd_t *pgd)
{
BUILD_BUG_ON(_PAGE_NX == 0);
- if (pgd->pgd & _PAGE_PRESENT && __supported_pte_mask & _PAGE_NX)
+ if ((pgd->pgd & _PAGE_PTI_CAN_NX) == _PAGE_PTI_CAN_NX &&
+ __supported_pte_mask & _PAGE_NX)
set_bit(_PAGE_BIT_NX, &pgd->pgd);
}

static inline void kaiser_unpoison_pgd_atomic(pgd_t *pgd)
{
- if (pgd->pgd & _PAGE_PRESENT && __supported_pte_mask & _PAGE_NX)
+ if ((pgd->pgd & _PAGE_PTI_CAN_NX) == _PAGE_PTI_CAN_NX &&
+ __supported_pte_mask & _PAGE_NX)
clear_bit(_PAGE_BIT_NX, &pgd->pgd);
}


2018-01-05 18:20:39

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)


[ adding Hugh ]

On Thu, 4 Jan 2018, Dave Hansen wrote:

> > BTW, we have just reported a bug caused by kaiser[1], which looks like
> > caused by SMEP. Could you please help to have a look?
> >
> > [1] https://lkml.org/lkml/2018/1/5/3
>
> Please report that to your kernel vendor. Your EFI page tables have the
> NX bit set on the low addresses. There have been a bunch of iterations
> of this, but you need to make sure that the EFI kernel mappings don't
> get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in
> mainline.

Unfortunately this is more complicated.

The thing is -- efi=old_memmap is broken even upstream. We will probably
not receive too many reports about this against upstream PTI, as most of
the machines are using classic high-mapping of EFI regions; but older
kernels force on certain machines stil old_memmap (or it can be specified
manually on kernel cmdline), where EFI has all its mapping in the
userspace range.

And that explodes, as those get marked NX in the kernel pagetables.

I've spent most of today tracking this down (the legacy EFI mmap is
horrid); the patch below is confirmed to fix it both on current upstream
kernel, as well as on original-KAISER based kernels (Hugh's backport) in
cases old_memmap is used by EFI.

I am not super happy about this, but I din't really want to extend the
_set_pgd() code to always figure out whether it's dealing wih low EFI
mapping or not, as that would be way too much overhead just for this
one-off call during boot.



From: Jiri Kosina <[email protected]>
Subject: [PATCH] PTI: unbreak EFI old_memmap

old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.

Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).

_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.

Signed-off-by: Jiri Kosina <[email protected]>
---
arch/x86/platform/efi/efi_64.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
+ /*
+ * pgprot API doesn't clear it for PGD
+ *
+ * Will be brought back automatically in _epilog()
+ */
+ pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
}
__flush_tlb_all();


--
Jiri Kosina
SUSE Labs

2018-01-05 19:01:01

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)


The previous patch was for slightly older kernel, and the logic in
_prologue() is a bit different in 4.15, but the (cofirmed) fix for
mainline is basically the same:


From: Jiri Kosina <[email protected]>
Subject: [PATCH] PTI: unbreak EFI old_memmap

old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.

Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).

_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.

Signed-off-by: Jiri Kosina <[email protected]>

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index d87ac96e37ed..2dd15e967c3f 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -135,7 +135,9 @@ pgd_t * __init efi_call_phys_prolog(void)
pud[j] = *pud_offset(p4d_k, vaddr);
}
}
+ pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
}
+
out:
__flush_tlb_all();


--
Jiri Kosina
SUSE Labs

2018-01-05 19:03:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/05/2018 10:19 AM, Jiri Kosina wrote:
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> + /*
> + * pgprot API doesn't clear it for PGD
> + *
> + * Will be brought back automatically in _epilog()
> + */
> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
> }
> __flush_tlb_all();

Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
the &init_mm in there and *not* set _PAGE_USER?

2018-01-05 19:17:22

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, 5 Jan 2018, Dave Hansen wrote:

> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
> > save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> > vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> > set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> > + /*
> > + * pgprot API doesn't clear it for PGD
> > + *
> > + * Will be brought back automatically in _epilog()
> > + */
> > + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
> > }
> > __flush_tlb_all();
>
> Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
> the &init_mm in there and *not* set _PAGE_USER?

That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
reasons that are behind me.

I did put this on my TODO list, but for later.

(and yes, I tried clearing _PAGE_USER from init_mm's PGD, and no obvious
breakages appeared, but I wanted to give it more thought later).

--
Jiri Kosina
SUSE Labs

2018-01-05 19:18:31

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, 5 Jan 2018, Jiri Kosina wrote:

> That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
> reasons that are behind me.

[ oh and BTW I find the fact that we have populate_pgd() and
pgd_populate(), which do something *completely* different quite
entertaining ]

--
Jiri Kosina
SUSE Labs

2018-01-05 19:55:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, Jan 05, 2018 at 08:17:17PM +0100, Jiri Kosina wrote:
> On Fri, 5 Jan 2018, Dave Hansen wrote:
>
> > > --- a/arch/x86/platform/efi/efi_64.c
> > > +++ b/arch/x86/platform/efi/efi_64.c
> > > @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
> > > save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> > > vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> > > set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> > > + /*
> > > + * pgprot API doesn't clear it for PGD
> > > + *
> > > + * Will be brought back automatically in _epilog()
> > > + */
> > > + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
> > > }
> > > __flush_tlb_all();

Upstream & downstream looks different, how the above looks completely
different I don't know, but I got it and updating is easy. Great
catch.

> >
> > Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
> > the &init_mm in there and *not* set _PAGE_USER?
>
> That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
> reasons that are behind me.

For vsyscalls? I also had to single out warnings out of init_mm.pgd
for the same reasons.

How does the below (untested) look?

>From ab949b80124588c4791568429cf8a234dda16340 Mon Sep 17 00:00:00 2001
From: Jiri Kosina <[email protected]>
Date: Fri, 5 Jan 2018 20:00:25 +0100
Subject: [RHEL7.5 PATCH 1/1] x86/kaiser/efi: unbreak EFI old_memmap

old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.

Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).

_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.

Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/platform/efi/efi_64.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index f951026ea2d2..395079128d98 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -110,6 +110,7 @@ void __init efi_call_phys_prelog(void)
vaddr = (unsigned long)__va(pgd * PGDIR_SIZE);
pgd_efi = pgd_offset_k(addr_pgd);
save_pgd[pgd] = *pgd_efi;
+ pgd_efi->pgd &= ~_PAGE_NX;

pud = pud_alloc(&init_mm, pgd_efi, addr_pgd);
if (!pud) {

2018-01-05 21:07:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/05/2018 11:17 AM, Jiri Kosina wrote:
> On Fri, 5 Jan 2018, Dave Hansen wrote:
>
>>> --- a/arch/x86/platform/efi/efi_64.c
>>> +++ b/arch/x86/platform/efi/efi_64.c
>>> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
>>> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
>>> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
>>> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
>>> + /*
>>> + * pgprot API doesn't clear it for PGD
>>> + *
>>> + * Will be brought back automatically in _epilog()
>>> + */
>>> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
>>> }
>>> __flush_tlb_all();
>>
>> Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
>> the &init_mm in there and *not* set _PAGE_USER?
>
> That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
> reasons that are behind me.
>
> I did put this on my TODO list, but for later.
>
> (and yes, I tried clearing _PAGE_USER from init_mm's PGD, and no obvious
> breakages appeared, but I wanted to give it more thought later).

Feel free to add my Ack on this. I'd personally much rather muck with
random relatively unused bits of the efi code than touch the core PGD code.

We need to go look at it again in the 4.16 timeframe, probably.

2018-01-05 21:14:11

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, 5 Jan 2018, Dave Hansen wrote:

> >>> --- a/arch/x86/platform/efi/efi_64.c
> >>> +++ b/arch/x86/platform/efi/efi_64.c
> >>> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
> >>> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> >>> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> >>> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> >>> + /*
> >>> + * pgprot API doesn't clear it for PGD
> >>> + *
> >>> + * Will be brought back automatically in _epilog()
> >>> + */
> >>> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
> >>> }
> >>> __flush_tlb_all();
> >>
> >> Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
> >> the &init_mm in there and *not* set _PAGE_USER?
> >
> > That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
> > reasons that are behind me.
> >
> > I did put this on my TODO list, but for later.
> >
> > (and yes, I tried clearing _PAGE_USER from init_mm's PGD, and no obvious
> > breakages appeared, but I wanted to give it more thought later).
>
> Feel free to add my Ack on this.

Thanks. I'll extract the patch out of this thread and submit it
separately, so that it doesn't get lost buried here.

> I'd personally much rather muck with random relatively unused bits of
> the efi code than touch the core PGD code.

Exactly. Especially at this point.

> We need to go look at it again in the 4.16 timeframe, probably.

Agreed. On my TODO list already.

Thanks,

--
Jiri Kosina
SUSE Labs

2018-01-05 21:29:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)



> On Jan 5, 2018, at 1:14 PM, Jiri Kosina <[email protected]> wrote:
>
> On Fri, 5 Jan 2018, Dave Hansen wrote:
>
>>>>> --- a/arch/x86/platform/efi/efi_64.c
>>>>> +++ b/arch/x86/platform/efi/efi_64.c
>>>>> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
>>>>> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
>>>>> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
>>>>> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
>>>>> + /*
>>>>> + * pgprot API doesn't clear it for PGD
>>>>> + *
>>>>> + * Will be brought back automatically in _epilog()
>>>>> + */
>>>>> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
>>>>> }
>>>>> __flush_tlb_all();
>>>>
>>>> Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
>>>> the &init_mm in there and *not* set _PAGE_USER?
>>>
>>> That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
>>> reasons that are behind me.
>>>
>>> I did put this on my TODO list, but for later.
>>>
>>> (and yes, I tried clearing _PAGE_USER from init_mm's PGD, and no obvious
>>> breakages appeared, but I wanted to give it more thought later).
>>
>> Feel free to add my Ack on this.
>
> Thanks. I'll extract the patch out of this thread and submit it
> separately, so that it doesn't get lost buried here.
>
>> I'd personally much rather muck with random relatively unused bits of
>> the efi code than touch the core PGD code.
>
> Exactly. Especially at this point.
>
>> We need to go look at it again in the 4.16 timeframe, probably.
>
> Agreed. On my TODO list already.

Can we just delete the old memmap code instead?

--Andy

>

2018-01-05 22:48:20

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, Jan 5, 2018 at 1:14 PM, Jiri Kosina <[email protected]> wrote:
> On Fri, 5 Jan 2018, Dave Hansen wrote:
>
>> >>> --- a/arch/x86/platform/efi/efi_64.c
>> >>> +++ b/arch/x86/platform/efi/efi_64.c
>> >>> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
>> >>> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
>> >>> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
>> >>> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
>> >>> + /*
>> >>> + * pgprot API doesn't clear it for PGD
>> >>> + *
>> >>> + * Will be brought back automatically in _epilog()
>> >>> + */
>> >>> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
>> >>> }
>> >>> __flush_tlb_all();
>> >>
>> >> Wait a sec... Where does the _PAGE_USER come from? Shouldn't we see
>> >> the &init_mm in there and *not* set _PAGE_USER?
>> >
>> > That's because pgd_populate() uses _PAGE_TABLE and not _KERNPG_TABLE for
>> > reasons that are behind me.

Oh, I completely missed that; and then the issue would have got hidden
by one of my later per-process-kaiser patches.

>> >
>> > I did put this on my TODO list, but for later.
>> >
>> > (and yes, I tried clearing _PAGE_USER from init_mm's PGD, and no obvious
>> > breakages appeared, but I wanted to give it more thought later).
>>
>> Feel free to add my Ack on this.

And mine - thanks a lot for dealing with this Jiri.

>
> Thanks. I'll extract the patch out of this thread and submit it
> separately, so that it doesn't get lost buried here.
>
>> I'd personally much rather muck with random relatively unused bits of
>> the efi code than touch the core PGD code.
>
> Exactly. Especially at this point.

Indeed.

>
>> We need to go look at it again in the 4.16 timeframe, probably.
>
> Agreed. On my TODO list already.
>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>

2018-01-06 04:55:49

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

Hi Jiri,

Thanks for the fix, comments inline.

On 2018/1/6 2:19, Jiri Kosina wrote:
>
> [ adding Hugh ]
>
> On Thu, 4 Jan 2018, Dave Hansen wrote:
>
>>> BTW, we have just reported a bug caused by kaiser[1], which looks like
>>> caused by SMEP. Could you please help to have a look?
>>>
>>> [1] https://lkml.org/lkml/2018/1/5/3
>>
>> Please report that to your kernel vendor. Your EFI page tables have the
>> NX bit set on the low addresses. There have been a bunch of iterations
>> of this, but you need to make sure that the EFI kernel mappings don't
>> get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in
>> mainline.
>
> Unfortunately this is more complicated.
>
> The thing is -- efi=old_memmap is broken even upstream. We will probably
> not receive too many reports about this against upstream PTI, as most of
> the machines are using classic high-mapping of EFI regions; but older
> kernels force on certain machines stil old_memmap (or it can be specified
> manually on kernel cmdline), where EFI has all its mapping in the
> userspace range.
>
> And that explodes, as those get marked NX in the kernel pagetables.
>
> I've spent most of today tracking this down (the legacy EFI mmap is
> horrid); the patch below is confirmed to fix it both on current upstream
> kernel, as well as on original-KAISER based kernels (Hugh's backport) in
> cases old_memmap is used by EFI.
>
> I am not super happy about this, but I din't really want to extend the
> _set_pgd() code to always figure out whether it's dealing wih low EFI
> mapping or not, as that would be way too much overhead just for this
> one-off call during boot.
>
>
>
> From: Jiri Kosina <[email protected]>
> Subject: [PATCH] PTI: unbreak EFI old_memmap
>
> old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
> has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
> execute it's code.
>
> Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
> by the pgprot API).
>
> _PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
> _set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
> it.
>
> Signed-off-by: Jiri Kosina <[email protected]>
> ---
> arch/x86/platform/efi/efi_64.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -95,6 +95,12 @@ pgd_t * __init efi_call_phys_prolog(void
> save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress));
> + /*
> + * pgprot API doesn't clear it for PGD
> + *
> + * Will be brought back automatically in _epilog()
> + */
> + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;

Do you mean NX bit will be brought back later? I'm asking this because
I tested this patch which it fixed the boot panic issue but the system
will hang when rebooting the system, because rebooting will also call efi
then panic as NS bit is set.

[ 1911.622675] BUG: unable to handle kernel paging request at 00000000008041c0
[ 1911.629880] IP: [<00000000008041c0>] 0x8041bf
[ 1911.634389] PGD 80000010272cb067 PUD 2025178067 PMD 10272d8067 PTE 804063
[ 1911.641472] Oops: 0011 [#1] SMP
[ 1911.711748] Modules linked in: bum(O) ip_set nfnetlink prio(O) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre kboxdriver(O) kbox(O) signo_catch(O) vfat fat tg3 intel_powerclamp coretemp intel_rapl crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel i2c_i801 kvm_intel(O) ptp lrw gf128mul i2c_core glue_helper ablk_helper pps_core kvm(O) cryptd iTCO_wdt iTCO_vendor_support sg pcspkr lpc_ich mfd_core sb_edac mei_me edac_core mei shpchp acpi_power_meter acpi_pad remote_trigger(O) nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd sr_mod sd_mod cdrom lpfc crc_t10dif ahci crct10dif_generic crct10dif_pclmul libahci scsi_transport_fc scsi_tgt crct10dif_common libata usb_storage megaraid_sas dm_mod [last unloaded: dev_connlimit]
[ 1911.796711] CPU: 0 PID: 12033 Comm: reboot Tainted: G OE ---- ------- 3.10.0-327.61.59.66_22.x86_64 #1
[ 1911.807449] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.79 11/07/2017
[ 1911.814702] task: ffff881025a91700 ti: ffff8810267fc000 task.ti: ffff8810267fc000
[ 1911.822401] RIP: 0010:[<00000000008041c0>] [<00000000008041c0>] 0x8041bf
[ 1911.829407] RSP: 0018:ffff8810267ffd50 EFLAGS: 00010086
[ 1911.834877] RAX: 00000000008041c0 RBX: 0000000000000000 RCX: ffffffffff425000
[ 1911.842220] RDX: ffff8820a4e40000 RSI: 000000000000c000 RDI: 0000002024e40000
[ 1911.849563] RBP: ffff8810267ffd60 R08: ffff882024e40000 R09: 0000000000000000
[ 1911.856908] R10: ffffffff81a8f300 R11: ffff8810267ffaae R12: 0000000028121969
[ 1911.864250] R13: ffffffff819aa8a0 R14: 0000000000000cf9 R15: 0000000000000000
[ 1911.871596] FS: 00007f89d6143880(0000) GS:ffff881040400000(0000) knlGS:0000000000000000
[ 1911.879921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1911.885836] CR2: 00000000008041c0 CR3: 0000002024e40000 CR4: 00000000001607f0
[ 1911.893180] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1911.900522] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1911.907863] Call Trace:
[ 1911.910384] [<ffffffff810241ab>] ? tboot_shutdown+0x5b/0x140
[ 1911.916298] [<ffffffff8104723c>] native_machine_emergency_restart+0x4c/0x250
[ 1911.923641] [<ffffffff8104c102>] ? disconnect_bsp_APIC+0x82/0xc0
[ 1911.929913] [<ffffffff81046e17>] native_machine_restart+0x37/0x40
[ 1911.936273] [<ffffffff810470ef>] machine_restart+0xf/0x20
[ 1911.941923] [<ffffffff8109af95>] kernel_restart+0x45/0x60
[ 1911.947570] [<ffffffff8109b1d9>] SYSC_reboot+0x229/0x260
[ 1911.953132] [<ffffffff811ef665>] ? vfs_writev+0x35/0x60
[ 1911.958603] [<ffffffff8109b27e>] SyS_reboot+0xe/0x10
[ 1911.963806] [<ffffffff8165e43d>] system_call_fastpath+0x16/0x1b
[ 1911.969987] Code: Bad RIP value.
[ 1911.973448] RIP [<00000000008041c0>] 0x8041bf
[ 1911.978044] RSP <ffff8810267ffd50>
[ 1911.990106] CR2: 00000000008041c0
[ 1912.001889] ---[ end trace e8475aee26ff7d9f ]---
[ 1912.408111] Kernel panic - not syncing: Fatal exception

Thanks
Hanjun

2018-01-06 06:06:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/05/2018 08:54 PM, Hanjun Guo wrote:
> Do you mean NX bit will be brought back later? I'm asking this because
> I tested this patch which it fixed the boot panic issue but the system
> will hang when rebooting the system, because rebooting will also call efi
> then panic as NS bit is set.

Wow, you're running a lot of very lighly-used code paths! You actually
found a similar but totally separate issue from what I gather. Thank
you immensely for the quick testing and bug reports!

Could you test the attached fix?

For those playing along at home, I think this will end up being needed
for 4.15 and probably all the backports. I want to see if it works
before I submit it for real, though.


Attachments:
pti-tboot-fix.patch (1.32 kB)

2018-01-06 06:32:24

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

Hi Dave,

Thank you very much for the quick response! Minor comments inline.

On 2018/1/6 14:06, Dave Hansen wrote:
> On 01/05/2018 08:54 PM, Hanjun Guo wrote:
>> Do you mean NX bit will be brought back later? I'm asking this because
>> I tested this patch which it fixed the boot panic issue but the system
>> will hang when rebooting the system, because rebooting will also call efi
>> then panic as NS bit is set.
> Wow, you're running a lot of very lighly-used code paths! You actually
> found a similar but totally separate issue from what I gather. Thank
> you immensely for the quick testing and bug reports!
>
> Could you test the attached fix?
>
> For those playing along at home, I think this will end up being needed
> for 4.15 and probably all the backports. I want to see if it works
> before I submit it for real, though.
>
>
> pti-tboot-fix.patch
>
>
> From: Dave Hansen <[email protected]>
>
> This is another case similar to what EFI does: create a new set of
> page tables, map some code at a low address, and jump to it. PTI
> mistakes this low address for userspace and mistakenly marks it
> non-executable in an effort to make it unusable for userspace. Undo
> the poison to allow execution.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Ning Sun <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
>
> b/arch/x86/kernel/tboot.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff -puN arch/x86/kernel/tboot.c~pti-tboot-fix arch/x86/kernel/tboot.c
> --- a/arch/x86/kernel/tboot.c~pti-tboot-fix 2018-01-05 21:50:55.755554960 -0800
> +++ b/arch/x86/kernel/tboot.c 2018-01-05 22:01:51.393553325 -0800
> @@ -124,6 +124,13 @@ static int map_tboot_page(unsigned long
> pte_t *pte;
>
> pgd = pgd_offset(&tboot_mm, vaddr);
> + /*
> + * PTI poisons low addresses in the kernel page tables in the
> + * name of making them unusable for userspace. To execute
> + * code at such a low address, the poison must be cleared.
> + */
> + pgd->pgd &= ~_PAGE_NX;

...

> +
> p4d = p4d_alloc(&tboot_mm, pgd, vaddr);

Seems pgd will be re-set after p4d_alloc(), so should
we put the code behind (or after pud_alloc())?

> if (!p4d)
> return -1;

+ /*
+ * PTI poisons low addresses in the kernel page tables in the
+ * name of making them unusable for userspace. To execute
+ * code at such a low address, the poison must be cleared.
+ */
+ pgd->pgd &= ~_PAGE_NX;

We will have a try in a minute, and report back later.

Thanks
Hanjun

2018-01-06 06:54:43

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 2018/1/6 14:28, Hanjun Guo wrote:
> Hi Dave,
>
> Thank you very much for the quick response! Minor comments inline.
>
> On 2018/1/6 14:06, Dave Hansen wrote:
>> On 01/05/2018 08:54 PM, Hanjun Guo wrote:
>>> Do you mean NX bit will be brought back later? I'm asking this because
>>> I tested this patch which it fixed the boot panic issue but the system
>>> will hang when rebooting the system, because rebooting will also call efi
>>> then panic as NS bit is set.
>> Wow, you're running a lot of very lighly-used code paths! You actually
>> found a similar but totally separate issue from what I gather. Thank
>> you immensely for the quick testing and bug reports!
>>
>> Could you test the attached fix?
>>
>> For those playing along at home, I think this will end up being needed
>> for 4.15 and probably all the backports. I want to see if it works
>> before I submit it for real, though.
>>
>>
>> pti-tboot-fix.patch
>>
>>
>> From: Dave Hansen <[email protected]>
>>
>> This is another case similar to what EFI does: create a new set of
>> page tables, map some code at a low address, and jump to it. PTI
>> mistakes this low address for userspace and mistakenly marks it
>> non-executable in an effort to make it unusable for userspace. Undo
>> the poison to allow execution.
>>
>> Signed-off-by: Dave Hansen <[email protected]>
>> Cc: Ning Sun <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>>
>> b/arch/x86/kernel/tboot.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff -puN arch/x86/kernel/tboot.c~pti-tboot-fix arch/x86/kernel/tboot.c
>> --- a/arch/x86/kernel/tboot.c~pti-tboot-fix 2018-01-05 21:50:55.755554960 -0800
>> +++ b/arch/x86/kernel/tboot.c 2018-01-05 22:01:51.393553325 -0800
>> @@ -124,6 +124,13 @@ static int map_tboot_page(unsigned long
>> pte_t *pte;
>>
>> pgd = pgd_offset(&tboot_mm, vaddr);
>> + /*
>> + * PTI poisons low addresses in the kernel page tables in the
>> + * name of making them unusable for userspace. To execute
>> + * code at such a low address, the poison must be cleared.
>> + */
>> + pgd->pgd &= ~_PAGE_NX;
>
> ...
>
>> +
>> p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
>
> Seems pgd will be re-set after p4d_alloc(), so should
> we put the code behind (or after pud_alloc())?
>
>> if (!p4d)
>> return -1;
>
> + /*
> + * PTI poisons low addresses in the kernel page tables in the
> + * name of making them unusable for userspace. To execute
> + * code at such a low address, the poison must be cleared.
> + */
> + pgd->pgd &= ~_PAGE_NX;
>
> We will have a try in a minute, and report back later.

And it works,we can boot/reboot the system successfully, thank
you all the quick response and debug!

Thanks
Hanjun

2018-01-06 07:51:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/05/2018 10:28 PM, Hanjun Guo wrote:
>> +
>> p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
> Seems pgd will be re-set after p4d_alloc(), so should
> we put the code behind (or after pud_alloc())?

<sigh> Yes, it has to go below where the PGD actually gets set which is
after pud_alloc(). You can put it anywhere later in the function.

2018-01-06 07:55:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 01/05/2018 10:53 PM, Hanjun Guo wrote:
>> + /*
>> + * PTI poisons low addresses in the kernel page tables in the
>> + * name of making them unusable for userspace. To execute
>> + * code at such a low address, the poison must be cleared.
>> + */
>> + pgd->pgd &= ~_PAGE_NX;
>>
>> We will have a try in a minute, and report back later.
> And it works,we can boot/reboot the system successfully, thank
> you all the quick response and debug!

I think I'll just submit the attached patch if there are no objections
(and if it works, of course!).

I just stuck the NX clearing at the bottom.


Attachments:
pti-tboot-fix.patch (1.42 kB)

2018-01-06 08:42:51

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On 2018/1/6 15:55, Dave Hansen wrote:
> On 01/05/2018 10:53 PM, Hanjun Guo wrote:
>>> + /*
>>> + * PTI poisons low addresses in the kernel page tables in the
>>> + * name of making them unusable for userspace. To execute
>>> + * code at such a low address, the poison must be cleared.
>>> + */
>>> + pgd->pgd &= ~_PAGE_NX;
>>>
>>> We will have a try in a minute, and report back later.
>> And it works,we can boot/reboot the system successfully, thank
>> you all the quick response and debug!
> I think I'll just submit the attached patch if there are no objections
> (and if it works, of course!).

We tested that placing the NX clearing after pud_alloc(), and it works,
patch below should work as well.

>
> I just stuck the NX clearing at the bottom.
>
>
> pti-tboot-fix.patch
>
>
> From: Dave Hansen <[email protected]>
>
> This is another case similar to what EFI does: create a new set of
> page tables, map some code at a low address, and jump to it. PTI
> mistakes this low address for userspace and mistakenly marks it
> non-executable in an effort to make it unusable for userspace. Undo
> the poison to allow execution.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Ning Sun <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
>
> b/arch/x86/kernel/tboot.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff -puN arch/x86/kernel/tboot.c~pti-tboot-fix arch/x86/kernel/tboot.c
> --- a/arch/x86/kernel/tboot.c~pti-tboot-fix 2018-01-05 21:50:55.755554960 -0800
> +++ b/arch/x86/kernel/tboot.c 2018-01-05 23:51:41.368536890 -0800
> @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long
> return -1;
> set_pte_at(&tboot_mm, vaddr, pte, pfn_pte(pfn, prot));
> pte_unmap(pte);
> +
> + /*
> + * PTI poisons low addresses in the kernel page tables in the
> + * name of making them unusable for userspace. To execute
> + * code at such a low address, the poison must be cleared.
> + *
> + * Note: 'pgd' actually gets set in p4d_alloc() _or_
> + * pud_alloc() depending on 4/5-level paging.
> + */
> + pgd->pgd &= ~_PAGE_NX;
> +
> return 0;
> }

Thanks
Hanjun

2018-01-06 17:22:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

On Fri, Jan 05, 2018 at 11:51:38PM -0800, Dave Hansen wrote:
> On 01/05/2018 10:28 PM, Hanjun Guo wrote:
> >> +
> >> p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
> > Seems pgd will be re-set after p4d_alloc(), so should
> > we put the code behind (or after pud_alloc())?

Thanks Dave and Jiri for these two tboot and efi_64 fixes.

>
> <sigh> Yes, it has to go below where the PGD actually gets set which is
> after pud_alloc(). You can put it anywhere later in the function.

Did the exact same oversight yesterday when porting Jiri's fix.

efi_64 booted fine verified yesterday in a respin of what I sent here
by just moving it after pud_alloc too:

pud = pud_alloc(&init_mm, pgd_efi, addr_pgd);
if (!pud) {
pr_err("Failed to allocate pud table!\n");
break;
}
+ pgd_efi->pgd &= ~_PAGE_NX;

Now I'm having this tested for tboot too (still untested). With tboot
I expect the first build pass the test. All followups on bz.

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 088681d4fc45..09cff5f4f9a4 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -131,6 +131,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
pud = pud_alloc(&tboot_mm, pgd, vaddr);
if (!pud)
return -1;
+ pgd->pgd &= ~_PAGE_NX;
pmd = pmd_alloc(&tboot_mm, pud, vaddr);
if (!pmd)
return -1;

Note your upstream submitted version is less theoretically correct than the
above. It won't make a difference in practice, but it is theoretically
wrong to clear the PAGE_NX only if pte_alloc_map succeeds like your
patch does.

If in the future pte_alloc_map fails and for whatever reason the pgd
will still be used and the whole thing will not abort, your fix will
still end up with NX set in the pgd.

Only the first pud allocation establishes itself in the pgd, follow
ups don't if in __pud_alloc pgd_present() will return true.

This is why I did the strictier backport of Jiri's fix yesterday but I
was a little too strict putting it just before pud_alloc and it had to
go just after it.

Thanks,
Andrea