Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 2 --
arch/x86/mm/pti.c | 14 ++++++++++++++
kernel/fork.c | 22 ++++++++++++++++++++++
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e0b5b010e0b..458af12ed9a1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm)
* rbx: kernel thread func (NULL for user thread)
* r12: kernel thread arg
*/
-.pushsection .text, "ax"
SYM_CODE_START(ret_from_fork)
UNWIND_HINT_REGS
movq %rsp, %rdi /* pt_regs */
@@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork)
call return_from_fork /* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork)
-.popsection
.macro DEBUG_ENTRY_ASSERT_IRQS_OFF
#ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 71ca245d7b38..f4f3d9ae4449 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -465,6 +465,11 @@ static void __init pti_clone_user_shared(void)
*/
pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
+ /*
+ * Map fixed_percpu_data to get the stack canary.
+ */
+ if (IS_ENABLED(CONFIG_STACKPROTECTOR))
+ pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu));
}
}
@@ -505,6 +510,15 @@ static void pti_clone_entry_text(void)
pti_clone_init_pgtable((unsigned long) __entry_text_start,
(unsigned long) __entry_text_end,
PTI_CLONE_PMD);
+
+ /*
+ * Syscall and interrupt entry code (which is in the noinstr
+ * section) will be entered with the user page-table, so that
+ * code has to be mapped in.
+ */
+ pti_clone_init_pgtable((unsigned long) __noinstr_text_start,
+ (unsigned long) __noinstr_text_end,
+ PTI_CLONE_PMD);
}
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..31cd77dbdba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
#endif
}
+static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ unsigned long addr;
+
+ if (!tsk || !static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ /*
+ * Map the task stack after the kernel stack into the user
+ * address space, so that this stack can be used when entering
+ * syscall or interrupt from user mode.
+ */
+ BUG_ON(!task_stack_page(tsk));
+ addr = (unsigned long)task_top_of_kernel_stack(tsk);
+ pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE);
+#endif
+}
+
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
{
@@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (init_new_context(p, mm))
goto fail_nocontext;
+ mm_map_task(mm, p);
+
mm->user_ns = get_user_ns(user_ns);
return mm;
@@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
vmacache_flush(tsk);
if (clone_flags & CLONE_VM) {
+ mm_map_task(oldmm, tsk);
mmget(oldmm);
mm = oldmm;
goto good_mm;
--
2.18.4
[Copying the reply to Andy in the thread with the right email addresses]
On 11/9/20 6:28 PM, Andy Lutomirski wrote:
> On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> Extend PTI user mappings so that more kernel entry code can be executed
>> with the user page-table. To do so, we need to map syscall and interrupt
>> entry code,
>
> Probably fine.
>
>> per cpu offsets (__per_cpu_offset, which is used some in
>> entry code),
>
> This likely already leaks due to vulnerable CPUs leaking address space
> layout info.
I forgot to update the comment, I am not mapping __per_cpu_offset anymore.
However, if we do map __per_cpu_offset then we don't need to enforce the
ordering in paranoid_entry to switch CR3 before GS.
>
>> the stack canary,
>
> That's going to be a very tough sell.
>
I can get rid of this, but this will require to disable stack-protector for
any function that we can call while using the user page-table, like already
done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).
alex.
On Mon, Nov 9, 2020 at 11:54 AM Alexandre Chartre
<[email protected]> wrote:
>
>
> [Copying the reply to Andy in the thread with the right email addresses]
>
> On 11/9/20 6:28 PM, Andy Lutomirski wrote:
> > On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >> Extend PTI user mappings so that more kernel entry code can be executed
> >> with the user page-table. To do so, we need to map syscall and interrupt
> >> entry code,
> >
> > Probably fine.
> >
> >> per cpu offsets (__per_cpu_offset, which is used some in
> >> entry code),
> >
> > This likely already leaks due to vulnerable CPUs leaking address space
> > layout info.
>
> I forgot to update the comment, I am not mapping __per_cpu_offset anymore.
>
> However, if we do map __per_cpu_offset then we don't need to enforce the
> ordering in paranoid_entry to switch CR3 before GS.
I'm okay with mapping __per_cpu_offset.
>
> >
> >> the stack canary,
> >
> > That's going to be a very tough sell.
> >
>
> I can get rid of this, but this will require to disable stack-protector for
> any function that we can call while using the user page-table, like already
> done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).
>
You could probably get away with using a different stack protector
canary before and after the CR3 switch as long as you are careful to
have the canary restored when you return from whatever function is
involved.
On 11/11/20 12:39 AM, Andy Lutomirski wrote:
>>
>> On 11/9/20 6:28 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> Extend PTI user mappings so that more kernel entry code can be executed
>>>> with the user page-table. To do so, we need to map syscall and interrupt
>>>> entry code,
>>>
>>> Probably fine.
>>>
>>>> per cpu offsets (__per_cpu_offset, which is used some in
>>>> entry code),
>>>
>>> This likely already leaks due to vulnerable CPUs leaking address space
>>> layout info.
>>
>> I forgot to update the comment, I am not mapping __per_cpu_offset anymore.
>>
>> However, if we do map __per_cpu_offset then we don't need to enforce the
>> ordering in paranoid_entry to switch CR3 before GS.
>
> I'm okay with mapping __per_cpu_offset.
>
Good. That way I can move the GS update back to assembly code (paranoid_entry/exit
will be mostly reduce to updating GS), and probably I won't need to disable
stack-protector.
>>>
>>>> the stack canary,
>>>
>>> That's going to be a very tough sell.
>>>
>>
>> I can get rid of this, but this will require to disable stack-protector for
>> any function that we can call while using the user page-table, like already
>> done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).
>>
>
> You could probably get away with using a different stack protector
> canary before and after the CR3 switch as long as you are careful to
> have the canary restored when you return from whatever function is
> involved.
>
I was thinking about doing that. I will give it a try.
Thanks,
alex.