I went trying to clean up the spurious protection key checks.
But, I got distracted by some other warts in the code. I hope this
makes things more comprehendable with some improved structure,
commenting and naming.
We come out the other side of it with a new warning in for pkey
faults in the kernel portion of the address space, and removal
of some dead pkeys code, but no other functional changes.
There is a potential impact from moving the vsyscall emulation.
But, this does pass the x86 selftests without any fuss.
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
From: Dave Hansen <[email protected]>
The page fault handler (__do_page_fault()) basically has two sections:
one for handling faults in the kernel porttion of the address space
and another for faults in the user porttion of the address space.
But, these two parts don't stick out that well. Let's make that more
clear from code separation and naming. Pull kernel fault
handling into its own helper, and reflect that naming by renaming
spurious_fault() -> spurious_kernel_fault().
Also, rewrite the vmalloc handling comment a bit. It was a bit
stale and also glossed over the reserved bit handling.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 98 ++++++++++++++++++++++++++++++--------------------
1 file changed, 59 insertions(+), 39 deletions(-)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-00 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings-00 2018-09-07 11:21:46.145751902 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:23:37.643751624 -0700
@@ -1033,7 +1033,7 @@ mm_fault_error(struct pt_regs *regs, uns
}
}
-static int spurious_fault_check(unsigned long error_code, pte_t *pte)
+static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
{
if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
return 0;
@@ -1072,7 +1072,7 @@ static int spurious_fault_check(unsigned
* (Optional Invalidation).
*/
static noinline int
-spurious_fault(unsigned long error_code, unsigned long address)
+spurious_kernel_fault(unsigned long error_code, unsigned long address)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -1103,27 +1103,27 @@ spurious_fault(unsigned long error_code,
return 0;
if (p4d_large(*p4d))
- return spurious_fault_check(error_code, (pte_t *) p4d);
+ return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return 0;
if (pud_large(*pud))
- return spurious_fault_check(error_code, (pte_t *) pud);
+ return spurious_kernel_fault_check(error_code, (pte_t *) pud);
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return 0;
if (pmd_large(*pmd))
- return spurious_fault_check(error_code, (pte_t *) pmd);
+ return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
pte = pte_offset_kernel(pmd, address);
if (!pte_present(*pte))
return 0;
- ret = spurious_fault_check(error_code, pte);
+ ret = spurious_kernel_fault_check(error_code, pte);
if (!ret)
return 0;
@@ -1131,12 +1131,12 @@ spurious_fault(unsigned long error_code,
* Make sure we have permissions in PMD.
* If not, then there's a bug in the page tables:
*/
- ret = spurious_fault_check(error_code, (pte_t *) pmd);
+ ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
return ret;
}
-NOKPROBE_SYMBOL(spurious_fault);
+NOKPROBE_SYMBOL(spurious_kernel_fault);
int show_unhandled_signals = 1;
@@ -1203,6 +1203,55 @@ static inline bool smap_violation(int er
return true;
}
+static void
+do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ /*
+ * We can fault-in kernel-space virtual memory on-demand. The
+ * 'reference' page table is init_mm.pgd.
+ *
+ * NOTE! We MUST NOT take any locks for this case. We may
+ * be in an interrupt or a critical region, and should
+ * only copy the information from the master page table,
+ * nothing more.
+ *
+ * Before doing this on-demand faulting, ensure that the
+ * fault is not any of the following:
+ * 1. A fault on a PTE with a reserved bit set.
+ * 2. A fault caused by a user-mode access. (Do not demand-
+ * fault kernel memory due to user-mode accesses).
+ * 3. A fault caused by a page-level protection violation.
+ * (A demand fault would be on a non-present page which
+ * would have X86_PF_PROT==0).
+ */
+ if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
+ if (vmalloc_fault(address) >= 0)
+ return;
+ }
+
+ /* Was the fault spurious, caused by lazy TLB invalidation? */
+ if (spurious_kernel_fault(hw_error_code, address))
+ return;
+
+ /* kprobes don't want to hook the spurious faults: */
+ if (kprobes_fault(regs))
+ return;
+
+ /*
+ * This is a "bad" fault in the kernel address space. There
+ * is no reasonable explanation for it. We will either kill
+ * the process for making a bad access, or oops the kernel.
+ */
+
+ /*
+ * Don't take the mm semaphore here. If we fixup a prefetch
+ * fault we could otherwise deadlock:
+ */
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
+}
+NOKPROBE_SYMBOL(do_kern_addr_space_fault);
+
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
@@ -1228,38 +1277,9 @@ __do_page_fault(struct pt_regs *regs, un
if (unlikely(kmmio_fault(regs, address)))
return;
- /*
- * We fault-in kernel-space virtual memory on-demand. The
- * 'reference' page table is init_mm.pgd.
- *
- * NOTE! We MUST NOT take any locks for this case. We may
- * be in an interrupt or a critical region, and should
- * only copy the information from the master page table,
- * nothing more.
- *
- * This verifies that the fault happens in kernel space
- * (hw_error_code & 4) == 0, and that the fault was not a
- * protection error (hw_error_code & 9) == 0.
- */
+ /* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
- if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
- if (vmalloc_fault(address) >= 0)
- return;
- }
-
- /* Can handle a stale RO->RW TLB: */
- if (spurious_fault(hw_error_code, address))
- return;
-
- /* kprobes don't want to hook the spurious faults: */
- if (kprobes_fault(regs))
- return;
- /*
- * Don't take the mm semaphore here. If we fixup a prefetch
- * fault we could otherwise deadlock:
- */
- bad_area_nosemaphore(regs, hw_error_code, address, NULL);
-
+ do_kern_addr_space_fault(regs, hw_error_code, address);
return;
}
_
From: Dave Hansen <[email protected]>
The SMAP and Reserved checking do not have nice comments. Add
some to clarify and make it match everything else.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-02 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings-02 2018-09-07 11:21:47.182751900 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:47.185751900 -0700
@@ -1274,9 +1274,17 @@ void do_user_addr_space_fault(struct pt_
if (unlikely(kprobes_fault(regs)))
return;
+ /*
+ * Reserved bits are never expected to be set on
+ * entries in the user portion of the page tables.
+ */
if (unlikely(hw_error_code & X86_PF_RSVD))
pgtable_bad(regs, hw_error_code, address);
+ /*
+ * Check for invalid kernel (supervisor) access to user
+ * pages in the user address space.
+ */
if (unlikely(smap_violation(hw_error_code, regs))) {
bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
_
From: Dave Hansen <[email protected]>
The comments here are wrong. They are too absolute about where
faults can occur when running in the kernel. The comments are
also a bit hard to match up with the code.
Trim down the comments, and make them more precise.
Also add a comment explaining why we are doing the
bad_area_nosemaphore() path here.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-03 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings-03 2018-09-07 11:21:47.696751898 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:47.700751898 -0700
@@ -1349,24 +1349,25 @@ void do_user_addr_space_fault(struct pt_
flags |= FAULT_FLAG_INSTRUCTION;
/*
- * When running in the kernel we expect faults to occur only to
- * addresses in user space. All other faults represent errors in
- * the kernel and should generate an OOPS. Unfortunately, in the
- * case of an erroneous fault occurring in a code path which already
- * holds mmap_sem we will deadlock attempting to validate the fault
- * against the address space. Luckily the kernel only validly
- * references user space from well defined areas of code, which are
- * listed in the exceptions table.
+ * Kernel-mode access to the user address space should only occur
+ * inside well-defined areas of code listed in the exception
+ * tables. But, an erroneous kernel fault occurring outside one of
+ * those areas which also holds mmap_sem might deadlock attempting
+ * to validate the fault against the address space.
*
- * As the vast majority of faults will be valid we will only perform
- * the source reference check when there is a possibility of a
- * deadlock. Attempt to lock the address space, if we cannot we then
- * validate the source. If this is invalid we can skip the address
- * space check, thus avoiding the deadlock:
+ * Only do the expensive exception table search when we might be at
+ * risk of a deadlock:
+ * 1. We failed to acquire mmap_sem, and
+ * 2. The access was an explicit kernel-mode access
+ * (X86_PF_USER=0).
*/
if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if (!(sw_error_code & X86_PF_USER) &&
!search_exception_tables(regs->ip)) {
+ /*
+ * Fault from code in kernel from
+ * which we do not expect faults.
+ */
bad_area_nosemaphore(regs, sw_error_code, address, NULL);
return;
}
_
From: Dave Hansen <[email protected]>
We pass around a variable called "error_code" all around the page
fault code. Sounds simple enough, especially since "error_code" looks
like it exactly matches the values that the hardware gives us on the
stack to report the page fault error code (PFEC in SDM parlance).
But, that's not how it works.
For part of the page fault handler, "error_code" does exactly match
PFEC. But, during later parts, it diverges and starts to mean
something a bit different.
Give it two names for its two jobs.
The place it diverges is also really screwy. It's only in a spot
where the hardware tells us we have kernel-mode access that occurred
while we were in usermode accessing user-controlled address space.
Add a warning in there.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 77 +++++++++++++++++++++++++++++++++-----------------
1 file changed, 52 insertions(+), 25 deletions(-)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-0 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings-0 2018-09-07 11:21:45.629751903 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:45.633751903 -0700
@@ -1209,9 +1209,10 @@ static inline bool smap_violation(int er
* routines.
*/
static noinline void
-__do_page_fault(struct pt_regs *regs, unsigned long error_code,
+__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
{
+ unsigned long sw_error_code;
struct vm_area_struct *vma;
struct task_struct *tsk;
struct mm_struct *mm;
@@ -1237,17 +1238,17 @@ __do_page_fault(struct pt_regs *regs, un
* nothing more.
*
* This verifies that the fault happens in kernel space
- * (error_code & 4) == 0, and that the fault was not a
- * protection error (error_code & 9) == 0.
+ * (hw_error_code & 4) == 0, and that the fault was not a
+ * protection error (hw_error_code & 9) == 0.
*/
if (unlikely(fault_in_kernel_space(address))) {
- if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
+ if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
if (vmalloc_fault(address) >= 0)
return;
}
/* Can handle a stale RO->RW TLB: */
- if (spurious_fault(error_code, address))
+ if (spurious_fault(hw_error_code, address))
return;
/* kprobes don't want to hook the spurious faults: */
@@ -1257,7 +1258,7 @@ __do_page_fault(struct pt_regs *regs, un
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
*/
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}
@@ -1266,11 +1267,11 @@ __do_page_fault(struct pt_regs *regs, un
if (unlikely(kprobes_fault(regs)))
return;
- if (unlikely(error_code & X86_PF_RSVD))
- pgtable_bad(regs, error_code, address);
+ if (unlikely(hw_error_code & X86_PF_RSVD))
+ pgtable_bad(regs, hw_error_code, address);
- if (unlikely(smap_violation(error_code, regs))) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ if (unlikely(smap_violation(hw_error_code, regs))) {
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}
@@ -1279,11 +1280,18 @@ __do_page_fault(struct pt_regs *regs, un
* in a region with pagefaults disabled then we must not take the fault
*/
if (unlikely(faulthandler_disabled() || !mm)) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}
/*
+ * hw_error_code is literally the "page fault error code" passed to
+ * the kernel directly from the hardware. But, we will shortly be
+ * modifying it in software, so give it a new name.
+ */
+ sw_error_code = hw_error_code;
+
+ /*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
*
@@ -1292,7 +1300,26 @@ __do_page_fault(struct pt_regs *regs, un
*/
if (user_mode(regs)) {
local_irq_enable();
- error_code |= X86_PF_USER;
+ /*
+ * Up to this point, X86_PF_USER set in hw_error_code
+ * indicated a user-mode access. But, after this,
+ * X86_PF_USER in sw_error_code will indicate either
+ * that, *or* an implicit kernel(supervisor)-mode access
+ * which originated from user mode.
+ */
+ if (!(hw_error_code & X86_PF_USER)) {
+ /*
+ * The CPU was in user mode, but the CPU says
+ * the fault was not a user-mode access.
+ * Must be an implicit kernel-mode access,
+ * which we do not expect to happen in the
+ * user address space.
+ */
+ WARN_ONCE(1, "kernel-mode error from user-mode: %lx\n",
+ hw_error_code);
+
+ sw_error_code |= X86_PF_USER;
+ }
flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
@@ -1301,9 +1328,9 @@ __do_page_fault(struct pt_regs *regs, un
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- if (error_code & X86_PF_WRITE)
+ if (sw_error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
- if (error_code & X86_PF_INSTR)
+ if (sw_error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
/*
@@ -1323,9 +1350,9 @@ __do_page_fault(struct pt_regs *regs, un
* space check, thus avoiding the deadlock:
*/
if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
- if (!(error_code & X86_PF_USER) &&
+ if (!(sw_error_code & X86_PF_USER) &&
!search_exception_tables(regs->ip)) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, sw_error_code, address, NULL);
return;
}
retry:
@@ -1341,16 +1368,16 @@ retry:
vma = find_vma(mm, address);
if (unlikely(!vma)) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
return;
}
if (likely(vma->vm_start <= address))
goto good_area;
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
return;
}
- if (error_code & X86_PF_USER) {
+ if (sw_error_code & X86_PF_USER) {
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
@@ -1358,12 +1385,12 @@ retry:
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
return;
}
}
if (unlikely(expand_stack(vma, address))) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
return;
}
@@ -1372,8 +1399,8 @@ retry:
* we can handle it..
*/
good_area:
- if (unlikely(access_error(error_code, vma))) {
- bad_area_access_error(regs, error_code, address, vma);
+ if (unlikely(access_error(sw_error_code, vma))) {
+ bad_area_access_error(regs, sw_error_code, address, vma);
return;
}
@@ -1415,13 +1442,13 @@ good_area:
return;
/* Not returning to user mode? Handle exceptions or die: */
- no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+ no_context(regs, sw_error_code, address, SIGBUS, BUS_ADRERR);
return;
}
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
- mm_fault_error(regs, error_code, address, &pkey, fault);
+ mm_fault_error(regs, sw_error_code, address, &pkey, fault);
return;
}
_
From: Dave Hansen <[email protected]>
The vsyscall page is weird. It is in what is traditionally part of the
kernel address space. But, it has user permissions and we handle faults
on it like we would on a user page: interrupts on.
Right now, we handle vsyscall emulation in the "bad_area" code, which
is used for both user-address-space and kernel-address-space faults. Move
the handling to the user-address-space code *only* and ensure we get there
by "excluding" the vsyscall page from the kernel address space via a check
in fault_in_kernel_space().
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)
diff -puN arch/x86/mm/fault.c~vsyscall-is-user-address-space arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~vsyscall-is-user-address-space 2018-09-07 11:21:48.720751896 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:48.724751896 -0700
@@ -873,18 +873,6 @@ __bad_area_nosemaphore(struct pt_regs *r
if (is_errata100(regs, address))
return;
-#ifdef CONFIG_X86_64
- /*
- * Instruction fetch faults in the vsyscall page might need
- * emulation.
- */
- if (unlikely((error_code & X86_PF_INSTR) &&
- is_vsyscall_vaddr(address))) {
- if (emulate_vsyscall(regs, address))
- return;
- }
-#endif
-
/*
* To avoid leaking information about the kernel page table
* layout, pretend that user-mode accesses to kernel addresses
@@ -1192,6 +1180,13 @@ access_error(unsigned long error_code, s
static int fault_in_kernel_space(unsigned long address)
{
+ /*
+ * The vsyscall page is at an address above TASK_SIZE_MAX,
+ * but is not considered part of the kernel address space.
+ */
+ if (is_vsyscall_vaddr(address))
+ return false;
+
return address >= TASK_SIZE_MAX;
}
@@ -1357,6 +1352,23 @@ void do_user_addr_space_fault(struct pt_
if (sw_error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
+#ifdef CONFIG_X86_64
+ /*
+ * Instruction fetch faults in the vsyscall page might need
+ * emulation. The vsyscall page is at a high address
+ * (>PAGE_OFFSET), but is considered to be part of the user
+ * address space.
+ *
+ * The vsyscall page does not have a "real" VMA, so do this
+ * emulation before we go searching for VMAse
+ */
+ if (unlikely((sw_error_code & X86_PF_INSTR) &&
+ is_vsyscall_vaddr(address))) {
+ if (emulate_vsyscall(regs, address))
+ return;
+ }
+#endif
+
/*
* Kernel-mode access to the user address space should only occur
* inside well-defined areas of code listed in the exception
_
From: Dave Hansen <[email protected]>
We will shortly be using this check in two locations. Put it in
a helper before we do so.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff -puN arch/x86/mm/fault.c~is-vsyscall-addr arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~is-vsyscall-addr 2018-09-07 11:21:48.208751897 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:48.211751897 -0700
@@ -841,6 +841,15 @@ show_signal_msg(struct pt_regs *regs, un
show_opcodes((u8 *)regs->ip, loglvl);
}
+/*
+ * The (legacy) vsyscall page is the long page in the kernel portion
+ * of the address space that has user-accessible permissions.
+ */
+static bool is_vsyscall_vaddr(unsigned long vaddr)
+{
+ return (vaddr & ~0xfff) == VSYSCALL_ADDR;
+}
+
static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
unsigned long address, u32 *pkey, int si_code)
@@ -870,7 +879,7 @@ __bad_area_nosemaphore(struct pt_regs *r
* emulation.
*/
if (unlikely((error_code & X86_PF_INSTR) &&
- ((address & ~0xfff) == VSYSCALL_ADDR))) {
+ is_vsyscall_vaddr(address))) {
if (emulate_vsyscall(regs, address))
return;
}
_
From: Dave Hansen <[email protected]>
The last patch broke out kernel address space handing into its own
helper. Now, do the same for user address space handling.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 47 ++++++++++++++++++++++++++++-------------------
1 file changed, 28 insertions(+), 19 deletions(-)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-01 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings-01 2018-09-07 11:21:46.663751901 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 11:21:46.667751901 -0700
@@ -1203,6 +1203,7 @@ static inline bool smap_violation(int er
return true;
}
+/* Handle faults in the kernel portion of the address space */
static void
do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
@@ -1252,14 +1253,11 @@ do_kern_addr_space_fault(struct pt_regs
}
NOKPROBE_SYMBOL(do_kern_addr_space_fault);
-/*
- * This routine handles page faults. It determines the address,
- * and the problem, and then passes it off to one of the appropriate
- * routines.
- */
-static noinline void
-__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
- unsigned long address)
+/* Handle faults in the user portion of the address space */
+static inline
+void do_user_addr_space_fault(struct pt_regs *regs,
+ unsigned long hw_error_code,
+ unsigned long address)
{
unsigned long sw_error_code;
struct vm_area_struct *vma;
@@ -1272,17 +1270,6 @@ __do_page_fault(struct pt_regs *regs, un
tsk = current;
mm = tsk->mm;
- prefetchw(&mm->mmap_sem);
-
- if (unlikely(kmmio_fault(regs, address)))
- return;
-
- /* Was the fault on kernel-controlled part of the address space? */
- if (unlikely(fault_in_kernel_space(address))) {
- do_kern_addr_space_fault(regs, hw_error_code, address);
- return;
- }
-
/* kprobes don't want to hook the spurious faults: */
if (unlikely(kprobes_fault(regs)))
return;
@@ -1486,6 +1473,28 @@ good_area:
check_v8086_mode(regs, address, tsk);
}
+NOKPROBE_SYMBOL(do_user_addr_space_fault);
+
+/*
+ * This routine handles page faults. It determines the address,
+ * and the problem, and then passes it off to one of the appropriate
+ * routines.
+ */
+static noinline void
+__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ prefetchw(¤t->mm->mmap_sem);
+
+ if (unlikely(kmmio_fault(regs, address)))
+ return;
+
+ /* Was the fault on kernel-controlled part of the address space? */
+ if (unlikely(fault_in_kernel_space(address)))
+ do_kern_addr_space_fault(regs, hw_error_code, address);
+ else
+ do_user_addr_space_fault(regs, hw_error_code, address);
+}
NOKPROBE_SYMBOL(__do_page_fault);
static nokprobe_inline void
_
From: Dave Hansen <[email protected]>
Spurious faults only ever occur in the kernel's address space. They
are also constrained specifically to faults with one of these error codes:
X86_PF_WRITE | X86_PF_PROT
X86_PF_INSTR | X86_PF_PROT
So, it's never even possible to reach spurious_kernel_fault_check() with
X86_PF_PK set.
In addition, the kernel's address space never has pages with user-mode
protections. Protection Keys are only enforced on pages with user-mode
protection.
This gives us lots of reasons to not check for protection keys in our
sprurious kernel fault handling.
But, let's also add some warnings to ensure that these assumptions about
protection keys hold true.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
---
b/arch/x86/mm/fault.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-fault-warnings 2018-09-07 12:32:23.190741335 -0700
+++ b/arch/x86/mm/fault.c 2018-09-07 12:32:23.194741335 -0700
@@ -1037,12 +1037,6 @@ static int spurious_kernel_fault_check(u
if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
return 0;
- /*
- * Note: We do not do lazy flushing on protection key
- * changes, so no spurious fault will ever set X86_PF_PK.
- */
- if ((error_code & X86_PF_PK))
- return 1;
return 1;
}
@@ -1213,6 +1207,13 @@ do_kern_addr_space_fault(struct pt_regs
unsigned long address)
{
/*
+ * Protection keys exceptions only happen on user pages. We
+ * have no user pages in the kernel portion of the address
+ * space, so do not expect them here.
+ */
+ WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+
+ /*
* We can fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
*
_
On Fri, 2018-09-07 at 12:49 -0700, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> The comments here are wrong. They are too absolute about where
> faults can occur when running in the kernel. The comments are
> also a bit hard to match up with the code.
>
> Trim down the comments, and make them more precise.
>
> Also add a comment explaining why we are doing the
> bad_area_nosemaphore() path here.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: "Peter Zijlstra (Intel)" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> ---
>
> b/arch/x86/mm/fault.c | 27 ++++++++++++++-------------
> 1 file changed, 14 insertions(+), 13 deletions(-)
>
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-03 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-03 2018-09-07 11:21:47.696751898 -0700
> +++ b/arch/x86/mm/fault.c 2018-09-07 11:21:47.700751898 -0700
> @@ -1349,24 +1349,25 @@ void do_user_addr_space_fault(struct pt_
> flags |= FAULT_FLAG_INSTRUCTION;
>
> /*
> - * When running in the kernel we expect faults to occur only to
> - * addresses in user space. All other faults represent errors in
> - * the kernel and should generate an OOPS. Unfortunately, in the
> - * case of an erroneous fault occurring in a code path which already
> - * holds mmap_sem we will deadlock attempting to validate the fault
> - * against the address space. Luckily the kernel only validly
> - * references user space from well defined areas of code, which are
> - * listed in the exceptions table.
> + * Kernel-mode access to the user address space should only occur
> + * inside well-defined areas of code listed in the exception
> + * tables. But, an erroneous kernel fault occurring outside one of
> + * those areas which also holds mmap_sem might deadlock attempting
> + * to validate the fault against the address space.
> *
> - * As the vast majority of faults will be valid we will only perform
> - * the source reference check when there is a possibility of a
> - * deadlock. Attempt to lock the address space, if we cannot we then
> - * validate the source. If this is invalid we can skip the address
> - * space check, thus avoiding the deadlock:
> + * Only do the expensive exception table search when we might be at
> + * risk of a deadlock:
> + * 1. We failed to acquire mmap_sem, and
> + * 2. The access was an explicit kernel-mode access
> + * (X86_PF_USER=0).
Might be worth reminding the reader that X86_PF_USER will be set in
sw_error_code for implicit accesses. I saw "explicit" and my mind
immediately jumped to hw_error_code for whatever reason. E.g.:
* 2. The access was an explicit kernel-mode access (we set X86_PF_USER
* in sw_error_code for implicit kernel-mode accesses).
> */
> if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> if (!(sw_error_code & X86_PF_USER) &&
> !search_exception_tables(regs->ip)) {
> + /*
> + * Fault from code in kernel from
> + * which we do not expect faults.
> + */
> bad_area_nosemaphore(regs, sw_error_code, address, NULL);
> return;
> }
> _
On Fri, 2018-09-07 at 12:48 -0700, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> The page fault handler (__do_page_fault()) basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.
%s/porttion/portion
> But, these two parts don't stick out that well. Let's make that more
> clear from code separation and naming. Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
>
> Also, rewrite the vmalloc handling comment a bit. It was a bit
> stale and also glossed over the reserved bit handling.
>> + * Only do the expensive exception table search when we might be at
>> + * risk of a deadlock:
>> + * 1. We failed to acquire mmap_sem, and
>> + * 2. The access was an explicit kernel-mode access
>> + * (X86_PF_USER=0).
>
> Might be worth reminding the reader that X86_PF_USER will be set in
> sw_error_code for implicit accesses. I saw "explicit" and my mind
> immediately jumped to hw_error_code for whatever reason. E.g.:
>
> * 2. The access was an explicit kernel-mode access (we set X86_PF_USER
> * in sw_error_code for implicit kernel-mode accesses).
Yeah, that was not worded well. Is this better?
> * Only do the expensive exception table search when we might be at
> * risk of a deadlock:
> * 1. We failed to acquire mmap_sem, and
> * 2. The access was an explicit kernel-mode access. An access
> * from user-mode will X86_PF_USER=1 set via hw_error_code or
> * set in sw_error_code if it were an implicit kernel-mode
> * access that originated in user mode.
On 09/07/2018 02:06 PM, Sean Christopherson wrote:
>> The page fault handler (__do_page_fault()) basically has two sections:
>> one for handling faults in the kernel porttion of the address space
>> and another for faults in the user porttion of the address space.
> %s/porttion/portion
Fixed, thanks.
> On Sep 7, 2018, at 12:48 PM, Dave Hansen <[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> The page fault handler (__do_page_fault()) basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.
>
> But, these two parts don't stick out that well. Let's make that more
> clear from code separation and naming. Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
>
> Also, rewrite the vmalloc handling comment a bit. It was a bit
> stale and also glossed over the reserved bit handling.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: "Peter Zijlstra (Intel)" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> ---
>
> b/arch/x86/mm/fault.c | 98 ++++++++++++++++++++++++++++++--------------------
> 1 file changed, 59 insertions(+), 39 deletions(-)
>
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-00 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-00 2018-09-07 11:21:46.145751902 -0700
> +++ b/arch/x86/mm/fault.c 2018-09-07 11:23:37.643751624 -0700
> @@ -1033,7 +1033,7 @@ mm_fault_error(struct pt_regs *regs, uns
> }
> }
>
> -static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> +static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
> {
> if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
> return 0;
> @@ -1072,7 +1072,7 @@ static int spurious_fault_check(unsigned
> * (Optional Invalidation).
> */
> static noinline int
> -spurious_fault(unsigned long error_code, unsigned long address)
> +spurious_kernel_fault(unsigned long error_code, unsigned long address)
> {
> pgd_t *pgd;
> p4d_t *p4d;
> @@ -1103,27 +1103,27 @@ spurious_fault(unsigned long error_code,
> return 0;
>
> if (p4d_large(*p4d))
> - return spurious_fault_check(error_code, (pte_t *) p4d);
> + return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
>
> pud = pud_offset(p4d, address);
> if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> - return spurious_fault_check(error_code, (pte_t *) pud);
> + return spurious_kernel_fault_check(error_code, (pte_t *) pud);
>
> pmd = pmd_offset(pud, address);
> if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
> - return spurious_fault_check(error_code, (pte_t *) pmd);
> + return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
>
> pte = pte_offset_kernel(pmd, address);
> if (!pte_present(*pte))
> return 0;
>
> - ret = spurious_fault_check(error_code, pte);
> + ret = spurious_kernel_fault_check(error_code, pte);
> if (!ret)
> return 0;
>
> @@ -1131,12 +1131,12 @@ spurious_fault(unsigned long error_code,
> * Make sure we have permissions in PMD.
> * If not, then there's a bug in the page tables:
> */
> - ret = spurious_fault_check(error_code, (pte_t *) pmd);
> + ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
> WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
>
> return ret;
> }
> -NOKPROBE_SYMBOL(spurious_fault);
> +NOKPROBE_SYMBOL(spurious_kernel_fault);
>
> int show_unhandled_signals = 1;
>
> @@ -1203,6 +1203,55 @@ static inline bool smap_violation(int er
> return true;
> }
>
> +static void
> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{
Can you add a comment above this documenting *when* it’s called? Is it all faults, !user_mode faults, or !PF_USER?
> + /*
> + * We can fault-in kernel-space virtual memory on-demand. The
> + * 'reference' page table is init_mm.pgd.
> + *
> + * NOTE! We MUST NOT take any locks for this case. We may
> + * be in an interrupt or a critical region, and should
> + * only copy the information from the master page table,
> + * nothing more.
> + *
> + * Before doing this on-demand faulting, ensure that the
> + * fault is not any of the following:
> + * 1. A fault on a PTE with a reserved bit set.
> + * 2. A fault caused by a user-mode access. (Do not demand-
> + * fault kernel memory due to user-mode accesses).
> + * 3. A fault caused by a page-level protection violation.
> + * (A demand fault would be on a non-present page which
> + * would have X86_PF_PROT==0).
> + */
> + if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
> + if (vmalloc_fault(address) >= 0)
> + return;
> + }
> +
> + /* Was the fault spurious, caused by lazy TLB invalidation? */
> + if (spurious_kernel_fault(hw_error_code, address))
> + return;
> +
> + /* kprobes don't want to hook the spurious faults: */
> + if (kprobes_fault(regs))
> + return;
> +
> + /*
> + * This is a "bad" fault in the kernel address space. There
> + * is no reasonable explanation for it. We will either kill
> + * the process for making a bad access, or oops the kernel.
> + */
Or call an extable handler?
Maybe the wording should be less scary, e.g. “this fault is a genuine error. Send a signal, call an exception handler, or oops, as appropriate.”
On 09/07/2018 03:21 PM, Andy Lutomirski wrote:
>> +static void
>> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
>> + unsigned long address)
>> +{
>
> Can you add a comment above this documenting *when* it’s called? Is
> it all faults, !user_mode faults, or !PF_USER?
Yep, can do.
>> + /*
>> + * This is a "bad" fault in the kernel address space. There
>> + * is no reasonable explanation for it. We will either kill
>> + * the process for making a bad access, or oops the kernel.
>> + */
>
> Or call an extable handler?
>
> Maybe the wording should be less scary, e.g. “this fault is a genuine
> error. Send a signal, call an exception handler, or oops, as
> appropriate.”
Yeah, the real behavior is quite a bit more subtle than I'm letting on.
I'll tone it down.
> On Sep 7, 2018, at 12:48 PM, Dave Hansen <[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> We pass around a variable called "error_code" all around the page
> fault code. Sounds simple enough, especially since "error_code" looks
> like it exactly matches the values that the hardware gives us on the
> stack to report the page fault error code (PFEC in SDM parlance).
>
> But, that's not how it works.
>
> For part of the page fault handler, "error_code" does exactly match
> PFEC. But, during later parts, it diverges and starts to mean
> something a bit different.
>
> Give it two names for its two jobs.
How hard would it be to just remove sw_error_code instead? It seems like it adds little value and much confusion.
I’m also unconvinced that the warning is terribly useful. We’re going to oops when this happens anyway.
I’ll rebase my diagnostic patch on top of this series after it settles.
> On Sep 7, 2018, at 12:49 PM, Dave Hansen <[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> The vsyscall page is weird. It is in what is traditionally part of the
> kernel address space. But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
>
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults. Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().
I assume the motivation is that you want to simplify the kernel error path. If so, can you mention this?
The patch itself is Reviewed-by: Andy Lutomirski <[email protected]>, although adding an unlikely() somewhere might be nice.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: "Peter Zijlstra (Intel)" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> ---
>
> b/arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++------------
> 1 file changed, 24 insertions(+), 12 deletions(-)
>
> diff -puN arch/x86/mm/fault.c~vsyscall-is-user-address-space arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~vsyscall-is-user-address-space 2018-09-07 11:21:48.720751896 -0700
> +++ b/arch/x86/mm/fault.c 2018-09-07 11:21:48.724751896 -0700
> @@ -873,18 +873,6 @@ __bad_area_nosemaphore(struct pt_regs *r
> if (is_errata100(regs, address))
> return;
>
> -#ifdef CONFIG_X86_64
> - /*
> - * Instruction fetch faults in the vsyscall page might need
> - * emulation.
> - */
> - if (unlikely((error_code & X86_PF_INSTR) &&
> - is_vsyscall_vaddr(address))) {
> - if (emulate_vsyscall(regs, address))
> - return;
> - }
> -#endif
> -
> /*
> * To avoid leaking information about the kernel page table
> * layout, pretend that user-mode accesses to kernel addresses
> @@ -1192,6 +1180,13 @@ access_error(unsigned long error_code, s
>
> static int fault_in_kernel_space(unsigned long address)
> {
> + /*
> + * The vsyscall page is at an address above TASK_SIZE_MAX,
> + * but is not considered part of the kernel address space.
> + */
> + if (is_vsyscall_vaddr(address))
> + return false;
> +
> return address >= TASK_SIZE_MAX;
> }
>
> @@ -1357,6 +1352,23 @@ void do_user_addr_space_fault(struct pt_
> if (sw_error_code & X86_PF_INSTR)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> +#ifdef CONFIG_X86_64
> + /*
> + * Instruction fetch faults in the vsyscall page might need
> + * emulation. The vsyscall page is at a high address
> + * (>PAGE_OFFSET), but is considered to be part of the user
> + * address space.
> + *
> + * The vsyscall page does not have a "real" VMA, so do this
> + * emulation before we go searching for VMAse
> + */
> + if (unlikely((sw_error_code & X86_PF_INSTR) &&
> + is_vsyscall_vaddr(address))) {
> + if (emulate_vsyscall(regs, address))
> + return;
> + }
> +#endif
> +
> /*
> * Kernel-mode access to the user address space should only occur
> * inside well-defined areas of code listed in the exception
> _
On Sat, Sep 8, 2018 at 2:22 AM Dave Hansen <[email protected]> wrote:
> + * Kernel-mode access to the user address space should only occur
> + * inside well-defined areas of code listed in the exception
Actually, not areas, but single whitelisted instructions. It would
probably be nice to say that more clearly.
From arch/x86/include/asm/extable.h:
/*
* The exception table consists of triples of addresses relative to the
* exception table entry itself. The first address is of an instruction
* that is allowed to fault, the second is the target at which the program
* should continue. The third is a handler function to deal with the fault
* caused by the instruction in the first field.
*
* All the routines below use bits of fixup code that are out of line
* with the main instruction path. This means when everything is well,
* we don't even have to jump over them. Further, they do not intrude
* on our cache or tlb entries.
*/
struct exception_table_entry {
int insn, fixup, handler;
};
> + * tables. But, an erroneous kernel fault occurring outside one of
> + * those areas which also holds mmap_sem might deadlock attempting
> + * to validate the fault against the address space.
> *
> - * As the vast majority of faults will be valid we will only perform
> - * the source reference check when there is a possibility of a
> - * deadlock. Attempt to lock the address space, if we cannot we then
> - * validate the source. If this is invalid we can skip the address
> - * space check, thus avoiding the deadlock:
> + * Only do the expensive exception table search when we might be at
> + * risk of a deadlock:
> + * 1. We failed to acquire mmap_sem, and
> + * 2. The access was an explicit kernel-mode access
> + * (X86_PF_USER=0).
> */
> if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> if (!(sw_error_code & X86_PF_USER) &&
> !search_exception_tables(regs->ip)) {
> + /*
> + * Fault from code in kernel from
> + * which we do not expect faults.
> + */
> bad_area_nosemaphore(regs, sw_error_code, address, NULL);
> return;
> }
> _
>
On Sat, Sep 8, 2018 at 2:25 AM Dave Hansen <[email protected]> wrote:
> We will shortly be using this check in two locations. Put it in
> a helper before we do so.
[...]
> +/*
> + * The (legacy) vsyscall page is the long page in the kernel portion
> + * of the address space that has user-accessible permissions.
> + */
> +static bool is_vsyscall_vaddr(unsigned long vaddr)
> +{
> + return (vaddr & ~0xfff) == VSYSCALL_ADDR;
> +}
<bikeshed>
Since you're touching this code anyway: Would it make sense to change
that constant to a more explicit "~0xfffUL" (or alternatively
PAGE_MASK)? I tend to end up staring at code like this forever, trying
to figure out whether the upper 32 bits of the constant end up being
set or clear. As a reader, looking at the current code, it's quite
annoying to see what actually happens - first there's a signed 32-bit
literal 0xfff, then a 32-bit negation happens, and then the number is
converted to 64 bits with sign extension.
</bikeshed>
On Sat, Sep 8, 2018 at 2:28 AM Dave Hansen <[email protected]> wrote:
> The vsyscall page is weird. It is in what is traditionally part of the
> kernel address space. But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
>
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults. Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().
[...]
> static int fault_in_kernel_space(unsigned long address)
> {
> + /*
> + * The vsyscall page is at an address above TASK_SIZE_MAX,
> + * but is not considered part of the kernel address space.
> + */
> + if (is_vsyscall_vaddr(address))
> + return false;
I think something should check for "#ifdef CONFIG_X86_64"? 32-bit
doesn't have a vsyscall page, right? And this code probably shouldn't
veer off into the userspace-area fault handling code for addresses in
the range 0xff600000-0xff600fff... what is in that region on 32-bit?
Modules or something like that?
Maybe change is_vsyscall_vaddr() so that it always returns false on
32-bit, or put both the definition of is_vsyscall_vaddr() and this
code behind #ifdef guards.
And, in a separate patch, maybe also #ifdef-guard the definition of
VSYSCALL_ADDR in vsyscall.h? Nothing good is going to result from
making a garbage VSYSCALL_ADDR available to 32-bit code.
> +#ifdef CONFIG_X86_64
> + /*
> + * Instruction fetch faults in the vsyscall page might need
> + * emulation. The vsyscall page is at a high address
> + * (>PAGE_OFFSET), but is considered to be part of the user
> + * address space.
> + *
> + * The vsyscall page does not have a "real" VMA, so do this
> + * emulation before we go searching for VMAse
"VMAse"? Is that a typo?
On Fri, Sep 07, 2018 at 12:48:57PM -0700, Dave Hansen wrote:
> +static noinline void
> +__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{
> + prefetchw(¤t->mm->mmap_sem);
> +
> + if (unlikely(kmmio_fault(regs, address)))
> + return;
> +
> + /* Was the fault on kernel-controlled part of the address space? */
> + if (unlikely(fault_in_kernel_space(address)))
> + do_kern_addr_space_fault(regs, hw_error_code, address);
> + else
> + do_user_addr_space_fault(regs, hw_error_code, address);
> +}
How about: do_{user,kernel}_fault() ? That _addr_space_ is just a lot of
typing for no real benefit imo.
On 09/07/2018 03:48 PM, Andy Lutomirski wrote:
>>
>> For part of the page fault handler, "error_code" does exactly
>> match PFEC. But, during later parts, it diverges and starts to
>> mean something a bit different.
>>
>> Give it two names for its two jobs.
> How hard would it be to just remove sw_error_code instead? It seems
> like it adds little value and much confusion.
I think it would be really nice to have hw_error_code stand by itself
and be limited in scope to just __do_page_fault() and then have
FAULT_FLAG_* for everything else.
But, I was a little scared off of that. For one, I think we fill in
signal info with error_code, which makes it nominally part of the ABI.
So, I wanted to muck with it as little as possible in this set.
But, if we just said that
1. hw_error_code goes out to userspace, always, and
2. We drive all kernel behavior off of FAULT_FLAG_*, not error_code,
I think we can get away with it.
> I’m also unconvinced that the warning is terribly useful. We’re going
> to oops when this happens anyway.
One thing I wanted to get out of the warning was the contents of
hw_error_code before we go screwing with it. I also don't mind a nice,
clarifying warning showing up just before an oops. Maybe it could be a
pr_warn/err() instead of a full warning?
On 09/08/2018 02:38 AM, Peter Zijlstra wrote:
>> + /* Was the fault on kernel-controlled part of the address space? */
>> + if (unlikely(fault_in_kernel_space(address)))
>> + do_kern_addr_space_fault(regs, hw_error_code, address);
>> + else
>> + do_user_addr_space_fault(regs, hw_error_code, address);
>> +}
> How about: do_{user,kernel}_fault() ? That _addr_space_ is just a lot of
> typing for no real benefit imo.
Yeah, totally, the names are long.
I was just trying to make it obvious to differentiate address space from
*mode* when the fault occurred. I can probably do that in comments
instead, though.
On Fri, 2018-09-07 at 14:51 -0700, Dave Hansen wrote:
> >
> > >
> > > + * Only do the expensive exception table search when we might be at
> > > + * risk of a deadlock:
> > > + * 1. We failed to acquire mmap_sem, and
> > > + * 2. The access was an explicit kernel-mode access
> > > + * (X86_PF_USER=0).
> > Might be worth reminding the reader that X86_PF_USER will be set in
> > sw_error_code for implicit accesses. I saw "explicit" and my mind
> > immediately jumped to hw_error_code for whatever reason. E.g.:
> >
> > * 2. The access was an explicit kernel-mode access (we set X86_PF_USER
> > * in sw_error_code for implicit kernel-mode accesses).
> Yeah, that was not worded well. Is this better?
>
> >
> > * Only do the expensive exception table search when we might be at
> > * risk of a deadlock:
> > * 1. We failed to acquire mmap_sem, and
> > * 2. The access was an explicit kernel-mode access. An access
> > * from user-mode will X86_PF_USER=1 set via hw_error_code or
> > * set in sw_error_code if it were an implicit kernel-mode
> > * access that originated in user mode.
For me, mentioning hw_error_code just muddies the waters, e.g. why is
hw_error_code mentioned when it's not checked in the code? Comments
alone won't help someone that's reading this code and doesn't understand
that hardware sets X86_PF_USER for user-mode accesses. Maybe this?
* 2. The access was an explicit kernel-mode access. X86_PF_USER
* is set in sw_error_code for both user-mode accesses and
* implicit kernel-mode accesses that originated in user mode.
> On Sep 10, 2018, at 1:07 PM, Dave Hansen <[email protected]> wrote:
>
> On 09/07/2018 03:48 PM, Andy Lutomirski wrote:
>>>
>>> For part of the page fault handler, "error_code" does exactly
>>> match PFEC. But, during later parts, it diverges and starts to
>>> mean something a bit different.
>>>
>>> Give it two names for its two jobs.
>> How hard would it be to just remove sw_error_code instead? It seems
>> like it adds little value and much confusion.
>
> I think it would be really nice to have hw_error_code stand by itself
> and be limited in scope to just __do_page_fault() and then have
> FAULT_FLAG_* for everything else.
>
> But, I was a little scared off of that. For one, I think we fill in
> signal info with error_code, which makes it nominally part of the ABI.
> So, I wanted to muck with it as little as possible in this set.
>
> But, if we just said that
> 1. hw_error_code goes out to userspace, always, and
Nope, it’s an info leak. If the address is in kernel land (and not vsyscall), we must (and do, I believe) fake it.
> 2. We drive all kernel behavior off of FAULT_FLAG_*, not error_code,
> I think we can get away with it.
>
>> I’m also unconvinced that the warning is terribly useful. We’re going
>> to oops when this happens anyway.
>
> One thing I wanted to get out of the warning was the contents of
> hw_error_code before we go screwing with it. I also don't mind a nice,
> clarifying warning showing up just before an oops. Maybe it could be a
> pr_warn/err() instead of a full warning?
Sure.