2014-10-10 23:25:25

by Andi Kleen

[permalink] [raw]
Subject: Updated perf backtrace improvement patchkit

I reworked the patchkit to speed up perf backtracing based on
the feedback. It now uses a new method to determine the kernel
addresses using the page tables, and also avoids using copy_*_user
in the backtrace at all.

The goal is to avoid perf exceeding the NMI handler CPU time quota.

In addition this patchkit also optimizes the generic fault handling
for one case in copy_*_user (running on systems with enhanced copy
string)

Hopefully this version is more acceptable.

-Andi


2014-10-10 23:25:27

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 3/4] x86: Optimize enhanced copy user fault handling

From: Andi Kleen <[email protected]>

For the enhanced copy string case we can trivially optimize the fault
handling. It is just a single subtraction, as there is only one
possible fault point. So get rid of handle tail for this and
just do the subtraction directly.

This patch is strictly not needed for the goal of making perf
backtrace faster, but it will possibly help other workloads.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/lib/copy_user_64.S | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index dee945d..608caf45 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -285,8 +285,10 @@ ENTRY(copy_user_enhanced_fast_string)
ret

.section .fixup,"ax"
-12: movl %ecx,%edx /* ecx is zerorest also */
- jmp copy_user_handle_tail
+ /* edx: len: ecx: actually copied bytes */
+12: sub %ecx,%edx
+ mov %edx,%eax
+ ret
.previous

_ASM_EXTABLE(1b,12b)
--
1.9.3

2014-10-10 23:25:36

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 2/4] x86: Move copy_from_user_nmi() inline

From: Andi Kleen <[email protected]>

Move copy_from_user_nmi() inline. This allows the compiler to directly
do the __builtin_constant_p() optimizations in __copy_from_user_nocheck.

This then allows to optimize an 8 byte (32bit) or 16byte copy (64bit)
into two direct __get_user() instead of using the generic copy function.

This covers the 8/16 byte copies dump_stack uses when called from
the performance critical perf nmi pmi handler.

First this is much faster by itself (single memory access vs complicated
function). But it also is a lot faster for any page fault, which are
common in backtracing. Currently copy_from_user() does every page
fault twice, to generate an exact unread-bytes count. This adds a lot
of overhead. The inline __get_user code can do this without significant
overhead, it bails out on the first fault.

copy_from_user_nmi() is only placed from a few places, so there
isn't any significant code size increase from inlining this.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess.h | 29 +++++++++++++++++++++++++++--
arch/x86/lib/usercopy.c | 36 ------------------------------------
2 files changed, 27 insertions(+), 38 deletions(-)
delete mode 100644 arch/x86/lib/usercopy.c

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index e50a84f..30c391c 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -523,8 +523,6 @@ struct __large_struct { unsigned long buf[100]; };
#define put_user_ex(x, ptr) \
__put_user_size_ex((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr)))

-extern unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
extern __must_check long
strncpy_from_user(char *dst, const char __user *src, long count);

@@ -741,5 +739,32 @@ copy_to_user(void __user *to, const void *from, unsigned long n)
#undef __copy_from_user_overflow
#undef __copy_to_user_overflow

+/*
+ * We rely on the nested NMI work to allow atomic faults from the NMI path; the
+ * nested NMI paths are careful to preserve CR2.
+ *
+ * Inline this function so that the caller gets the __builtin_constant_p
+ * optimizations in __copy_from_user_nocheck
+ */
+static __must_check __always_inline unsigned long
+copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
+{
+ unsigned long ret;
+
+ if (__range_not_ok(from, n, user_addr_max()))
+ return 0;
+
+ /*
+ * Even though this function is typically called from NMI/IRQ context
+ * disable pagefaults so that its behaviour is consistent even when
+ * called form other contexts.
+ */
+ pagefault_disable();
+ ret = __copy_from_user_inatomic(to, from, n);
+ pagefault_enable();
+
+ return ret;
+}
+
#endif /* _ASM_X86_UACCESS_H */

diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
deleted file mode 100644
index ddf9ecb..0000000
--- a/arch/x86/lib/usercopy.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/*
- * User address space access functions.
- *
- * For licencing details see kernel-base/COPYING
- */
-
-#include <linux/highmem.h>
-#include <linux/module.h>
-
-#include <asm/word-at-a-time.h>
-#include <linux/sched.h>
-
-/*
- * We rely on the nested NMI work to allow atomic faults from the NMI path; the
- * nested NMI paths are careful to preserve CR2.
- */
-unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
-{
- unsigned long ret;
-
- if (__range_not_ok(from, n, TASK_SIZE))
- return 0;
-
- /*
- * Even though this function is typically called from NMI/IRQ context
- * disable pagefaults so that its behaviour is consistent even when
- * called form other contexts.
- */
- pagefault_disable();
- ret = __copy_from_user_inatomic(to, from, n);
- pagefault_enable();
-
- return ret;
-}
-EXPORT_SYMBOL_GPL(copy_from_user_nmi);
--
1.9.3

2014-10-10 23:25:26

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 1/4] Move pagefault_enable/disable to own include file

From: Andi Kleen <[email protected]>

Move pagefault_enable/disable from linux/uaccess.h to
an own include file. This avoids an include loop
with asm/uaccess.h needing these inlines in its own
inlines.

linux/uaccess.h still includes the header so there
is no change for any existing users.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess.h | 1 +
include/linux/pagefault.h | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/uaccess.h | 34 +---------------------------------
3 files changed, 41 insertions(+), 33 deletions(-)
create mode 100644 include/linux/pagefault.h

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 0d592e0..e50a84f 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -7,6 +7,7 @@
#include <linux/compiler.h>
#include <linux/thread_info.h>
#include <linux/string.h>
+#include <linux/pagefault.h>
#include <asm/asm.h>
#include <asm/page.h>
#include <asm/smap.h>
diff --git a/include/linux/pagefault.h b/include/linux/pagefault.h
new file mode 100644
index 0000000..f9520bd
--- /dev/null
+++ b/include/linux/pagefault.h
@@ -0,0 +1,39 @@
+#ifndef _LINUX_PAGEFAULT_H
+#define _LINUX_PAGEFAULT_H 1
+
+#include <linux/preempt.h>
+
+/*
+ * These routines enable/disable the pagefault handler in that
+ * it will not take any locks and go straight to the fixup table.
+ *
+ * They have great resemblance to the preempt_disable/enable calls
+ * and in fact they are identical; this is because currently there is
+ * no other way to make the pagefault handlers do this. So we do
+ * disable preemption but we don't necessarily care about that.
+ */
+static inline void pagefault_disable(void)
+{
+ preempt_count_inc();
+ /*
+ * make sure to have issued the store before a pagefault
+ * can hit.
+ */
+ barrier();
+}
+
+static inline void pagefault_enable(void)
+{
+#ifndef CONFIG_PREEMPT
+ /*
+ * make sure to issue those last loads/stores before enabling
+ * the pagefault handler again.
+ */
+ barrier();
+ preempt_count_dec();
+#else
+ preempt_enable();
+#endif
+}
+
+#endif
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319..9a5e894 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -2,41 +2,9 @@
#define __LINUX_UACCESS_H__

#include <linux/preempt.h>
+#include <linux/pagefault.h>
#include <asm/uaccess.h>

-/*
- * These routines enable/disable the pagefault handler in that
- * it will not take any locks and go straight to the fixup table.
- *
- * They have great resemblance to the preempt_disable/enable calls
- * and in fact they are identical; this is because currently there is
- * no other way to make the pagefault handlers do this. So we do
- * disable preemption but we don't necessarily care about that.
- */
-static inline void pagefault_disable(void)
-{
- preempt_count_inc();
- /*
- * make sure to have issued the store before a pagefault
- * can hit.
- */
- barrier();
-}
-
-static inline void pagefault_enable(void)
-{
-#ifndef CONFIG_PREEMPT
- /*
- * make sure to issue those last loads/stores before enabling
- * the pagefault handler again.
- */
- barrier();
- preempt_count_dec();
-#else
- preempt_enable();
-#endif
-}
-
#ifndef ARCH_HAS_NOCACHE_UACCESS

static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
--
1.9.3

2014-10-10 23:26:24

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

From: Andi Kleen <[email protected]>

On my workstation which has a lot of modules loaded:

$ lsmod | wc -l
80

backtrace from the NMI for perf record -g can take a quite long time.

This leads to frequent messages like:
perf interrupt took too long (7852 > 7812), lowering kernel.perf_event_max_sample_rate to 16000

One larger part of the PMI cost is each text address check during
the backtrace taking upto to 3us, like this:

1) | print_context_stack_bp() {
1) | __kernel_text_address() {
1) | is_module_text_address() {
1) | __module_text_address() {
1) 1.611 us | __module_address();
1) 1.939 us | }
1) 2.296 us | }
1) 2.659 us | }
1) | __kernel_text_address() {
1) | is_module_text_address() {
1) | __module_text_address() {
1) 0.724 us | __module_address();
1) 1.064 us | }
1) 1.430 us | }
1) 1.798 us | }
1) | __kernel_text_address() {
1) | is_module_text_address() {
1) | __module_text_address() {
1) 0.656 us | __module_address();
1) 1.012 us | }
1) 1.356 us | }
1) 1.761 us | }

So just with a reasonably sized backtrace easily 10-20us can be spent
on just checking the frame pointer IPs.

So essentially currently the module lookup is N-MODULES*M-length of backtrace

This patch uses the NX bits in the page tables to check for
valid kernel addresses instead. This can be done in any context
because kernel page tables are not removed (if they were it could
be handled by RCU like the user page tables)

The lookup here is 2-4 memory accesses bounded.

Anything with no NX bit set and is in kernel space is a valid
kernel executable. Unlike the previous scheme this will also
handle cases like the profiler hitting BIOS code or similar
(e.g. the PCI BIOS on 32bit)

On systems without NX we fall back to the previous scheme.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/dumpstack.c | 38 +++++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index b74ebc7..9279549 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
return p > t && p < t + THREAD_SIZE - size;
}

+/*
+ * Check if the address is in a executable page.
+ * This can be much faster than looking it up in the module
+ * table list when many modules are loaded.
+ *
+ * This is safe in any context because kernel page tables
+ * are never removed.
+ */
+static bool addr_is_executable(unsigned long addr)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ if (!(__supported_pte_mask & _PAGE_NX))
+ return __kernel_text_address(addr);
+ if (addr < __PAGE_OFFSET)
+ return false;
+ pgd = pgd_offset_k(addr);
+ if (!pgd_present(*pgd))
+ return false;
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ return false;
+ if (pud_large(*pud))
+ return pte_exec(*(pte_t *)pud);
+ pmd = pmd_offset(pud, addr);
+ if (!pmd_present(*pmd))
+ return false;
+ if (pmd_large(*pmd))
+ return pte_exec(*(pte_t *)pmd);
+ pte = pte_offset_kernel(pmd, addr);
+ return pte_present(*pte) && pte_exec(*pte);
+}
+
unsigned long
print_context_stack(struct thread_info *tinfo,
unsigned long *stack, unsigned long bp,
@@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
unsigned long addr;

addr = *stack;
- if (__kernel_text_address(addr)) {
+ if (addr_is_executable(addr)) {
if ((unsigned long) stack == bp + sizeof(long)) {
ops->address(data, addr, 1);
frame = frame->next_frame;
--
1.9.3

2014-10-11 00:24:52

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

On Fri, 10 Oct 2014 16:25:17 -0700
Andi Kleen <[email protected]> wrote:

> From: Andi Kleen <[email protected]>
>
> On my workstation which has a lot of modules loaded:
>
> $ lsmod | wc -l
> 80
>
> backtrace from the NMI for perf record -g can take a quite long time.
>
> This leads to frequent messages like:
> perf interrupt took too long (7852 > 7812), lowering kernel.perf_event_max_sample_rate to 16000
>
> One larger part of the PMI cost is each text address check during
> the backtrace taking upto to 3us, like this:
>
> 1) | print_context_stack_bp() {
> 1) | __kernel_text_address() {
> 1) | is_module_text_address() {
> 1) | __module_text_address() {
> 1) 1.611 us | __module_address();
> 1) 1.939 us | }
> 1) 2.296 us | }
> 1) 2.659 us | }
> 1) | __kernel_text_address() {
> 1) | is_module_text_address() {
> 1) | __module_text_address() {
> 1) 0.724 us | __module_address();
> 1) 1.064 us | }
> 1) 1.430 us | }
> 1) 1.798 us | }
> 1) | __kernel_text_address() {
> 1) | is_module_text_address() {
> 1) | __module_text_address() {
> 1) 0.656 us | __module_address();
> 1) 1.012 us | }
> 1) 1.356 us | }
> 1) 1.761 us | }
>
> So just with a reasonably sized backtrace easily 10-20us can be spent
> on just checking the frame pointer IPs.
>
> So essentially currently the module lookup is N-MODULES*M-length of backtrace
>
> This patch uses the NX bits in the page tables to check for
> valid kernel addresses instead. This can be done in any context
> because kernel page tables are not removed (if they were it could
> be handled by RCU like the user page tables)
>
> The lookup here is 2-4 memory accesses bounded.
>
> Anything with no NX bit set and is in kernel space is a valid
> kernel executable. Unlike the previous scheme this will also
> handle cases like the profiler hitting BIOS code or similar
> (e.g. the PCI BIOS on 32bit)
>
> On systems without NX we fall back to the previous scheme.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/kernel/dumpstack.c | 38 +++++++++++++++++++++++++++++++++++++-
> 1 file changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index b74ebc7..9279549 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
> return p > t && p < t + THREAD_SIZE - size;
> }
>
> +/*
> + * Check if the address is in a executable page.
> + * This can be much faster than looking it up in the module
> + * table list when many modules are loaded.
> + *
> + * This is safe in any context because kernel page tables
> + * are never removed.
> + */
> +static bool addr_is_executable(unsigned long addr)

That name is confusing. Maybe call it x86_kernel_text_address()?

> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + if (!(__supported_pte_mask & _PAGE_NX))
> + return __kernel_text_address(addr);
> + if (addr < __PAGE_OFFSET)
> + return false;

Can't you check __PAGE_OFFSET first? That would speed up the non-NX
case...

> + pgd = pgd_offset_k(addr);
> + if (!pgd_present(*pgd))
> + return false;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + return false;
> + if (pud_large(*pud))
> + return pte_exec(*(pte_t *)pud);
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + return false;
> + if (pmd_large(*pmd))
> + return pte_exec(*(pte_t *)pmd);
> + pte = pte_offset_kernel(pmd, addr);
> + return pte_present(*pte) && pte_exec(*pte);
> +}
> +
> unsigned long
> print_context_stack(struct thread_info *tinfo,
> unsigned long *stack, unsigned long bp,
> @@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
> unsigned long addr;
>
> addr = *stack;
> - if (__kernel_text_address(addr)) {
> + if (addr_is_executable(addr)) {
> if ((unsigned long) stack == bp + sizeof(long)) {
> ops->address(data, addr, 1);
> frame = frame->next_frame;

2014-10-11 00:33:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

On Fri, 2014-10-10 at 16:25 -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> On my workstation which has a lot of modules loaded:
>
> $ lsmod | wc -l
> 80

> This patch uses the NX bits in the page tables to check for
> valid kernel addresses instead. This can be done in any context
> because kernel page tables are not removed (if they were it could
> be handled by RCU like the user page tables)
>
> The lookup here is 2-4 memory accesses bounded.
>
> Anything with no NX bit set and is in kernel space is a valid
> kernel executable. Unlike the previous scheme this will also
> handle cases like the profiler hitting BIOS code or similar
> (e.g. the PCI BIOS on 32bit)
>
> On systems without NX we fall back to the previous scheme.

For such systems, we probably could reorder fields in struct module to
reduce at one cache line per module instead of three...

offsetof(struct module, module_core)=0x138
offsetof(struct module, module_init)=0x130
offsetof(struct module, core_size)=0x144
offsetof(struct module, init_size)=0x140
offsetof(struct module, list)=0x8