2010-08-16 20:48:41

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 0/0 v3] callchain fixes and cleanups

Hi,


This set factorizes a lot of callchain code and fixes concurrent
callchain buffers accesses.

The previous set has revealed we can't decently use the per cpu
allocation api for memory that can be accessed from NMI.
Hence this temporary uses a manual per cpu allocation until that
gets fixed up.

Built-tested on PowerPc, Arm and SuperH (can someone point me to
a ready to use sparc cross compiler, like those that can be found
there for other archs: http://www.codesourcery.com/sgpp/lite_edition.html ?)

And runtime tested on x86-64 withiut troubles.

You can test it by pulling that branch:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perf/callchains

If you have no comments, I'll propose it to -tip in a few days.

Thanks,
Frederic
---

Frederic Weisbecker (6):
perf: Drop unappropriate tests on arch callchains
perf: Generalize callchain_store()
perf: Generalize some arch callchain code
perf: Factorize callchain context handling
perf: Fix race in callchains
perf: Humanize the number of contexts


arch/arm/kernel/perf_event.c | 62 +---------
arch/powerpc/kernel/perf_callchain.c | 86 ++++---------
arch/sh/kernel/perf_callchain.c | 50 +-------
arch/sparc/kernel/perf_event.c | 69 ++++-------
arch/x86/kernel/cpu/perf_event.c | 82 +++----------
include/linux/perf_event.h | 30 ++++-
kernel/perf_event.c | 226 +++++++++++++++++++++++++++++-----
7 files changed, 295 insertions(+), 310 deletions(-)


2010-08-16 20:48:45

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 1/6] perf: Drop unappropriate tests on arch callchains

Drop the TASK_RUNNING test on user tasks for callchains as
this check doesn't seem to make any sense.

Also remove the tests for !current that is not supposed to
happen and current->pid as this should be handled at the
generic level, with exclude_idle attribute.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: David Miller <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/arm/kernel/perf_event.c | 6 ------
arch/sh/kernel/perf_callchain.c | 3 ---
arch/x86/kernel/cpu/perf_event.c | 3 ---
3 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 417c392..fdcb0be 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -3107,12 +3107,6 @@ perf_do_callchain(struct pt_regs *regs,

is_user = user_mode(regs);

- if (!current || !current->pid)
- return;
-
- if (is_user && current->state != TASK_RUNNING)
- return;
-
if (!is_user)
perf_callchain_kernel(regs, entry);

diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index a9dd3ab..1d6dbce 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -68,9 +68,6 @@ perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)

is_user = user_mode(regs);

- if (is_user && current->state != TASK_RUNNING)
- return;
-
/*
* Only the kernel side is implemented for now.
*/
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f2da20f..4a4d191 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1703,9 +1703,6 @@ perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)

is_user = user_mode(regs);

- if (is_user && current->state != TASK_RUNNING)
- return;
-
if (!is_user)
perf_callchain_kernel(regs, entry);

--
1.6.2.3

2010-08-16 20:48:50

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 3/6] perf: Generalize some arch callchain code

- Most archs use one callchain buffer per cpu, except x86 that needs
to deal with NMIs. Provide a default perf_callchain_buffer()
implementation that x86 overrides.

- Centralize all the kernel/user regs handling and invoke new arch
handlers from there: perf_callchain_user() / perf_callchain_kernel()
That avoid all the user_mode(), current->mm checks and so...

- Invert some parameters in perf_callchain_*() helpers: entry to the
left, regs to the right, following the traditional (src, dst).

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: David Miller <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/arm/kernel/perf_event.c | 43 +++---------------------------
arch/powerpc/kernel/perf_callchain.c | 49 +++++++++------------------------
arch/sh/kernel/perf_callchain.c | 37 +------------------------
arch/sparc/kernel/perf_event.c | 46 ++++++++++---------------------
arch/x86/kernel/cpu/perf_event.c | 45 +++++-------------------------
include/linux/perf_event.h | 10 ++++++-
kernel/perf_event.c | 40 ++++++++++++++++++++++++++-
7 files changed, 90 insertions(+), 180 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index a07c3b1..0e3bbdb 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -3044,17 +3044,13 @@ user_backtrace(struct frame_tail *tail,
return buftail.fp - 1;
}

-static void
-perf_callchain_user(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+void
+perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
struct frame_tail *tail;

perf_callchain_store(entry, PERF_CONTEXT_USER);

- if (!user_mode(regs))
- regs = task_pt_regs(current);
-
tail = (struct frame_tail *)regs->ARM_fp - 1;

while (tail && !((unsigned long)tail & 0x3))
@@ -3075,9 +3071,8 @@ callchain_trace(struct stackframe *fr,
return 0;
}

-static void
-perf_callchain_kernel(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+void
+perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
struct stackframe fr;

@@ -3088,33 +3083,3 @@ perf_callchain_kernel(struct pt_regs *regs,
fr.pc = regs->ARM_pc;
walk_stackframe(&fr, callchain_trace, entry);
}
-
-static void
-perf_do_callchain(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
-{
- int is_user;
-
- if (!regs)
- return;
-
- is_user = user_mode(regs);
-
- if (!is_user)
- perf_callchain_kernel(regs, entry);
-
- if (current->mm)
- perf_callchain_user(regs, entry);
-}
-
-static DEFINE_PER_CPU(struct perf_callchain_entry, pmc_irq_entry);
-
-struct perf_callchain_entry *
-perf_callchain(struct pt_regs *regs)
-{
- struct perf_callchain_entry *entry = &__get_cpu_var(pmc_irq_entry);
-
- entry->nr = 0;
- perf_do_callchain(regs, entry);
- return entry;
-}
diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index a286c2e..f7a85ed 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -46,8 +46,8 @@ static int valid_next_sp(unsigned long sp, unsigned long prev_sp)
return 0;
}

-static void perf_callchain_kernel(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+void
+perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
unsigned long sp, next_sp;
unsigned long next_ip;
@@ -221,8 +221,8 @@ static int sane_signal_64_frame(unsigned long sp)
puc == (unsigned long) &sf->uc;
}

-static void perf_callchain_user_64(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+static void perf_callchain_user_64(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
unsigned long sp, next_sp;
unsigned long next_ip;
@@ -303,8 +303,8 @@ static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
return __get_user_inatomic(*ret, ptr);
}

-static inline void perf_callchain_user_64(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+static inline void perf_callchain_user_64(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
}

@@ -423,8 +423,8 @@ static unsigned int __user *signal_frame_32_regs(unsigned int sp,
return mctx->mc_gregs;
}

-static void perf_callchain_user_32(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+static void perf_callchain_user_32(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
unsigned int sp, next_sp;
unsigned int next_ip;
@@ -471,32 +471,11 @@ static void perf_callchain_user_32(struct pt_regs *regs,
}
}

-/*
- * Since we can't get PMU interrupts inside a PMU interrupt handler,
- * we don't need separate irq and nmi entries here.
- */
-static DEFINE_PER_CPU(struct perf_callchain_entry, cpu_perf_callchain);
-
-struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+void
+perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
- struct perf_callchain_entry *entry = &__get_cpu_var(cpu_perf_callchain);
-
- entry->nr = 0;
-
- if (!user_mode(regs)) {
- perf_callchain_kernel(regs, entry);
- if (current->mm)
- regs = task_pt_regs(current);
- else
- regs = NULL;
- }
-
- if (regs) {
- if (current_is_64bit())
- perf_callchain_user_64(regs, entry);
- else
- perf_callchain_user_32(regs, entry);
- }
-
- return entry;
+ if (current_is_64bit())
+ perf_callchain_user_64(entry, regs);
+ else
+ perf_callchain_user_32(entry, regs);
}
diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index 00143f3..ef076a9 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -44,44 +44,11 @@ static const struct stacktrace_ops callchain_ops = {
.address = callchain_address,
};

-static void
-perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
+void
+perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->pc);

unwind_stack(NULL, regs, NULL, &callchain_ops, entry);
}
-
-static void
-perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
-{
- int is_user;
-
- if (!regs)
- return;
-
- is_user = user_mode(regs);
-
- /*
- * Only the kernel side is implemented for now.
- */
- if (!is_user)
- perf_callchain_kernel(regs, entry);
-}
-
-/*
- * No need for separate IRQ and NMI entries.
- */
-static DEFINE_PER_CPU(struct perf_callchain_entry, callchain);
-
-struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
-{
- struct perf_callchain_entry *entry = &__get_cpu_var(callchain);
-
- entry->nr = 0;
-
- perf_do_callchain(regs, entry);
-
- return entry;
-}
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 2a95a90..460162d 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1283,14 +1283,16 @@ void __init init_hw_perf_events(void)
register_die_notifier(&perf_event_nmi_notifier);
}

-static void perf_callchain_kernel(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
unsigned long ksp, fp;
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
int graph = 0;
#endif

+ stack_trace_flush();
+
perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->tpc);

@@ -1330,8 +1332,8 @@ static void perf_callchain_kernel(struct pt_regs *regs,
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

-static void perf_callchain_user_64(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+static void perf_callchain_user_64(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
unsigned long ufp;

@@ -1353,8 +1355,8 @@ static void perf_callchain_user_64(struct pt_regs *regs,
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

-static void perf_callchain_user_32(struct pt_regs *regs,
- struct perf_callchain_entry *entry)
+static void perf_callchain_user_32(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
{
unsigned long ufp;

@@ -1376,30 +1378,12 @@ static void perf_callchain_user_32(struct pt_regs *regs,
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

-/* Like powerpc we can't get PMU interrupts within the PMU handler,
- * so no need for separate NMI and IRQ chains as on x86.
- */
-static DEFINE_PER_CPU(struct perf_callchain_entry, callchain);
-
-struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+void
+perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
- struct perf_callchain_entry *entry = &__get_cpu_var(callchain);
-
- entry->nr = 0;
- if (!user_mode(regs)) {
- stack_trace_flush();
- perf_callchain_kernel(regs, entry);
- if (current->mm)
- regs = task_pt_regs(current);
- else
- regs = NULL;
- }
- if (regs) {
- flushw_user();
- if (test_thread_flag(TIF_32BIT))
- perf_callchain_user_32(regs, entry);
- else
- perf_callchain_user_64(regs, entry);
- }
- return entry;
+ flushw_user();
+ if (test_thread_flag(TIF_32BIT))
+ perf_callchain_user_32(entry, regs);
+ else
+ perf_callchain_user_64(entry, regs);
}
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8af28ca..39f8421 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1571,9 +1571,7 @@ const struct pmu *hw_perf_event_init(struct perf_event *event)
* callchain support
*/

-
-static DEFINE_PER_CPU(struct perf_callchain_entry, pmc_irq_entry);
-static DEFINE_PER_CPU(struct perf_callchain_entry, pmc_nmi_entry);
+static DEFINE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry_nmi);


static void
@@ -1607,8 +1605,8 @@ static const struct stacktrace_ops backtrace_ops = {
.walk_stack = print_context_stack_bp,
};

-static void
-perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
+void
+perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->ip);
@@ -1653,14 +1651,12 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
}
#endif

-static void
-perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
+void
+perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
struct stack_frame frame;
const void __user *fp;

- if (!user_mode(regs))
- regs = task_pt_regs(current);

fp = (void __user *)regs->bp;

@@ -1687,42 +1683,17 @@ perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
}
}

-static void
-perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
-{
- int is_user;
-
- if (!regs)
- return;
-
- is_user = user_mode(regs);
-
- if (!is_user)
- perf_callchain_kernel(regs, entry);
-
- if (current->mm)
- perf_callchain_user(regs, entry);
-}
-
-struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+struct perf_callchain_entry *perf_callchain_buffer(void)
{
- struct perf_callchain_entry *entry;
-
if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
/* TODO: We don't support guest os callchain now */
return NULL;
}

if (in_nmi())
- entry = &__get_cpu_var(pmc_nmi_entry);
- else
- entry = &__get_cpu_var(pmc_irq_entry);
-
- entry->nr = 0;
-
- perf_do_callchain(regs, entry);
+ return &__get_cpu_var(perf_callchain_entry_nmi);

- return entry;
+ return &__get_cpu_var(perf_callchain_entry);
}

unsigned long perf_instruction_pointer(struct pt_regs *regs)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3588804..4db61dd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -976,7 +976,15 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
extern void perf_event_comm(struct task_struct *tsk);
extern void perf_event_fork(struct task_struct *tsk);

-extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
+/* Callchains */
+DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
+
+extern void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs);
+extern void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs);
+extern struct perf_callchain_entry *perf_callchain_buffer(void);
+

static inline void
perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c772a3d..02efde6 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2937,13 +2937,49 @@ void perf_event_do_pending(void)
__perf_pending_run();
}

+DEFINE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
+
/*
* Callchain support -- arch specific
*/

-__weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+__weak struct perf_callchain_entry *perf_callchain_buffer(void)
{
- return NULL;
+ return &__get_cpu_var(perf_callchain_entry);
+}
+
+__weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+__weak void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+{
+ struct perf_callchain_entry *entry;
+
+ entry = perf_callchain_buffer();
+ if (!entry)
+ return NULL;
+
+ entry->nr = 0;
+
+ if (!user_mode(regs)) {
+ perf_callchain_kernel(entry, regs);
+ if (current->mm)
+ regs = task_pt_regs(current);
+ else
+ regs = NULL;
+ }
+
+ if (regs)
+ perf_callchain_user(entry, regs);
+
+ return entry;
}


--
1.6.2.3

2010-08-16 20:48:55

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 4/6] perf: Factorize callchain context handling

Store the kernel and user contexts from the generic layer instead
of archs, this gathers some repetitive code.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: David Miller <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/arm/kernel/perf_event.c | 2 --
arch/powerpc/kernel/perf_callchain.c | 3 ---
arch/sh/kernel/perf_callchain.c | 1 -
arch/sparc/kernel/perf_event.c | 3 ---
arch/x86/kernel/cpu/perf_event.c | 2 --
kernel/perf_event.c | 5 ++++-
6 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 0e3bbdb..64ca8c3 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -3049,7 +3049,6 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
struct frame_tail *tail;

- perf_callchain_store(entry, PERF_CONTEXT_USER);

tail = (struct frame_tail *)regs->ARM_fp - 1;

@@ -3076,7 +3075,6 @@ perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
struct stackframe fr;

- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
fr.fp = regs->ARM_fp;
fr.sp = regs->ARM_sp;
fr.lr = regs->ARM_lr;
diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index f7a85ed..d05ae42 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -57,7 +57,6 @@ perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)

lr = regs->link;
sp = regs->gpr[1];
- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->nip);

if (!validate_sp(sp, current, STACK_FRAME_OVERHEAD))
@@ -234,7 +233,6 @@ static void perf_callchain_user_64(struct perf_callchain_entry *entry,
next_ip = regs->nip;
lr = regs->link;
sp = regs->gpr[1];
- perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_store(entry, next_ip);

for (;;) {
@@ -435,7 +433,6 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
next_ip = regs->nip;
lr = regs->link;
sp = regs->gpr[1];
- perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_store(entry, next_ip);

while (entry->nr < PERF_MAX_STACK_DEPTH) {
diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index ef076a9..d5ca1ef 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -47,7 +47,6 @@ static const struct stacktrace_ops callchain_ops = {
void
perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->pc);

unwind_stack(NULL, regs, NULL, &callchain_ops, entry);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 460162d..4bc4029 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1293,7 +1293,6 @@ void perf_callchain_kernel(struct perf_callchain_entry *entry,

stack_trace_flush();

- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->tpc);

ksp = regs->u_regs[UREG_I6];
@@ -1337,7 +1336,6 @@ static void perf_callchain_user_64(struct perf_callchain_entry *entry,
{
unsigned long ufp;

- perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_store(entry, regs->tpc);

ufp = regs->u_regs[UREG_I6] + STACK_BIAS;
@@ -1360,7 +1358,6 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
{
unsigned long ufp;

- perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_store(entry, regs->tpc);

ufp = regs->u_regs[UREG_I6] & 0xffffffffUL;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 39f8421..a3c9222 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1608,7 +1608,6 @@ static const struct stacktrace_ops backtrace_ops = {
void
perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_store(entry, regs->ip);

dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
@@ -1660,7 +1659,6 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)

fp = (void __user *)regs->bp;

- perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_store(entry, regs->ip);

if (perf_callchain_user32(regs, entry))
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 02efde6..615d024 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2969,6 +2969,7 @@ static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
entry->nr = 0;

if (!user_mode(regs)) {
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
perf_callchain_kernel(entry, regs);
if (current->mm)
regs = task_pt_regs(current);
@@ -2976,8 +2977,10 @@ static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
regs = NULL;
}

- if (regs)
+ if (regs) {
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
perf_callchain_user(entry, regs);
+ }

return entry;
}
--
1.6.2.3

2010-08-16 20:49:19

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 5/6] perf: Fix race in callchains

Now that software events don't have interrupt disabled anymore in
the event path, callchains can nest on any context. So seperating
nmi and others contexts in two buffers has become racy.

Fix this by providing one buffer per nesting level. Given the size
of the callchain entries (2040 bytes * 4), we now need to allocate
them dynamically.

v2: Fixed put_callchain_entry call after recursion.
Fix the type of the recursion, it must be an array.

v3: Use a manual per cpu allocation (temporary solution until NMIs
can safely access vmalloc'ed memory).
Do a better separation between callchain reference tracking and
allocation. Make the "put" path lockless for non-release cases.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: David Miller <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 22 ++--
include/linux/perf_event.h | 1 -
kernel/perf_event.c | 265 ++++++++++++++++++++++++++++----------
3 files changed, 205 insertions(+), 83 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a3c9222..8e91cf3 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1608,6 +1608,11 @@ static const struct stacktrace_ops backtrace_ops = {
void
perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ /* TODO: We don't support guest os callchain now */
+ return NULL;
+ }
+
perf_callchain_store(entry, regs->ip);

dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
@@ -1656,6 +1661,10 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
struct stack_frame frame;
const void __user *fp;

+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ /* TODO: We don't support guest os callchain now */
+ return NULL;
+ }

fp = (void __user *)regs->bp;

@@ -1681,19 +1690,6 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
}
}

-struct perf_callchain_entry *perf_callchain_buffer(void)
-{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- /* TODO: We don't support guest os callchain now */
- return NULL;
- }
-
- if (in_nmi())
- return &__get_cpu_var(perf_callchain_entry_nmi);
-
- return &__get_cpu_var(perf_callchain_entry);
-}
-
unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
unsigned long ip;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4db61dd..d7e8ea6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -983,7 +983,6 @@ extern void perf_callchain_user(struct perf_callchain_entry *entry,
struct pt_regs *regs);
extern void perf_callchain_kernel(struct perf_callchain_entry *entry,
struct pt_regs *regs);
-extern struct perf_callchain_entry *perf_callchain_buffer(void);


static inline void
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 615d024..6ad61fb 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1764,6 +1764,183 @@ static u64 perf_event_read(struct perf_event *event)
}

/*
+ * Callchain support
+ */
+
+static DEFINE_PER_CPU(int, callchain_recursion[4]);
+static atomic_t nr_callchain_events;
+static DEFINE_MUTEX(callchain_mutex);
+static struct perf_callchain_entry **callchain_cpu_entries;
+
+__weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+__weak void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+static void release_callchain_buffers(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ kfree(callchain_cpu_entries[cpu]);
+ callchain_cpu_entries[cpu] = NULL;
+ }
+
+ kfree(callchain_cpu_entries);
+ callchain_cpu_entries = NULL;
+}
+
+static int alloc_callchain_buffers(void)
+{
+ int cpu;
+ struct perf_callchain_entry **entries;
+
+ /*
+ * We can't use the percpu allocation API for data that can be
+ * accessed from NMI. Use a temporary manual per cpu allocation
+ * until that gets sorted out.
+ */
+ entries = kzalloc(sizeof(*entries) * num_possible_cpus(), GFP_KERNEL);
+ if (!entries)
+ return -ENOMEM;
+
+ callchain_cpu_entries = entries;
+
+ for_each_possible_cpu(cpu) {
+ entries[cpu] = kmalloc(sizeof(struct perf_callchain_entry) * 4,
+ GFP_KERNEL);
+ if (!entries[cpu])
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int get_callchain_buffers(void)
+{
+ int err = 0;
+ int count;
+
+ mutex_lock(&callchain_mutex);
+
+ count = atomic_inc_return(&nr_callchain_events);
+ if (WARN_ON_ONCE(count < 1)) {
+ err = -EINVAL;
+ goto exit;
+ }
+
+ if (count > 1) {
+ /* If the allocation failed, give up */
+ if (!callchain_cpu_entries)
+ err = -ENOMEM;
+ goto exit;
+ }
+
+ err = alloc_callchain_buffers();
+ if (err)
+ release_callchain_buffers();
+exit:
+ mutex_unlock(&callchain_mutex);
+
+ return err;
+}
+
+static void put_callchain_buffers(void)
+{
+ if (atomic_dec_and_mutex_lock(&nr_callchain_events, &callchain_mutex)) {
+ release_callchain_buffers();
+ mutex_unlock(&callchain_mutex);
+ }
+}
+
+static int get_recursion_context(int *recursion)
+{
+ int rctx;
+
+ if (in_nmi())
+ rctx = 3;
+ else if (in_irq())
+ rctx = 2;
+ else if (in_softirq())
+ rctx = 1;
+ else
+ rctx = 0;
+
+ if (recursion[rctx])
+ return -1;
+
+ recursion[rctx]++;
+ barrier();
+
+ return rctx;
+}
+
+static inline void put_recursion_context(int *recursion, int rctx)
+{
+ barrier();
+ recursion[rctx]--;
+}
+
+static struct perf_callchain_entry *get_callchain_entry(int *rctx)
+{
+ int cpu;
+
+ *rctx = get_recursion_context(__get_cpu_var(callchain_recursion));
+ if (*rctx == -1)
+ return NULL;
+
+ cpu = smp_processor_id();
+
+ return &callchain_cpu_entries[cpu][*rctx];
+}
+
+static void
+put_callchain_entry(int rctx)
+{
+ put_recursion_context(__get_cpu_var(callchain_recursion), rctx);
+}
+
+static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+{
+ int rctx;
+ struct perf_callchain_entry *entry;
+
+
+ entry = get_callchain_entry(&rctx);
+ if (rctx == -1)
+ return NULL;
+
+ if (!entry)
+ goto exit_put;
+
+ entry->nr = 0;
+
+ if (!user_mode(regs)) {
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_kernel(entry, regs);
+ if (current->mm)
+ regs = task_pt_regs(current);
+ else
+ regs = NULL;
+ }
+
+ if (regs) {
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_user(entry, regs);
+ }
+
+exit_put:
+ put_callchain_entry(rctx);
+
+ return entry;
+}
+
+/*
* Initialize the perf_event context in a task_struct:
*/
static void
@@ -1895,6 +2072,8 @@ static void free_event(struct perf_event *event)
atomic_dec(&nr_comm_events);
if (event->attr.task)
atomic_dec(&nr_task_events);
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+ put_callchain_buffers();
}

if (event->buffer) {
@@ -2937,55 +3116,6 @@ void perf_event_do_pending(void)
__perf_pending_run();
}

-DEFINE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
-
-/*
- * Callchain support -- arch specific
- */
-
-__weak struct perf_callchain_entry *perf_callchain_buffer(void)
-{
- return &__get_cpu_var(perf_callchain_entry);
-}
-
-__weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
- struct pt_regs *regs)
-{
-}
-
-__weak void perf_callchain_user(struct perf_callchain_entry *entry,
- struct pt_regs *regs)
-{
-}
-
-static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
-{
- struct perf_callchain_entry *entry;
-
- entry = perf_callchain_buffer();
- if (!entry)
- return NULL;
-
- entry->nr = 0;
-
- if (!user_mode(regs)) {
- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
- perf_callchain_kernel(entry, regs);
- if (current->mm)
- regs = task_pt_regs(current);
- else
- regs = NULL;
- }
-
- if (regs) {
- perf_callchain_store(entry, PERF_CONTEXT_USER);
- perf_callchain_user(entry, regs);
- }
-
- return entry;
-}
-
-
/*
* We assume there is only KVM supporting the callbacks.
* Later on, we might change it to a list if there is
@@ -3480,14 +3610,20 @@ static void perf_event_output(struct perf_event *event, int nmi,
struct perf_output_handle handle;
struct perf_event_header header;

+ /* protect the callchain buffers */
+ rcu_read_lock();
+
perf_prepare_sample(&header, data, event, regs);

if (perf_output_begin(&handle, event, header.size, nmi, 1))
- return;
+ goto exit;

perf_output_sample(&handle, &header, data, event);

perf_output_end(&handle);
+
+exit:
+ rcu_read_unlock();
}

/*
@@ -4243,32 +4379,16 @@ end:
int perf_swevent_get_recursion_context(void)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
- int rctx;

- if (in_nmi())
- rctx = 3;
- else if (in_irq())
- rctx = 2;
- else if (in_softirq())
- rctx = 1;
- else
- rctx = 0;
-
- if (cpuctx->recursion[rctx])
- return -1;
-
- cpuctx->recursion[rctx]++;
- barrier();
-
- return rctx;
+ return get_recursion_context(cpuctx->recursion);
}
EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);

void inline perf_swevent_put_recursion_context(int rctx)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
- barrier();
- cpuctx->recursion[rctx]--;
+
+ put_recursion_context(cpuctx->recursion, rctx);
}

void __perf_sw_event(u32 event_id, u64 nr, int nmi,
@@ -4968,6 +5088,13 @@ done:
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
+ err = get_callchain_buffers();
+ if (err) {
+ free_event(event);
+ return ERR_PTR(err);
+ }
+ }
}

return event;
--
1.6.2.3

2010-08-16 20:49:03

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 6/6] perf: Humanize the number of contexts

Instead of hardcoding the number of contexts for the recursions
barriers, define a cpp constant to make the code more
self-explanatory.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
include/linux/perf_event.h | 14 ++++++++------
kernel/perf_event.c | 6 +++---
2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7e8ea6..ae6fa60 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -808,6 +808,12 @@ struct perf_event_context {
struct rcu_head rcu_head;
};

+/*
+ * Number of contexts where an event can trigger:
+ * task, softirq, hardirq, nmi.
+ */
+#define PERF_NR_CONTEXTS 4
+
/**
* struct perf_event_cpu_context - per cpu event context structure
*/
@@ -821,12 +827,8 @@ struct perf_cpu_context {
struct mutex hlist_mutex;
int hlist_refcount;

- /*
- * Recursion avoidance:
- *
- * task, softirq, irq, nmi context
- */
- int recursion[4];
+ /* Recursion avoidance in each contexts */
+ int recursion[PERF_NR_CONTEXTS];
};

struct perf_output_handle {
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 6ad61fb..e41e29b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1767,7 +1767,7 @@ static u64 perf_event_read(struct perf_event *event)
* Callchain support
*/

-static DEFINE_PER_CPU(int, callchain_recursion[4]);
+static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
static atomic_t nr_callchain_events;
static DEFINE_MUTEX(callchain_mutex);
static struct perf_callchain_entry **callchain_cpu_entries;
@@ -1812,8 +1812,8 @@ static int alloc_callchain_buffers(void)
callchain_cpu_entries = entries;

for_each_possible_cpu(cpu) {
- entries[cpu] = kmalloc(sizeof(struct perf_callchain_entry) * 4,
- GFP_KERNEL);
+ entries[cpu] = kmalloc(sizeof(struct perf_callchain_entry) *
+ PERF_NR_CONTEXTS, GFP_KERNEL);
if (!entries[cpu])
return -ENOMEM;
}
--
1.6.2.3

2010-08-16 20:49:54

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 2/6] perf: Generalize callchain_store()

callchain_store() is the same on every archs, inline it in
perf_event.h and rename it to perf_callchain_store() to avoid
any collision.

This removes repetitive code.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: David Miller <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/arm/kernel/perf_event.c | 15 +++---------
arch/powerpc/kernel/perf_callchain.c | 40 ++++++++++++----------------------
arch/sh/kernel/perf_callchain.c | 11 ++------
arch/sparc/kernel/perf_event.c | 26 ++++++++-------------
arch/x86/kernel/cpu/perf_event.c | 20 ++++++-----------
include/linux/perf_event.h | 7 ++++++
6 files changed, 45 insertions(+), 74 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index fdcb0be..a07c3b1 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -3001,13 +3001,6 @@ arch_initcall(init_hw_perf_events);
/*
* Callchain handling code.
*/
-static inline void
-callchain_store(struct perf_callchain_entry *entry,
- u64 ip)
-{
- if (entry->nr < PERF_MAX_STACK_DEPTH)
- entry->ip[entry->nr++] = ip;
-}

/*
* The registers we're interested in are at the end of the variable
@@ -3039,7 +3032,7 @@ user_backtrace(struct frame_tail *tail,
if (__copy_from_user_inatomic(&buftail, tail, sizeof(buftail)))
return NULL;

- callchain_store(entry, buftail.lr);
+ perf_callchain_store(entry, buftail.lr);

/*
* Frame pointers should strictly progress back up the stack
@@ -3057,7 +3050,7 @@ perf_callchain_user(struct pt_regs *regs,
{
struct frame_tail *tail;

- callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);

if (!user_mode(regs))
regs = task_pt_regs(current);
@@ -3078,7 +3071,7 @@ callchain_trace(struct stackframe *fr,
void *data)
{
struct perf_callchain_entry *entry = data;
- callchain_store(entry, fr->pc);
+ perf_callchain_store(entry, fr->pc);
return 0;
}

@@ -3088,7 +3081,7 @@ perf_callchain_kernel(struct pt_regs *regs,
{
struct stackframe fr;

- callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
fr.fp = regs->ARM_fp;
fr.sp = regs->ARM_sp;
fr.lr = regs->ARM_lr;
diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index 95ad9da..a286c2e 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -23,18 +23,6 @@
#include "ppc32.h"
#endif

-/*
- * Store another value in a callchain_entry.
- */
-static inline void callchain_store(struct perf_callchain_entry *entry, u64 ip)
-{
- unsigned int nr = entry->nr;
-
- if (nr < PERF_MAX_STACK_DEPTH) {
- entry->ip[nr] = ip;
- entry->nr = nr + 1;
- }
-}

/*
* Is sp valid as the address of the next kernel stack frame after prev_sp?
@@ -69,8 +57,8 @@ static void perf_callchain_kernel(struct pt_regs *regs,

lr = regs->link;
sp = regs->gpr[1];
- callchain_store(entry, PERF_CONTEXT_KERNEL);
- callchain_store(entry, regs->nip);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, regs->nip);

if (!validate_sp(sp, current, STACK_FRAME_OVERHEAD))
return;
@@ -89,7 +77,7 @@ static void perf_callchain_kernel(struct pt_regs *regs,
next_ip = regs->nip;
lr = regs->link;
level = 0;
- callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);

} else {
if (level == 0)
@@ -111,7 +99,7 @@ static void perf_callchain_kernel(struct pt_regs *regs,
++level;
}

- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, next_ip);
if (!valid_next_sp(next_sp, sp))
return;
sp = next_sp;
@@ -246,8 +234,8 @@ static void perf_callchain_user_64(struct pt_regs *regs,
next_ip = regs->nip;
lr = regs->link;
sp = regs->gpr[1];
- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, next_ip);

for (;;) {
fp = (unsigned long __user *) sp;
@@ -276,14 +264,14 @@ static void perf_callchain_user_64(struct pt_regs *regs,
read_user_stack_64(&uregs[PT_R1], &sp))
return;
level = 0;
- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, next_ip);
continue;
}

if (level == 0)
next_ip = lr;
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, next_ip);
++level;
sp = next_sp;
}
@@ -447,8 +435,8 @@ static void perf_callchain_user_32(struct pt_regs *regs,
next_ip = regs->nip;
lr = regs->link;
sp = regs->gpr[1];
- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, next_ip);

while (entry->nr < PERF_MAX_STACK_DEPTH) {
fp = (unsigned int __user *) (unsigned long) sp;
@@ -470,14 +458,14 @@ static void perf_callchain_user_32(struct pt_regs *regs,
read_user_stack_32(&uregs[PT_R1], &sp))
return;
level = 0;
- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, next_ip);
continue;
}

if (level == 0)
next_ip = lr;
- callchain_store(entry, next_ip);
+ perf_callchain_store(entry, next_ip);
++level;
sp = next_sp;
}
diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index 1d6dbce..00143f3 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -14,11 +14,6 @@
#include <asm/unwinder.h>
#include <asm/ptrace.h>

-static inline void callchain_store(struct perf_callchain_entry *entry, u64 ip)
-{
- if (entry->nr < PERF_MAX_STACK_DEPTH)
- entry->ip[entry->nr++] = ip;
-}

static void callchain_warning(void *data, char *msg)
{
@@ -39,7 +34,7 @@ static void callchain_address(void *data, unsigned long addr, int reliable)
struct perf_callchain_entry *entry = data;

if (reliable)
- callchain_store(entry, addr);
+ perf_callchain_store(entry, addr);
}

static const struct stacktrace_ops callchain_ops = {
@@ -52,8 +47,8 @@ static const struct stacktrace_ops callchain_ops = {
static void
perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
- callchain_store(entry, PERF_CONTEXT_KERNEL);
- callchain_store(entry, regs->pc);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, regs->pc);

unwind_stack(NULL, regs, NULL, &callchain_ops, entry);
}
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 357ced3..2a95a90 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1283,12 +1283,6 @@ void __init init_hw_perf_events(void)
register_die_notifier(&perf_event_nmi_notifier);
}

-static inline void callchain_store(struct perf_callchain_entry *entry, u64 ip)
-{
- if (entry->nr < PERF_MAX_STACK_DEPTH)
- entry->ip[entry->nr++] = ip;
-}
-
static void perf_callchain_kernel(struct pt_regs *regs,
struct perf_callchain_entry *entry)
{
@@ -1297,8 +1291,8 @@ static void perf_callchain_kernel(struct pt_regs *regs,
int graph = 0;
#endif

- callchain_store(entry, PERF_CONTEXT_KERNEL);
- callchain_store(entry, regs->tpc);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, regs->tpc);

ksp = regs->u_regs[UREG_I6];
fp = ksp + STACK_BIAS;
@@ -1322,13 +1316,13 @@ static void perf_callchain_kernel(struct pt_regs *regs,
pc = sf->callers_pc;
fp = (unsigned long)sf->fp + STACK_BIAS;
}
- callchain_store(entry, pc);
+ perf_callchain_store(entry, pc);
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
if ((pc + 8UL) == (unsigned long) &return_to_handler) {
int index = current->curr_ret_stack;
if (current->ret_stack && index >= graph) {
pc = current->ret_stack[index - graph].ret;
- callchain_store(entry, pc);
+ perf_callchain_store(entry, pc);
graph++;
}
}
@@ -1341,8 +1335,8 @@ static void perf_callchain_user_64(struct pt_regs *regs,
{
unsigned long ufp;

- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, regs->tpc);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, regs->tpc);

ufp = regs->u_regs[UREG_I6] + STACK_BIAS;
do {
@@ -1355,7 +1349,7 @@ static void perf_callchain_user_64(struct pt_regs *regs,

pc = sf.callers_pc;
ufp = (unsigned long)sf.fp + STACK_BIAS;
- callchain_store(entry, pc);
+ perf_callchain_store(entry, pc);
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

@@ -1364,8 +1358,8 @@ static void perf_callchain_user_32(struct pt_regs *regs,
{
unsigned long ufp;

- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, regs->tpc);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, regs->tpc);

ufp = regs->u_regs[UREG_I6] & 0xffffffffUL;
do {
@@ -1378,7 +1372,7 @@ static void perf_callchain_user_32(struct pt_regs *regs,

pc = sf.callers_pc;
ufp = (unsigned long)sf.fp;
- callchain_store(entry, pc);
+ perf_callchain_store(entry, pc);
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4a4d191..8af28ca 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1571,12 +1571,6 @@ const struct pmu *hw_perf_event_init(struct perf_event *event)
* callchain support
*/

-static inline
-void callchain_store(struct perf_callchain_entry *entry, u64 ip)
-{
- if (entry->nr < PERF_MAX_STACK_DEPTH)
- entry->ip[entry->nr++] = ip;
-}

static DEFINE_PER_CPU(struct perf_callchain_entry, pmc_irq_entry);
static DEFINE_PER_CPU(struct perf_callchain_entry, pmc_nmi_entry);
@@ -1602,7 +1596,7 @@ static void backtrace_address(void *data, unsigned long addr, int reliable)
{
struct perf_callchain_entry *entry = data;

- callchain_store(entry, addr);
+ perf_callchain_store(entry, addr);
}

static const struct stacktrace_ops backtrace_ops = {
@@ -1616,8 +1610,8 @@ static const struct stacktrace_ops backtrace_ops = {
static void
perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
- callchain_store(entry, PERF_CONTEXT_KERNEL);
- callchain_store(entry, regs->ip);
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_store(entry, regs->ip);

dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
}
@@ -1646,7 +1640,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
if (fp < compat_ptr(regs->sp))
break;

- callchain_store(entry, frame.return_address);
+ perf_callchain_store(entry, frame.return_address);
fp = compat_ptr(frame.next_frame);
}
return 1;
@@ -1670,8 +1664,8 @@ perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)

fp = (void __user *)regs->bp;

- callchain_store(entry, PERF_CONTEXT_USER);
- callchain_store(entry, regs->ip);
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_store(entry, regs->ip);

if (perf_callchain_user32(regs, entry))
return;
@@ -1688,7 +1682,7 @@ perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
if ((unsigned long)fp < regs->sp)
break;

- callchain_store(entry, frame.return_address);
+ perf_callchain_store(entry, frame.return_address);
fp = frame.next_frame;
}
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 937495c..3588804 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -978,6 +978,13 @@ extern void perf_event_fork(struct task_struct *tsk);

extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);

+static inline void
+perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
+{
+ if (entry->nr < PERF_MAX_STACK_DEPTH)
+ entry->ip[entry->nr++] = ip;
+}
+
extern int sysctl_perf_event_paranoid;
extern int sysctl_perf_event_mlock;
extern int sysctl_perf_event_sample_rate;
--
1.6.2.3

2010-08-17 01:34:30

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 5/6 v4] perf: Fix race in callchains

Now that software events don't have interrupt disabled anymore in
the event path, callchains can nest on any context. So seperating
nmi and others contexts in two buffers has become racy.

Fix this by providing one buffer per nesting level. Given the size
of the callchain entries (2040 bytes * 4), we now need to allocate
them dynamically.

v2: Fixed put_callchain_entry call after recursion.
Fix the type of the recursion, it must be an array.

v3: Use a manual pr cpu allocation (temporary solution until NMIs
can safely access vmalloc'ed memory).
Do a better separation between callchain reference tracking and
allocation. Make the "put" path lockless for non-release cases.

v4: Protect the callchain buffers with rcu.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: David Miller <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 22 ++--
include/linux/perf_event.h | 1 -
kernel/perf_event.c | 297 +++++++++++++++++++++++++++++---------
3 files changed, 237 insertions(+), 83 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a3c9222..8e91cf3 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1608,6 +1608,11 @@ static const struct stacktrace_ops backtrace_ops = {
void
perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
{
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ /* TODO: We don't support guest os callchain now */
+ return NULL;
+ }
+
perf_callchain_store(entry, regs->ip);

dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
@@ -1656,6 +1661,10 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
struct stack_frame frame;
const void __user *fp;

+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ /* TODO: We don't support guest os callchain now */
+ return NULL;
+ }

fp = (void __user *)regs->bp;

@@ -1681,19 +1690,6 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
}
}

-struct perf_callchain_entry *perf_callchain_buffer(void)
-{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- /* TODO: We don't support guest os callchain now */
- return NULL;
- }
-
- if (in_nmi())
- return &__get_cpu_var(perf_callchain_entry_nmi);
-
- return &__get_cpu_var(perf_callchain_entry);
-}
-
unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
unsigned long ip;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4db61dd..d7e8ea6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -983,7 +983,6 @@ extern void perf_callchain_user(struct perf_callchain_entry *entry,
struct pt_regs *regs);
extern void perf_callchain_kernel(struct perf_callchain_entry *entry,
struct pt_regs *regs);
-extern struct perf_callchain_entry *perf_callchain_buffer(void);


static inline void
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 615d024..3a44091 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1764,6 +1764,215 @@ static u64 perf_event_read(struct perf_event *event)
}

/*
+ * Callchain support
+ */
+
+struct callchain_cpus_entries {
+ struct rcu_head rcu_head;
+ struct perf_callchain_entry *cpu_entries[0];
+};
+
+static DEFINE_PER_CPU(int, callchain_recursion[4]);
+static atomic_t nr_callchain_events;
+static DEFINE_MUTEX(callchain_mutex);
+struct callchain_cpus_entries *callchain_cpus_entries;
+
+
+__weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+__weak void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+{
+}
+
+static void release_callchain_buffers_rcu(struct rcu_head *head)
+{
+ struct callchain_cpus_entries *entries;
+ int cpu;
+
+ entries = container_of(head, struct callchain_cpus_entries, rcu_head);
+
+ for_each_possible_cpu(cpu)
+ kfree(entries->cpu_entries[cpu]);
+
+ kfree(entries);
+}
+
+static void release_callchain_buffers(void)
+{
+ struct callchain_cpus_entries *entries;
+
+ entries = callchain_cpus_entries;
+ rcu_assign_pointer(callchain_cpus_entries, NULL);
+ call_rcu(&entries->rcu_head, release_callchain_buffers_rcu);
+}
+
+static int alloc_callchain_buffers(void)
+{
+ int cpu;
+ int size;
+ struct callchain_cpus_entries *entries;
+
+ /*
+ * We can't use the percpu allocation API for data that can be
+ * accessed from NMI. Use a temporary manual per cpu allocation
+ * until that gets sorted out.
+ */
+ size = sizeof(*entries) + sizeof(struct perf_callchain_entry *) *
+ num_possible_cpus();
+
+ entries = kzalloc(size, GFP_KERNEL);
+ if (!entries)
+ return -ENOMEM;
+
+ size = sizeof(struct perf_callchain_entry) * 4;
+
+ for_each_possible_cpu(cpu) {
+ entries->cpu_entries[cpu] = kmalloc(size, GFP_KERNEL);
+ if (!entries->cpu_entries[cpu])
+ goto fail;
+ }
+
+ rcu_assign_pointer(callchain_cpus_entries, entries);
+
+ return 0;
+
+fail:
+ for_each_possible_cpu(cpu)
+ kfree(entries->cpu_entries[cpu]);
+ kfree(entries);
+
+ return -ENOMEM;
+}
+
+static int get_callchain_buffers(void)
+{
+ int err = 0;
+ int count;
+
+ mutex_lock(&callchain_mutex);
+
+ count = atomic_inc_return(&nr_callchain_events);
+ if (WARN_ON_ONCE(count < 1)) {
+ err = -EINVAL;
+ goto exit;
+ }
+
+ if (count > 1) {
+ /* If the allocation failed, give up */
+ if (!callchain_cpus_entries)
+ err = -ENOMEM;
+ goto exit;
+ }
+
+ err = alloc_callchain_buffers();
+ if (err)
+ release_callchain_buffers();
+exit:
+ mutex_unlock(&callchain_mutex);
+
+ return err;
+}
+
+static void put_callchain_buffers(void)
+{
+ if (atomic_dec_and_mutex_lock(&nr_callchain_events, &callchain_mutex)) {
+ release_callchain_buffers();
+ mutex_unlock(&callchain_mutex);
+ }
+}
+
+static int get_recursion_context(int *recursion)
+{
+ int rctx;
+
+ if (in_nmi())
+ rctx = 3;
+ else if (in_irq())
+ rctx = 2;
+ else if (in_softirq())
+ rctx = 1;
+ else
+ rctx = 0;
+
+ if (recursion[rctx])
+ return -1;
+
+ recursion[rctx]++;
+ barrier();
+
+ return rctx;
+}
+
+static inline void put_recursion_context(int *recursion, int rctx)
+{
+ barrier();
+ recursion[rctx]--;
+}
+
+static struct perf_callchain_entry *get_callchain_entry(int *rctx)
+{
+ int cpu;
+ struct callchain_cpus_entries *entries;
+
+ *rctx = get_recursion_context(__get_cpu_var(callchain_recursion));
+ if (*rctx == -1)
+ return NULL;
+
+ entries = rcu_dereference(callchain_cpus_entries);
+ if (!entries)
+ return NULL;
+
+ cpu = smp_processor_id();
+
+ return &entries->cpu_entries[cpu][*rctx];
+}
+
+static void
+put_callchain_entry(int rctx)
+{
+ put_recursion_context(__get_cpu_var(callchain_recursion), rctx);
+}
+
+static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+{
+ int rctx;
+ struct perf_callchain_entry *entry;
+
+
+ entry = get_callchain_entry(&rctx);
+ if (rctx == -1)
+ return NULL;
+
+ if (!entry)
+ goto exit_put;
+
+ entry->nr = 0;
+
+ if (!user_mode(regs)) {
+ perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+ perf_callchain_kernel(entry, regs);
+ if (current->mm)
+ regs = task_pt_regs(current);
+ else
+ regs = NULL;
+ }
+
+ if (regs) {
+ perf_callchain_store(entry, PERF_CONTEXT_USER);
+ perf_callchain_user(entry, regs);
+ }
+
+exit_put:
+ put_callchain_entry(rctx);
+
+ return entry;
+}
+
+/*
* Initialize the perf_event context in a task_struct:
*/
static void
@@ -1895,6 +2104,8 @@ static void free_event(struct perf_event *event)
atomic_dec(&nr_comm_events);
if (event->attr.task)
atomic_dec(&nr_task_events);
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+ put_callchain_buffers();
}

if (event->buffer) {
@@ -2937,55 +3148,6 @@ void perf_event_do_pending(void)
__perf_pending_run();
}

-DEFINE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
-
-/*
- * Callchain support -- arch specific
- */
-
-__weak struct perf_callchain_entry *perf_callchain_buffer(void)
-{
- return &__get_cpu_var(perf_callchain_entry);
-}
-
-__weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
- struct pt_regs *regs)
-{
-}
-
-__weak void perf_callchain_user(struct perf_callchain_entry *entry,
- struct pt_regs *regs)
-{
-}
-
-static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
-{
- struct perf_callchain_entry *entry;
-
- entry = perf_callchain_buffer();
- if (!entry)
- return NULL;
-
- entry->nr = 0;
-
- if (!user_mode(regs)) {
- perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
- perf_callchain_kernel(entry, regs);
- if (current->mm)
- regs = task_pt_regs(current);
- else
- regs = NULL;
- }
-
- if (regs) {
- perf_callchain_store(entry, PERF_CONTEXT_USER);
- perf_callchain_user(entry, regs);
- }
-
- return entry;
-}
-
-
/*
* We assume there is only KVM supporting the callbacks.
* Later on, we might change it to a list if there is
@@ -3480,14 +3642,20 @@ static void perf_event_output(struct perf_event *event, int nmi,
struct perf_output_handle handle;
struct perf_event_header header;

+ /* protect the callchain buffers */
+ rcu_read_lock();
+
perf_prepare_sample(&header, data, event, regs);

if (perf_output_begin(&handle, event, header.size, nmi, 1))
- return;
+ goto exit;

perf_output_sample(&handle, &header, data, event);

perf_output_end(&handle);
+
+exit:
+ rcu_read_unlock();
}

/*
@@ -4243,32 +4411,16 @@ end:
int perf_swevent_get_recursion_context(void)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
- int rctx;

- if (in_nmi())
- rctx = 3;
- else if (in_irq())
- rctx = 2;
- else if (in_softirq())
- rctx = 1;
- else
- rctx = 0;
-
- if (cpuctx->recursion[rctx])
- return -1;
-
- cpuctx->recursion[rctx]++;
- barrier();
-
- return rctx;
+ return get_recursion_context(cpuctx->recursion);
}
EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);

void inline perf_swevent_put_recursion_context(int rctx)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
- barrier();
- cpuctx->recursion[rctx]--;
+
+ put_recursion_context(cpuctx->recursion, rctx);
}

void __perf_sw_event(u32 event_id, u64 nr, int nmi,
@@ -4968,6 +5120,13 @@ done:
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
+ err = get_callchain_buffers();
+ if (err) {
+ free_event(event);
+ return ERR_PTR(err);
+ }
+ }
}

return event;
--
1.6.2.3

2010-08-17 04:53:41

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] perf: Generalize some arch callchain code

On Mon, Aug 16, 2010 at 10:48:32PM +0200, Frederic Weisbecker wrote:

> - Most archs use one callchain buffer per cpu, except x86 that needs
> to deal with NMIs. Provide a default perf_callchain_buffer()
> implementation that x86 overrides.
>
> - Centralize all the kernel/user regs handling and invoke new arch
> handlers from there: perf_callchain_user() / perf_callchain_kernel()
> That avoid all the user_mode(), current->mm checks and so...
>
> - Invert some parameters in perf_callchain_*() helpers: entry to the
> left, regs to the right, following the traditional (src, dst).

I think you mean (dst, src), don't you?

Apart from that, the patch looks OK, so

Acked-by: Paul Mackerras <[email protected]>

Paul.

2010-08-17 04:53:38

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6 v4] perf: Fix race in callchains

On Tue, Aug 17, 2010 at 03:34:06AM +0200, Frederic Weisbecker wrote:

> Now that software events don't have interrupt disabled anymore in
> the event path, callchains can nest on any context. So seperating
> nmi and others contexts in two buffers has become racy.
>
> Fix this by providing one buffer per nesting level. Given the size
> of the callchain entries (2040 bytes * 4), we now need to allocate
> them dynamically.
>
> v2: Fixed put_callchain_entry call after recursion.
> Fix the type of the recursion, it must be an array.
>
> v3: Use a manual pr cpu allocation (temporary solution until NMIs
> can safely access vmalloc'ed memory).

It would be nice to make these allocations node-local.

Also, I see that we're allocating 4 buffers per cpu on powerpc when we
strictly only need 3, but I don't suppose that really matters.

Paul.

2010-08-17 04:53:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] perf: Factorize callchain context handling

On Mon, Aug 16, 2010 at 10:48:33PM +0200, Frederic Weisbecker wrote:

> Store the kernel and user contexts from the generic layer instead
> of archs, this gathers some repetitive code.

Acked-by: Paul Mackerras <[email protected]>

2010-08-17 04:53:40

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] perf: Generalize callchain_store()

On Mon, Aug 16, 2010 at 10:48:31PM +0200, Frederic Weisbecker wrote:

> callchain_store() is the same on every archs, inline it in
> perf_event.h and rename it to perf_callchain_store() to avoid
> any collision.
>
> This removes repetitive code.

Acked-by: Paul Mackerras <[email protected]>

2010-08-17 04:56:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups

From: Frederic Weisbecker <[email protected]>
Date: Mon, Aug 16, 2010 at 04:48:29PM -0400

Hey Frederic,

> Built-tested on PowerPc, Arm and SuperH (can someone point me to a
> ready to use sparc cross compiler, like those that can be found there
> for other archs: http://www.codesourcery.com/sgpp/lite_edition.html ?)

Have you by chance tried http://www.kernel.org/pub/tools/crosstool/
already?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2010-08-17 10:34:17

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups

On Mon, 2010-08-16 at 21:48 +0100, Frederic Weisbecker wrote:
> Hi,
>
Hi Frederic,
>
> This set factorizes a lot of callchain code and fixes concurrent
> callchain buffers accesses.

> You can test it by pulling that branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> perf/callchains
>

I've tested this on an ARM Cortex-A9 board and it all seems fine [plus
the code is a lot cleaner!].

Tested-by: Will Deacon <[email protected]>

Cheers,

Will

2010-08-18 03:49:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6 v4] perf: Fix race in callchains

On Tue, Aug 17, 2010 at 02:53:31PM +1000, Paul Mackerras wrote:
> On Tue, Aug 17, 2010 at 03:34:06AM +0200, Frederic Weisbecker wrote:
>
> > Now that software events don't have interrupt disabled anymore in
> > the event path, callchains can nest on any context. So seperating
> > nmi and others contexts in two buffers has become racy.
> >
> > Fix this by providing one buffer per nesting level. Given the size
> > of the callchain entries (2040 bytes * 4), we now need to allocate
> > them dynamically.
> >
> > v2: Fixed put_callchain_entry call after recursion.
> > Fix the type of the recursion, it must be an array.
> >
> > v3: Use a manual pr cpu allocation (temporary solution until NMIs
> > can safely access vmalloc'ed memory).
>
> It would be nice to make these allocations node-local.


You're right, I'll use kmalloc_node then.


> Also, I see that we're allocating 4 buffers per cpu on powerpc when we
> strictly only need 3, but I don't suppose that really matters.


Yeah, we are allocating a similar per-context bunch of buffers for trace
events and also for software events.

If we had a way to know if an arch has nmis, we could manage that.
Whenever this is about simulated or really unmaskable, we don't care,
we just need to know if events can nest on traditional irq disabled
sections.

How do you deal with that in PPC? Is the event delayed to when irqs are
restored?

2010-08-18 03:51:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] perf: Generalize some arch callchain code

On Tue, Aug 17, 2010 at 01:46:34PM +1000, Paul Mackerras wrote:
> On Mon, Aug 16, 2010 at 10:48:32PM +0200, Frederic Weisbecker wrote:
>
> > - Most archs use one callchain buffer per cpu, except x86 that needs
> > to deal with NMIs. Provide a default perf_callchain_buffer()
> > implementation that x86 overrides.
> >
> > - Centralize all the kernel/user regs handling and invoke new arch
> > handlers from there: perf_callchain_user() / perf_callchain_kernel()
> > That avoid all the user_mode(), current->mm checks and so...
> >
> > - Invert some parameters in perf_callchain_*() helpers: entry to the
> > left, regs to the right, following the traditional (src, dst).
>
> I think you mean (dst, src), don't you?


Indeed yeah :)




> Apart from that, the patch looks OK, so
>
> Acked-by: Paul Mackerras <[email protected]>


Thanks a lot for your reviews and acks!

2010-08-18 03:53:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups

On Tue, Aug 17, 2010 at 06:58:20AM +0200, Borislav Petkov wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, Aug 16, 2010 at 04:48:29PM -0400
>
> Hey Frederic,
>
> > Built-tested on PowerPc, Arm and SuperH (can someone point me to a
> > ready to use sparc cross compiler, like those that can be found there
> > for other archs: http://www.codesourcery.com/sgpp/lite_edition.html ?)
>
> Have you by chance tried http://www.kernel.org/pub/tools/crosstool/
> already?


Nice, I'll give it a try.

Thanks :)

2010-08-18 03:55:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups

On Tue, Aug 17, 2010 at 11:32:39AM +0100, Will Deacon wrote:
> On Mon, 2010-08-16 at 21:48 +0100, Frederic Weisbecker wrote:
> > Hi,
> >
> Hi Frederic,
> >
> > This set factorizes a lot of callchain code and fixes concurrent
> > callchain buffers accesses.
>
> > You can test it by pulling that branch:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> > perf/callchains
> >
>
> I've tested this on an ARM Cortex-A9 board and it all seems fine [plus
> the code is a lot cleaner!].
>
> Tested-by: Will Deacon <[email protected]>


Thanks a lot!

BTW, out of curiosity, do you have NMIs on ARM and do the hardware events
make use of them? Or may be you use FIQ to simulate NMIs?

2010-08-18 09:10:05

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups

On Wed, 2010-08-18 at 04:55 +0100, Frederic Weisbecker wrote:
> On Tue, Aug 17, 2010 at 11:32:39AM +0100, Will Deacon wrote:
> > I've tested this on an ARM Cortex-A9 board and it all seems fine [plus
> > the code is a lot cleaner!].
> >
> > Tested-by: Will Deacon <[email protected]>
>
> Thanks a lot!

> BTW, out of curiosity, do you have NMIs on ARM and do the hardware events
> make use of them? Or may be you use FIQ to simulate NMIs?
>

We don't have NMIs on ARM [so obviously we can't use them!] but
you're right to point out the FIQ. I've actually been thinking about
this during the past week, but there are the following problems:

(1) The FIQ isn't always wired up in the hardware, so you can't
assume that it is available.

(2) The FIQ can only have a single handler at a given time. This
is because it is a separate exception mode, with its own banked
registers. Consequently, we might not be able to use it if it's
being used for something else.

(3) The Trustzone security extensions may reserve the FIQ for secure
use only or make it available only via the secure monitor [which
will increase latency].

Of course, the advantage is that we could then use sample-based
profiling techniques in sections of code where the interrupts are
disabled.

The only way I can think of adding this is as a Kconfig option,
which, when selected, tries to use the FIQ and then falls back to
normal IRQs if it fails.

Cheers,

Will

2010-08-18 16:15:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/0 v3] callchain fixes and cleanups


* Will Deacon <[email protected]> wrote:

> On Wed, 2010-08-18 at 04:55 +0100, Frederic Weisbecker wrote:
> > On Tue, Aug 17, 2010 at 11:32:39AM +0100, Will Deacon wrote:
> > > I've tested this on an ARM Cortex-A9 board and it all seems fine [plus
> > > the code is a lot cleaner!].
> > >
> > > Tested-by: Will Deacon <[email protected]>
> >
> > Thanks a lot!
>
> > BTW, out of curiosity, do you have NMIs on ARM and do the hardware events
> > make use of them? Or may be you use FIQ to simulate NMIs?
> >
>
> We don't have NMIs on ARM [so obviously we can't use them!] but you're right
> to point out the FIQ. I've actually been thinking about this during the past
> week, but there are the following problems:
>
> (1) The FIQ isn't always wired up in the hardware, so you can't
> assume that it is available.

We dont always have NMIs on x86 either - we fall back to hrtimers in that
case.

> (2) The FIQ can only have a single handler at a given time. This
> is because it is a separate exception mode, with its own banked
> registers. Consequently, we might not be able to use it if it's
> being used for something else.

Technically the NMI is only a single exception source on x86 as well. We
multiplex from there - if there are multiple users we call them using a
notifier chain.

> (3) The Trustzone security extensions may reserve the FIQ for secure
> use only or make it available only via the secure monitor [which
> will increase latency].

As long as it can still be detected during PMU init and set up safely, it
should be OK.

> Of course, the advantage is that we could then use sample-based profiling
> techniques in sections of code where the interrupts are disabled.

Once you've tried NMI profiling you wont be going back - the difference is day
and night ;-)

Here's the profile of a scheduling-intense workload using a timer based
fallback path:

# Events: 586 cycles
#
# Overhead Command Shared Object Symbol
# ........ ............ ................. ...........................
#
21.33% pipe-test-1m [kernel.kallsyms] [k] finish_task_switch
14.33% pipe-test-1m [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
4.61% pipe-test-1m [kernel.kallsyms] [k] avc_has_perm_noaudit
3.75% pipe-test-1m [kernel.kallsyms] [k] pipe_read
3.58% pipe-test-1m libc-2.12.so [.] __write_nocancel
3.58% pipe-test-1m [kernel.kallsyms] [k] copy_user_generic_string
3.41% pipe-test-1m libc-2.12.so [.] __GI___libc_read
3.07% pipe-test-1m [kernel.kallsyms] [k] system_call_after_swapgs
3.07% pipe-test-1m [kernel.kallsyms] [k] pipe_write
3.07% pipe-test-1m [kernel.kallsyms] [k] file_has_perm
2.22% pipe-test-1m pipe-test-1m [.] main
2.22% pipe-test-1m [kernel.kallsyms] [k] selinux_file_permission
1.88% pipe-test-1m [kernel.kallsyms] [k] rw_verify_area
1.88% pipe-test-1m [kernel.kallsyms] [k] fsnotify


# Events: 23K cycles
#
# Overhead Command Shared Object Symbol
# ........ ............ ................. ...................................
#
7.14% pipe-test-1m [kernel.kallsyms] [k] __default_send_IPI_dest_field
4.34% pipe-test-1m [kernel.kallsyms] [k] schedule
4.27% pipe-test-1m [kernel.kallsyms] [k] __switch_to
3.88% pipe-test-1m [kernel.kallsyms] [k] pipe_read
3.57% pipe-test-1m [kernel.kallsyms] [k] switch_mm
3.45% pipe-test-1m [kernel.kallsyms] [k] file_has_perm
3.37% pipe-test-1m [kernel.kallsyms] [k] copy_user_generic_string
3.37% pipe-test-1m pipe-test-1m [.] main
3.20% pipe-test-1m [kernel.kallsyms] [k] avc_has_perm_noaudit
2.62% pipe-test-1m libc-2.12.so [.] __GI___libc_read
2.09% pipe-test-1m [kernel.kallsyms] [k] fsnotify
1.96% pipe-test-1m [kernel.kallsyms] [k] system_call
1.94% pipe-test-1m [kernel.kallsyms] [k] pipe_write
1.90% pipe-test-1m libc-2.12.so [.] __write_nocancel
1.88% pipe-test-1m [kernel.kallsyms] [k] mutex_lock
1.88% pipe-test-1m [kernel.kallsyms] [k] selinux_file_permission
1.78% pipe-test-1m [kernel.kallsyms] [k] mutex_unlock
1.66% pipe-test-1m [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.50% pipe-test-1m [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.23% pipe-test-1m [kernel.kallsyms] [k] vfs_read
1.19% pipe-test-1m [kernel.kallsyms] [k] do_sync_read
1.18% pipe-test-1m [kernel.kallsyms] [k] update_curr

The NMI output is an order of magnitude richer in information.

> The only way I can think of adding this is as a Kconfig option, which, when
> selected, tries to use the FIQ and then falls back to normal IRQs if it
> fails.

Dynamic detection and a fallback path, should be perfectly OK. Kconfig options
have the disadvantage of doubling the test space and halving the tester base
(or worse).

Thanks,

Ingo