Put in counts to tell which ips belong to what context.
-----
| | hv
| --
nr | | kernel
| --
| | user
-----
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/cpu/perf_counter.c | 9 +++++++++
include/linux/perf_counter.h | 4 ++--
kernel/perf_counter.c | 2 +-
3 files changed, 12 insertions(+), 3 deletions(-)
Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -1088,6 +1088,7 @@ perf_callchain_kernel(struct pt_regs *re
{
unsigned long bp;
char *stack;
+ int nr = entry->nr;
callchain_store(entry, instruction_pointer(regs));
@@ -1099,6 +1100,8 @@ perf_callchain_kernel(struct pt_regs *re
#endif
dump_trace(NULL, regs, (void *)stack, bp, &backtrace_ops, entry);
+
+ entry->kernel = entry->nr - nr;
}
@@ -1128,6 +1131,7 @@ perf_callchain_user(struct pt_regs *regs
{
struct stack_frame frame;
const void __user *fp;
+ int nr = entry->nr;
regs = (struct pt_regs *)current->thread.sp0 - 1;
fp = (void __user *)regs->bp;
@@ -1147,6 +1151,8 @@ perf_callchain_user(struct pt_regs *regs
callchain_store(entry, frame.return_address);
fp = frame.next_fp;
}
+
+ entry->user = entry->nr - nr;
}
static void
@@ -1182,6 +1188,9 @@ struct perf_callchain_entry *perf_callch
entry = &__get_cpu_var(irq_entry);
entry->nr = 0;
+ entry->hv = 0;
+ entry->kernel = 0;
+ entry->user = 0;
perf_do_callchain(regs, entry);
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -520,10 +520,10 @@ extern void perf_counter_mmap(unsigned l
extern void perf_counter_munmap(unsigned long addr, unsigned long len,
unsigned long pgoff, struct file *file);
-#define MAX_STACK_DEPTH 255
+#define MAX_STACK_DEPTH 254
struct perf_callchain_entry {
- u64 nr;
+ u32 nr, hv, kernel, user;
u64 ip[MAX_STACK_DEPTH];
};
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1830,7 +1830,7 @@ void perf_counter_output(struct perf_cou
callchain = perf_callchain(regs);
if (callchain) {
- callchain_size = (1 + callchain->nr) * sizeof(u64);
+ callchain_size = (2 + callchain->nr) * sizeof(u64);
header.type |= __PERF_EVENT_CALLCHAIN;
header.size += callchain_size;
--
* Peter Zijlstra <[email protected]> wrote:
> Put in counts to tell which ips belong to what context.
>
> -----
> | | hv
> | --
> nr | | kernel
> | --
> | | user
> -----
btw., i have an observation about the format:
> -#define MAX_STACK_DEPTH 255
> +#define MAX_STACK_DEPTH 254
>
> struct perf_callchain_entry {
> - u64 nr;
> + u32 nr, hv, kernel, user;
> u64 ip[MAX_STACK_DEPTH];
> };
For the special case of signal notifications, if the signal is
delivered immediately to the same task that raised it (pid=0), the
call chain is actually a still meaningful one: it is the stack that
is below the currently executing signal handler context.
Wouldnt it make sense to record the full stack frame for that case,
to allow walking/unwinding of the stack? Or can user-space do that
just fine, based on its own signal context?
We are going to hard-code the "call-chain is a series of IPs,
nothing else" model, and i'd like to make sure it's future-proof :)
Ingo
On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > Put in counts to tell which ips belong to what context.
> >
> > -----
> > | | hv
> > | --
> > nr | | kernel
> > | --
> > | | user
> > -----
>
> btw., i have an observation about the format:
>
> > -#define MAX_STACK_DEPTH 255
> > +#define MAX_STACK_DEPTH 254
> >
> > struct perf_callchain_entry {
> > - u64 nr;
> > + u32 nr, hv, kernel, user;
> > u64 ip[MAX_STACK_DEPTH];
> > };
>
> For the special case of signal notifications, if the signal is
> delivered immediately to the same task that raised it (pid=0), the
> call chain is actually a still meaningful one: it is the stack that
> is below the currently executing signal handler context.
>
> Wouldnt it make sense to record the full stack frame for that case,
> to allow walking/unwinding of the stack? Or can user-space do that
> just fine, based on its own signal context?
I think it can do that just fine or even better than we can -- userspace
having access to a full dwarf2 unwinder and such.
> We are going to hard-code the "call-chain is a series of IPs,
> nothing else" model, and i'd like to make sure it's future-proof :)
I think it should be, function return addresses are the primary piece of
information here.
On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> > -#define MAX_STACK_DEPTH 255
> > +#define MAX_STACK_DEPTH 254
> >
> > struct perf_callchain_entry {
> > - u64 nr;
> > + u32 nr, hv, kernel, user;
> > u64 ip[MAX_STACK_DEPTH];
> > };
Oh, and Paul suggested using u16s right after I send it out. So I'll
either send an update or send a incremental in case you already applied
it.
Commit-ID: 13c47e37b33167d9784ffbb2c921f65279665dd7
Gitweb: http://git.kernel.org/tip/13c47e37b33167d9784ffbb2c921f65279665dd7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 2 Apr 2009 11:12:03 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 2 Apr 2009 13:53:00 +0200
perf_counter: add more context information
Put in counts to tell which ips belong to what context.
-----
| | hv
| --
nr | | kernel
| --
| | user
-----
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Corey Ashford <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_counter.c | 9 +++++++++
include/linux/perf_counter.h | 4 ++--
kernel/perf_counter.c | 2 +-
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index 2a946a1..c74e20d 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -1088,6 +1088,7 @@ perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
unsigned long bp;
char *stack;
+ int nr = entry->nr;
callchain_store(entry, instruction_pointer(regs));
@@ -1099,6 +1100,8 @@ perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
#endif
dump_trace(NULL, regs, (void *)stack, bp, &backtrace_ops, entry);
+
+ entry->kernel = entry->nr - nr;
}
@@ -1128,6 +1131,7 @@ perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
struct stack_frame frame;
const void __user *fp;
+ int nr = entry->nr;
regs = (struct pt_regs *)current->thread.sp0 - 1;
fp = (void __user *)regs->bp;
@@ -1147,6 +1151,8 @@ perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
callchain_store(entry, frame.return_address);
fp = frame.next_fp;
}
+
+ entry->user = entry->nr - nr;
}
static void
@@ -1182,6 +1188,9 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
entry = &__get_cpu_var(irq_entry);
entry->nr = 0;
+ entry->hv = 0;
+ entry->kernel = 0;
+ entry->user = 0;
perf_do_callchain(regs, entry);
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 5428ba1..90cce0c 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -513,10 +513,10 @@ extern void perf_counter_mmap(unsigned long addr, unsigned long len,
extern void perf_counter_munmap(unsigned long addr, unsigned long len,
unsigned long pgoff, struct file *file);
-#define MAX_STACK_DEPTH 255
+#define MAX_STACK_DEPTH 254
struct perf_callchain_entry {
- u64 nr;
+ u32 nr, hv, kernel, user;
u64 ip[MAX_STACK_DEPTH];
};
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 9bcab10..f105a6e 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1819,7 +1819,7 @@ void perf_counter_output(struct perf_counter *counter,
callchain = perf_callchain(regs);
if (callchain) {
- callchain_size = (1 + callchain->nr) * sizeof(u64);
+ callchain_size = (2 + callchain->nr) * sizeof(u64);
header.type |= __PERF_EVENT_CALLCHAIN;
header.size += callchain_size;
* Peter Zijlstra <[email protected]> wrote:
> On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Put in counts to tell which ips belong to what context.
> > >
> > > -----
> > > | | hv
> > > | --
> > > nr | | kernel
> > > | --
> > > | | user
> > > -----
> >
> > btw., i have an observation about the format:
> >
> > > -#define MAX_STACK_DEPTH 255
> > > +#define MAX_STACK_DEPTH 254
> > >
> > > struct perf_callchain_entry {
> > > - u64 nr;
> > > + u32 nr, hv, kernel, user;
> > > u64 ip[MAX_STACK_DEPTH];
> > > };
> >
> > For the special case of signal notifications, if the signal is
> > delivered immediately to the same task that raised it (pid=0), the
> > call chain is actually a still meaningful one: it is the stack that
> > is below the currently executing signal handler context.
> >
> > Wouldnt it make sense to record the full stack frame for that
> > case, to allow walking/unwinding of the stack? Or can user-space
> > do that just fine, based on its own signal context?
>
> I think it can do that just fine or even better than we can --
> userspace having access to a full dwarf2 unwinder and such.
eventually we'll have one in the kernel too, but yeah, user-space
can do this better. It will have precise details about the runtime
environment.
And any async mechanism has no chance to do anything useful with
stack frame info anyway - that stack frame might be long gone.
> > We are going to hard-code the "call-chain is a series of IPs,
> > nothing else" model, and i'd like to make sure it's future-proof
> > :)
>
> I think it should be, function return addresses are the primary
> piece of information here.
ok - good - just wanted to make sure :)
Ingo
* Peter Zijlstra <[email protected]> wrote:
> On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
>
> > > -#define MAX_STACK_DEPTH 255
> > > +#define MAX_STACK_DEPTH 254
> > >
> > > struct perf_callchain_entry {
> > > - u64 nr;
> > > + u32 nr, hv, kernel, user;
> > > u64 ip[MAX_STACK_DEPTH];
> > > };
>
> Oh, and Paul suggested using u16s right after I send it out. So
> I'll either send an update or send a incremental in case you
> already applied it.
yes, that's probably a good idea. Although u8 might be even better -
do we ever want to do more than 256 deep stack vectors? Even those
would take quite some time to construct and pass down.
Ingo
On Thu, 2009-04-02 at 20:18 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> >
> > > > -#define MAX_STACK_DEPTH 255
> > > > +#define MAX_STACK_DEPTH 254
> > > >
> > > > struct perf_callchain_entry {
> > > > - u64 nr;
> > > > + u32 nr, hv, kernel, user;
> > > > u64 ip[MAX_STACK_DEPTH];
> > > > };
> >
> > Oh, and Paul suggested using u16s right after I send it out. So
> > I'll either send an update or send a incremental in case you
> > already applied it.
>
> yes, that's probably a good idea. Although u8 might be even better -
> do we ever want to do more than 256 deep stack vectors? Even those
> would take quite some time to construct and pass down.
We'd have to pad it with 4 more bytes to remain u64 aligned, also, why
restrict ourselves. That MAX_STACK_DEPTH limit is trivially fixable if
indeed someone finds its insufficient.
* Peter Zijlstra <[email protected]> wrote:
> On Thu, 2009-04-02 at 20:18 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> > >
> > > > > -#define MAX_STACK_DEPTH 255
> > > > > +#define MAX_STACK_DEPTH 254
> > > > >
> > > > > struct perf_callchain_entry {
> > > > > - u64 nr;
> > > > > + u32 nr, hv, kernel, user;
> > > > > u64 ip[MAX_STACK_DEPTH];
> > > > > };
> > >
> > > Oh, and Paul suggested using u16s right after I send it out. So
> > > I'll either send an update or send a incremental in case you
> > > already applied it.
> >
> > yes, that's probably a good idea. Although u8 might be even better -
> > do we ever want to do more than 256 deep stack vectors? Even those
> > would take quite some time to construct and pass down.
>
> We'd have to pad it with 4 more bytes to remain u64 aligned,
ok, indeed.
> [...] also, why restrict ourselves. That MAX_STACK_DEPTH limit is
> trivially fixable if indeed someone finds its insufficient.
well .. think about it: walking more than 256 stack frames for every
IRQ event? Getting backtraces like:
<func_0+0x123>
<func_1+0x123>
<func_2+0x123>
<func_3+0x123>
<func_4+0x123>
<func_5+0x123>
<func_6+0x123>
<func_7+0x123>
<func_8+0x123>
<func_9+0x123>
<func_10+0x123>
<func_11+0x123>
<func_12+0x123>
<func_13+0x123>
<func_14+0x123>
<func_15+0x123>
<func_16+0x123>
<func_17+0x123>
<func_18+0x123>
<func_19+0x123>
<func_20+0x123>
<func_21+0x123>
<func_22+0x123>
<func_23+0x123>
<func_24+0x123>
<func_25+0x123>
<func_26+0x123>
<func_27+0x123>
<func_28+0x123>
<func_29+0x123>
<func_30+0x123>
<func_31+0x123>
<func_32+0x123>
<func_33+0x123>
<func_34+0x123>
<func_35+0x123>
<func_36+0x123>
<func_37+0x123>
<func_38+0x123>
<func_39+0x123>
<func_40+0x123>
<func_41+0x123>
<func_42+0x123>
<func_43+0x123>
<func_44+0x123>
<func_45+0x123>
<func_46+0x123>
<func_47+0x123>
<func_48+0x123>
<func_49+0x123>
<func_50+0x123>
<func_51+0x123>
<func_52+0x123>
<func_53+0x123>
<func_54+0x123>
<func_55+0x123>
<func_56+0x123>
<func_57+0x123>
<func_58+0x123>
<func_59+0x123>
<func_60+0x123>
<func_61+0x123>
<func_62+0x123>
<func_63+0x123>
<func_64+0x123>
<func_65+0x123>
<func_66+0x123>
<func_67+0x123>
<func_68+0x123>
<func_69+0x123>
<func_70+0x123>
<func_71+0x123>
<func_72+0x123>
<func_73+0x123>
<func_74+0x123>
<func_75+0x123>
<func_76+0x123>
<func_77+0x123>
<func_78+0x123>
<func_79+0x123>
<func_80+0x123>
<func_81+0x123>
<func_82+0x123>
<func_83+0x123>
<func_84+0x123>
<func_85+0x123>
<func_86+0x123>
<func_87+0x123>
<func_88+0x123>
<func_89+0x123>
<func_90+0x123>
<func_91+0x123>
<func_92+0x123>
<func_93+0x123>
<func_94+0x123>
<func_95+0x123>
<func_96+0x123>
<func_97+0x123>
<func_98+0x123>
<func_99+0x123>
<func_100+0x123>
<func_101+0x123>
<func_102+0x123>
<func_103+0x123>
<func_104+0x123>
<func_105+0x123>
<func_106+0x123>
<func_107+0x123>
<func_108+0x123>
<func_109+0x123>
<func_110+0x123>
<func_111+0x123>
<func_112+0x123>
<func_113+0x123>
<func_114+0x123>
<func_115+0x123>
<func_116+0x123>
<func_117+0x123>
<func_118+0x123>
<func_119+0x123>
<func_120+0x123>
<func_121+0x123>
<func_122+0x123>
<func_123+0x123>
<func_124+0x123>
<func_125+0x123>
<func_126+0x123>
<func_127+0x123>
<func_128+0x123>
<func_129+0x123>
<func_130+0x123>
<func_131+0x123>
<func_132+0x123>
<func_133+0x123>
<func_134+0x123>
<func_135+0x123>
<func_136+0x123>
<func_137+0x123>
<func_138+0x123>
<func_139+0x123>
<func_140+0x123>
<func_141+0x123>
<func_142+0x123>
<func_143+0x123>
<func_144+0x123>
<func_145+0x123>
<func_146+0x123>
<func_147+0x123>
<func_148+0x123>
<func_149+0x123>
<func_150+0x123>
<func_151+0x123>
<func_152+0x123>
<func_153+0x123>
<func_154+0x123>
<func_155+0x123>
<func_156+0x123>
<func_157+0x123>
<func_158+0x123>
<func_159+0x123>
<func_160+0x123>
<func_161+0x123>
<func_162+0x123>
<func_163+0x123>
<func_164+0x123>
<func_165+0x123>
<func_166+0x123>
<func_167+0x123>
<func_168+0x123>
<func_169+0x123>
<func_170+0x123>
<func_171+0x123>
<func_172+0x123>
<func_173+0x123>
<func_174+0x123>
<func_175+0x123>
<func_176+0x123>
<func_177+0x123>
<func_178+0x123>
<func_179+0x123>
<func_180+0x123>
<func_181+0x123>
<func_182+0x123>
<func_183+0x123>
<func_184+0x123>
<func_185+0x123>
<func_186+0x123>
<func_187+0x123>
<func_188+0x123>
<func_189+0x123>
<func_190+0x123>
<func_191+0x123>
<func_192+0x123>
<func_193+0x123>
<func_194+0x123>
<func_195+0x123>
<func_196+0x123>
<func_197+0x123>
<func_198+0x123>
<func_199+0x123>
<func_200+0x123>
<func_201+0x123>
<func_202+0x123>
<func_203+0x123>
<func_204+0x123>
<func_205+0x123>
<func_206+0x123>
<func_207+0x123>
<func_208+0x123>
<func_209+0x123>
<func_210+0x123>
<func_211+0x123>
<func_212+0x123>
<func_213+0x123>
<func_214+0x123>
<func_215+0x123>
<func_216+0x123>
<func_217+0x123>
<func_218+0x123>
<func_219+0x123>
<func_220+0x123>
<func_221+0x123>
<func_222+0x123>
<func_223+0x123>
<func_224+0x123>
<func_225+0x123>
<func_226+0x123>
<func_227+0x123>
<func_228+0x123>
<func_229+0x123>
<func_230+0x123>
<func_231+0x123>
<func_232+0x123>
<func_233+0x123>
<func_234+0x123>
<func_235+0x123>
<func_236+0x123>
<func_237+0x123>
<func_238+0x123>
<func_239+0x123>
<func_240+0x123>
<func_241+0x123>
<func_242+0x123>
<func_243+0x123>
<func_244+0x123>
<func_245+0x123>
<func_246+0x123>
<func_247+0x123>
<func_248+0x123>
<func_249+0x123>
<func_250+0x123>
<func_251+0x123>
<func_252+0x123>
<func_253+0x123>
<func_254+0x123>
<func_255+0x123>
<func_256+0x123>
<func_257+0x123>
<func_258+0x123>
<func_259+0x123>
<func_260+0x123>
<func_261+0x123>
<func_262+0x123>
<func_263+0x123>
<func_264+0x123>
<func_265+0x123>
<func_266+0x123>
<func_267+0x123>
<func_268+0x123>
<func_269+0x123>
does that make much sense _per event_? How do you visualize it?
But yeah ... i could imagine some user-space craziness and since we
want to align to u64 i guess that pretty much settles it to u16.
Ingo
On Thu, 2009-04-02 at 20:34 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, 2009-04-02 at 20:18 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> > > >
> > > > > > -#define MAX_STACK_DEPTH 255
> > > > > > +#define MAX_STACK_DEPTH 254
> > > > > >
> > > > > > struct perf_callchain_entry {
> > > > > > - u64 nr;
> > > > > > + u32 nr, hv, kernel, user;
> > > > > > u64 ip[MAX_STACK_DEPTH];
> > > > > > };
> > > >
> > > > Oh, and Paul suggested using u16s right after I send it out. So
> > > > I'll either send an update or send a incremental in case you
> > > > already applied it.
> > >
> > > yes, that's probably a good idea. Although u8 might be even better -
> > > do we ever want to do more than 256 deep stack vectors? Even those
> > > would take quite some time to construct and pass down.
> >
> > We'd have to pad it with 4 more bytes to remain u64 aligned,
>
> ok, indeed.
>
> > [...] also, why restrict ourselves. That MAX_STACK_DEPTH limit is
> > trivially fixable if indeed someone finds its insufficient.
>
> well .. think about it: walking more than 256 stack frames for every
> IRQ event? Getting backtraces like:
>
> <func_0+0x123>
...
> <func_269+0x123>
>
> does that make much sense _per event_? How do you visualize it?
You can use it to calculate aggregate times. Eg. attribute the time
spend in func_0 to func_1 to func_2 etc. And use a tree view based on
these call-chains, allowing you to drill-down -- which is basically what
the sysprof GUI does.
* Peter Zijlstra <[email protected]> wrote:
> On Thu, 2009-04-02 at 20:34 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Thu, 2009-04-02 at 20:18 +0200, Ingo Molnar wrote:
> > > > * Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > > On Thu, 2009-04-02 at 13:36 +0200, Ingo Molnar wrote:
> > > > >
> > > > > > > -#define MAX_STACK_DEPTH 255
> > > > > > > +#define MAX_STACK_DEPTH 254
> > > > > > >
> > > > > > > struct perf_callchain_entry {
> > > > > > > - u64 nr;
> > > > > > > + u32 nr, hv, kernel, user;
> > > > > > > u64 ip[MAX_STACK_DEPTH];
> > > > > > > };
> > > > >
> > > > > Oh, and Paul suggested using u16s right after I send it out. So
> > > > > I'll either send an update or send a incremental in case you
> > > > > already applied it.
> > > >
> > > > yes, that's probably a good idea. Although u8 might be even better -
> > > > do we ever want to do more than 256 deep stack vectors? Even those
> > > > would take quite some time to construct and pass down.
> > >
> > > We'd have to pad it with 4 more bytes to remain u64 aligned,
> >
> > ok, indeed.
> >
> > > [...] also, why restrict ourselves. That MAX_STACK_DEPTH limit is
> > > trivially fixable if indeed someone finds its insufficient.
> >
> > well .. think about it: walking more than 256 stack frames for every
> > IRQ event? Getting backtraces like:
> >
> > <func_0+0x123>
> ...
> > <func_269+0x123>
> >
> > does that make much sense _per event_? How do you visualize it?
>
> You can use it to calculate aggregate times. Eg. attribute the
> time spend in func_0 to func_1 to func_2 etc. And use a tree view
> based on these call-chains, allowing you to drill-down -- which is
> basically what the sysprof GUI does.
yeah - but at a depth of more than 256?
(and who'd ever use more than 640K RAM anyway ;-)
Ingo
On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
> plain text document attachment (perf_counter_callchain_context.patch)
> Put in counts to tell which ips belong to what context.
>
> -----
> | | hv
> | --
> nr | | kernel
> | --
> | | user
> -----
Right, just realized that PERF_RECORD_IP needs something similar if one
if not able to derive the context from the IP itself..
Peter Zijlstra wrote:
> On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
>> plain text document attachment (perf_counter_callchain_context.patch)
>> Put in counts to tell which ips belong to what context.
>>
>> -----
>> | | hv
>> | --
>> nr | | kernel
>> | --
>> | | user
>> -----
>
> Right, just realized that PERF_RECORD_IP needs something similar if one
> if not able to derive the context from the IP itself..
>
Three individual bits would suffice, or you could use a two-bit code -
00 = user
01 = kernel
10 = hypervisor
11 = reserved (or perhaps unknown)
Unfortunately, because of alignment, it would need to take up another 64
bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
a machine independent way.
And since you probably need a separate word, that effectively doubles
the amount of space taken up by IP samples (if we add a "no event
header" option). Should we add another bit in the record_type field -
PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
get this if they don't need it?
Regards,
- Corey
On Fri, 2009-04-03 at 11:25 -0700, Corey Ashford wrote:
> Peter Zijlstra wrote:
> > On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
> >> plain text document attachment (perf_counter_callchain_context.patch)
> >> Put in counts to tell which ips belong to what context.
> >>
> >> -----
> >> | | hv
> >> | --
> >> nr | | kernel
> >> | --
> >> | | user
> >> -----
> >
> > Right, just realized that PERF_RECORD_IP needs something similar if one
> > if not able to derive the context from the IP itself..
> >
> Three individual bits would suffice, or you could use a two-bit code -
> 00 = user
> 01 = kernel
> 10 = hypervisor
> 11 = reserved (or perhaps unknown)
>
> Unfortunately, because of alignment, it would need to take up another 64
> bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
> a machine independent way.
>
> And since you probably need a separate word, that effectively doubles
> the amount of space taken up by IP samples (if we add a "no event
> header" option). Should we add another bit in the record_type field -
> PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
> get this if they don't need it?
If we limit the event size to 64k (surely enough, right? :-), then we
have 16 more bits to play with in the header, and we could do something
like the below.
A further possibility would also be to add an overflow bit in there,
making the full 32bit PERF_RECORD space available to output events as
well.
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -201,9 +201,17 @@ struct perf_counter_mmap_page {
__u32 data_head; /* head in the data section */
};
+enum {
+ PERF_EVENT_LEVEL_HV = 0,
+ PERF_EVENT_LEVEL_KERNEL = 1,
+ PERF_EVENT_LEVEL_USER = 2,
+};
+
struct perf_event_header {
__u32 type;
- __u32 size;
+ __u16 level : 2,
+ __reserved : 14;
+ __u16 size;
};
enum perf_event_type {
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1832,6 +1832,8 @@ static void perf_counter_output(struct p
header.type = PERF_EVENT_COUNTER_OVERFLOW;
header.size = sizeof(header);
+ header.level = user_mode(regs) ?
+ PERF_EVENT_LEVEL_USER : PERF_EVENT_LEVEL_KERNEL;
if (record_type & PERF_RECORD_IP) {
ip = instruction_pointer(regs);
On Mon, 2009-04-06 at 13:01 +0200, Peter Zijlstra wrote:
> On Fri, 2009-04-03 at 11:25 -0700, Corey Ashford wrote:
> > Peter Zijlstra wrote:
> > > On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
> > >> plain text document attachment (perf_counter_callchain_context.patch)
> > >> Put in counts to tell which ips belong to what context.
> > >>
> > >> -----
> > >> | | hv
> > >> | --
> > >> nr | | kernel
> > >> | --
> > >> | | user
> > >> -----
> > >
> > > Right, just realized that PERF_RECORD_IP needs something similar if one
> > > if not able to derive the context from the IP itself..
> > >
> > Three individual bits would suffice, or you could use a two-bit code -
> > 00 = user
> > 01 = kernel
> > 10 = hypervisor
> > 11 = reserved (or perhaps unknown)
> >
> > Unfortunately, because of alignment, it would need to take up another 64
> > bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
> > a machine independent way.
> >
> > And since you probably need a separate word, that effectively doubles
> > the amount of space taken up by IP samples (if we add a "no event
> > header" option). Should we add another bit in the record_type field -
> > PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
> > get this if they don't need it?
>
> If we limit the event size to 64k (surely enough, right? :-), then we
> have 16 more bits to play with in the header, and we could do something
> like the below.
>
> A further possibility would also be to add an overflow bit in there,
> making the full 32bit PERF_RECORD space available to output events as
> well.
>
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -201,9 +201,17 @@ struct perf_counter_mmap_page {
> __u32 data_head; /* head in the data section */
> };
>
> +enum {
> + PERF_EVENT_LEVEL_HV = 0,
> + PERF_EVENT_LEVEL_KERNEL = 1,
> + PERF_EVENT_LEVEL_USER = 2,
> +};
> +
> struct perf_event_header {
> __u32 type;
> - __u32 size;
> + __u16 level : 2,
> + __reserved : 14;
> + __u16 size;
> };
Except we should probably use masks again instead of bitfields so that
the thing is portable when streamed to disk, such as would be common
with splice().
Peter Zijlstra wrote:
> On Mon, 2009-04-06 at 13:01 +0200, Peter Zijlstra wrote:
>> On Fri, 2009-04-03 at 11:25 -0700, Corey Ashford wrote:
>>> Peter Zijlstra wrote:
>>>> On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
>>>>> plain text document attachment (perf_counter_callchain_context.patch)
>>>>> Put in counts to tell which ips belong to what context.
>>>>>
>>>>> -----
>>>>> | | hv
>>>>> | --
>>>>> nr | | kernel
>>>>> | --
>>>>> | | user
>>>>> -----
>>>> Right, just realized that PERF_RECORD_IP needs something similar if one
>>>> if not able to derive the context from the IP itself..
>>>>
>>> Three individual bits would suffice, or you could use a two-bit code -
>>> 00 = user
>>> 01 = kernel
>>> 10 = hypervisor
>>> 11 = reserved (or perhaps unknown)
>>>
>>> Unfortunately, because of alignment, it would need to take up another 64
>>> bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
>>> a machine independent way.
>>>
>>> And since you probably need a separate word, that effectively doubles
>>> the amount of space taken up by IP samples (if we add a "no event
>>> header" option). Should we add another bit in the record_type field -
>>> PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
>>> get this if they don't need it?
>> If we limit the event size to 64k (surely enough, right? :-), then we
>> have 16 more bits to play with in the header, and we could do something
>> like the below.
>>
>> A further possibility would also be to add an overflow bit in there,
>> making the full 32bit PERF_RECORD space available to output events as
>> well.
>>
>> Index: linux-2.6/include/linux/perf_counter.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/perf_counter.h
>> +++ linux-2.6/include/linux/perf_counter.h
>> @@ -201,9 +201,17 @@ struct perf_counter_mmap_page {
>> __u32 data_head; /* head in the data section */
>> };
>>
>> +enum {
>> + PERF_EVENT_LEVEL_HV = 0,
>> + PERF_EVENT_LEVEL_KERNEL = 1,
>> + PERF_EVENT_LEVEL_USER = 2,
>> +};
>> +
>> struct perf_event_header {
>> __u32 type;
>> - __u32 size;
>> + __u16 level : 2,
>> + __reserved : 14;
>> + __u16 size;
>> };
>
> Except we should probably use masks again instead of bitfields so that
> the thing is portable when streamed to disk, such as would be common
> with splice().
One downside of this approach is that you if you specify "no header"
(currently not possible, but maybe later?), you will not be able to get
the level bits.
How about adding an optional, 64-bit "miscellaneous" word to the event
record which could contain a number of small bit fields, any or all of
which could be enabled with a PERF_RECORD_* bit. If one or more of the
miscellaneous PERF_RECORD_* bits are set to enable, this assembled word
would be added to the record. So the space cost of the level field goes
down as we add more small fields that need to be recorded.
Something like:
PERF_RECORD_LEVEL = 1U << 4,
PERF_RECORD_INTR_DEPTH = 1U << 5,
PERF_RECORD_STUFF = 1U << 6,
...
#define __PERF_MISC_MASK(name) \
(((1ULL << PERF_MISC_##name##_BITS) - 1) << \
PERF_MISC_##name##_SHIFT)
#define PERF_MISC_LEVEL_BITS 2
#define PERF_MISC_LEVEL_SHIFT 0
#define PERF_MISC_LEVEL_MASK __PERF_MISC_MASK(LEVEL)
#define PERF_MISC_INTR_DEPTH_BITS 8
#define PERF_MISC_INTR_DEPTH_SHIFT 2
#define PERF_MISC_INTR_DEPTH_MASK __PERF_MISC_MASK(INTR_DEPTH)
etc.
Regards,
- Corey
Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]
On Mon, 2009-04-06 at 11:53 -0700, Corey Ashford wrote:
>
> Peter Zijlstra wrote:
> > On Mon, 2009-04-06 at 13:01 +0200, Peter Zijlstra wrote:
> >> On Fri, 2009-04-03 at 11:25 -0700, Corey Ashford wrote:
> >>> Peter Zijlstra wrote:
> >>>> On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
> >>>>> plain text document attachment (perf_counter_callchain_context.patch)
> >>>>> Put in counts to tell which ips belong to what context.
> >>>>>
> >>>>> -----
> >>>>> | | hv
> >>>>> | --
> >>>>> nr | | kernel
> >>>>> | --
> >>>>> | | user
> >>>>> -----
> >>>> Right, just realized that PERF_RECORD_IP needs something similar if one
> >>>> if not able to derive the context from the IP itself..
> >>>>
> >>> Three individual bits would suffice, or you could use a two-bit code -
> >>> 00 = user
> >>> 01 = kernel
> >>> 10 = hypervisor
> >>> 11 = reserved (or perhaps unknown)
> >>>
> >>> Unfortunately, because of alignment, it would need to take up another 64
> >>> bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
> >>> a machine independent way.
> >>>
> >>> And since you probably need a separate word, that effectively doubles
> >>> the amount of space taken up by IP samples (if we add a "no event
> >>> header" option). Should we add another bit in the record_type field -
> >>> PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
> >>> get this if they don't need it?
> >> If we limit the event size to 64k (surely enough, right? :-), then we
> >> have 16 more bits to play with in the header, and we could do something
> >> like the below.
> >>
> >> A further possibility would also be to add an overflow bit in there,
> >> making the full 32bit PERF_RECORD space available to output events as
> >> well.
> >>
> >> Index: linux-2.6/include/linux/perf_counter.h
> >> ===================================================================
> >> --- linux-2.6.orig/include/linux/perf_counter.h
> >> +++ linux-2.6/include/linux/perf_counter.h
> >> @@ -201,9 +201,17 @@ struct perf_counter_mmap_page {
> >> __u32 data_head; /* head in the data section */
> >> };
> >>
> >> +enum {
> >> + PERF_EVENT_LEVEL_HV = 0,
> >> + PERF_EVENT_LEVEL_KERNEL = 1,
> >> + PERF_EVENT_LEVEL_USER = 2,
> >> +};
> >> +
> >> struct perf_event_header {
> >> __u32 type;
> >> - __u32 size;
> >> + __u16 level : 2,
> >> + __reserved : 14;
> >> + __u16 size;
> >> };
> >
> > Except we should probably use masks again instead of bitfields so that
> > the thing is portable when streamed to disk, such as would be common
> > with splice().
>
> One downside of this approach is that you if you specify "no header"
> (currently not possible, but maybe later?), you will not be able to get
> the level bits.
Would this be desirable? I know we've mentioned it before, but it would
mean one cannot mix various event types (currently that means !mmap and
callchain with difficulty).
As long as we mandate this header, we can have 16 misc bits.
> How about adding an optional, 64-bit "miscellaneous" word to the event
> record which could contain a number of small bit fields, any or all of
> which could be enabled with a PERF_RECORD_* bit. If one or more of the
> miscellaneous PERF_RECORD_* bits are set to enable, this assembled word
> would be added to the record. So the space cost of the level field goes
> down as we add more small fields that need to be recorded.
>
> Something like:
>
> PERF_RECORD_LEVEL = 1U << 4,
> PERF_RECORD_INTR_DEPTH = 1U << 5,
> PERF_RECORD_STUFF = 1U << 6,
> ...
>
> #define __PERF_MISC_MASK(name) \
> (((1ULL << PERF_MISC_##name##_BITS) - 1) << \
> PERF_MISC_##name##_SHIFT)
>
> #define PERF_MISC_LEVEL_BITS 2
> #define PERF_MISC_LEVEL_SHIFT 0
> #define PERF_MISC_LEVEL_MASK __PERF_MISC_MASK(LEVEL)
>
> #define PERF_MISC_INTR_DEPTH_BITS 8
> #define PERF_MISC_INTR_DEPTH_SHIFT 2
> #define PERF_MISC_INTR_DEPTH_MASK __PERF_MISC_MASK(INTR_DEPTH)
Yeah, that's the alternative.
Peter Zijlstra wrote:
> On Mon, 2009-04-06 at 11:53 -0700, Corey Ashford wrote:
>> Peter Zijlstra wrote:
>>> On Mon, 2009-04-06 at 13:01 +0200, Peter Zijlstra wrote:
>>>> On Fri, 2009-04-03 at 11:25 -0700, Corey Ashford wrote:
>>>>> Peter Zijlstra wrote:
>>>>>> On Thu, 2009-04-02 at 11:12 +0200, Peter Zijlstra wrote:
>>>>>>> plain text document attachment (perf_counter_callchain_context.patch)
>>>>>>> Put in counts to tell which ips belong to what context.
>>>>>>>
>>>>>>> -----
>>>>>>> | | hv
>>>>>>> | --
>>>>>>> nr | | kernel
>>>>>>> | --
>>>>>>> | | user
>>>>>>> -----
>>>>>> Right, just realized that PERF_RECORD_IP needs something similar if one
>>>>>> if not able to derive the context from the IP itself..
>>>>>>
>>>>> Three individual bits would suffice, or you could use a two-bit code -
>>>>> 00 = user
>>>>> 01 = kernel
>>>>> 10 = hypervisor
>>>>> 11 = reserved (or perhaps unknown)
>>>>>
>>>>> Unfortunately, because of alignment, it would need to take up another 64
>>>>> bit word, wouldn't it? Too bad you cannot sneak the bits into the IP in
>>>>> a machine independent way.
>>>>>
>>>>> And since you probably need a separate word, that effectively doubles
>>>>> the amount of space taken up by IP samples (if we add a "no event
>>>>> header" option). Should we add another bit in the record_type field -
>>>>> PERF_RECORD_IP_LEVEL (or similar) so that user-space apps don't have to
>>>>> get this if they don't need it?
>>>> If we limit the event size to 64k (surely enough, right? :-), then we
>>>> have 16 more bits to play with in the header, and we could do something
>>>> like the below.
>>>>
>>>> A further possibility would also be to add an overflow bit in there,
>>>> making the full 32bit PERF_RECORD space available to output events as
>>>> well.
>>>>
>>>> Index: linux-2.6/include/linux/perf_counter.h
>>>> ===================================================================
>>>> --- linux-2.6.orig/include/linux/perf_counter.h
>>>> +++ linux-2.6/include/linux/perf_counter.h
>>>> @@ -201,9 +201,17 @@ struct perf_counter_mmap_page {
>>>> __u32 data_head; /* head in the data section */
>>>> };
>>>>
>>>> +enum {
>>>> + PERF_EVENT_LEVEL_HV = 0,
>>>> + PERF_EVENT_LEVEL_KERNEL = 1,
>>>> + PERF_EVENT_LEVEL_USER = 2,
>>>> +};
>>>> +
>>>> struct perf_event_header {
>>>> __u32 type;
>>>> - __u32 size;
>>>> + __u16 level : 2,
>>>> + __reserved : 14;
>>>> + __u16 size;
>>>> };
>>> Except we should probably use masks again instead of bitfields so that
>>> the thing is portable when streamed to disk, such as would be common
>>> with splice().
>> One downside of this approach is that you if you specify "no header"
>> (currently not possible, but maybe later?), you will not be able to get
>> the level bits.
>
> Would this be desirable? I know we've mentioned it before, but it would
> mean one cannot mix various event types (currently that means !mmap and
> callchain with difficulty).
I think it would. For one use case I'm working on right now, simple
profiling, all I need are ip's. If I could omit the header, that would
reduce the frequency of sigio's by a factor of three, and make it faster
to read up the ip's when the SIGIO's occur.
I realize that it makes it impossible to mix record types with the
header removed, and skipping over the call chain data a bit more
difficult (but not rocket science).
It could be made an error for the caller to specify both "no header" and
perf_coiunter_hw_event.mmap|munmap
>
> As long as we mandate this header, we can have 16 misc bits.
>
True.
- Corey
On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
> >> One downside of this approach is that you if you specify "no header"
> >> (currently not possible, but maybe later?), you will not be able to get
> >> the level bits.
> >
> > Would this be desirable?
> I think it would. For one use case I'm working on right now, simple
> profiling, all I need are ip's. If I could omit the header, that would
> reduce the frequency of sigio's by a factor of three, and make it faster
> to read up the ip's when the SIGIO's occur.
Self-profiling?
So you're interested in getting the smallest possible record size, that
would still be 2 u64, right? Otherwise you don't get the IP context that
started this.
Peter Zijlstra wrote:
> On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
>
>>>> One downside of this approach is that you if you specify "no header"
>>>> (currently not possible, but maybe later?), you will not be able to get
>>>> the level bits.
>>> Would this be desirable?
>
>> I think it would. For one use case I'm working on right now, simple
>> profiling, all I need are ip's. If I could omit the header, that would
>> reduce the frequency of sigio's by a factor of three, and make it faster
>> to read up the ip's when the SIGIO's occur.
>
> Self-profiling?
>
> So you're interested in getting the smallest possible record size, that
> would still be 2 u64, right? Otherwise you don't get the IP context that
> started this.
>
>
Self-profiling mainly, yes. PAPI specs an ability for remote monitoring
of processes and threads, but I think it's only partially implemented.
So when you are talking about IP context, you mean pid/tid?
Regards,
- Corey
Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]
On Mon, 2009-04-06 at 14:15 -0700, Corey Ashford wrote:
>
> Peter Zijlstra wrote:
> > On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
> >
> >>>> One downside of this approach is that you if you specify "no header"
> >>>> (currently not possible, but maybe later?), you will not be able to get
> >>>> the level bits.
> >>> Would this be desirable?
> >
> >> I think it would. For one use case I'm working on right now, simple
> >> profiling, all I need are ip's. If I could omit the header, that would
> >> reduce the frequency of sigio's by a factor of three, and make it faster
> >> to read up the ip's when the SIGIO's occur.
> >
> > Self-profiling?
> >
> > So you're interested in getting the smallest possible record size, that
> > would still be 2 u64, right? Otherwise you don't get the IP context that
> > started this.
> >
> >
>
> Self-profiling mainly, yes. PAPI specs an ability for remote monitoring
> of processes and threads, but I think it's only partially implemented.
>
> So when you are talking about IP context, you mean pid/tid?
Ah, we called it level before, the hv/kernel/user thing. For remote
profiling you'd want to have the mmap thing too.
Peter Zijlstra wrote:
> On Mon, 2009-04-06 at 14:15 -0700, Corey Ashford wrote:
>> Peter Zijlstra wrote:
>>> On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
>>>
>>>>>> One downside of this approach is that you if you specify "no header"
>>>>>> (currently not possible, but maybe later?), you will not be able to get
>>>>>> the level bits.
>>>>> Would this be desirable?
>>>> I think it would. For one use case I'm working on right now, simple
>>>> profiling, all I need are ip's. If I could omit the header, that would
>>>> reduce the frequency of sigio's by a factor of three, and make it faster
>>>> to read up the ip's when the SIGIO's occur.
>>> Self-profiling?
>>>
>>> So you're interested in getting the smallest possible record size, that
>>> would still be 2 u64, right? Otherwise you don't get the IP context that
>>> started this.
>>>
>>>
>> Self-profiling mainly, yes. PAPI specs an ability for remote monitoring
>> of processes and threads, but I think it's only partially implemented.
>>
>> So when you are talking about IP context, you mean pid/tid?
>
> Ah, we called it level before, the hv/kernel/user thing. For remote
> profiling you'd want to have the mmap thing too.
Oh I see. In PAPI, the user specifies the range(s) of addresses he's
interested in profiling (any sampled IP's outside the requested ranges
are discarded), and so as long as the kernel space IP's don't overlap
with user space IP's, we should be fine.
- Corey
On Mon, 2009-04-06 at 14:33 -0700, Corey Ashford wrote:
>
> Peter Zijlstra wrote:
> > On Mon, 2009-04-06 at 14:15 -0700, Corey Ashford wrote:
> >> Peter Zijlstra wrote:
> >>> On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
> >>>
> >>>>>> One downside of this approach is that you if you specify "no header"
> >>>>>> (currently not possible, but maybe later?), you will not be able to get
> >>>>>> the level bits.
> >>>>> Would this be desirable?
> >>>> I think it would. For one use case I'm working on right now, simple
> >>>> profiling, all I need are ip's. If I could omit the header, that would
> >>>> reduce the frequency of sigio's by a factor of three, and make it faster
> >>>> to read up the ip's when the SIGIO's occur.
> >>> Self-profiling?
> >>>
> >>> So you're interested in getting the smallest possible record size, that
> >>> would still be 2 u64, right? Otherwise you don't get the IP context that
> >>> started this.
> >>>
> >>>
> >> Self-profiling mainly, yes. PAPI specs an ability for remote monitoring
> >> of processes and threads, but I think it's only partially implemented.
> >>
> >> So when you are talking about IP context, you mean pid/tid?
> >
> > Ah, we called it level before, the hv/kernel/user thing. For remote
> > profiling you'd want to have the mmap thing too.
>
> Oh I see. In PAPI, the user specifies the range(s) of addresses he's
> interested in profiling (any sampled IP's outside the requested ranges
> are discarded), and so as long as the kernel space IP's don't overlap
> with user space IP's, we should be fine.
Ah, while this would be true for most 'sane' architectures, Paul was
right in pointing out that this is not true for all architectures -- and
we should therefore not rely on address range alone.
You could of course use: hw_event.exclude_{hv,kernel} = 1 to ensure you
only get userspace thingies I suppose (but then you have no way of
telling how many you missed I guess).
Peter Zijlstra wrote:
> On Mon, 2009-04-06 at 14:33 -0700, Corey Ashford wrote:
>> Peter Zijlstra wrote:
>>> On Mon, 2009-04-06 at 14:15 -0700, Corey Ashford wrote:
>>>> Peter Zijlstra wrote:
>>>>> On Mon, 2009-04-06 at 13:16 -0700, Corey Ashford wrote:
>>>>>
>>>>>>>> One downside of this approach is that you if you specify "no header"
>>>>>>>> (currently not possible, but maybe later?), you will not be able to get
>>>>>>>> the level bits.
>>>>>>> Would this be desirable?
>>>>>> I think it would. For one use case I'm working on right now, simple
>>>>>> profiling, all I need are ip's. If I could omit the header, that would
>>>>>> reduce the frequency of sigio's by a factor of three, and make it faster
>>>>>> to read up the ip's when the SIGIO's occur.
>>>>> Self-profiling?
>>>>>
>>>>> So you're interested in getting the smallest possible record size, that
>>>>> would still be 2 u64, right? Otherwise you don't get the IP context that
>>>>> started this.
>>>>>
>>>>>
>>>> Self-profiling mainly, yes. PAPI specs an ability for remote monitoring
>>>> of processes and threads, but I think it's only partially implemented.
>>>>
>>>> So when you are talking about IP context, you mean pid/tid?
>>> Ah, we called it level before, the hv/kernel/user thing. For remote
>>> profiling you'd want to have the mmap thing too.
>> Oh I see. In PAPI, the user specifies the range(s) of addresses he's
>> interested in profiling (any sampled IP's outside the requested ranges
>> are discarded), and so as long as the kernel space IP's don't overlap
>> with user space IP's, we should be fine.
>
> Ah, while this would be true for most 'sane' architectures, Paul was
> right in pointing out that this is not true for all architectures -- and
> we should therefore not rely on address range alone.
>
> You could of course use: hw_event.exclude_{hv,kernel} = 1 to ensure you
> only get userspace thingies I suppose (but then you have no way of
> telling how many you missed I guess).
That's a good point. PAPI's profiling API doesn't have a way for the
caller to distinguish which address spaces (user/kernel/hv) he wants to
profile. It does have a way to designate which levels to ignore, but if
you enable them all, you cannot specify the profiling address ranges
pertaining to each. That may be something I could propose adding to
PAPI. I suspect it would be pretty rarely used, though.
- Corey