Roland, Andi,
I would like to discuss the ptrace user interface for the BTS extension.
In previous emails,
Andi suggested a stream-like interface, but is also OK with an
array-like interface (as far as I understood).
Roland is dubious about the ptrace API additions.
I would like to settle the discussion and find an interface that
everybody can agree to, so I can implement that interface and we can
move forward with the patch.
Here's the link to the original patch:
http://lkml.org/lkml/2007/12/5/234.
Here are the facts:
- we need to provide access to an array (cyclic buffer) of BTS records
- the array can be quite big
- the most interesting part is the tail
- a BTS record can either describe a branch (from, to address)
or a scheduling event (task arrives/departs at timestamp)
Let's look at the entire array, first. I see the following alternatives:
1. get the entire array in one command
+ simple interface, like GET<arch-specific>REGS
- a lot of (redundant) copying
2. array-like commands (get size, read element at index)
+ allows precise reads; minimizes copying
3. stream-like commands (read, maybe seek) [read from back to front]
+ favors most expected use cases
- makes other uses much harder (e.g read from front to back)
- harder to get the semantics right and intuitive
(when to reset read pointer? e.g. when stepping between two reads)
Alternatives 1 and 3 require a reordering to turn the cyclic buffer into
a sequential array or stream.
Alternative 2 would benefit from that, as well.
When we reorder the array, the best order would be from back to front,
so users can start reading the most interesting part first, and stop
when they read enough.
I would recommend alternative 2. Number 1 may result in too much
copying, and number 3 is better done in user space; the kernel API
should be more flexible and not favor a single use case.
Let's look at the array size, next.
1. pre-defined array size
+ most simple, no extra command
- one size will not fit all users
2. user-defined array size
+ most flexible for the user
(need to set a system limit to restrain greedy users)
I would recommend alternative 2. A good citizen will only ask for the
space he needs.
In the ideal case, the system limit would be variable (as Andi
suggested).
Let's look at the array contents. Currently, we have 3 different record
types.
1. self-describing union
+ most extensible
+ allows single bts array
- may waste (user-space) memory
2. separate fixed-type arrays
+ get command defines interpretation
- need additional effort to describe relative order between array
elements
- extension requires new set of access commands
I would recommend alternative 1. It is most flexible and most easily
extensible. And it is easier to use.
What extensions do we expect in the future?
1. more architectures
2. additional data
Regarding 2, a union would easily allow us to add additional data; at
the cost of a few wasted bytes, if the data is not evenly sized. A user
may look at the qualifier and either ignore records he does not
understand, or bail out.
Regarding 1, we currently provide scheduling timestamps, which are arch
independent, and from-to branch information, which should be available
on all architectures for a similar feature. I could think of basic block
from-to information as an alternative representation on some
architectures. I could also imagine that other architectures provide
additional information (like the predicted bit on Netburst that was
dropped for later architectures). Both could be modelled using
additional record types.
Additional architectures may want to (re)use and extend the x86 bts
record, or they may want to invent their own format. In the former case,
we may move the bts union and the bts commands to the generic ptrace
header, and provide a default implementation for architectures that do
not support it (basically pretend that the array is empty or return an
error). In the latter case, they may copy parts of the x86 header.
I would postpone the decision until there are more arch's that wish to
support this feature.
Thank you for reading until here.
regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
On Friday 07 December 2007 10:11:04 Metzger, Markus T wrote:
> Roland, Andi,
>
> I would like to discuss the ptrace user interface for the BTS extension.
> In previous emails,
> Andi suggested a stream-like interface, but is also OK with an
> array-like interface (as far as I understood).
> Roland is dubious about the ptrace API additions.
>
> I would like to settle the discussion and find an interface that
> everybody can agree to, so I can implement that interface and we can
> move forward with the patch.
The most efficient interface would be zero copy with tracer user process
supplying memory that is pinned (get_user_pages()) subject to the
mlock rlimit. Then kernel telling the CPU to directly log into
that.
Kernel buffers would be only needed for the per CPU kernel
logging.
Then the only information that would need to be passed with
system calls would be wakeup, tail position and perhaps a wrapping
counter.
> Regarding 1, we currently provide scheduling timestamps, which are arch
That's actually broken because you don't log the CPU number.
sched_clock() without the CPU number associated is meaningless
on systems without synchronized, pstate invariant TSC
[that is older Intel systems or some larger current systems]
And even if you log the CPU number it is unclear how user space
would make sense of that. It can't generally, even the kernel
can't. Perhaps better to just not supply any time stamps for this.
Even on systems that don't have unsync TSC problem above
it can be tricky to convert the TSC into real time. Right now
we don't report the TSC frequency for once. Usually it tends
to be at highest p state but finding that out is also
difficult and unreliable (rounding errors) and might not
always be true in the future. Anyways could be solved
by reporting that separately in /proc/cpuinfo, but given all
the other problems I have my doubts it is really worth it. I would
suggest dropping the time stamp.
> Additional architectures may want to (re)use and extend the x86 bts
> record, or they may want to invent their own format. In the former case,
I think that's actually not a good goal. If the code is so complicated
that it makes sense sharing then you did something wrong :)
-Andi
>From: Andi Kleen [mailto:[email protected]]
>Sent: Freitag, 7. Dezember 2007 12:18
>> I would like to settle the discussion and find an interface that
>> everybody can agree to, so I can implement that interface and we can
>> move forward with the patch.
>
>The most efficient interface would be zero copy with tracer
>user process
>supplying memory that is pinned (get_user_pages()) subject to the
>mlock rlimit. Then kernel telling the CPU to directly log into
>that.
That would require users to understand all kinds of BTS formats
and to detect the hardware they are running on in order to interpret
the data.
So far, there are two different formats. But one of them is wasting
an entire word of memory per record. I could imagine that this would
change some day.
Other architectures would likely use an entirely different format.
Users who want to support several architectures would benefit from
a common format for this from-to branch information.
>> Regarding 1, we currently provide scheduling timestamps,
>which are arch
>
>That's actually broken because you don't log the CPU number.
>sched_clock() without the CPU number associated is meaningless
>on systems without synchronized, pstate invariant TSC
>[that is older Intel systems or some larger current systems]
I see.
The intention was not to provide exact timestamps, but rather a
relative order of BTS chunks that would allow debuggers to
show which parts were (actually, "might have been" is the best we
can say) executed in parallel, and which parts were definitely
executed sequentially.
Without a global time, though, this becomes rather meaningless.
Is there some other metric that would allow me to order BTS
chunks for different threads?
>> Additional architectures may want to (re)use and extend the x86 bts
>> record, or they may want to invent their own format. In the
>former case,
>
>I think that's actually not a good goal. If the code is so complicated
>that it makes sense sharing then you did something wrong :)
Agreed;-)
Users would benefit if they wanted to support multiple architectures.
They would need to invent such a more general interface; or duplicate
code, which is never a good thing.
regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
On Friday 07 December 2007 13:01:28 Metzger, Markus T wrote:
> >From: Andi Kleen [mailto:[email protected]]
> >Sent: Freitag, 7. Dezember 2007 12:18
>
> >> I would like to settle the discussion and find an interface that
> >> everybody can agree to, so I can implement that interface and we can
> >> move forward with the patch.
> >
> >The most efficient interface would be zero copy with tracer
> >user process
> >supplying memory that is pinned (get_user_pages()) subject to the
> >mlock rlimit. Then kernel telling the CPU to directly log into
> >that.
>
> That would require users to understand all kinds of BTS formats
> and to detect the hardware they are running on in order to interpret
> the data.
That's true. I guess it could be abstracted in a library, but doing
it all in kernel is indeed nicer.
Ok in theory you could go fancy and put the library into the vDSO
which runs in ring 3. Then it would be tied to the kernel again.
> So far, there are two different formats. But one of them is wasting
> an entire word of memory per record. I could imagine that this would
> change some day.
>
> Other architectures would likely use an entirely different format.
> Users who want to support several architectures would benefit from
> a common format for this from-to branch information.
I guess some other users would prefer higher performance, but yes
there are probably both types. I don't know what is more important.
> Is there some other metric that would allow me to order BTS
> chunks for different threads?
With Out-of-order CPUs exact global metrics are pretty difficult.
At which point of the instruction execution would you measure?
Anyways if RDTSC doesn't work the only global alternatives are much slower
(like southbridge timers) or very inaccurate (jiffies)
I would just drop it since it'll likely always be somewhat misleading.
-Andi
>From: Andi Kleen [mailto:[email protected]]
>Sent: Freitag, 7. Dezember 2007 14:04
>With Out-of-order CPUs exact global metrics are pretty difficult.
>At which point of the instruction execution would you measure?
All I want to do is order the execution chunks of different
threads. Taking two snapshots somewhere near the beginning and
the end of context switching should be good enough.
There's all the scheduler code in between (or at least the context
switch code). I don't think I need to worry about the exact point
during instruction execution.
I don't think it makes sense to try to correlate instructions
from different threads. It would be a wonderful feature to show
a synchronous trace across multiple threads. But that would require
you to measure time for each instruction. I don't think that's
feasible without reducing performance to single stepping;-)
>Anyways if RDTSC doesn't work the only global alternatives are
>much slower
>(like southbridge timers) or very inaccurate (jiffies)
Would jiffies be a metric that works across cpu's?
At the granularity that I want to measure, I guess that accuracy
is not important at all.
>I would just drop it since it'll likely always be somewhat misleading.
I guess I will (have to) drop it if it cannot be used for what I
intended.
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Sorry I did not get more into this discussion earlier. I still have not
read through all of the email threads. But I have looked over the current
version of your code now in -mm.
I think this work has a great deal of overlap with the perfmon2 project.
There are two facets that overlap, which together are the whole BTS feature.
The same x86 "debug store" hardware is programmed for both the BTS and the
performance monitoring features. The implementations clearly have to
cooperate on managing that hardware. Your ds.c is a start in the right
direction, to abstract the hardware configuration from the BTS feature and
its interface. I'm not familiar with the perfmon2 code, but it may have
something similar already.
The rest of the BTS feature is the buffer management and the interface.
It has to deal with the hardware buffer setup, context switching, and
overflow interrupts, and delivering data from the hardware buffers to the
interface in appropriate formats. We'd also like it to be able to trace
kernel-mode as well as user-mode, and either deliver combined data or
segregate the data between the two for user-space and kernel-space users
who need not know about each other's tracing. (On some of the hardware
you can program it to record only one or the other (X86_FEATURE_DSCPL).
On older hardware, or when separately tracing both, you can trace both
and then distinguish each sample by its from_ip.) perfmon2 also wants to
address all of that.
I don't much like the way you've shoe-horned the context-switch timestamp
logging into the BTS feature. It's a nice feature to have in some form,
and I sympathize with your seeing it as easy pickings once you had the
BTS buffer machinery handy. But really it is not part of the BTS feature
and there is nothing arch-dependent about it. Given some other general
thing providing the buffer management et al, that could just be done in
schedule(), i.e.:
departs(prev);
context_switch(rq, prev, next); /* unlocks the rq */
arrives(prev);
If there is a general thing for event-reporting from perfmon2 or
whatever, then it might be natural to have the context-switch event
reports configurable to different record formats you might be using
for other things. For a BTS-style record, I would use:
departs: { .from = task_pt_regs(prev)->ip, .to = jiffies, .flag = MAGIC1 }
arrives: { .from = jiffies, .to = task_pt_regs(next)->ip, .flag = MAGIC2 }
MAGIC1 = 0xffff0001
MAGIC2 = 0xffff0002
or something like that, i.e. as if it were a "branch to block-time" and a
"branch from wake-time". (Actually you might want MAGIC3 and MAGIC4 too,
for whether it was a voluntary or involuntary context switch.) For
different use that is doing mostly other event sampling rather than BTS,
it might use a different format that gives more register into a la PEBS.
I'm no expert on perfmon2 and I understand there are many issues to be
resolved to get it into the kernel. But if you are not desperate to have
the BTS feature in the kernel ASAP, it would ideal IMHO if you can work
with Stephane et al on putting this work together. I'd like to see the
work go into the kernel in much smaller pieces even than your BTS patch
set that's in -mm. The first thing is just the DS hardware management,
context switch and hardware-facing parts of the buffer management (one or
three or fourth small bisect-friendly patches just for that much). If
you and Stephane can hash out a fresh patch that provides what you both
need for that, that would be a great start. Personally, I'd prefer to
abandon the ptrace extensions altogether in favor of some generalized
event buffer interface that comes from merging perfmon2. But if you
still want to do the ptrace interface, it can be built on the shared
DS-management code. What do you think?
Thanks,
Roland
>From: Roland McGrath [mailto:[email protected]]
>Sent: Mittwoch, 30. Januar 2008 08:26
>To: Metzger, Markus T
>I think this work has a great deal of overlap with the
>perfmon2 project.
>There are two facets that overlap, which together are the
>whole BTS feature.
>
>The same x86 "debug store" hardware is programmed for both the
>BTS and the
>performance monitoring features. The implementations clearly have to
>cooperate on managing that hardware. Your ds.c is a start in the right
>direction, to abstract the hardware configuration from the BTS
>feature and
>its interface. I'm not familiar with the perfmon2 code, but
>it may have
>something similar already.
I am not familiar with the perfmon project. I looked briefly at
arch/x86/kernel/cpu/perfctr-watchdog.c and before I started this, I
grep'ed for DS_AREA to find out who else is using it, but did not find
anything.
As far as I know, performance monitoring can be done in two ways:
- collect performance event counts
- collect machine state samples on (the n-th) performance event(s)
The former is done entirely in registers. The latter (spelled PEBS) uses
a buffer to store the machine state samples.
The configuration of this buffer shares the DS SAVE_AREA MSR with the
configuration of BTS. That is, DS_AREA points to a struct that
configures both BTS and PEBS.
My first impression of perfctr-watchdog is that is collects events and
does not use PEBS at all. There is neither a conflict nor an overlap.
I do agree that they share a common interest. Both collect information
about a task's execution. The information they collect is disjoint,
though.
>The rest of the BTS feature is the buffer management and the interface.
>It has to deal with the hardware buffer setup, context switching, and
>overflow interrupts, and delivering data from the hardware
>buffers to the
>interface in appropriate formats. We'd also like it to be
>able to trace
>kernel-mode as well as user-mode, and either deliver combined data or
>segregate the data between the two for user-space and
>kernel-space users
>who need not know about each other's tracing.
Kernel mode users would want a per-cpu (system) trace, whereas user mode
users would want a per-task trace.
So far, I thought that you could have either or. I'm beginning to think
that you could serve both at the same time if you are willing to accept
a few hiccups in the system trace.
For a system trace, we would stick with the per-task trace but trace all
tasks. We would need to extend the trace format to encode the cpuid.
This would allow us to serve all per-task trace requests. On a system
trace request, we would combine the per-task traces to for a (mostly
complete) per-cpu system trace.
I'm afraid that this would require us to move the buffers to pageable
memory; something I was very much against, so far.
This logic would be shared between performance monitoring and last
branch recording.
>I don't much like the way you've shoe-horned the
>context-switch timestamp
>logging into the BTS feature. It's a nice feature to have in
>some form,
>and I sympathize with your seeing it as easy pickings once you had the
>BTS buffer machinery handy. But really it is not part of the
>BTS feature
>and there is nothing arch-dependent about it. Given some other general
>thing providing the buffer management et al, that could just be done in
>schedule(), i.e.:
>
> departs(prev);
> context_switch(rq, prev, next); /* unlocks the rq */
> arrives(prev);
>
I agree that this is not arch dependent.
I need the context-switch timestamps in the BTS trace, though, and the
BTS buffer is arch dependent. Since there is only one arch that supports
this, at the moment, I just put it there.
Once we get more archs, it would begin to make sense to introduce
another layer and have the context-switch timestamps taken in
schedule(). The functionality still needs to be provided by the arch
(which knows about the BTS format), so I wonder whether it would not
cause more work and confusion than it helps.
>If there is a general thing for event-reporting from perfmon2 or
>whatever, then it might be natural to have the context-switch event
>reports configurable to different record formats you might be using
>for other things. For a BTS-style record, I would use:
>
> departs: { .from = task_pt_regs(prev)->ip, .to = jiffies,
>.flag = MAGIC1 }
> arrives: { .from = jiffies, .to = task_pt_regs(next)->ip,
>.flag = MAGIC2 }
> MAGIC1 = 0xffff0001
> MAGIC2 = 0xffff0002
>
You put the magic into the flags field, I put it into an escape from
address.
The flags variant seems more natural to me and I like your way of
reading it; the escape address variant could be implemented on arch's
that do not support an extra field.
I don't have a strong preference, either way. If there is going to be a
new abstraction layer, I would leave it to the arch.
>I'm no expert on perfmon2 and I understand there are many issues to be
>resolved to get it into the kernel. But if you are not
>desperate to have
>the BTS feature in the kernel ASAP, it would ideal IMHO if you can work
>with Stephane et al on putting this work together.
I would like to see the per-task BTS feature in the kernel and
accessible from user space.
This would allow people to program against the API and start using the
hardware's feature for application debugging.
Do you have concerns regarding the user API?
On the other hand, I see the possibility and the necessity to provide a
common kernel interface for all kinds of task and system profiling - and
maybe a user interface, as well.
I do see the two parts independent from each other, though.
The BTS patch series establishes the user API for branch recording and
implements the technique most useful for debuggers. It allows one aspect
of it to be used right away.
We are discussing extensions that happen behind the scenes to make this
more useful from inside the kernel and to structure it in a better way.
I would be happy to work with Stephane to get this additional work done.
>I'd like to see the
>work go into the kernel in much smaller pieces even than your BTS patch
>set that's in -mm. The first thing is just the DS hardware management,
>context switch and hardware-facing parts of the buffer
>management (one or
>three or fourth small bisect-friendly patches just for that much). If
>you and Stephane can hash out a fresh patch that provides what you both
>need for that, that would be a great start.
I think that the low-level parts are pretty much disjoint - unless I
completely missed the PEBS aspect of perfmon. That would leave the patch
pretty much as it is.
>Personally, I'd prefer to
>abandon the ptrace extensions altogether in favor of some generalized
>event buffer interface that comes from merging perfmon2. But if you
>still want to do the ptrace interface, it can be built on the shared
>DS-management code. What do you think?
I think that the ptrace interface is useful for the debugging aspect of
it.
The way it was written, it is intended to be extended to cover other
events, as well. Whether we want to actually extend it or use some other
mechanism is another question.
I would prefer the two features to live independently in the kernel
while Stephane, I, and whoever is interested work on identifying
commonalities and merge the two features.
I would like to start with a better understanding of perfmon. If I grep
for perfmon, I find arch/x86/kernel/cpu/perfctr-watchdog.c and
include/asm-x86/intel_arch_perfmon.h. Could someone point me to more
information and more source files?
I see a lot of hits in arch/ia64, though. Is there a project to extend
this to arch/x86?
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Hello Markus,
On Jan 30, 2008 11:32 AM, Metzger, Markus T <[email protected]> wrote:
> >From: Roland McGrath [mailto:[email protected]]
> >Sent: Mittwoch, 30. Januar 2008 08:26
> >To: Metzger, Markus T
>
> >I think this work has a great deal of overlap with the
> >perfmon2 project.
> >There are two facets that overlap, which together are the
> >whole BTS feature.
> >
> >The same x86 "debug store" hardware is programmed for both the
> >BTS and the
> >performance monitoring features. The implementations clearly have to
> >cooperate on managing that hardware. Your ds.c is a start in the right
> >direction, to abstract the hardware configuration from the BTS
> >feature and
> >its interface. I'm not familiar with the perfmon2 code, but
> >it may have
> >something similar already.
>
> I am not familiar with the perfmon project. I looked briefly at
> arch/x86/kernel/cpu/perfctr-watchdog.c and before I started this, I
> grep'ed for DS_AREA to find out who else is using it, but did not find
> anything.
>
You can get information about the perfmon2 project at: http://perfmon2.sf.net
On processors that do support PEBS, perfmon2 does provide support. As such
it needs access o the DS_AREA.
You can browse our git tree on kernel.org, follow the pointer form our
project's page.
I would like to take a look at your patches. Where can I get them?
My understanding at this point is that we would need to coordinate
access to the
DS_AREA. It would likely have to be mutually exclusive access. I am planning on
adding support for LBRs in the next few months. But BTS the way it is
currently defined
is useless for performance monitoring. I think this a great debug
feature, though.
Now, there is some preliminary perf. MSR allocator in the kernel.
However, it does not know
of all available MSRs out there. It focuses on the counters mostly.
Perfmon, oprofile and
the NMI watchdog use it. I think it could be generalized to other MSR
(non-contiguous).
I would be happy to work with you in refining this MSR allocator.
Thanks.
> As far as I know, performance monitoring can be done in two ways:
> - collect performance event counts
> - collect machine state samples on (the n-th) performance event(s)
>
> The former is done entirely in registers. The latter (spelled PEBS) uses
> a buffer to store the machine state samples.
>
> The configuration of this buffer shares the DS SAVE_AREA MSR with the
> configuration of BTS. That is, DS_AREA points to a struct that
> configures both BTS and PEBS.
>
> My first impression of perfctr-watchdog is that is collects events and
> does not use PEBS at all. There is neither a conflict nor an overlap.
>
> I do agree that they share a common interest. Both collect information
> about a task's execution. The information they collect is disjoint,
> though.
>
>
> >The rest of the BTS feature is the buffer management and the interface.
> >It has to deal with the hardware buffer setup, context switching, and
> >overflow interrupts, and delivering data from the hardware
> >buffers to the
> >interface in appropriate formats. We'd also like it to be
> >able to trace
> >kernel-mode as well as user-mode, and either deliver combined data or
> >segregate the data between the two for user-space and
> >kernel-space users
> >who need not know about each other's tracing.
>
> Kernel mode users would want a per-cpu (system) trace, whereas user mode
> users would want a per-task trace.
>
> So far, I thought that you could have either or. I'm beginning to think
> that you could serve both at the same time if you are willing to accept
> a few hiccups in the system trace.
>
> For a system trace, we would stick with the per-task trace but trace all
> tasks. We would need to extend the trace format to encode the cpuid.
> This would allow us to serve all per-task trace requests. On a system
> trace request, we would combine the per-task traces to for a (mostly
> complete) per-cpu system trace.
>
> I'm afraid that this would require us to move the buffers to pageable
> memory; something I was very much against, so far.
>
> This logic would be shared between performance monitoring and last
> branch recording.
>
>
> >I don't much like the way you've shoe-horned the
> >context-switch timestamp
> >logging into the BTS feature. It's a nice feature to have in
> >some form,
> >and I sympathize with your seeing it as easy pickings once you had the
> >BTS buffer machinery handy. But really it is not part of the
> >BTS feature
> >and there is nothing arch-dependent about it. Given some other general
> >thing providing the buffer management et al, that could just be done in
> >schedule(), i.e.:
> >
> > departs(prev);
> > context_switch(rq, prev, next); /* unlocks the rq */
> > arrives(prev);
> >
>
> I agree that this is not arch dependent.
>
> I need the context-switch timestamps in the BTS trace, though, and the
> BTS buffer is arch dependent. Since there is only one arch that supports
> this, at the moment, I just put it there.
>
> Once we get more archs, it would begin to make sense to introduce
> another layer and have the context-switch timestamps taken in
> schedule(). The functionality still needs to be provided by the arch
> (which knows about the BTS format), so I wonder whether it would not
> cause more work and confusion than it helps.
>
>
> >If there is a general thing for event-reporting from perfmon2 or
> >whatever, then it might be natural to have the context-switch event
> >reports configurable to different record formats you might be using
> >for other things. For a BTS-style record, I would use:
> >
> > departs: { .from = task_pt_regs(prev)->ip, .to = jiffies,
> >.flag = MAGIC1 }
> > arrives: { .from = jiffies, .to = task_pt_regs(next)->ip,
> >.flag = MAGIC2 }
> > MAGIC1 = 0xffff0001
> > MAGIC2 = 0xffff0002
> >
>
> You put the magic into the flags field, I put it into an escape from
> address.
>
> The flags variant seems more natural to me and I like your way of
> reading it; the escape address variant could be implemented on arch's
> that do not support an extra field.
>
> I don't have a strong preference, either way. If there is going to be a
> new abstraction layer, I would leave it to the arch.
>
>
> >I'm no expert on perfmon2 and I understand there are many issues to be
> >resolved to get it into the kernel. But if you are not
> >desperate to have
> >the BTS feature in the kernel ASAP, it would ideal IMHO if you can work
> >with Stephane et al on putting this work together.
>
> I would like to see the per-task BTS feature in the kernel and
> accessible from user space.
>
> This would allow people to program against the API and start using the
> hardware's feature for application debugging.
>
> Do you have concerns regarding the user API?
>
>
>
> On the other hand, I see the possibility and the necessity to provide a
> common kernel interface for all kinds of task and system profiling - and
> maybe a user interface, as well.
>
> I do see the two parts independent from each other, though.
>
> The BTS patch series establishes the user API for branch recording and
> implements the technique most useful for debuggers. It allows one aspect
> of it to be used right away.
>
> We are discussing extensions that happen behind the scenes to make this
> more useful from inside the kernel and to structure it in a better way.
>
> I would be happy to work with Stephane to get this additional work done.
>
>
>
> >I'd like to see the
> >work go into the kernel in much smaller pieces even than your BTS patch
> >set that's in -mm. The first thing is just the DS hardware management,
> >context switch and hardware-facing parts of the buffer
> >management (one or
> >three or fourth small bisect-friendly patches just for that much). If
> >you and Stephane can hash out a fresh patch that provides what you both
> >need for that, that would be a great start.
>
> I think that the low-level parts are pretty much disjoint - unless I
> completely missed the PEBS aspect of perfmon. That would leave the patch
> pretty much as it is.
>
> >Personally, I'd prefer to
> >abandon the ptrace extensions altogether in favor of some generalized
> >event buffer interface that comes from merging perfmon2. But if you
> >still want to do the ptrace interface, it can be built on the shared
> >DS-management code. What do you think?
>
> I think that the ptrace interface is useful for the debugging aspect of
> it.
>
> The way it was written, it is intended to be extended to cover other
> events, as well. Whether we want to actually extend it or use some other
> mechanism is another question.
>
>
>
> I would prefer the two features to live independently in the kernel
> while Stephane, I, and whoever is interested work on identifying
> commonalities and merge the two features.
>
> I would like to start with a better understanding of perfmon. If I grep
> for perfmon, I find arch/x86/kernel/cpu/perfctr-watchdog.c and
> include/asm-x86/intel_arch_perfmon.h. Could someone point me to more
> information and more source files?
>
> I see a lot of hits in arch/ia64, though. Is there a project to extend
> this to arch/x86?
>
>
> thanks and regards,
> markus.
>
>From: stephane eranian [mailto:[email protected]]
>Sent: Mittwoch, 30. Januar 2008 12:01
>You can get information about the perfmon2 project at:
>http://perfmon2.sf.net
I downloaded your patch for 2.6.23 and cloned your git repository.
>From a first glance, it looks like there is indeed a lot of overlap.
>I would like to take a look at your patches. Where can I get them?
You can find the changes in the mm branch of the x86 git. Do they keep
the patches there, somewhere, as well?
You should get the best overview if you look at arch/x86/kernel/ds.c,
include/asm-x86/ds.h, include/asm-x86/ptrace-abi.h, and
arch/x86/kernel/ptrace.c or simply grep for BTS.
I have all the patches that I submitted but they rather protocol
development than describe the final thing.
>My understanding at this point is that we would need to coordinate
>access to the
>DS_AREA. It would likely have to be mutually exclusive access.
>From a brief look at your patch, I would say that we should rather
combine the configuration and collection part. They are very similar.
We might still want to provide different interfaces on top of it.
In particular, I think the ptrace interface is most appropriate for
debuggers.
>I am planning on
>adding support for LBRs in the next few months. But BTS the way it is
>currently defined
>is useless for performance monitoring. I think this a great debug
>feature, though.
Debugging was my target;-)
>Now, there is some preliminary perf. MSR allocator in the kernel.
>However, it does not know
>of all available MSRs out there. It focuses on the counters mostly.
>Perfmon, oprofile and
>the NMI watchdog use it. I think it could be generalized to other MSR
>(non-contiguous).
>
>I would be happy to work with you in refining this MSR allocator.
Are you planning to get the perfmon patch into the kernel? Or do you
want it to remain a separate patch?
In the first case, we should try to merge the features. In the second
case, refining the MSR allocator would probably be best.
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Markus,
On Jan 30, 2008 1:52 PM, Metzger, Markus T <[email protected]> wrote:
> >From: stephane eranian [mailto:[email protected]]
> >Sent: Mittwoch, 30. Januar 2008 12:01
>
> >You can get information about the perfmon2 project at:
> >http://perfmon2.sf.net
>
> I downloaded your patch for 2.6.23 and cloned your git repository.
>
> From a first glance, it looks like there is indeed a lot of overlap.
>
>
> >I would like to take a look at your patches. Where can I get them?
>
> You can find the changes in the mm branch of the x86 git. Do they keep
> the patches there, somewhere, as well?
I will take a look at what you have in -mm git.
>
> You should get the best overview if you look at arch/x86/kernel/ds.c,
> include/asm-x86/ds.h, include/asm-x86/ptrace-abi.h, and
> arch/x86/kernel/ptrace.c or simply grep for BTS.
>
OK.
>
> >My understanding at this point is that we would need to coordinate
> >access to the
> >DS_AREA. It would likely have to be mutually exclusive access.
>
> From a brief look at your patch, I would say that we should rather
> combine the configuration and collection part. They are very similar.
>
That's probably fine too. I'll let you know once I know more about your code.
> We might still want to provide different interfaces on top of it.
> In particular, I think the ptrace interface is most appropriate for
> debuggers.
I agree.
> >Now, there is some preliminary perf. MSR allocator in the kernel.
> >However, it does not know
> >of all available MSRs out there. It focuses on the counters mostly.
> >Perfmon, oprofile and
> >the NMI watchdog use it. I think it could be generalized to other MSR
> >(non-contiguous).
> >
> >I would be happy to work with you in refining this MSR allocator.
>
>
> Are you planning to get the perfmon patch into the kernel? Or do you
> want it to remain a separate patch?
The former, i.e., trying to merge with mainline.
>
> In the first case, we should try to merge the features. In the second
> case, refining the MSR allocator would probably be best.
>
>
Markus,
I looked at your code. My understanding is that it abstracts the BTS
so that higher
level code does not have to know the intricacies of the implementation
especially
given that the changed the DS_AREA structure a lot between P4 and Core 2.
The code does not cover PEBS though. In the case of perfmon, the ra PEBS data
is exposed to user-level. This way we ensure zero-copy all the way to
the application.
Of course, that means applications needs to know whether they are using P4 PEBS
or Core PEBS, but that's fine.
I noticed that your code does not support 64-bit Netburst DS_AREA configuration.
I have used that myself and it works. Also Penryn support is missing
(Fam 6 model 23).
The code does not actually touch the HW resource. This needs to be
added somehow.
It must be part of a register allocator outside of ds.c. We could
extend the simple
allocator currently used for the PMU registers in perfctr-watchdog.c.
In fact, that one
would have to be generalized.
Perfmon does read/write the DS_AREA pointer and content. PEBS is supported in
per-thread and system-wide mode. This means that we save/restore DS_AREA pointer
on ctxsw. Perfmon does not touch the BTS specific fields, just the PEBS relevant
fields. PEBS does generate interrupts and they are routed via the PMU interrupt.
Internally perfmon2 needs to access the internals of the ds_area to
fiddle with the
pebs_index/pebs_intr_thres. Some of that code is in the perfmon core and not in
the model specific (P4 vs. Core) kernel module. On Core2, I think I could get
rid of this dependency but not on P4. I could use your code to abstract the DS
area and give me access to the pebs fields in a more generic way. that assumes
you would add PEBS support to ds.c.
What do you think?
Thanks.
On Jan 30, 2008 1:52 PM, Metzger, Markus T <[email protected]> wrote:
> >From: stephane eranian [mailto:[email protected]]
> >Sent: Mittwoch, 30. Januar 2008 12:01
>
> >You can get information about the perfmon2 project at:
> >http://perfmon2.sf.net
>
> I downloaded your patch for 2.6.23 and cloned your git repository.
>
> From a first glance, it looks like there is indeed a lot of overlap.
>
>
> >I would like to take a look at your patches. Where can I get them?
>
> You can find the changes in the mm branch of the x86 git. Do they keep
> the patches there, somewhere, as well?
>
> You should get the best overview if you look at arch/x86/kernel/ds.c,
> include/asm-x86/ds.h, include/asm-x86/ptrace-abi.h, and
> arch/x86/kernel/ptrace.c or simply grep for BTS.
>
> I have all the patches that I submitted but they rather protocol
> development than describe the final thing.
>
>
> >My understanding at this point is that we would need to coordinate
> >access to the
> >DS_AREA. It would likely have to be mutually exclusive access.
>
> From a brief look at your patch, I would say that we should rather
> combine the configuration and collection part. They are very similar.
>
> We might still want to provide different interfaces on top of it.
> In particular, I think the ptrace interface is most appropriate for
> debuggers.
>
>
> >I am planning on
> >adding support for LBRs in the next few months. But BTS the way it is
> >currently defined
> >is useless for performance monitoring. I think this a great debug
> >feature, though.
>
> Debugging was my target;-)
>
>
> >Now, there is some preliminary perf. MSR allocator in the kernel.
> >However, it does not know
> >of all available MSRs out there. It focuses on the counters mostly.
> >Perfmon, oprofile and
> >the NMI watchdog use it. I think it could be generalized to other MSR
> >(non-contiguous).
> >
> >I would be happy to work with you in refining this MSR allocator.
>
>
> Are you planning to get the perfmon patch into the kernel? Or do you
> want it to remain a separate patch?
>
> In the first case, we should try to merge the features. In the second
> case, refining the MSR allocator would probably be best.
>
>
> thanks and regards,
>-----Original Message-----
>From: stephane eranian [mailto:[email protected]]
>Sent: Donnerstag, 31. Januar 2008 12:16
>I looked at your code. My understanding is that it abstracts the BTS
>so that higher
>level code does not have to know the intricacies of the implementation
>especially
>given that the changed the DS_AREA structure a lot between P4
>and Core 2.
correct.
>I noticed that your code does not support 64-bit Netburst
>DS_AREA configuration.
>I have used that myself and it works. Also Penryn support is missing
>(Fam 6 model 23).
correct. I started with the hardware underneath my desk. I plan to
support more hardware once the feature is perceived as being useful.
>The code does not actually touch the HW resource. This needs to be
>added somehow.
>It must be part of a register allocator outside of ds.c. We could
>extend the simple
>allocator currently used for the PMU registers in perfctr-watchdog.c.
>In fact, that one
>would have to be generalized.
>
>Perfmon does read/write the DS_AREA pointer and content. PEBS
>is supported in
>per-thread and system-wide mode. This means that we
>save/restore DS_AREA pointer
>on ctxsw. Perfmon does not touch the BTS specific fields, just
>the PEBS relevant
>fields. PEBS does generate interrupts and they are routed via
>the PMU interrupt.
I added ds_area to the thread_struct; someone added debugctl before.
On context-switch, the DS_AREA MSR is written if and only if the next
thread has a different ds_area entry than the prev thread and either one
has the corresponding TIF flag set.
>Internally perfmon2 needs to access the internals of the ds_area to
>fiddle with the
>pebs_index/pebs_intr_thres. Some of that code is in the
>perfmon core and not in
>the model specific (P4 vs. Core) kernel module. On Core2, I
>think I could get
>rid of this dependency but not on P4. I could use your code to
>abstract the DS
>area and give me access to the pebs fields in a more generic
>way. that assumes
>you would add PEBS support to ds.c.
That's what I was thinking as well.
It could also do the buffer memory management. I currently do the rlimit
check in ptrace.c and the allocation in ds.c. We should probably move
the rlimit check to ds.c, as well.
Currently, it also knows how to read and write a single BTS entry. And
it even understands the BTS format including artificial records.
I would leave at least the knowledge how to read and write a BTS and
PEBS record inside ds.c. This way, we get all the overflow handling on
write in one place. I would definitely move the interpretation of
artificial records to its users.
It was very convenient for me to put the entire hardware abstraction
into ds.c. And it does make sense for BTS, since the general format is
identical for all processors.
This would not work for your zero-copy approach, though. And it makes
less sense for PEBS due to the different formats.
I need to think some more about this.
So far, ds.c would:
- do resource allocation of BTS and PEBS buffers per task (or
system-wide)
- do memory allocation for those buffers including rlimit checks
- currently, I use non-pageable memory only
there were ideas to use a pageable shadow buffer
this would go here
- provide a PMI interrupt handler
- users would register a callback for overflow notification
- wrap-around, if NULL callback
- allow read and write of BTS and PEBS records
- give (read-only) access to BTS and PEBS pointers
What do you think?
regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.