Remove a comment that is no longer correct. The reconfiguration is done directly in __switch_to_xtra.
Signed-off-by: Markus Metzger <[email protected]>
---
Index: linux-2.6-x86/arch/x86/kernel/process_64.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/process_64.c 2007-12-10 09:35:26.%N +0100
+++ linux-2.6-x86/arch/x86/kernel/process_64.c 2007-12-10 09:36:38.%N +0100
@@ -598,10 +598,6 @@
memset(tss->io_bitmap, 0xff, prev->io_bitmap_max);
}
- /*
- * Last branch recording recofiguration of trace hardware and
- * disentangling of trace data per task.
- */
if (test_tsk_thread_flag(prev_p, TIF_BTS_TRACE_TS))
ptrace_bts_take_timestamp(prev_p, BTS_TASK_DEPARTS);
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
[ Cc:-ed Alan Stern a'ka kwatch fame - Alan might be interested in this
too. ]
hi Markus,
finally had time to take a closer look at the design of your
Branch-Trace-Support-via-ptrace-on-x86 patchset. I think we'll need to
work some more on general API issues.
here's the current proposed API:
+ case PTRACE_BTS_MAX_BUFFER_SIZE:
+ case PTRACE_BTS_ALLOCATE_BUFFER:
+ case PTRACE_BTS_GET_BUFFER_SIZE:
+ case PTRACE_BTS_READ_RECORD:
+ case PTRACE_BTS_CONFIG:
+ case PTRACE_BTS_STATUS:
i can see a couple of open questions:
1) PTRACE_BTS_MAX_BUFFER_SIZE
why is a trace buffer size limit visible to user-space? It's 4000
entries right now:
#define PTRACE_BTS_BUFFER_MAX 4000
it would be more flexible if user-space could offer arbitrary sized
buffers, which the kernel would attempt to mlock(). Then those pages
could be fed to the BTS MSR. I.e. there should be no hard limit in the
API, other than a natural resource limit.
2) struct bts_struct
the structure of it is hardwired:
the basic unit is an array of bts_struct:
+struct bts_struct {
+ enum bts_qualifier qualifier;
+ union {
+ /* BTS_BRANCH */
+ struct {
+ long from_ip;
+ long to_ip;
+ } lbr;
+ /* BTS_TASK_ARRIVES or
+ BTS_TASK_DEPARTS */
+ unsigned long long timestamp;
+ } variant;
+};
while other CPUs (on other architectures, etc.) might have a different
raw format for the trace entries. So it would be better to implement BTS
support in kernel/ptrace.c with a general trace format, and to provide a
cross-arch API (only implemented by arch/x86 in the beginning) to
translate the 'raw' trace entries into the general format.
3)
it would certainly be useful for some applications to have a large
"virtual" trace buffer and a small "real" trace buffer. The kernel would
use BTS high-watermark interrupts to feed the real trace buffer into the
larger "virtual" trace buffer. This way we wouldnt even have to use
mlock() to pin down the trace buffer.
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Montag, 10. Dezember 2007 21:21
>here's the current proposed API:
>
>+ case PTRACE_BTS_MAX_BUFFER_SIZE:
>+ case PTRACE_BTS_ALLOCATE_BUFFER:
>+ case PTRACE_BTS_GET_BUFFER_SIZE:
>+ case PTRACE_BTS_READ_RECORD:
>+ case PTRACE_BTS_CONFIG:
>+ case PTRACE_BTS_STATUS:
>
>i can see a couple of open questions:
>
>1) PTRACE_BTS_MAX_BUFFER_SIZE
>
>why is a trace buffer size limit visible to user-space? It's 4000
>entries right now:
>
> #define PTRACE_BTS_BUFFER_MAX 4000
>
>it would be more flexible if user-space could offer arbitrary sized
>buffers, which the kernel would attempt to mlock(). Then those pages
>could be fed to the BTS MSR. I.e. there should be no hard limit in the
>API, other than a natural resource limit.
That would be a variation on Andi's zero-copy proposal, wouldn't it?
The user supplies the BTS buffer and the kernel manages DS.
Andi further suggested a vDSO to interpret the data and translate the
hardware format into a higher level user format.
I take it that you would leave that inside ptrace.
I need to look more into mlock. So far, I found a system call in
/usr/include/sys/mman.h and two functions sys_mlock() and
user_shm_lock() in the kernel. Is there a memory expert around who could
point me to some interesting places to look at?
Can we distinguish kernel-locked memory from user-locked memory?
I could imagine a malicious user to munlock() the buffer he provided to
ptrace.
Is there a real difference between mlock()ing user memory and allocating
kernel memory? There would be if we could page out mlock()ed memory when
the user thread is not running. We would need to disable DS before
paging out, and page in before enabling it. If we cannot, then kernel
allocated memory would require less space in physical memory.
>2) struct bts_struct
>
>the structure of it is hardwired:
>
>the basic unit is an array of bts_struct:
>
> +struct bts_struct {
> + enum bts_qualifier qualifier;
> + union {
> + /* BTS_BRANCH */
> + struct {
> + long from_ip;
> + long to_ip;
> + } lbr;
> + /* BTS_TASK_ARRIVES or
> + BTS_TASK_DEPARTS */
> + unsigned long long timestamp;
> + } variant;
> +};
>
>while other CPUs (on other architectures, etc.) might have a different
>raw format for the trace entries. So it would be better to
>implement BTS
>support in kernel/ptrace.c with a general trace format, and to
>provide a
>cross-arch API (only implemented by arch/x86 in the beginning) to
>translate the 'raw' trace entries into the general format.
The bts_struct is already an abstraction from the raw hardware format.
The hardware uses:
struct bts_netburst {
void* from;
void* to;
long :4;
long predicted :1;
};
struct bts_core2 {
u64 from;
u64 to;
u64 :64;
};
I encode timestamps in the raw hardware format. This puts the timestamps
at the correct places and frees me from any efforts to correlate
separate timestamp data and branch record data (which is getting ugly
when the BTS buffer overflows).
Now, arch/x86/kernel/ds.c already needs to abstract from the raw
hardware format. I chose to add timestamps to that abstraction. This
makes the DS interface general enough to be directly used by ptrace
(causing some misunderstanding, I'm afraid).
If new architectures would eliminate the extra word, I would have to
merge the qualifier and the escape address, i.e. have some (probably the
least significant) bits of the escape address encode the qualifier. The
interface would remain stable.
Other architectures would at least need to encode the source and
destination address. Thus, we should have enough room to support
timestamps for other architectures, as well. If they choose a more
compact encoding, the interface might no longer be usable.
I could instead use
struct ds_bts {
void* from;
void* to;
};
for the DS interface and move timestamps and their encoding in the DS
BTS format to ptrace.
This would move the DS interface closer to the hardware (yet still
abstract from details) and move anything extra to its users - currently
only ptrace.
On the down side, other users of DS (kernel tracing, for example) would
have to invent their own method to encode additional information into
the BTS buffer.
I saw benefit in making the DS interface more powerful.
>3)
>
>it would certainly be useful for some applications to have a large
>"virtual" trace buffer and a small "real" trace buffer. The
>kernel would
>use BTS high-watermark interrupts to feed the real trace
>buffer into the
>larger "virtual" trace buffer. This way we wouldnt even have to use
>mlock() to pin down the trace buffer.
I see two ways this could be done wrt disentangling trace:
1. keep a small per-thread buffer
2. use a per-cpu buffer
The first alternative would allow us to disentangle the per-thread trace
easily by simply reconfiguring the DS on a context switch (similar to
how it is currently done).
The second alternative would require us to memcpy() the BTS buffer at
each context switch. This would be a huge overhead at a performance
critical place.
I would therefore keep the per-thread buffer. This costs one mlock()ed
or kalloc()ed page per traced thread.
In addition to that, we need the (pageable) memory to hold the virtual
buffer.
Thus, it only pays off if the virtual buffer is significantly bigger
than the real buffer (we're paying one page extra).
Let's look at the interrupt routine.
It would need to disable tracing, page in the memory, memcpy() the BTS
buffer, and reenable tracing. This is again a big overhead and might
take some time.
AFAIK, the interrupt routine would schedule the work and return
immediately. A kernel thread would later pick up the work and do the
copying.
The traced thread would need to be suspended until the copying is done
(unless we have a table of buffers that we cycle through and thus can
provide a fresh buffer to the thread immediately).
Thus, its timeslice would effectively be reduced to the amount of work
it gets done in a single 'real' BTS buffer.
I am not familiar with the new scheduler and how this would impact
(system) performance. I could imagine that either the traced task
virtually starves or that it receives an unfairly big amount of system
resources. But then, I'm only guessing. I'm sure you can give a precise
answer.
Overall, this looks like a lot more effort and a lot more things that
could go wrong and bring down the entire system. Compared to an almost
trivial and reasonably safe implementation.
And it only starts paying off when the virtual buffer is getting
reasonably big; up to a few pages, we should be better off simply
mlock()ing the entire buffer.
For use in debuggers to show the debuggee's execution trace to the user,
I think that this is unnecessarily complicated. For all the use cases
that I can think of today, a single page of memory should suffice.
>From my own experience (debugging compilers on XScale), by far the most
interesting information is the last handful of branch records. I cannot
say how far users would want to look back in the trace history or how
far they can look back and still make some sense out of what they see.
Besides, all this would be hidden behind the ptrace interface. We might
add this functionality when there is demand.
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
* Metzger, Markus T <[email protected]> wrote:
> That would be a variation on Andi's zero-copy proposal, wouldn't it?
>
> The user supplies the BTS buffer and the kernel manages DS.
>
> Andi further suggested a vDSO to interpret the data and translate the
> hardware format into a higher level user format.
>
> I take it that you would leave that inside ptrace.
yeah - i think both zero-copy and vdso are probably overkill for this.
On the highest level, there are two main usecases of BTS that i can
think of: debugging [a user-space task crashes and developer would like
to see the last few branches taken - possibly extended to kernel space
crashes as well], and instrumentation.
In the first use-case (debugging) zero-copy is just an unnecessary
complication.
In the second use-case (tracing, profiling, call coverage metrics), we
could live without zero-copy, as long as the buffer could be made "large
enough". The current 4000 records limit seems rather low (and arbitrary)
and probably makes the mechanism unsuitable for say call coverage
profiling purposes. There's also no real mechanism that i can see to
create a guaranteed flow of this information between the debugger and
debuggee (unless i missed something), the code appears to overflow the
array, and destroy earlier entries, right? That's "by design" for
debugging, but quite a limitation for instrumentation which might want
to have a reliable stream of the data (and would like the originating
task to block until the debugger had an opportunity to siphoon out the
data).
> I need to look more into mlock. So far, I found a system call in
> /usr/include/sys/mman.h and two functions sys_mlock() and
> user_shm_lock() in the kernel. Is there a memory expert around who
> could point me to some interesting places to look at?
sys_mlock() is what i meant - you could just call it internally from
ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all
the "there's too much memory pinned down" details out of the ptrace
code.
> Can we distinguish kernel-locked memory from user-locked memory? I
> could imagine a malicious user to munlock() the buffer he provided to
> ptrace.
yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That
gives a permanent reference count to those pages.
> Is there a real difference between mlock()ing user memory and
> allocating kernel memory? There would be if we could page out
> mlock()ed memory when the user thread is not running. We would need to
> disable DS before paging out, and page in before enabling it. If we
> cannot, then kernel allocated memory would require less space in
> physical memory.
mlock() would in essence just give you an easy "does this user have
enough privilege to lock this many pages" API. The real pinning would be
done by get_user_pages(). Once you have those pages, they wont be
swapped out.
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Dienstag, 11. Dezember 2007 15:53
>In the second use-case (tracing, profiling, call coverage metrics), we
>could live without zero-copy, as long as the buffer could be
>made "large
>enough". The current 4000 records limit seems rather low (and
>arbitrary)
>and probably makes the mechanism unsuitable for say call coverage
>profiling purposes.
The limit is indeed arbitrary. It was requested by one of my internal
reviewers. The rationale was that we do not want to allow ptrace users
to request that an unlimited amount of memory be allocated in the
kernel.
Andi suggested to make this a sysctl.
Would it be safe to drop the artificial limit and let the limit be the
available memory?
>There's also no real mechanism that i can see to
>create a guaranteed flow of this information between the debugger and
>debuggee (unless i missed something), the code appears to overflow the
>array, and destroy earlier entries, right? That's "by design" for
>debugging, but quite a limitation for instrumentation which might want
>to have a reliable stream of the data (and would like the originating
>task to block until the debugger had an opportunity to siphoon out the
>data).
That's correct as well. My focus is on debugging. And that's actually
the most useful behavior in that case. I'm not sure what you mean with
'instrumentation'.
Regarding streaming data (the only real use case I can think of would be
path profiling), I think I would rather allocate a bigger buffer and
send a signal when the buffer gets full instead of having the kernel
provide an interrupt service routine and copy the data from a smaller
buffer into a bigger one.
This would leave it to the tracing task which likely wants to write the
data into a file if it really got too big to hold it in physical memory.
We would need another command to reset the DS index, and probably one to
provide the entire buffer at once, and add some configuration bits to
say whether we want the signal or a cyclic buffer.
For debugging, I think this extension is not relevant. Since it could be
implemented as an extension of the current interface, I would leave it,
for now.
>> I need to look more into mlock. So far, I found a system call in
>> /usr/include/sys/mman.h and two functions sys_mlock() and
>> user_shm_lock() in the kernel. Is there a memory expert around who
>> could point me to some interesting places to look at?
>
>sys_mlock() is what i meant - you could just call it internally from
>ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all
>the "there's too much memory pinned down" details out of the ptrace
>code.
>
>> Can we distinguish kernel-locked memory from user-locked memory? I
>> could imagine a malicious user to munlock() the buffer he
>provided to
>> ptrace.
>
>yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That
>gives a permanent reference count to those pages.
sys_mlock() eventually results in a call to get_user_pages().
I have not found the corresponding put_page() call for munlock(),
though.
So you're suggesting to do an additional get_user_pages to prevent the
user from munlock()ing the memory?
So, if the pages are unlocked, they will still be pinned down and cannot
further be unlocked?
Something like:
ptrace_allocate(struct task_struct *child, void *base, size_t size) {
if (sys_mlock(base, size) < 0)
return -EPERM;
get_user_pages(child, ...., base, size, ....);
}
>> Is there a real difference between mlock()ing user memory and
>> allocating kernel memory? There would be if we could page out
>> mlock()ed memory when the user thread is not running. We
>would need to
>> disable DS before paging out, and page in before enabling it. If we
>> cannot, then kernel allocated memory would require less space in
>> physical memory.
>
>mlock() would in essence just give you an easy "does this user have
>enough privilege to lock this many pages" API. The real
>pinning would be
>done by get_user_pages(). Once you have those pages, they wont be
>swapped out.
The actual physical memory consumption will be worse (or at best equal)
compared to kalloc()ed memory, since we need to pin down entire pages,
whereas kalloc() would allocate just the memory that is actually needed.
Unless we would put_page() and get_user_pages() the pinned down pages on
a context switch. But then, get_user_pages() calls cond_resched(), which
might not be a good think to do during a context switch.
Overall, I think that kalloc() is actually more memory efficient than
mlock(). The only benefit of mlock() would be that we save some
copy_to_user() in the debugger task. This should not be a big deal for
debuggers, but it might become interesting for profilers.
On the other hand, profilers would typically expect a stream of data and
they would either want a terribly big buffer or store the data into a
file. In both cases, they need to do a lot of copying, already.
I would stay with kalloc() for both use cases.
Regarding your other open, are you OK with the DS interface?
Andi suggested in one of his first reviews that the configuration
commands should be combined.
Something like:
struct bts_config {
size_t buffer_size;
unsigned int flags;
}
The buffer would be reallocated, if its size changed. The flags would be
treated like the current CONFIG command to control last branch recording
and timestamp recording. They would eventually be extended to control
overflow behavior (signal vs. cyclic buffer).
I would expect that the buffer_size does not change too often and that
the flags might change more frequently. This interface would further be
harder to extend in case the struct needs to be extended.
On the up side, it rids us of two commands.
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
* Metzger, Markus T <[email protected]> wrote:
> Andi suggested to make this a sysctl.
that's just as arbitrary ...
> Would it be safe to drop the artificial limit and let the limit be the
> available memory?
no, that would be a DoS :-/
mlock() is rlimit controlled and is available to unprivileged users - up
to a small amount of memory can be locked down. But i agree that mlock()
can be problematic - see below.
> > There's also no real mechanism that i can see to create a guaranteed
> > flow of this information between the debugger and debuggee (unless i
> > missed something), the code appears to overflow the array, and
> > destroy earlier entries, right? That's "by design" for debugging,
> > but quite a limitation for instrumentation which might want to have
> > a reliable stream of the data (and would like the originating task
> > to block until the debugger had an opportunity to siphoon out the
> > data).
>
> That's correct as well. My focus is on debugging. And that's actually
> the most useful behavior in that case. I'm not sure what you mean with
> 'instrumentation'.
the branch trace can be used to generate a very finegrained
profile/histogram of code execution - even of rarely executed codepaths
which cannot be captured via timer/perf-counter based profiling.
another potential use would be for call graph coverage testing. (which
currently requires compiler-inserted calls - would be much nicer if we
could do this via the hardware.)
etc. Branch tracing isnt just for debugging i think - as long as the
framework is flexible enough.
> The actual physical memory consumption will be worse (or at best
> equal) compared to kalloc()ed memory, since we need to pin down entire
> pages, whereas kalloc() would allocate just the memory that is
> actually needed.
i agree that mlock() has problems. A different model would be: no mlock
and no get_user_pages() - something quite close to what you have
already. Data is streamed out of the internal (hardware-filled,
kmalloc()-ed, user-inaccessible) buffer, we stop task execution until it
is copied to the larger, user-provided buffer. The debugging feature you
are interested in could be enabled as a special-case of this mechanism:
if the user-space buffer is not larger than the hardware buffer then no
streaming is needed, you can just capture into the kernel buffer.
User-space would have to do a PTRACE_BTS_DRAIN_BUFFER call (or something
like that) to get the "final portion" of the branch trace out into the
user-space buffer. [which, in your debugging special-case, would the
full internal buffer.]
that way the kmalloc()-ed buffer becomes an internal detail of buffering
that you rarely have to be aware of. (it could still be queried - like
your patch does it now.)
or something else that is intelligent. Basically, what we'd like to have
is a future-proof, extensible approach that does not necessarily stop at
debugging and integrates this hardware capability into Linux
intelligently.
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Mittwoch, 12. Dezember 2007 12:04
>> Would it be safe to drop the artificial limit and let the
>limit be the
>> available memory?
>
>no, that would be a DoS :-/
How about:
for (;;) {
int pid = fork();
if (pid)
ptrace_allocate(pid, <small_buffer_size>);
}
I guess what we really want is a per-task kalloc() limit. That limit can
be quite big.
Should this rather be within kalloc() itself (with the data stored in
task_struct) or should this be local to ptrace_bts_alloc?
Is there already a per-task memory limit? We could deduct the kalloc()ed
memory from that.
>> the most useful behavior in that case. I'm not sure what you
>mean with
>> 'instrumentation'.
>
>the branch trace can be used to generate a very finegrained
>profile/histogram of code execution - even of rarely executed
>codepaths
>which cannot be captured via timer/perf-counter based profiling.
>
>another potential use would be for call graph coverage testing. (which
>currently requires compiler-inserted calls - would be much nicer if we
>could do this via the hardware.)
>
>etc. Branch tracing isnt just for debugging i think - as long as the
>framework is flexible enough.
Agreed.
>> The actual physical memory consumption will be worse (or at best
>> equal) compared to kalloc()ed memory, since we need to pin
>down entire
>> pages, whereas kalloc() would allocate just the memory that is
>> actually needed.
>
>i agree that mlock() has problems. A different model would be:
I have my doubts regarding any form of two-buffer approach.
Let's assume we find a way to get a reasonably big limit.
Users who want to process that huge amount of data would be better off
using a file-based approach (well, if it cannot be held in physical
memory, they will spend most of their time swapping, anyway). Those
users would typically wait for the 'buffer full' event and drain the
buffer into a file - whether this is the real buffer or a bigger virtual
buffer.
The two-buffer approach would only benefit users who want to hold the
full profile in memory - or who want to stall the debuggee until they
processed or somehow compressed the data collected so far.
Those approaches would not scale for very big profiles.
The small profile cases would already be covered with a reasonably big
real buffer.
The two-buffer approach is moving out the limit of what can be profiled
in (virtual) memory.
Independent of how many buffers we have, we need to offer overflow
handling. This could be:
1. cyclic buffer
2. buffer full event
Robust users would either:
a. select 2 and ignore the overflow signal to collect an initial profile
b. select 1 to collect a profile tail
c. select 2 and drain the real or virtual buffer into a file to collect
the full profile
d. select 2 and bail out on overflow (robust but limited usefulness)
Debuggers would typically choose b. Path profilers would typically
choose c. Student projects would typically choose d.
I see benefit in providing a second overflow handling method.
I only see benefit in a bigger second buffer if we cannot get the limit
to a reasonable size.
Am I missing something?
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
* Metzger, Markus T <[email protected]> wrote:
> Users who want to process that huge amount of data would be better off
> using a file-based approach (well, if it cannot be held in physical
> memory, they will spend most of their time swapping, anyway). Those
> users would typically wait for the 'buffer full' event and drain the
> buffer into a file - whether this is the real buffer or a bigger
> virtual buffer.
>
> The two-buffer approach would only benefit users who want to hold the
> full profile in memory - or who want to stall the debuggee until they
> processed or somehow compressed the data collected so far. Those
> approaches would not scale for very big profiles. The small profile
> cases would already be covered with a reasonably big real buffer.
well, the two-buffer approach would just be a general API with no
limitations. It would make the internal buffer mostly a pure performance
detail.
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Donnerstag, 13. Dezember 2007 11:30
>> Users who want to process that huge amount of data would be
>better off
>> using a file-based approach (well, if it cannot be held in physical
>> memory, they will spend most of their time swapping, anyway). Those
>> users would typically wait for the 'buffer full' event and drain the
>> buffer into a file - whether this is the real buffer or a bigger
>> virtual buffer.
>>
>> The two-buffer approach would only benefit users who want to
>hold the
>> full profile in memory - or who want to stall the debuggee
>until they
>> processed or somehow compressed the data collected so far. Those
>> approaches would not scale for very big profiles. The small profile
>> cases would already be covered with a reasonably big real buffer.
>
>well, the two-buffer approach would just be a general API with no
>limitations. It would make the internal buffer mostly a pure
>performance
>detail.
Agreed.
Somewhat.
A user-provided second buffer would need to be up-to-date when we switch
to the user's task.
We would either need to drain the real buffer when switching from the
traced task;
or we would need to drain the real buffers of all traced tasks when
switching to the tracing task.
Both would require a get_user_pages() during context switching.
Alternatively, we could schedule a kernel task to drain the real buffer
when switching from a traced task.
The tracing task would then need to wait for all those kernel tasks. I'm
not sure how that affects scheduling fairness. And it's getting quite
complicated.
A kernel-provided second buffer could be entirely hidden behind the
ptrace (or, rather, ds) interface. It would not even have to be drained
before switching to the tracing task, since ds would just look into the
real buffer and then move on to the second buffer - transparent to the
user. Its size could be deducted from the user's memory limit and it
could be in pageable memory.
We would not be able to give precise overflow signals, that way (the
not-yet-drained real buffer might actually cause an overflow of the
second buffer, when drained). By allowing the user to query for the
number of BTS records to drain, we would not need to. A user drain would
drain both bufers.
The second buffer would be a pure performance/convenience detail of ds,
just like you suggested.
The ptrace API would allow the user to:
- define (and query) the overflow mechanism
(wrap-around or event)
- define (and query) the size of the buffer within certain limits
(we could either give an error or cut off)
- define (and query) events to be monitored
(last branch trace, scheduling timestamps)
- get a single BTS record
- query the number of BTS records
(to find out how big your drain buffer needs to be; it may be bigger
than you requested)
- drain all BTS records (copy, then clear)
- clear all BTS records
Draining would require the user to allocate a buffer to hold the data,
which might not be feasible when he is near his memory limit. He could
fall back to looping over the single-entry get. It is questionable, how
useful the drain ptrace command would actually be; we might want to
replace it with a get range command.
Are you OK with this?
thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
* Metzger, Markus T <[email protected]> wrote:
> The ptrace API would allow the user to:
> - define (and query) the overflow mechanism
> (wrap-around or event)
> - define (and query) the size of the buffer within certain limits
> (we could either give an error or cut off)
> - define (and query) events to be monitored
> (last branch trace, scheduling timestamps)
> - get a single BTS record
> - query the number of BTS records
> (to find out how big your drain buffer needs to be; it may be bigger
> than you requested)
> - drain all BTS records (copy, then clear)
> - clear all BTS records
>
> Draining would require the user to allocate a buffer to hold the data,
> which might not be feasible when he is near his memory limit. He could
> fall back to looping over the single-entry get. It is questionable,
> how useful the drain ptrace command would actually be; we might want
> to replace it with a get range command.
>
> Are you OK with this?
this sounds a lot more flexible to me. Please, once it looks good to all
of us also extend LTP's ptrace bits with unit tests for these API
additions. (Cc: such LTP bits to [email protected])
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Donnerstag, 13. Dezember 2007 14:08
>> Are you OK with this?
>
>this sounds a lot more flexible to me. Please, once it looks
>good to all
I will implement the ptrace ABI and the man pages and send them out for
review (I'll return -EOPNOTSUPP a few times) tomorrow.
I'll add support for the overflow signal in a later patch, once I
figured out how to do it.
>of us also extend LTP's ptrace bits with unit tests for these API
>additions. (Cc: such LTP bits to [email protected])
I need to look into this. This will take some time, as well.
regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.