> >
> > If I misunderstand something, please let me know.
> >
>
> Quite.
>
> These functions are being invoked from the IDT, which is an indirect pointer structure. When not being traced, there is absolutely no
> reason why it should go through a thunk with tracepoints.
I agree that the cost can be absolutely zero by switching each interrupt hander when turning on/off the tracepoint.
But I would like to talk about a time penalty of a tracepoint more.
When not being traced, the tracepoint is just nop.
And it is described in the documentation below.
So, the tracepoint seems to be designed to add to performance critical paths.
Documentation/trace/tracepoints.txt
<snip>
* Purpose of tracepoints
A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site).
You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file.
<snip>
Also, as I submitted an actual latency measurement, the time penalty is almost zero.
<snip>
(1-1) local_timer_entry
- 3.6-rc6 original
<...>-2788 2dNh. 894834337us : exit_idle <-smp_apic_timer_interrupt
<...>-2788 2dNh. 894834337us : hrtimer_interrupt <-smp_apic_timer_interrupt
- 3.6-rc6 + this patch + trace off
<...>-1981 0d.h. 210538us : exit_idle <-smp_apic_timer_interrupt
<...>-1981 0d.h. 210538us : hrtimer_interrupt <-smp_apic_timer_interrupt
<snip>
When switching each interrupt handler, all cpus have to be interrupt-disable with smp_call_funciton() or something like that.
IMO, rather than doing such a complex thing, just adding a tracepoint is reasonable.
What do you think?
Seiji
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, 2012-10-02 at 19:10 +0000, Seiji Aguchi wrote:
> > >
> > > If I misunderstand something, please let me know.
> > >
> >
> > Quite.
> >
> > These functions are being invoked from the IDT, which is an indirect pointer structure. When not being traced, there is absolutely no
> > reason why it should go through a thunk with tracepoints.
>
> I agree that the cost can be absolutely zero by switching each interrupt hander when turning on/off the tracepoint.
>
Peter,
I agree that the IDT version is a zero cost in performance, where as the
tracepoint version is a negligible cost in performance. But my worry is
the complexity (read error prone and possible openings of security
exploits) worth it?
Switching of the IDT is not that trivial, and to make it something for
common activities such as reading tracepoints by tools like ftrace and
perf, that do it often, even on production machines, may lead to issues
if its not done right.
You are the maintainer and are responsible for the outcome of changes to
the x86 arch, thus you do have final say. And if you think there's
nothing to worry about with an IDT change then Seiji should implement
it.
I just want to point out some possible repercussions of doing it in a
more complex way. As tracepoints use nops, and I may be pushing to even
out-of-line the tracepoint unlikely part even more, I'm not sure the
complexity is worth the amount of savings it would be against just
adding the tracepoint in the code.
-- Steve
On 10/05/2012 07:13 AM, Steven Rostedt wrote:
>
> Peter,
>
> I agree that the IDT version is a zero cost in performance, where as the
> tracepoint version is a negligible cost in performance. But my worry is
> the complexity (read error prone and possible openings of security
> exploits) worth it?
>
> Switching of the IDT is not that trivial, and to make it something for
> common activities such as reading tracepoints by tools like ftrace and
> perf, that do it often, even on production machines, may lead to issues
> if its not done right.
>
It's a table of pointers... there really isn't anything magic about it
(except perhaps the slightly weird format.)
> You are the maintainer and are responsible for the outcome of changes to
> the x86 arch, thus you do have final say. And if you think there's
> nothing to worry about with an IDT change then Seiji should implement
> it.
>
> I just want to point out some possible repercussions of doing it in a
> more complex way. As tracepoints use nops, and I may be pushing to even
> out-of-line the tracepoint unlikely part even more, I'm not sure the
> complexity is worth the amount of savings it would be against just
> adding the tracepoint in the code.
The problem I'm seeing is the constant "oh, just a little bit more." My
experience over the years is that there is always demand for "just one
more debug feature", each of which has negible cost, because they always
use the previous thing as a baseline... noone ever looks at the grand
total cost of the package (and by the time that happens, it is too late.)
tracepoints in particular are something I'm getting concerned about,
because they are one of those things that turn kernel internals into an
ABI, which means misdesigned tracepoints can actually make kernel
internal changes very hard to do. The cost of those kinds of issues go
up over time as the strain between where we'd like the code to be vs.
where the code is increases.
-hpa
On Fri, 2012-10-05 at 17:16 -0700, H. Peter Anvin wrote:
> On 10/05/2012 07:13 AM, Steven Rostedt wrote:
> >
> > Peter,
> >
> > I agree that the IDT version is a zero cost in performance, where as the
> > tracepoint version is a negligible cost in performance. But my worry is
> > the complexity (read error prone and possible openings of security
> > exploits) worth it?
> >
> > Switching of the IDT is not that trivial, and to make it something for
> > common activities such as reading tracepoints by tools like ftrace and
> > perf, that do it often, even on production machines, may lead to issues
> > if its not done right.
> >
>
> It's a table of pointers... there really isn't anything magic about it
> (except perhaps the slightly weird format.)
I didn't say anything magic, but a table of pointers that are very
critical for the system running. Should we implement it with a single
switch, like we discussed in San Diego to do with the system call table?
That is, have a "normal" table, and a "trace" table. The trace table
points to functions that have tracepoints. The first enabler of tracing
switches the table to use the tracepoints, and the last disabler
switches it back?
>
> > You are the maintainer and are responsible for the outcome of changes to
> > the x86 arch, thus you do have final say. And if you think there's
> > nothing to worry about with an IDT change then Seiji should implement
> > it.
> >
> > I just want to point out some possible repercussions of doing it in a
> > more complex way. As tracepoints use nops, and I may be pushing to even
> > out-of-line the tracepoint unlikely part even more, I'm not sure the
> > complexity is worth the amount of savings it would be against just
> > adding the tracepoint in the code.
>
> The problem I'm seeing is the constant "oh, just a little bit more." My
> experience over the years is that there is always demand for "just one
> more debug feature", each of which has negible cost, because they always
> use the previous thing as a baseline... noone ever looks at the grand
> total cost of the package (and by the time that happens, it is too late.)
Now I can turn this back at you ;-) We can implement the simple "just
add the tracepoints in the code" first, and then later implement the
table swap version and we can say "hey! we just made the code faster!".
>
> tracepoints in particular are something I'm getting concerned about,
> because they are one of those things that turn kernel internals into an
> ABI, which means misdesigned tracepoints can actually make kernel
> internal changes very hard to do. The cost of those kinds of issues go
> up over time as the strain between where we'd like the code to be vs.
> where the code is increases.
Honestly, I'm extremely concerned about this too. In fact, I've bitched
about this so many times in the past, but it just fell to deaf ears:
http://lwn.net/Articles/412685/
http://lwn.net/Articles/415591/
http://lwn.net/Articles/416665/
http://lwn.net/Articles/416684/
-- Steve
On Fri, Oct 05, 2012 at 10:57:41PM -0400, Steven Rostedt wrote:
> > The problem I'm seeing is the constant "oh, just a little bit more." My
> > experience over the years is that there is always demand for "just one
> > more debug feature", each of which has negible cost, because they always
> > use the previous thing as a baseline... noone ever looks at the grand
> > total cost of the package (and by the time that happens, it is too late.)
>
> Now I can turn this back at you ;-) We can implement the simple "just
> add the tracepoints in the code" first, and then later implement the
> table swap version and we can say "hey! we just made the code faster!".
I absolutely agree with hpa here - it's like he's reading my mind. The
sum of the total cost of all those features simply and surely slows down
the kernel with time and if we don't pay attention, we might get bogged
down with fat no matter the IPC improvements hw guys give us. Which are
not endless, btw, in case someone wonders.
What I'm missing with all those patches on LKML is: here's a patch that
doesn't add a new feature but gives us n% improv with this and that
workload. I wish we had more of those instead of the gazillion new
features each 3 months.
> > tracepoints in particular are something I'm getting concerned about,
> > because they are one of those things that turn kernel internals into an
> > ABI, which means misdesigned tracepoints can actually make kernel
> > internal changes very hard to do. The cost of those kinds of issues go
> > up over time as the strain between where we'd like the code to be vs.
> > where the code is increases.
>
> Honestly, I'm extremely concerned about this too. In fact, I've bitched
> about this so many times in the past, but it just fell to deaf ears:
>
> http://lwn.net/Articles/412685/
> http://lwn.net/Articles/415591/
> http://lwn.net/Articles/416665/
> http://lwn.net/Articles/416684/
Absolutely agreed too. This is why we had such a long discussion about
the RAS tracepoint format recently, for example.
Thanks.
--
Regards/Gruss,
Boris.
On Sat, 2012-10-06 at 15:05 +0200, Borislav Petkov wrote:
> What I'm missing with all those patches on LKML is: here's a patch that
> doesn't add a new feature but gives us n% improv with this and that
> workload. I wish we had more of those instead of the gazillion new
> features each 3 months.
That would be nice too. But we can also add a patch that gives us
negligible improvement that makes things a little more complex and also
opens the possibility of a security hole.
Thus my question is, is the swap IDT really worth it? I'm fine if
someone goes ahead and implements it. Heck, I'd love to implement it
when I have time, as it refreshes my knowledge of how intel archs do
interrupt processing.
I'm just worried that we are adding more complexity to code for very
little gain.
I think we need to take another look at this.
1) Are the tracepoints that Seiji worth adding? If not then we can stop
the discussion here.
2) Are the tracepoints done in a way that it's not going to cause "ABI"
issues. If not then we need to redesign the tracepoints.
3) If we are here, then we have tracepoints that are worth adding and
are done in a way that they should be stable for the long term. OK, how
to implement them? The question really is, should we keep it 0% impact
when not active by the IDT switch or allow for the negligible impact by
adding the tracepoints into the code directly and not worrying about it.
a) The tracepoints are going in the code regardless. Even with a
switch we need to have a duplicate of the calls, one with and one
without the tracepoints.
b) It can be done with one big change: add the tracepoints and do the
duplicate with and without versions for the IDT switch. Or we can
break it into two parts. First, add the tracepoints, then add the
switch with the duplicates. I prefer this method if we are doing
the switch.
I expect that if we do the switch we would have something like this:
void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
/*
* NOTE! We'd better ACK the irq immediately,
* because timer handling can be slow.
*/
ack_APIC_irq();
/*
* update_process_times() expects us to have done irq_enter().
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
irq_enter();
exit_idle();
local_apic_timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
}
void __irq_entry trace_smp_apic_timer_interrupt(struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
/*
* NOTE! We'd better ACK the irq immediately,
* because timer handling can be slow.
*/
ack_APIC_irq();
/*
* update_process_times() expects us to have done irq_enter().
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
irq_enter();
exit_idle();
trace_local_timer_entry(LOCAL_TIMER_VECTOR);
local_apic_timer_interrupt();
trace_local_timer_exit(LOCAL_TIMER_VECTOR);
irq_exit();
set_irq_regs(old_regs);
}
Now we have two functions accomplishing the same task. Any change to one
must be done to change the other. Due to rcu idle, the tracepoint needs
to be after the exit_idle() and before irq_exit().
We could force the two to be in step by using ugly macro magic:
#define APIC_TIMER_INTERRUPT(trace, trace_enter, trace_exit) \
void __irq_entry trace##_smp_apic_timer_interrupt(struct pt_regs *regs) \
{ \
struct pt_regs *old_regs = set_irq_regs(regs); \
\
/* \
* NOTE! We'd better ACK the irq immediately, \
* because timer handling can be slow. \
*/ \
ack_APIC_irq(); \
/* \
* update_process_times() expects us to have done irq_enter(). \
* Besides, if we don't timer interrupts ignore the global \
* interrupt lock, which is the WrongThing (tm) to do. \
*/ \
irq_enter(); \
exit_idle(); \
trace_enter; \
local_apic_timer_interrupt(); \
trace_exit; \
irq_exit(); \
\
set_irq_regs(old_regs); \
}
APIC_TIMER_INTERRUPT(,,)
APIC_TIMER_INTERRUPT(trace,trace_local_timer_entry(LOCAL_TIMER_VECTOR), trace_local_timer_exit(LOCAL_TIMER_VECTOR))
But I'm not sure we want to go there.
-- Steve
On Sat, Oct 06, 2012 at 10:51:45AM -0400, Steven Rostedt wrote:
> On Sat, 2012-10-06 at 15:05 +0200, Borislav Petkov wrote:
> > What I'm missing with all those patches on LKML is: here's a patch that
> > doesn't add a new feature but gives us n% improv with this and that
> > workload. I wish we had more of those instead of the gazillion new
> > features each 3 months.
>
> That would be nice too. But we can also add a patch that gives us
> negligible improvement that makes things a little more complex and also
> opens the possibility of a security hole.
>
> Thus my question is, is the swap IDT really worth it? I'm fine if
> someone goes ahead and implements it. Heck, I'd love to implement it
> when I have time, as it refreshes my knowledge of how intel archs do
> interrupt processing.
>
> I'm just worried that we are adding more complexity to code for very
> little gain.
>
> I think we need to take another look at this.
>
> 1) Are the tracepoints that Seiji worth adding? If not then we can stop
> the discussion here.
I like straight talk - it saves everybody a lot of time :-)
But seriously, I was adressing the general observation how a lot of new
features get added because "it would be cool if we could do X and Y" and
how we're progressively getting fatter and slowing down over time.
And I like how you're giving that feature a hard look - something we
should be doing always, btw.
So http://marc.info/?l=linux-kernel&m=134827445716419 talks about how it
is good to know which cores handle IRQs and how this affects the system,
and IRQ interaction and yadda yadda... But frankly speaking, that still
doesn't give me a hard-on; I gotta say - it sounds more like a debugging
feature which people can enable, with certain overhead like most of
those in "Kernel hacking" but the general public doesn't need it.
So, do we really really need this or are we better off with a debugging
design where we don't care about overhead?
Hmm, I'd say make it off by default and let people who want it enable it
and go crazy.
> 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> issues. If not then we need to redesign the tracepoints.
Btw, this we should be asking ourselves about *all* TPs, especially if
they're in generic code.
[ … ]
Thanks.
--
Regards/Gruss,
Boris.
On Sat, 2012-10-06 at 19:32 +0200, Borislav Petkov wrote:
> > 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> > issues. If not then we need to redesign the tracepoints.
>
> Btw, this we should be asking ourselves about *all* TPs, especially if
> they're in generic code.
I agree, and I'm starting to think I shouldn't have given free reign
over the TPs to system maintainers. That is, I should have pushed harder
to understand all tracepoints added to code to make sure the maintainer
knows that it can become an ABI.
Some maintainers don't worry about it. But I can see it coming back to
haunt them. In the end, it will hurt the maintainer of the code, which
is why I gave the ownership of tracepoints to locations where they are
at (instead of a "joint" ownership). But I probably should have been a
TP cop for a while to allow them to understand the consequences first.
-- Steve
On Sat, Oct 06, 2012 at 02:26:17PM -0400, Steven Rostedt wrote:
> On Sat, 2012-10-06 at 19:32 +0200, Borislav Petkov wrote:
>
> > > 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> > > issues. If not then we need to redesign the tracepoints.
> >
> > Btw, this we should be asking ourselves about *all* TPs, especially if
> > they're in generic code.
>
> I agree, and I'm starting to think I shouldn't have given free reign
> over the TPs to system maintainers. That is, I should have pushed harder
> to understand all tracepoints added to code to make sure the maintainer
> knows that it can become an ABI.
>
> Some maintainers don't worry about it. But I can see it coming back to
> haunt them. In the end, it will hurt the maintainer of the code, which
> is why I gave the ownership of tracepoints to locations where they are
> at (instead of a "joint" ownership). But I probably should have been a
> TP cop for a while to allow them to understand the consequences first.
Yeah, even if you were the TP cop and had a shiny uniform with a badge
8-), do you think you'd have the time to review all the code adding TPs?
I think maybe it would've been better to add some text to Documentation
explaining with what care TPs should be designed, have checkpatch warn
on all new tracepoints, hope for the best and prepare for the worst. In
addition maybe review all TPs added to generic or arch-you-care-about
code. Maybe...
--
Regards/Gruss,
Boris.
On 10/06/2012 10:57 AM, Steven Rostedt wrote:
>
> I didn't say anything magic, but a table of pointers that are very
> critical for the system running. Should we implement it with a single
> switch, like we discussed in San Diego to do with the system call table?
>
> That is, have a "normal" table, and a "trace" table. The trace table
> points to functions that have tracepoints. The first enabler of tracing
> switches the table to use the tracepoints, and the last disabler
> switches it back?
>
That is certainly a reasonable implementation option. It is slightly
less usable than it is for system calls, though, because the vectors in
the IDTs are somewhat scrambled and so you can't just do an indirect
jump to the original vector content. This does get messy because you
also want to preserve registers...
>>
>>> You are the maintainer and are responsible for the outcome of changes to
>>> the x86 arch, thus you do have final say. And if you think there's
>>> nothing to worry about with an IDT change then Seiji should implement
>>> it.
>>>
>>> I just want to point out some possible repercussions of doing it in a
>>> more complex way. As tracepoints use nops, and I may be pushing to even
>>> out-of-line the tracepoint unlikely part even more, I'm not sure the
>>> complexity is worth the amount of savings it would be against just
>>> adding the tracepoint in the code.
>>
>> The problem I'm seeing is the constant "oh, just a little bit more." My
>> experience over the years is that there is always demand for "just one
>> more debug feature", each of which has negible cost, because they always
>> use the previous thing as a baseline... noone ever looks at the grand
>> total cost of the package (and by the time that happens, it is too late.)
>
> Now I can turn this back at you ;-) We can implement the simple "just
> add the tracepoints in the code" first, and then later implement the
> table swap version and we can say "hey! we just made the code faster!".
Can we? My understanding how tracepoints are is that they export a
bunch of data structures via debugfs (a.k.a. sploitfs -- any system with
debugfs mounted really should be considered compromised from the start)
and that they are intimately tied to how they are implemented.
>>
>> tracepoints in particular are something I'm getting concerned about,
>> because they are one of those things that turn kernel internals into an
>> ABI, which means misdesigned tracepoints can actually make kernel
>> internal changes very hard to do. The cost of those kinds of issues go
>> up over time as the strain between where we'd like the code to be vs.
>> where the code is increases.
>
> Honestly, I'm extremely concerned about this too. In fact, I've bitched
> about this so many times in the past, but it just fell to deaf ears:
>
> http://lwn.net/Articles/412685/
> http://lwn.net/Articles/415591/
> http://lwn.net/Articles/416665/
> http://lwn.net/Articles/416684/
>
Slightly different complaint, but in the same general vein, yes.
-hpa
> >
> > I didn't say anything magic, but a table of pointers that are very
> > critical for the system running. Should we implement it with a single
> > switch, like we discussed in San Diego to do with the system call table?
> >
> > That is, have a "normal" table, and a "trace" table. The trace table
> > points to functions that have tracepoints. The first enabler of
> > tracing switches the table to use the tracepoints, and the last
> > disabler switches it back?
> >
>
> That is certainly a reasonable implementation option. It is slightly less usable than it is for system calls, though, because the vectors in
> the IDTs are somewhat scrambled and so you can't just do an indirect jump to the original vector content. This does get messy
> because you also want to preserve registers...
>
Peter, Steven,
Thank you for explaining the reason why you think a time penalty should be zero
and discussing its implementation.
I will update my patch so that a time penalty makes zero and submit it shortly.
Seiji