From: Steven Rostedt <[email protected]>
When the function tracer starts modifying the code via breakpoints
it sets a variable (modifying_ftrace_code) to inform the breakpoint
handler to call the ftrace int3 code.
But there's no synchronization between setting this code and the
handler, thus it is possible for the handler to be called on another
CPU before it sees the variable. This will cause a kernel crash as
the int3 handler will not know what to do with it.
I originally added smp_mb()'s to force the visibility of the variable
but H. Peter Anvin suggested that I just make it atomic.
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/include/asm/ftrace.h | 2 +-
arch/x86/kernel/ftrace.c | 6 +++---
arch/x86/kernel/traps.c | 3 ++-
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 18d9005..b0767bc 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -34,7 +34,7 @@
#ifndef __ASSEMBLY__
extern void mcount(void);
-extern int modifying_ftrace_code;
+extern atomic_t modifying_ftrace_code;
static inline unsigned long ftrace_call_adjust(unsigned long addr)
{
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 32ff365..ea9904f 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -168,7 +168,7 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
return ret;
}
-int modifying_ftrace_code __read_mostly;
+atomic_t modifying_ftrace_code __read_mostly;
/*
* A breakpoint was added to the code address we are about to
@@ -491,11 +491,11 @@ void ftrace_replace_code(int enable)
void arch_ftrace_update_code(int command)
{
- modifying_ftrace_code++;
+ atomic_inc(&modifying_ftrace_code);
ftrace_modify_all_code(command);
- modifying_ftrace_code--;
+ atomic_dec(&modifying_ftrace_code);
}
int __init ftrace_dyn_arch_init(void *data)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ff08457..32443ae 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -304,7 +304,8 @@ dotraplinkage void __kprobes notrace do_int3(struct pt_regs *regs, long error_co
{
#ifdef CONFIG_DYNAMIC_FTRACE
/* ftrace must be first, everything else may cause a recursive crash */
- if (unlikely(modifying_ftrace_code) && ftrace_int3_handler(regs))
+ if (unlikely(atomic_read(&modifying_ftrace_code)) &&
+ ftrace_int3_handler(regs))
return;
#endif
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
--
1.7.10
On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> From: Steven Rostedt <[email protected]>
>
> When the function tracer starts modifying the code via breakpoints
> it sets a variable (modifying_ftrace_code) to inform the breakpoint
> handler to call the ftrace int3 code.
>
> But there's no synchronization between setting this code and the
> handler, thus it is possible for the handler to be called on another
> CPU before it sees the variable. This will cause a kernel crash as
> the int3 handler will not know what to do with it.
>
> I originally added smp_mb()'s to force the visibility of the variable
> but H. Peter Anvin suggested that I just make it atomic.
Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,
but atomic_read() never has the smp_rmb() required.
Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.
So this should mostly work, but yuck.
Also, why does this stuff live in ftrace? I always thought you were
going to replace text_poke() so everybody that uses cross-modifying code
could profit?
On Thu, 2012-05-31 at 13:06 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <[email protected]>
> >
> > When the function tracer starts modifying the code via breakpoints
> > it sets a variable (modifying_ftrace_code) to inform the breakpoint
> > handler to call the ftrace int3 code.
> >
> > But there's no synchronization between setting this code and the
> > handler, thus it is possible for the handler to be called on another
> > CPU before it sees the variable. This will cause a kernel crash as
> > the int3 handler will not know what to do with it.
> >
> > I originally added smp_mb()'s to force the visibility of the variable
> > but H. Peter Anvin suggested that I just make it atomic.
>
> Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,
Yeah, I believe (and H. Peter can correct me) that this is all that's
required for x86.
> but atomic_read() never has the smp_rmb() required.
>
> Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.
No rmb() is required, as that's supplied by the breakpoint itself.
Basically, rmb() is used for ordering:
load(A);
rmb();
loab(B);
To keep the machine from actually doing:
load(B);
load(A);
But what this is:
<breakpoint>
|
+---------> <handler>
|
load(A);
We need the load(A) to be after the breakpoint. Is it possible for the
machine to do it before?:
load(A)
|
|
<breakpoint>
+----------> test(A)
??
If another breakpoint is hit (one other than one put in by ftrace) then
we don't care. It wont crash the system whether or not A is 1 or 0. We
just need to make sure that a ftrace breakpoint that is hit knows that
it was a ftrace breakpoint (calls the ftrace handler). No other
breakpoint should be on a ftrace nop anyway.
>
> So this should mostly work, but yuck.
It should work, not just mostly.
>
>
> Also, why does this stuff live in ftrace? I always thought you were
> going to replace text_poke() so everybody that uses cross-modifying code
> could profit?
I discussed this with Masami at Collaboration Summit. The two are
similar but also very different. But we want to start merging the two
together where it makes sense.
Currently text_poke maps 2 pages into a FIXMAP area to make the
modification so that it does not need to change the protection of the
kernel (from ro to rw). Ftrace changes the text protection while it
makes the update then reverts it back when done. The reason for the
difference is that text_poke only handles 1 change at a time. There is a
batch mode, but it does the mapping 1 at a time even for that.
Now ftrace must change 30,000 locations at once!
# cat /debug/tracing/available_filter_functions | wc -l
29467
Could you imagine ftrace mapping 30,000 pages one at a time to do the
update?
Also, the text_poke() batch mode requires allocation of an array. This
could cause failures to start function tracing to allocate 30,000
structures. Ftrace has its own tight structure that's done in blocks
(the struct dyn_ftrace), that is (on x86_64) 16 byte items. This is
allocated at boot up, and module load, and is persistent throughout the
life of the system uptime (or module loaded). You can see it in the
dmesg:
ftrace: allocating 20503 entries in 81 pages
- note this is just core code, before modules are added
The dyn_ftrace has a pointer to the mcount location, and a flags field.
The flags tells the update code what to change.
Now Masami said he thinks the text_poke should also do the change of
kernel protection as well, and not do the FIXMAP mapping. At least for
bulk changes. The added breakpoint code that we have can be shared
between ftrace and text_poke. That is something we want to do. After we
get ftrace working, we'll go ahead and make it work for text_poke as
well ;-)
-- Steve
(2012/05/31 23:08), Steven Rostedt wrote:
> Now Masami said he thinks the text_poke should also do the change of
> kernel protection as well, and not do the FIXMAP mapping. At least for
> bulk changes. The added breakpoint code that we have can be shared
> between ftrace and text_poke. That is something we want to do. After we
> get ftrace working, we'll go ahead and make it work for text_poke as
> well ;-)
Indeed, ftrace can be a better test case for breakpoint-based self-
modifying. So after we can make sure that is safely work on x86
widely, it is better to rewrite text_poke with that method.
Perhaps, kprobes-*jump*-optimization may be better to handle it
because the target probe is simultaneously working while
optimizing (modifying code). This means that if someone hits
breakpoint of such kprobe, it must be handled by kprobes, not
only just tweaking IP address.
Thank you,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, 2012-06-01 at 00:16 +0900, Masami Hiramatsu wrote:
> Perhaps, kprobes-*jump*-optimization may be better to handle it
> because the target probe is simultaneously working while
> optimizing (modifying code). This means that if someone hits
> breakpoint of such kprobe, it must be handled by kprobes, not
> only just tweaking IP address.
Yeah, ftrace is easier on what the breakpoint does. For kprobes, it's a
bit more invasive. Ftrace only deals with a nop, or a "trace-me" call.
As the modifications are being done, we always treat it as a nop until
the modification is complete.
With kprobes, the text_poke() is a bit more difficult, because it can't
treat the location as a nop, as what is changing actually performs some
task for the kernel. It requires that the breakpoint know what to do
with the text (out-of-line execution, and such). ftrace has the benefit
of just "skip this instruction" and return.
-- Steve
On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 13:06 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> > > From: Steven Rostedt <[email protected]>
> > >
> > > When the function tracer starts modifying the code via breakpoints
> > > it sets a variable (modifying_ftrace_code) to inform the breakpoint
> > > handler to call the ftrace int3 code.
> > >
> > > But there's no synchronization between setting this code and the
> > > handler, thus it is possible for the handler to be called on another
> > > CPU before it sees the variable. This will cause a kernel crash as
> > > the int3 handler will not know what to do with it.
> > >
> > > I originally added smp_mb()'s to force the visibility of the variable
> > > but H. Peter Anvin suggested that I just make it atomic.
> >
> > Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,
>
> Yeah, I believe (and H. Peter can correct me) that this is all that's
> required for x86.
>
> > but atomic_read() never has the smp_rmb() required.
> >
> > Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.
>
> No rmb() is required, as that's supplied by the breakpoint itself.
> Basically, rmb() is used for ordering:
>
> load(A);
> rmb();
> loab(B);
>
> To keep the machine from actually doing:
>
> load(B);
> load(A);
I know what rmb is for.. I also know you need to pair barriers. Hiding
them in atomic doesn't make the ordering any more obvious.
> But what this is:
>
> <breakpoint>
> |
> +---------> <handler>
> |
> load(A);
>
> We need the load(A) to be after the breakpoint. Is it possible for the
> machine to do it before?:
>
> load(A)
> |
> |
> <breakpoint>
> +----------> test(A)
I don't know, nor did you explain the implicit ordering there. Also in
such diagrams you need the other side as well.
> If another breakpoint is hit (one other than one put in by ftrace) then
> we don't care. It wont crash the system whether or not A is 1 or 0. We
> just need to make sure that a ftrace breakpoint that is hit knows that
> it was a ftrace breakpoint (calls the ftrace handler). No other
> breakpoint should be on a ftrace nop anyway.
So the ordering is like:
---
CPU-0 CPU-1
lock inc mod-count /* implicit (w)mb */
write int3
<trap-int3> /* implicit (r)mb */
load mod-count
sync-ipi-broadcast
write rest-of-instruction
sync-ipi-broadcast
write head-of-instruction
sync-ipi-broadcast
lock dec mod-count /* implicit (w)mb */
Such that when we observe the int3 on CPU-1 we also must see the
increment on mod-count.
---
A simple something like the above makes it very clear what we're doing
and what we're expecting. I think a (local) trap should imply a barrier
of sorts but will have to defer to others (hpa?) to confirm. But at the
very least write it down someplace that you are assuming that.
fwiw run_sync() could do with a much bigger comment on why its sane to
enable interrupts.. That simply reeks, enabling interrupts too early can
wreck stuff properly.
On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > Also, why does this stuff live in ftrace? I always thought you were
> > going to replace text_poke() so everybody that uses cross-modifying code
> > could profit?
>
> I discussed this with Masami at Collaboration Summit. The two are
> similar but also very different. But we want to start merging the two
> together where it makes sense.
Argh,. I so disagree. You're doing it backwards.
First you merge whatever is there, regardless of who came first.
Then, when everybody doing text modification is using the same
interface, do a second implementation using a Kconfig knob. If the scary
new one breaks, no sweat, flip the config. If its proven stable, kill
off the old one.
I really don't see why ftrace would be special here.. if you have all
text_poke() users use the magic new way you'll have more coverage and
better chances of hitting any snags if there are any.
On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > Also, why does this stuff live in ftrace? I always thought you were
> > > going to replace text_poke() so everybody that uses cross-modifying code
> > > could profit?
> >
> > I discussed this with Masami at Collaboration Summit. The two are
> > similar but also very different. But we want to start merging the two
> > together where it makes sense.
>
> Argh,. I so disagree. You're doing it backwards.
>
> First you merge whatever is there, regardless of who came first.
The thing is, that's useful even if the new stuff never works. Having
two code-bases doing cross-modifying code in different ways just doesn't
sound right. That stuff is tricky enough as it is.
On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > Also, why does this stuff live in ftrace? I always thought you were
> > > going to replace text_poke() so everybody that uses cross-modifying code
> > > could profit?
> >
> > I discussed this with Masami at Collaboration Summit. The two are
> > similar but also very different. But we want to start merging the two
> > together where it makes sense.
>
> Argh,. I so disagree. You're doing it backwards.
>
> First you merge whatever is there, regardless of who came first.
The comment about coming first was more about 're-inventing' then about
merging. You can't reinvent something that didn't exist.
That said, I didn't even think about text poke while doing this. I was
just simply thinking about removing stop_machine from ftrace, that
required this. It was only a after thought that text_poke() could do the
same. And this came up at Collab, where I thought, oh yeah! we can
incorporate this with text poke.
> Then, when everybody doing text modification is using the same
> interface, do a second implementation using a Kconfig knob. If the scary
> new one breaks, no sweat, flip the config. If its proven stable, kill
> off the old one.
What do you suggest then? To revert the code and rewrite it so that
text_poke() does a similar thing?
-- Steve
On Thu, 2012-05-31 at 13:53 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> > On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > > Also, why does this stuff live in ftrace? I always thought you were
> > > > going to replace text_poke() so everybody that uses cross-modifying code
> > > > could profit?
> > >
> > > I discussed this with Masami at Collaboration Summit. The two are
> > > similar but also very different. But we want to start merging the two
> > > together where it makes sense.
> >
> > Argh,. I so disagree. You're doing it backwards.
> >
> > First you merge whatever is there, regardless of who came first.
>
> The comment about coming first was more about 're-inventing' then about
> merging. You can't reinvent something that didn't exist.
>
> That said, I didn't even think about text poke while doing this.
Well, the fail is before that, how could we grow two pieces of code
doing similar things in the first place?
> I was
> just simply thinking about removing stop_machine from ftrace, that
> required this. It was only a after thought that text_poke() could do the
> same. And this came up at Collab, where I thought, oh yeah! we can
> incorporate this with text poke.
But but but but.. the thing far back when Mathieu proposed the int3
scheme it was text_poke().. how.. did you not think of it this time!?
>
> > Then, when everybody doing text modification is using the same
> > interface, do a second implementation using a Kconfig knob. If the scary
> > new one breaks, no sweat, flip the config. If its proven stable, kill
> > off the old one.
>
> What do you suggest then? To revert the code and rewrite it so that
> text_poke() does a similar thing?
Too late for that now I guess.. I just wonder why you all thought it was
a good idea to have two pieces of code doing cross-modifying-code. I
always assumed ftrace used text_poke().
I hardly ever use dyn-ftrace but I do use some text_poke() through
jump_labels.
I would still like to end up with one code base doing CMC with two
implementations depending on a Kconfig knob.
On Thu, 2012-05-31 at 19:28 +0200, Peter Zijlstra wrote:
> I don't know, nor did you explain the implicit ordering there. Also in
> such diagrams you need the other side as well.
Right, I forgot to add that.
>
> > If another breakpoint is hit (one other than one put in by ftrace) then
> > we don't care. It wont crash the system whether or not A is 1 or 0. We
> > just need to make sure that a ftrace breakpoint that is hit knows that
> > it was a ftrace breakpoint (calls the ftrace handler). No other
> > breakpoint should be on a ftrace nop anyway.
>
> So the ordering is like:
>
> ---
>
> CPU-0 CPU-1
>
>
> lock inc mod-count /* implicit (w)mb */
> write int3
> <trap-int3> /* implicit (r)mb */
> load mod-count
>
> sync-ipi-broadcast
> write rest-of-instruction
> sync-ipi-broadcast
> write head-of-instruction
> sync-ipi-broadcast
> lock dec mod-count /* implicit (w)mb */
>
>
> Such that when we observe the int3 on CPU-1 we also must see the
> increment on mod-count.
>
> ---
>
> A simple something like the above makes it very clear what we're doing
> and what we're expecting.
You are obviously better at drawing such diagrams than I am. Which also
explains why I needed to do the lockdep annotations to understand them
better. Because I don't have your ability to visualize it in such a nice
simple diagram. My vision is more of a bayesian tree than an oak. I
guess that explains why I like complex code so much :-)
> I think a (local) trap should imply a barrier
> of sorts but will have to defer to others (hpa?) to confirm.
I would think it does, but lets have hpa or others confirm.
> But at the
> very least write it down someplace that you are assuming that.
I'll cut and paste this into the comments.
>
>
>
> fwiw run_sync() could do with a much bigger comment on why its sane to
> enable interrupts.. That simply reeks, enabling interrupts too early can
> wreck stuff properly.
Hmm, actually, I think these patches avoid the need for run_sync() at
boot up now anyway. Thus it shouldn't need to enable interrupts.
The ftrace.c code was what disabled interrupts on start up (and
re-enables them) before calling into this, this is because the old way
needed interrupts disabled, and other archs still use the old way. But,
your point stands, it should be commented better.
-- Steve
On Thu, 2012-05-31 at 20:03 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 13:53 -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> > > On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > > > Also, why does this stuff live in ftrace? I always thought you were
> > > > > going to replace text_poke() so everybody that uses cross-modifying code
> > > > > could profit?
> > > >
> > > > I discussed this with Masami at Collaboration Summit. The two are
> > > > similar but also very different. But we want to start merging the two
> > > > together where it makes sense.
> > >
> > > Argh,. I so disagree. You're doing it backwards.
> > >
> > > First you merge whatever is there, regardless of who came first.
> >
> > The comment about coming first was more about 're-inventing' then about
> > merging. You can't reinvent something that didn't exist.
> >
> > That said, I didn't even think about text poke while doing this.
>
> Well, the fail is before that, how could we grow two pieces of code
> doing similar things in the first place?
Again, ftrace is slightly different as it does 30,000 changes at once,
on top of known nops. This was done through stop_machine(), thus any
slowdown was a large hit to system performance. text_poke() took the way
of mapping a page to do the change, and Mathieu didn't want to change
that (IIRC). But now we want the two to be similar.
>
> > I was
> > just simply thinking about removing stop_machine from ftrace, that
> > required this. It was only a after thought that text_poke() could do the
> > same. And this came up at Collab, where I thought, oh yeah! we can
> > incorporate this with text poke.
>
> But but but but.. the thing far back when Mathieu proposed the int3
> scheme it was text_poke().. how.. did you not think of it this time!?
Right, but we disagreed on its implementation, and yes, the idea came
from that, but I thought text_poke() was so different that it wouldn't
apply. It wasn't until Masami suggested changing text_poke() to be more
like what ftrace does, that I really took thought into it.
>
> >
> > > Then, when everybody doing text modification is using the same
> > > interface, do a second implementation using a Kconfig knob. If the scary
> > > new one breaks, no sweat, flip the config. If its proven stable, kill
> > > off the old one.
> >
> > What do you suggest then? To revert the code and rewrite it so that
> > text_poke() does a similar thing?
>
> Too late for that now I guess.. I just wonder why you all thought it was
> a good idea to have two pieces of code doing cross-modifying-code. I
> always assumed ftrace used text_poke().
Well, it was more of two different people working on two different
things. We really didn't look too closely at each others work. It was
more a social failure than a technical one.
>
> I hardly ever use dyn-ftrace but I do use some text_poke() through
> jump_labels.
You don't use function tracer? That's dyn-ftrace.
But still, we need to keep the record as small as possible because it is
persistent throughout the life of the system running. Every location
must be recorded, and maintain a state (flags).
Text_poke() mostly grew out of the jump-label work. But yes, there's
still a lot that can be shared. The actual code modification may be.
>
> I would still like to end up with one code base doing CMC with two
> implementations depending on a Kconfig knob.
You mean keep stop_machine around?
-- Steve
On Thu, 2012-05-31 at 14:50 -0400, Steven Rostedt wrote:
> > Well, the fail is before that, how could we grow two pieces of code
> > doing similar things in the first place?
>
> Again, ftrace is slightly different as it does 30,000 changes at once,
> on top of known nops. This was done through stop_machine(), thus any
> slowdown was a large hit to system performance. text_poke() took the way
> of mapping a page to do the change, and Mathieu didn't want to change
> that (IIRC). But now we want the two to be similar.
We could give text_poke a function argument to do the actual
modification, leaving all the magic centralized.
Also, why did Mathieu insist on keeping that kmap()?
> > I hardly ever use dyn-ftrace but I do use some text_poke() through
> > jump_labels.
>
> You don't use function tracer? That's dyn-ftrace.
Not much no.. I do use trace_printk() and ftrace_dump_on_oops a lot
though.
> But still, we need to keep the record as small as possible because it is
> persistent throughout the life of the system running. Every location
> must be recorded, and maintain a state (flags).
>
> Text_poke() mostly grew out of the jump-label work. But yes, there's
> still a lot that can be shared. The actual code modification may be.
Afaicr we didn't change text_poke() for the jump-label stuff, except in
trivial ways (added a #ifdef and exposed a function etc..).
> > I would still like to end up with one code base doing CMC with two
> > implementations depending on a Kconfig knob.
>
> You mean keep stop_machine around?
Yeah, like have CONFIG_CMC_STOPMACHINE and CONFIG_CMC_FANCY for a little
while.
If we find a problem with the fancy approach going back is easy, once
its proven stable we could remove the stop-machine one.
* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2012-05-31 at 14:50 -0400, Steven Rostedt wrote:
>
> > > Well, the fail is before that, how could we grow two pieces of code
> > > doing similar things in the first place?
> >
> > Again, ftrace is slightly different as it does 30,000 changes at once,
> > on top of known nops. This was done through stop_machine(), thus any
> > slowdown was a large hit to system performance. text_poke() took the way
> > of mapping a page to do the change, and Mathieu didn't want to change
> > that (IIRC). But now we want the two to be similar.
>
> We could give text_poke a function argument to do the actual
> modification, leaving all the magic centralized.
>
> Also, why did Mathieu insist on keeping that kmap()?
Not sure about the entire context here, but the goal of using kmap() is
to allow modification of text in configurations where the kernel text
is read-only: the kmap does a temporary shadow RW mapping that allows
modification of the text. Presumably that Ftrace's 30k changes are done
before the kernel text mapping is set to read-only ? If this is the
case, then it is similar to text_poke_early, which don't use the kmap
since it happens before kernel text gets write-protected. But text_poke
has to deal with RO pages.
Hopefully my answer makes sense in the context of your discussion.
Thanks,
Mathieu
>
> > > I hardly ever use dyn-ftrace but I do use some text_poke() through
> > > jump_labels.
> >
> > You don't use function tracer? That's dyn-ftrace.
>
> Not much no.. I do use trace_printk() and ftrace_dump_on_oops a lot
> though.
>
> > But still, we need to keep the record as small as possible because it is
> > persistent throughout the life of the system running. Every location
> > must be recorded, and maintain a state (flags).
> >
> > Text_poke() mostly grew out of the jump-label work. But yes, there's
> > still a lot that can be shared. The actual code modification may be.
>
> Afaicr we didn't change text_poke() for the jump-label stuff, except in
> trivial ways (added a #ifdef and exposed a function etc..).
>
> > > I would still like to end up with one code base doing CMC with two
> > > implementations depending on a Kconfig knob.
> >
> > You mean keep stop_machine around?
>
> Yeah, like have CONFIG_CMC_STOPMACHINE and CONFIG_CMC_FANCY for a little
> while.
>
> If we find a problem with the fancy approach going back is easy, once
> its proven stable we could remove the stop-machine one.
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On Thu, 2012-05-31 at 15:55 -0400, Mathieu Desnoyers wrote:
> > Also, why did Mathieu insist on keeping that kmap()?
>
> Not sure about the entire context here, but the goal of using kmap() is
> to allow modification of text in configurations where the kernel text
> is read-only: the kmap does a temporary shadow RW mapping that allows
> modification of the text. Presumably that Ftrace's 30k changes are done
> before the kernel text mapping is set to read-only ? If this is the
> case, then it is similar to text_poke_early, which don't use the kmap
> since it happens before kernel text gets write-protected. But text_poke
> has to deal with RO pages.
No this is also when ftrace is enabled at runtime. The trick that ftrace
does is to temporarially convert the kernel text from ro to rw, and
then back to ro when done. You can argue that this degrades the security
of the system, but tracing every function in the kernel does too ;-)
That's why it's a root only utility.
Hmm, this brings up another question. By default, perf does not allow
users to profile trace_events correct? IOW, perf does not let
unprivileged users call text_poke()? I just tried it and got the:
$ ~/bin/perf record -e sched:sched_switch sleep 1
Permission error - are you root?
Consider tweaking /proc/sys/kernel/perf_event_paranoid:
-1 - Not paranoid at all
0 - Disallow raw tracepoint access for unpriv
1 - Disallow cpu events for unpriv
2 - Disallow kernel profiling for unpriv
>
> Hopefully my answer makes sense in the context of your discussion.
Almost,
-- Steve
On Thu, 2012-05-31 at 16:10 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 15:55 -0400, Mathieu Desnoyers wrote:
>
> > > Also, why did Mathieu insist on keeping that kmap()?
> >
> > Not sure about the entire context here, but the goal of using kmap() is
> > to allow modification of text in configurations where the kernel text
> > is read-only: the kmap does a temporary shadow RW mapping that allows
> > modification of the text. Presumably that Ftrace's 30k changes are done
> > before the kernel text mapping is set to read-only ? If this is the
> > case, then it is similar to text_poke_early, which don't use the kmap
> > since it happens before kernel text gets write-protected. But text_poke
> > has to deal with RO pages.
>
> No this is also when ftrace is enabled at runtime. The trick that ftrace
> does is to temporarially convert the kernel text from ro to rw, and
> then back to ro when done. You can argue that this degrades the security
> of the system, but tracing every function in the kernel does too ;-)
> That's why it's a root only utility.
Right, but when you loose stop-machine you could simply do 30k
kmap_atomic/kunmap_atomic's consecutively since you're not holding
anybody up.
> Hmm, this brings up another question. By default, perf does not allow
> users to profile trace_events correct? IOW, perf does not let
> unprivileged users call text_poke()? I just tried it and got the:
>
> $ ~/bin/perf record -e sched:sched_switch sleep 1
> Permission error - are you root?
> Consider tweaking /proc/sys/kernel/perf_event_paranoid:
> -1 - Not paranoid at all
> 0 - Disallow raw tracepoint access for unpriv
> 1 - Disallow cpu events for unpriv
> 2 - Disallow kernel profiling for unpriv
It would, except tools/perf does stupid, its unconditionally adding
PERF_SAMPLE_RAW (even for non-sampling events), which is the bit that
requires privs.
On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
> Right, but when you loose stop-machine you could simply do 30k
> kmap_atomic/kunmap_atomic's consecutively since you're not holding
> anybody up.
It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
blasting^Wsending to all CPUs.
-- Steve
On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
>
> > Right, but when you loose stop-machine you could simply do 30k
> > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > anybody up.
>
> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> blasting^Wsending to all CPUs.
>
Note, currently ftrace adds breakpoints to all locations, sends out the
IPIs, modifies the code, sends out the IPIs and then removes the
breakpoint, sends out the IPIs.
Just three IPIs for those 30,000 updates.
-- Steve
On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
>
> > Right, but when you loose stop-machine you could simply do 30k
> > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > anybody up.
>
> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> blasting^Wsending to all CPUs.
Uhm, no.
for_each() {
kmap_atomic()
frob int3
kunmap_atomic();
}
sync-ipi-broadcast();
for_each() {
kmap_atomic();
frob tail
kunmap_atomic();
}
sync-ipi-broadcast();
for_each() {
kmap_atomic();
frob head
kunmap_atomic();
};
sync-ipi-broadcast();
How is that sending 90k ipis?
On Thu, 2012-05-31 at 22:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
> >
> > > Right, but when you loose stop-machine you could simply do 30k
> > > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > > anybody up.
> >
> > It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> > blasting^Wsending to all CPUs.
>
> Uhm, no.
>
----------------------------------+
> for_each() { |
> kmap_atomic() |
> frob int3 |
> kunmap_atomic(); |
> } |
> |
> sync-ipi-broadcast(); +--- Break points applied
> |
> for_each() { |
> kmap_atomic(); |
> frob tail |
> kunmap_atomic(); |
> } |
----------------------------------+
Note, for the above time, the entire kernel has breakpoints added, and
every function is taking a hit due to it. By slowing down this process,
the rest of the system *will* be impacted. Ideally, we want to finish it
as quick as possible. Not to mention, the kmap_atomics will be slowed
down by the breakpoints that are attached to them.
-- Steve
>
> sync-ipi-broadcast();
>
> for_each() {
> kmap_atomic();
> frob head
> kunmap_atomic();
> };
>
> sync-ipi-broadcast();
>
> How is that sending 90k ipis?
Em Thu, May 31, 2012 at 10:26:09PM +0200, Peter Zijlstra escreveu:
> > Hmm, this brings up another question. By default, perf does not allow
> > users to profile trace_events correct? IOW, perf does not let
> > unprivileged users call text_poke()? I just tried it and got the:
> >
> > $ ~/bin/perf record -e sched:sched_switch sleep 1
> > Permission error - are you root?
> > Consider tweaking /proc/sys/kernel/perf_event_paranoid:
> > -1 - Not paranoid at all
> > 0 - Disallow raw tracepoint access for unpriv
> > 1 - Disallow cpu events for unpriv
> > 2 - Disallow kernel profiling for unpriv
>
> It would, except tools/perf does stupid, its unconditionally adding
> PERF_SAMPLE_RAW (even for non-sampling events), which is the bit that
> requires privs.
Looserland, bah, I'll check that, thanks.
- Arnaldo
(2012/06/01 5:49), Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:40 +0200, Peter Zijlstra wrote:
>> On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
>>> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
>>>
>>>> Right, but when you loose stop-machine you could simply do 30k
>>>> kmap_atomic/kunmap_atomic's consecutively since you're not holding
>>>> anybody up.
>>>
>>> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
>>> blasting^Wsending to all CPUs.
>>
>> Uhm, no.
>>
>
> ----------------------------------+
>> for_each() { |
>> kmap_atomic() |
>> frob int3 |
>> kunmap_atomic(); |
>> } |
>> |
>> sync-ipi-broadcast(); +--- Break points applied
>> |
>> for_each() { |
>> kmap_atomic(); |
>> frob tail |
>> kunmap_atomic(); |
>> } |
> ----------------------------------+
>
> Note, for the above time, the entire kernel has breakpoints added, and
> every function is taking a hit due to it. By slowing down this process,
> the rest of the system *will* be impacted. Ideally, we want to finish it
> as quick as possible. Not to mention, the kmap_atomics will be slowed
> down by the breakpoints that are attached to them.
Hmm, why don't we have two text_poke interfaces for single and
batch? As like dyn_ftrace, since modifying massive points takes
a lot time, so we may have additional kconfig something like
CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
batch text_poke.
Thank you,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, 2012-06-01 at 13:53 +0900, Masami Hiramatsu wrote:
> Hmm, why don't we have two text_poke interfaces for single and
> batch? As like dyn_ftrace, since modifying massive points takes
> a lot time, so we may have additional kconfig something like
> CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
> batch text_poke.
>
I'll be working on patches to consolidate the two after I get everything
else working :-) I still need to work on the ftrace kprobe stuff too.
I hate having a config option to switch between the two, except for
something that can be there if we find the new approach is buggy (like
we have with lockdep). That may be a solution for this if we don't agree
on one now. That is, bring back stop_machine() when LOCKDEP is enabled.
But that should only be a temporary work around not a true fix.
I have no problem with having most of the modifying code be shared
between text_poke and ftrace. I guess the question is, do we want to do
it only with the FIXMAP or do we want text_poke and ftrace to use the rw
method for large batches. Heck, we can set a limit. If we are going to
update more that 100 locations, we switch the kernel to rw, otherwise we
do the FIXMAP update.
-- Steve
(2012/06/01 20:37), Steven Rostedt wrote:
> On Fri, 2012-06-01 at 13:53 +0900, Masami Hiramatsu wrote:
>
>> Hmm, why don't we have two text_poke interfaces for single and
>> batch? As like dyn_ftrace, since modifying massive points takes
>> a lot time, so we may have additional kconfig something like
>> CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
>> batch text_poke.
>>
>
> I'll be working on patches to consolidate the two after I get everything
> else working :-) I still need to work on the ftrace kprobe stuff too.
>
> I hate having a config option to switch between the two, except for
> something that can be there if we find the new approach is buggy (like
> we have with lockdep). That may be a solution for this if we don't agree
> on one now. That is, bring back stop_machine() when LOCKDEP is enabled.
> But that should only be a temporary work around not a true fix.
>
> I have no problem with having most of the modifying code be shared
> between text_poke and ftrace. I guess the question is, do we want to do
> it only with the FIXMAP or do we want text_poke and ftrace to use the rw
> method for large batches. Heck, we can set a limit. If we are going to
> update more that 100 locations, we switch the kernel to rw, otherwise we
> do the FIXMAP update.
That's reasonable for me :). You can feel free to change/remove
text_poke_smp_batch which is already old-style interface.
Thanks!
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]