Hi Frank,
On Tue, 19 Jan 2010 16:16:46 -0500 "Frank Ch. Eigler" <[email protected]> wrote:
>
> Having been reviewed a couple of times, and we hope being a good
> candidate for merging next time, please start pulling
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-utrace.git branch master
I have added this from today with you and utrace-devel as the contacts.
I have cc'd the wider community on this email so that people are aware
that this has been included.
> This repo contains frequent merges from Linus' tree. If you'd prefer
> a cleaner rebase-based branch to pull from, we can make one of those too.
For now it is OK, but you might like to ask Linus if he would like it
cleaned up before submission since it seems to have history right back to
2.6.29 and (as you say) lots of merges with his tree.
You should also add a commit with an entry in MAINTAINERS.
[Standard boilerplate]
Thanks for adding your subsystem tree as a participant of linux-next. As
you may know, this is not a judgment of your code. The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window.
You will need to ensure that the patches/commits in your tree/series have
been:
* submitted under GPL v2 (or later) and include the Contributor's
Signed-off-by,
* posted to the relevant mailing list,
* reviewed by you (or another maintainer of your subsystem tree),
* successfully unit tested, and
* destined for the current or next Linux merge window.
Basically, this should be just what you would send to Linus (or ask him
to fetch). It is allowed to be rebased if you deem it necessary.
--
Cheers,
Stephen Rothwell
[email protected]
Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees. You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next. These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc. The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc. If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.
* Stephen Rothwell <[email protected]> wrote:
> Hi Frank,
>
> On Tue, 19 Jan 2010 16:16:46 -0500 "Frank Ch. Eigler" <[email protected]> wrote:
> >
> > Having been reviewed a couple of times, and we hope being a good
> > candidate for merging next time, please start pulling
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-utrace.git branch master
>
> I have added this from today with you and utrace-devel as the contacts.
> I have cc'd the wider community on this email so that people are aware
> that this has been included.
>
> > This repo contains frequent merges from Linus' tree. If you'd prefer
> > a cleaner rebase-based branch to pull from, we can make one of those too.
>
> For now it is OK, but you might like to ask Linus if he would like it
> cleaned up before submission since it seems to have history right back to
> 2.6.29 and (as you say) lots of merges with his tree.
>
> You should also add a commit with an entry in MAINTAINERS.
Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
etc.) can go uptream in its present form.
IMHO the far more important thing to address beyond formalities and workflow
cleanliness are the (many) technical observations and objections offered by
Peter Zijstra on lkml. Not just the git history but also the abstractions and
concepts are messy and should be reworked IMO, and also good and working perf
events integration should be achieved, etc.
The fact that there's a well established upstream workflow for instrumentation
patches, which is being routed around by the utrace/uprobes/systemtap code
here is not a good sign in terms of reaching a good upstream solution. Lets
hope it works out well though.
Thanks,
Ingo
On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
Ingo,
> Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
> etc.) can go uptream in its present form.
Agreed, uprobes is still not upstream ready -- it was an RFC. We are
working through the comments there to get it ready for merger.
> IMHO the far more important thing to address beyond formalities and workflow
> cleanliness are the (many) technical observations and objections offered by
> Peter Zijstra on lkml. Not just the git history but also the abstractions and
> concepts are messy and should be reworked IMO, and also good and working perf
> events integration should be achieved, etc.
I think Oleg addressed most of Peter's concerns on utrace when the
ptrace/utrace patchset was reposted.
Perf integration with uprobes will be done and discussions have started
with Masami and Frederic. There are a couple of fundamental technical
aspects (XOL vma vs. emulation; breakpoint insertion through CoW and not
through quiesce) that need resolution.
> The fact that there's a well established upstream workflow for instrumentation
> patches, which is being routed around by the utrace/uprobes/systemtap code
> here is not a good sign in terms of reaching a good upstream solution. Lets
> hope it works out well though.
Agreed.
On the other hand, having ptrace/utrace in the -next tree will give it a
lot more testing, while any outstanding technical issues are being addressed.
Stephen,
To exercise ptrace/utrace, it would be very useful if you pulled in
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-utrace.git branch utrace-ptrace
instead of 'master'.
Thanks,
Ananth
* Ananth N Mavinakayanahalli <[email protected]> wrote:
> On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
>
> Ingo,
>
> > Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
> > etc.) can go uptream in its present form.
>
> Agreed, uprobes is still not upstream ready -- it was an RFC. We are
> working through the comments there to get it ready for merger.
>
> > IMHO the far more important thing to address beyond formalities and workflow
> > cleanliness are the (many) technical observations and objections offered by
> > Peter Zijstra on lkml. Not just the git history but also the abstractions and
> > concepts are messy and should be reworked IMO, and also good and working perf
> > events integration should be achieved, etc.
>
> I think Oleg addressed most of Peter's concerns on utrace when the
> ptrace/utrace patchset was reposted.
Peter is Cc:-ed and he might want to chime in.
> Perf integration with uprobes will be done and discussions have started with
> Masami and Frederic. There are a couple of fundamental technical aspects
> (XOL vma vs. emulation; breakpoint insertion through CoW and not through
> quiesce) that need resolution.
>
> > The fact that there's a well established upstream workflow for instrumentation
> > patches, which is being routed around by the utrace/uprobes/systemtap code
> > here is not a good sign in terms of reaching a good upstream solution. Lets
> > hope it works out well though.
>
> Agreed.
>
> On the other hand, having ptrace/utrace in the -next tree will give it a
> lot more testing, while any outstanding technical issues are being addressed.
Including experimental code that is RFC and which is not certain to go
upstream is certainly not the purpose of linux-next though.
It will cause conflicts with various other trees and increases the overhead
all around. It also causes us to trust linux-next bugreports less - as it's
not the 'next Linux' anymore. Also, there's virtually no high-level technical
review done in linux-next: the trees are implicitly trusted (because they are
pushed by maintainers), bugs and conflicts are reported but otherwise it's a
neutral tree that includes pretty much any commit indiscriminately.
If you need review and testing there's a number of trees you can get inclusion
into.
Ingo
On Wed, Jan 20, 2010 at 07:28:34AM +0100, Ingo Molnar wrote:
>
> * Ananth N Mavinakayanahalli <[email protected]> wrote:
>
> > On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
...
> > On the other hand, having ptrace/utrace in the -next tree will give it a
> > lot more testing, while any outstanding technical issues are being addressed.
>
> Including experimental code that is RFC and which is not certain to go
> upstream is certainly not the purpose of linux-next though.
OK.
> It will cause conflicts with various other trees and increases the overhead
> all around. It also causes us to trust linux-next bugreports less - as it's
> not the 'next Linux' anymore. Also, there's virtually no high-level technical
> review done in linux-next: the trees are implicitly trusted (because they are
> pushed by maintainers), bugs and conflicts are reported but otherwise it's a
> neutral tree that includes pretty much any commit indiscriminately.
>
> If you need review and testing there's a number of trees you can get inclusion
> into.
So would -tip be one of them? If so could you pull the utrace-ptrace
branch in?
Or did you intend some other tree (random-tracing)? (Though I think a
ptrace reimplementation isn't 'random'-tracing :-))
Ananth
Hi Frank,
On Wed, 20 Jan 2010 07:28:34 +0100 Ingo Molnar <[email protected]> wrote:
>
> Including experimental code that is RFC and which is not certain to go
> upstream is certainly not the purpose of linux-next though.
Ingo is correct in what he says here. See the boilerplate:
" * destined for the current or next Linux merge window.
Basically, this should be just what you would send to Linus (or ask him
to fetch)."
I will remove this tree from linux-next tomorrow and wait until it is
more ready for mainline inclusion.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/
* Ingo Molnar <[email protected]> wrote:
>
> * Ananth N Mavinakayanahalli <[email protected]> wrote:
>
> > On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
> >
> > Ingo,
> >
> > > Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
> > > etc.) can go uptream in its present form.
> >
> > Agreed, uprobes is still not upstream ready -- it was an RFC. We are
> > working through the comments there to get it ready for merger.
> >
> > > IMHO the far more important thing to address beyond formalities and workflow
> > > cleanliness are the (many) technical observations and objections offered by
> > > Peter Zijstra on lkml. Not just the git history but also the abstractions and
> > > concepts are messy and should be reworked IMO, and also good and working perf
> > > events integration should be achieved, etc.
> >
> > I think Oleg addressed most of Peter's concerns on utrace when the
> > ptrace/utrace patchset was reposted.
>
> Peter is Cc:-ed and he might want to chime in.
>
> > Perf integration with uprobes will be done and discussions have started with
> > Masami and Frederic. There are a couple of fundamental technical aspects
> > (XOL vma vs. emulation; breakpoint insertion through CoW and not through
> > quiesce) that need resolution.
> >
> > > The fact that there's a well established upstream workflow for instrumentation
> > > patches, which is being routed around by the utrace/uprobes/systemtap code
> > > here is not a good sign in terms of reaching a good upstream solution. Lets
> > > hope it works out well though.
> >
> > Agreed.
> >
> > On the other hand, having ptrace/utrace in the -next tree will give it a
> > lot more testing, while any outstanding technical issues are being addressed.
>
> Including experimental code that is RFC and which is not certain to go
> upstream is certainly not the purpose of linux-next though.
>
> It will cause conflicts with various other trees and increases the overhead
> all around. It also causes us to trust linux-next bugreports less - as it's
> not the 'next Linux' anymore. Also, there's virtually no high-level
> technical review done in linux-next: the trees are implicitly trusted
> (because they are pushed by maintainers), bugs and conflicts are reported
> but otherwise it's a neutral tree that includes pretty much any commit
> indiscriminately.
>
> If you need review and testing there's a number of trees you can get
> inclusion into.
Btw., the utrace code has lived in -mm for quite some time - that's an
excellent route as Andrew does thorough review and testing.
If Andrew agrees with this particular tree as-is and wants these bits to live
in linux-next and have it in -mm that way then that's a fair approach
obviously and i have no objections ...
The point is to have at least one relevant maintainer request and track it and
then supervise the completion of it (which includes the resolution of all
outstanding objections) and then push it to Linus.
Ingo
On Wed, 2010-01-20 at 07:28 +0100, Ingo Molnar wrote:
> > I think Oleg addressed most of Peter's concerns on utrace when the
> > ptrace/utrace patchset was reposted.
>
> Peter is Cc:-ed and he might want to chime in.
Yeah, I'll make some time to go through the latest code again.. if only
there was a clone() for humans ;-)
On Wed, Jan 20, 2010 at 12:10:26PM +0530, Ananth N Mavinakayanahalli wrote:
> > It will cause conflicts with various other trees and increases the overhead
> > all around. It also causes us to trust linux-next bugreports less - as it's
> > not the 'next Linux' anymore. Also, there's virtually no high-level technical
> > review done in linux-next: the trees are implicitly trusted (because they are
> > pushed by maintainers), bugs and conflicts are reported but otherwise it's a
> > neutral tree that includes pretty much any commit indiscriminately.
> >
> > If you need review and testing there's a number of trees you can get inclusion
> > into.
>
> So would -tip be one of them? If so could you pull the utrace-ptrace
> branch in?
>
> Or did you intend some other tree (random-tracing)? (Though I think a
> ptrace reimplementation isn't 'random'-tracing :-))
Heh. No this is a tree I use for, well, random tracing patches indeed,
which has extended to random tracing/perf/* patches by the time.
I sometimes relay other's patches to Ingo toward this tree but this is
usually about small volumes and for small term storage: patches that
have been reviewed/acked already.
utrace/uprobe is about high volume and longer time debate/review/maintainance
and I won't have the time to carry this.
> Ananth
Hi -
On Wed, Jan 20, 2010 at 07:28:34AM +0100, Ingo Molnar wrote:
> [...]
> > On the other hand, having ptrace/utrace in the -next tree will give it a
> > lot more testing, while any outstanding technical issues are being addressed.
>
> Including experimental code that is RFC and which is not certain to go
> upstream is certainly not the purpose of linux-next though.
> [...]
Ingo, you are mistaken. The utrace core and utrace/ptrace code were
submitted and reviewed together on lkml, and are not considered
experimental.
I know the names may be confusing, but it is unnecessary to bring up
uprobes and other RFC/experimental items. None of these are included
in the branch I pointed sfr to.
- FChE
Hi -
On Wed, Jan 20, 2010 at 05:59:59PM +1100, Stephen Rothwell wrote:
> [...]
> > Including experimental code that is RFC and which is not certain to go
> > upstream is certainly not the purpose of linux-next though.
>
> Ingo is correct in what he says here. See the boilerplate:
> [...]
> Basically, this should be just what you would send to Linus (or ask him
> to fetch)."
> I will remove this tree from linux-next tomorrow and wait until it is
> more ready for mainline inclusion.
Please reconsider. Ingo mistook what was being proposed. We request
merge/integration testing for just the set of patches posted
<http://lkml.org/lkml/2009/12/17/466>, which was in response to
peterz's earlier review comments, and none of which is labeled or
considered RFC or experimental.
Ananth was right that the utrace-ptrace git branch represents this
rather than master.
- FChE
Hi Ingo, Andrew,
On Wed, 20 Jan 2010 08:29:25 +0100 Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ananth N Mavinakayanahalli <[email protected]> wrote:
> >
> > > On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
> > >
> > > Ingo,
> > >
> > > > Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
> > > > etc.) can go uptream in its present form.
> > >
> > > Agreed, uprobes is still not upstream ready -- it was an RFC. We are
> > > working through the comments there to get it ready for merger.
> > >
> > > > IMHO the far more important thing to address beyond formalities and workflow
> > > > cleanliness are the (many) technical observations and objections offered by
> > > > Peter Zijstra on lkml. Not just the git history but also the abstractions and
> > > > concepts are messy and should be reworked IMO, and also good and working perf
> > > > events integration should be achieved, etc.
> > >
> > > I think Oleg addressed most of Peter's concerns on utrace when the
> > > ptrace/utrace patchset was reposted.
> >
> > Peter is Cc:-ed and he might want to chime in.
> >
> > > Perf integration with uprobes will be done and discussions have started with
> > > Masami and Frederic. There are a couple of fundamental technical aspects
> > > (XOL vma vs. emulation; breakpoint insertion through CoW and not through
> > > quiesce) that need resolution.
> > >
> > > > The fact that there's a well established upstream workflow for instrumentation
> > > > patches, which is being routed around by the utrace/uprobes/systemtap code
> > > > here is not a good sign in terms of reaching a good upstream solution. Lets
> > > > hope it works out well though.
> > >
> > > Agreed.
> > >
> > > On the other hand, having ptrace/utrace in the -next tree will give it a
> > > lot more testing, while any outstanding technical issues are being addressed.
> >
> > Including experimental code that is RFC and which is not certain to go
> > upstream is certainly not the purpose of linux-next though.
> >
> > It will cause conflicts with various other trees and increases the overhead
> > all around. It also causes us to trust linux-next bugreports less - as it's
> > not the 'next Linux' anymore. Also, there's virtually no high-level
> > technical review done in linux-next: the trees are implicitly trusted
> > (because they are pushed by maintainers), bugs and conflicts are reported
> > but otherwise it's a neutral tree that includes pretty much any commit
> > indiscriminately.
> >
> > If you need review and testing there's a number of trees you can get
> > inclusion into.
>
> Btw., the utrace code has lived in -mm for quite some time - that's an
> excellent route as Andrew does thorough review and testing.
>
> If Andrew agrees with this particular tree as-is and wants these bits to live
> in linux-next and have it in -mm that way then that's a fair approach
> obviously and i have no objections ...
So, what is it to be? In or out?
Frank, please be clear as to which branch you want included (master or
utrace-ptrace). Also note that neither of those branches matches what
was posted in the sense that they both have lots of history and merges
not represented in the patches. (I assume that they do produce the same
final source tree, though).
> The point is to have at least one relevant maintainer request and track it and
> then supervise the completion of it (which includes the resolution of all
> outstanding objections) and then push it to Linus.
If we do include it, it is still possible for people to decide (when the
next merge window opens) that it is still not ready. It adds a bit of
maybe unneeded complication to linux-next, but we had the same problem in
this merge window and we have all survived. :-)
In the end, Linus is the final arbitrator of course.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/
> Frank, please be clear as to which branch you want included (master or
> utrace-ptrace). Also note that neither of those branches matches what
> was posted in the sense that they both have lots of history and merges
> not represented in the patches. (I assume that they do produce the same
> final source tree, though).
Yes, the trees do match. I certainly never expected our ancient git
history to get merged in directly upstream. I've made a new branch on:
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-utrace.git
called:
next/master
(Actually it's on master.kernel.org and the public mirror is being a little
slow as I write this.)
This starts from v2.6.33-rc4 and then has commits for the 7 patches that
Oleg posted in December. Beyond that, we've added one follow-on patch to
fix a bug Oleg just tracked down (Oleg will post that patch soon). And
I've added one more commit with a MAINTAINERS update, shown below.
You can also find the same stuff from the series file and patch files in:
http://people.redhat.com/utrace/2.6-next/
If it makes things easier for linux-next to have this git branch either
rebased or merged from a different fork point, please let me know.
Thanks,
Roland
---
[PATCH] MAINTAINERS: add utrace
This updates the ptrace entry to cover utrace too.
They are part of the same maintenance effort.
Also add the utrace mailing list.
Signed-off-by: Roland McGrath <[email protected]>
---
MAINTAINERS | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index c8f47bf..8da2a0a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4375,15 +4375,18 @@ M: Jim Paris <[email protected]>
L: [email protected]
S: Maintained
-PTRACE SUPPORT
+PTRACE AND UTRACE SUPPORT
M: Roland McGrath <[email protected]>
M: Oleg Nesterov <[email protected]>
+L: [email protected]
S: Maintained
F: include/asm-generic/syscall.h
F: include/linux/ptrace.h
F: include/linux/regset.h
F: include/linux/tracehook.h
-F: kernel/ptrace.c
+F: include/linux/utrace.h
+F: kernel/ptrace*
+F: kernel/utrace*
PVRUSB2 VIDEO4LINUX DRIVER
M: Mike Isely <[email protected]>
Hi Ingo, Andrew,
Any thoughts?
On Thu, 21 Jan 2010 01:38:22 +1100 Stephen Rothwell <[email protected]> wrote:
>
> On Wed, 20 Jan 2010 08:29:25 +0100 Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Ananth N Mavinakayanahalli <[email protected]> wrote:
> > >
> > > > On Wed, Jan 20, 2010 at 06:49:50AM +0100, Ingo Molnar wrote:
> > > >
> > > > Ingo,
> > > >
> > > > > Note, i'm not yet convinced that this (and the rest: uprobes and systemtap,
> > > > > etc.) can go uptream in its present form.
> > > >
> > > > Agreed, uprobes is still not upstream ready -- it was an RFC. We are
> > > > working through the comments there to get it ready for merger.
> > > >
> > > > > IMHO the far more important thing to address beyond formalities and workflow
> > > > > cleanliness are the (many) technical observations and objections offered by
> > > > > Peter Zijstra on lkml. Not just the git history but also the abstractions and
> > > > > concepts are messy and should be reworked IMO, and also good and working perf
> > > > > events integration should be achieved, etc.
> > > >
> > > > I think Oleg addressed most of Peter's concerns on utrace when the
> > > > ptrace/utrace patchset was reposted.
> > >
> > > Peter is Cc:-ed and he might want to chime in.
> > >
> > > > Perf integration with uprobes will be done and discussions have started with
> > > > Masami and Frederic. There are a couple of fundamental technical aspects
> > > > (XOL vma vs. emulation; breakpoint insertion through CoW and not through
> > > > quiesce) that need resolution.
> > > >
> > > > > The fact that there's a well established upstream workflow for instrumentation
> > > > > patches, which is being routed around by the utrace/uprobes/systemtap code
> > > > > here is not a good sign in terms of reaching a good upstream solution. Lets
> > > > > hope it works out well though.
> > > >
> > > > Agreed.
> > > >
> > > > On the other hand, having ptrace/utrace in the -next tree will give it a
> > > > lot more testing, while any outstanding technical issues are being addressed.
> > >
> > > Including experimental code that is RFC and which is not certain to go
> > > upstream is certainly not the purpose of linux-next though.
> > >
> > > It will cause conflicts with various other trees and increases the overhead
> > > all around. It also causes us to trust linux-next bugreports less - as it's
> > > not the 'next Linux' anymore. Also, there's virtually no high-level
> > > technical review done in linux-next: the trees are implicitly trusted
> > > (because they are pushed by maintainers), bugs and conflicts are reported
> > > but otherwise it's a neutral tree that includes pretty much any commit
> > > indiscriminately.
> > >
> > > If you need review and testing there's a number of trees you can get
> > > inclusion into.
> >
> > Btw., the utrace code has lived in -mm for quite some time - that's an
> > excellent route as Andrew does thorough review and testing.
> >
> > If Andrew agrees with this particular tree as-is and wants these bits to live
> > in linux-next and have it in -mm that way then that's a fair approach
> > obviously and i have no objections ...
>
> So, what is it to be? In or out?
>
> Frank, please be clear as to which branch you want included (master or
> utrace-ptrace). Also note that neither of those branches matches what
> was posted in the sense that they both have lots of history and merges
> not represented in the patches. (I assume that they do produce the same
> final source tree, though).
>
> > The point is to have at least one relevant maintainer request and track it and
> > then supervise the completion of it (which includes the resolution of all
> > outstanding objections) and then push it to Linus.
>
> If we do include it, it is still possible for people to decide (when the
> next merge window opens) that it is still not ready. It adds a bit of
> maybe unneeded complication to linux-next, but we had the same problem in
> this merge window and we have all survived. :-)
>
> In the end, Linus is the final arbitrator of course.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/
On Fri, 22 Jan 2010 11:17:47 +1100
Stephen Rothwell <[email protected]> wrote:
> Any thoughts?
I'm nearly a week behind again and am trying to avoid thinking.
I've had a (n old) version of utrace in -mm for ages and it didn't
break anything.
I still don't think I've seen a really compelling reason for merging
it. At least, I wouldn't be able to explain why we did it. But
presumably there _are_ such reasons, because it was a lot of development work.
Someone please sell this to us.
On Thu, 21 Jan 2010 16:30:04 -0800
Andrew Morton <[email protected]> wrote:
> Someone please sell this to us.
Here's what Oleg said last time I asked this:
First of all, utrace makes other things possible. gdbstub,
nondestructive core dump, uprobes, kmview, hopefully more. I didn't
look at these projects closely, perhaps other people can tell more. As
for their merge status, until utrace itself is merged it is very hard to
develop them out of tree.
To me, even seccomp is the good example why utrace is useful. seccomp
is simple, but it needs hooks in arch/ hot pathes. Contrary,
utrace-based implementation is more flexible, simple, and it is
completely "hidden" behind utrace.
In my opinion, ptrace-utrace is another example. Once CONFIG_UTRACE
goes away, we can remove almost all ptrace-related code from core kernel
(and kill task_struct->ptrace/etc members).
ftrace/etc is excellent in many ways, but even if we need the simple
"passive" tracing it is not enough sometimes. And we have nothing else
except ptrace currently. But ptrace is so horrible and unfixeable, and
it has so many limitations. In fact, even the simple things like stop/
continue this thread/process are not trivial using ptrace, gdb/strace
have to do a lot of hacks to overcome ptrace's limitations, and some of
these hacks falls into "mostly works, but that is all" category.
Of course, I can't promise we will have the new gdb which explores
utrace facilities soon, but I think at least utrace gives a chance.
Hi -
On Thu, Jan 21, 2010 at 04:31:45PM -0800, Andrew Morton wrote:
> [...]
> > Someone please sell this to us.
> Here's what Oleg said last time I asked this: [...]
I wonder if Roland/Oleg are being too modest in their current role as
ptrace maintainers. Considering that *they* think of utrace as a
means toward proper refactoring of ptrace, how much further burden of
proof should they shoulder? To what extent are other subsystem
maintainers required to "sell" reworkings of their areas, when there
appear to be no drawbacks and at least arguable benefits?
- FChE
On Thu, 21 Jan 2010 19:51:47 -0500 "Frank Ch. Eigler" <[email protected]> wrote:
> Hi -
>
> On Thu, Jan 21, 2010 at 04:31:45PM -0800, Andrew Morton wrote:
> > [...]
> > > Someone please sell this to us.
> > Here's what Oleg said last time I asked this: [...]
>
> I wonder if Roland/Oleg are being too modest in their current role as
> ptrace maintainers. Considering that *they* think of utrace as a
> means toward proper refactoring of ptrace, how much further burden of
> proof should they shoulder? To what extent are other subsystem
> maintainers required to "sell" reworkings of their areas, when there
> appear to be no drawbacks and at least arguable benefits?
>
ptrace is a nasty, complex part of the kernel which has a long history
of problems, but it's all been pretty quiet in there for the the past few
years. This leads one to expect that a rip-out-n-rewrite is a
high-risk prospect. So, quite reasonably, one looks for a good reason
for taking such risk.
It's not really appropriate to generalise from other subsystem
maintainer's reworkings onto ptrace. It's very rare that we'd make a
change this radical to a tricky part of core kernel.
Hi -
On Thu, Jan 21, 2010 at 05:05:41PM -0800, Andrew Morton wrote:
> [...] ptrace is a nasty, complex part of the kernel which has a
> long history of problems, but it's all been pretty quiet in there
> for the the past few years. This leads one to expect that a
> rip-out-n-rewrite is a high-risk prospect. So, quite reasonably,
> one looks for a good reason for taking such risk. [...]
To the extent the discussion is colored by risk avoidance, then the
answer to that would consist of code reviews, and of course a look at
the actual historical reliability of this code. While some might
enjoy reminding us about the brief kerneloops incident in 2008, let's
keep in mind that versions of this code has been deployed in fedora
and rhel for several *years*, with millions of users. It's not some
rickety experiment.
To the extent the discussion is colored by the new features enabled
from this refactoring, well, there is Oleg's list which may or may not
have mentioned enabling systemtap's user-space probing. More details
can be furnished on demand. Several of the use examples were
constructed in good faith upon request from the kernel community
asking for more and more. But what's enough? Who knows, really?
- FChE
On Thu, 21 Jan 2010, Andrew Morton wrote:
>
> ptrace is a nasty, complex part of the kernel which has a long history
> of problems, but it's all been pretty quiet in there for the the past few
> years.
More importantly, we're not ever going to get rid of it.
Quite frankly, judging my all past history we have ever seen in kernel
interfaces, new an non-portable interfaces simply are never used. The
whole question whether they are nicer or not is entirely immaterial.
I'm personally very dubious that there are any merits to utrace that
outweigh the very clear disadvantages: just another layer that adds a new
level of abstraction to the only interface that people actually _use_,
namely ptrace.
But I haven't followed utrace. I doubt _anybody_ has, except for the
utrace people themselves.
Linus
On Thu, 21 Jan 2010, Frank Ch. Eigler wrote:
>
> To the extent the discussion is colored by the new features enabled
> from this refactoring, well, there is Oleg's list which may or may not
> have mentioned enabling systemtap's user-space probing.
Let's face it, system tap isn't going to be merged, so why even bring it
up? Every kernel developer I have _ever_ seen agrees that all the new
tracing is a million times superior. I'm sure there are system tap people
who disagree, but quite frankly, I don't see it being merged considering
how little the system tap people ever did for the kernel.
So if things like system tap and "security models that go behind the
kernel by tying into utrace" are the reasons for utrace, color me utterly
uninterested. In fact, color me actively hostile. I think that's the worst
possible situation that we'd ever be in as kernel people (namely exactly
the "do things in kernel space by hiding behind utrace without having
kernel people involved")
Linus
Hi -
On Thu, Jan 21, 2010 at 05:32:47PM -0800, Linus Torvalds wrote:
> [...]
> > To the extent the discussion is colored by the new features enabled
> > from this refactoring, well, there is Oleg's list which may or may not
> > have mentioned enabling systemtap's user-space probing.
>
> Let's face it, system tap isn't going to be merged, so why even bring it
> up?
It was certainly not meant to derail the discussion about the merits
of utrace as a useful cleanup API in its own right, but rather to be
an example of what kinds of things become straightforward in its
presence. You may be aware of nascent efforts to bring the same
uprobes infrastructure to perf.
> Every kernel developer I have _ever_ seen agrees that all the new
> tracing is a million times superior. [...]
And that is fine. We believe there is plenty of space in the problem
domain for different approaches.
> ... considering how little the system tap people ever did for the kernel.
Less passionate analysis would identify a long history of contribution
by the the greater affiliated team, including via merged code and by
and passing on requirements and experiences. We have been trying to
share as much as you have been willing to take. While systemtap's
current codebase may not (and need not) have a future inside the
kernel, chances are good that improvements in common infrastructure
will allow systemtap to shrink and change enough that the question
becomes moot.
- FChE
On Thu, 21 Jan 2010, Frank Ch. Eigler wrote:
>
> Less passionate analysis would identify a long history of contribution
> by the the greater affiliated team, including via merged code and by
> and passing on requirements and experiences.
The reason I'm so passionate is that I dislike the turn the discussion was
taking, as if "utrace" was somehow _good_ because it allowed various other
interfaces to hide behind it. And I'm not at all convinced that is true.
And I really didn't want to single out system tap, I very much feel the
same way abotu some seccomp-replacement "security model that the kernel
doesn't even need to know about" thing.
So don't take the systemtap part to be the important part, it's the bigger
issue of "I'd much rather have explicit interfaces than have generic hooks
that people can then use in any random way".
I realize that my argument is very anti-thetical to the normal CS teaching
of "general-purpose is good". I often feel that very specific code with
very clearly defined (and limited) applicability is a good thing - I'd
rather have just a very specific ptrace layer that does nothing but
ptrace, than a "generic plugin layer that can be layered under ptrace and
other things".
In one case, you know exactly what the users are, and what the semantics
are going to be. In the other, you don't.
So I really want to see a very big and immediate upside from utrace.
Because to me, the "it's a generic layer with any application you want to
throw at it" is a _downside_.
Linus
On Thu, Jan 21, 2010 at 05:28:42PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 21 Jan 2010, Andrew Morton wrote:
> >
> > ptrace is a nasty, complex part of the kernel which has a long history
> > of problems, but it's all been pretty quiet in there for the the past few
> > years.
>
> More importantly, we're not ever going to get rid of it.
FWIW, Oleg's implementation of ptrace over utrace is 100% compatible
with legacy ptrace; gdb testsuite indicates that
(http://lkml.org/lkml/2009/12/21/98).
Ananth
On Fri, 22 Jan 2010 10:51:39 +0530, Ananth N Mavinakayanahalli said:
> FWIW, Oleg's implementation of ptrace over utrace is 100% compatible
> with legacy ptrace; gdb testsuite indicates that
> (http://lkml.org/lkml/2009/12/21/98).
No, that only proves it's compatible enough for gdb to not care. The problem
is all those *other* packages that abuse ptrace in totally crackhead ways.
(No, I can't name them - but ptrace is the sort of interface that almost
encourages its use for things somewhere between crackhead and mad-scientist,
so they're almost certainly out there.. WAY out there.. :)
On 01/21, Andrew Morton wrote:
>
> ptrace is a nasty, complex part of the kernel which has a long history
> of problems, but it's all been pretty quiet in there for the the past few
> years.
Well, yes, I'd say ptrace is "frozen". Nobody add new features/improvements,
only bugfixes.
> This leads one to expect that a rip-out-n-rewrite is a
> high-risk prospect. So, quite reasonably, one looks for a good reason
> for taking such risk.
As it was already said, utrace was not created to just replace the current
ptrace.
However, speaking of ptrace, imho even ptrace-utrace is more flexible and
allows to improve this api easily.
Just for example, even attach and detach are not trivial to use from
user-space when it comes to multithread tracees. A one-liner patch for
ptrace-utrace can implement PTRACE_DETACH which doesn't need TASK_TRACED,
it is easy to avoid the initial SIGSTOP from attach (which doesn't always
work but strace/gdb relies on it).
Of course, I do not profess this is not posible with the current
implementation. But this will need more changes, and these changes will
touch the code outside of ptrace.c. And in fact I think that any
enhancements in this area will lead to rewrite of the current ptrace
code.
I must admit that personally I think the current ptrace api is unfixable,
we need the new one in the long term. It would be nice to just kill ptrace,
but this is not possible and that is why ptrace-utrace exists. And, if
nothing more, utrace allows to have both old and new ones without any
changes outside of ptrace/utrace code.
Oleg.
On 01/21, Linus Torvalds wrote:
>
> On Thu, 21 Jan 2010, Andrew Morton wrote:
> >
> > ptrace is a nasty, complex part of the kernel which has a long history
> > of problems, but it's all been pretty quiet in there for the the past few
> > years.
>
> More importantly, we're not ever going to get rid of it.
Unfortunately, you are right. The current ptrace (as it is visible from
user-space) should stay forever.
> Quite frankly, judging my all past history we have ever seen in kernel
> interfaces, new an non-portable interfaces simply are never used. The
> whole question whether they are nicer or not is entirely immaterial.
I have to admit this point looks very reasonable to me. Except, can't
resist, ptrace itself is hardly portable.
> I'm personally very dubious that there are any merits to utrace that
> outweigh the very clear disadvantages: just another layer that adds a new
> level of abstraction to the only interface that people actually _use_,
> namely ptrace.
Of course they can't use other interfaces, we don't have them. And
without the new abstraction layer we will never have, I think.
Oleg.
On 01/22, [email protected] wrote:
>
> No, that only proves it's compatible enough for gdb to not care. The problem
> is all those *other* packages that abuse ptrace in totally crackhead ways.
>
> (No, I can't name them - but ptrace is the sort of interface that almost
> encourages its use for things somewhere between crackhead and mad-scientist,
> so they're almost certainly out there.. WAY out there.. :)
Yes, this is true.
We are trying to test it as much as possible.
Oleg.
Hi -
oleg wrote:
> [...]
>> I'm personally very dubious that there are any merits to utrace that
>> outweigh the very clear disadvantages: just another layer that adds a new
>> level of abstraction to the only interface that people actually _use_,
>> namely ptrace.
>
> Of course they can't use other interfaces, we don't have them. And
> without the new abstraction layer we will never have, I think.
This is one of the reasons we built, up on request of lkml people, the
utrace-gdbstub prototype (http://lkml.org/lkml/2009/11/30/173). It
presents a standard userspace debugging interface -- actually, more
standard than ptrace! It has the potential to be more powerful
feature-wise and perhaps even perform faster than ptrace. And yet
that RFC didn't receive any on-topic review, only wishes for
unspecified blue-sky integration with kernel debugging.
So then there's uprobes, which is another potential utrace "killer
app", if it weren't so tainted by some peoples' disdain for its
current user, when other users are already being seriously discussed.
So a working prototype, which demonstrates both the utility of utrace
itself and the end-user value of user-space probing, is disregarded.
And there are several smaller utrace clients in the works, each of
them merge candidates in the future. Yes, most of them may be
rewritten with special-purpose hook after hook as people reinvent the
utrace wheel piece by piece, but how long will that take? How is the
opportunity cost of missing features valued?
Finally, I don't know how to address the logic of "if a feature
requires utrace, that's a bad argument for utrace" and at the same
time "you need to show a killer app for utrace". What could possibly
satisfy both of those constraints? Please advise.
- FChE
On Fri, 2010-01-22 at 15:01 -0500, Frank Ch. Eigler wrote:
> So then there's uprobes, which is another potential utrace "killer
> app"
That's bollocks, uprobes is an utter and total mis-match for utrace.
Probing userspace is primarily about DSOs which is files and vma's, not
tasks.
You might maybe want a utrace interface to that, but that is largely
non-interesting.
IOW, we don't need utrace to make sensible use of uprobes.
(And when I speak of uprobes I mean the thing formerly called UBP)
On 01/21, Linus Torvalds wrote:
>
> I realize that my argument is very anti-thetical to the normal CS teaching
> of "general-purpose is good". I often feel that very specific code with
> very clearly defined (and limited) applicability is a good thing - I'd
> rather have just a very specific ptrace layer that does nothing but
> ptrace, than a "generic plugin layer that can be layered under ptrace and
> other things".
I am repeating the same (and probably poor) arguments, but we don't have a
clearly defined ptrace layer. The current code is just the set of precedents,
I mean, "this code does this because we always did this for unknown reason".
And we can't fix it without breaking things. Even the obvious bugs which
could be fixed by the very simple patch should be preserved sometimes.
In fact, afaics the current state is: if it can't crash the kernel - it is
not the bug.
Otoh, ptrace is very limited, yes. Imho - too limited. And, as a user-space
api, it is just horrible.
However: "we're not ever going to get rid of it". Yes, sure.
But I am afraid this all is almost off-topic. Afaik, utrace was not created
to solve the problems with ptrace, at least I am sure this wasn't the only
goal.
Unfortunately, I didn't participate in other projects which use utrace.
Even if I did, I don't know how could I prove they are "important enough"
to have a generic layer to make other things possible.
Oleg.
Hi -
On Fri, Jan 22, 2010 at 09:16:16PM +0100, Peter Zijlstra wrote:
> [...]
> > So then there's uprobes, which is another potential utrace "killer
> > app"
> That's bollocks, uprobes is an utter and total mis-match for utrace.
> Probing userspace is primarily about DSOs which is files and vma's,
> not tasks. [...]
Your experience with user-space probing apparently differs from ours.
In fact there exists plenty of interest and utility in probing given
processes only, if for no other reason then to avoid disrupting others
running on the machine.
Nearly always, it is better to build a multiprocess probing widget
from multiply-applied single-process ones, rather than to build
single-process probing from grossly-filtered systemwide/VMA ones.
(If the lower level infrastructure provides both options, groovy.)
- FChE
On Fri, 22 Jan 2010, Frank Ch. Eigler wrote:
>
> Finally, I don't know how to address the logic of "if a feature
> requires utrace, that's a bad argument for utrace" and at the same
> time "you need to show a killer app for utrace". What could possibly
> satisfy both of those constraints? Please advise.
The point is, the feature needs to be a killer feature. And I have yet to
hear _any_ such killer feature, especially from a kernel maintenance
standpoint.
The "better ptrace than ptrace" is irrelevant. Sure, we all know ptrace
isn't a wonderful feature. But it's there, and a debugger is going to have
support for it anyway, so what's the _advantage_ of a "better ptrace
interface"? There is absolutely _zero_ advantage, there's just "yet
another interface". We can't get rid of the old one _anyway_.
And the seccomp replacement just sounds horrible. Using some tracing
interface to implement security models sounds like the worst idea ever.
And like it or not, over the last almost-decade, _not_ having to have to
work with system tap has been a feature, not a problem, for the kernel
community.
So what's the killer feature?
Linus
Hi -
On Fri, Jan 22, 2010 at 01:59:11PM -0800, Linus Torvalds wrote:
> [...]
> > Finally, I don't know how to address the logic of "if a feature
> > requires utrace, that's a bad argument for utrace" and at the same
> > time "you need to show a killer app for utrace". What could possibly
> > satisfy both of those constraints? Please advise.
>
> The point is, the feature needs to be a killer feature. And I have yet to
> hear _any_ such killer feature, especially from a kernel maintenance
> standpoint.
> The "better ptrace than ptrace" is irrelevant. Sure, we all know ptrace
> isn't a wonderful feature. But it's there, and a debugger is going to have
> support for it anyway, so what's the _advantage_ of a "better ptrace
> interface"? There is absolutely _zero_ advantage, there's just "yet
> another interface". We can't get rid of the old one _anyway_.
The point is that the intermediate api will allow (and, as the part
you clipped out about utrace-gdbstub said, *already has allowed*)
alternative plausible interfaces that coexist just fine.
> And the seccomp replacement just sounds horrible. Using some tracing
> interface to implement security models sounds like the worst idea ever.
So all this is about *naming* utrace? It was never built "for
tracing", but for (efficient/multiplexed) *control*. That wasn't even
its original name -- one of your lieutenants asked roland to change it
to utrace.
> And like it or not, over the last almost-decade, _not_ having to
> have to work with system tap has been a feature, not a problem, for
> the kernel community.
I don't have a problem with that. We have apprx. never imposed
anything on developers who didn't want to use it. There are plenty
who have and will.
- FChE
On Fri, 22 Jan 2010, Frank Ch. Eigler wrote:
>
> The point is that the intermediate api will allow (and, as the part
> you clipped out about utrace-gdbstub said, *already has allowed*)
> alternative plausible interfaces that coexist just fine.
And my point is that multiple interfaces are BAD.
There is one interface we _have_ to have: the traditional ptrace one. That
one we can't get away from.
"Multiple interfaces" on its own is just confusion with no upside.
You need a _reason_ to have other interfaces. They need to have that
killer feature. Just being "different" is not a feature at all.
> So all this is about *naming* utrace? It was never built "for
> tracing", but for (efficient/multiplexed) *control*. That wasn't even
> its original name -- one of your lieutenants asked roland to change it
> to utrace.
No. It's not about naming. It's about the downside of having amorphous
interfaces that apparently don't even have rules, and are then used to
implement random crap.
Yes, the SNL skit about "It's a dessert topping _and_ a floor wax" was
funny, but it was funny exactly because it was crazy.
The fact that you can do crazy things is not a good thing. You need to
find the "goodness" somewhere else, and that's what I'm trying to tell
you.
You just seem to have trouble listening.
Linus
On Fri, 22 Jan 2010, Linus Torvalds wrote:
>
> No. It's not about naming. It's about the downside of having amorphous
> interfaces that apparently don't even have rules, and are then used to
> implement random crap.
>
> Yes, the SNL skit about "It's a dessert topping _and_ a floor wax" was
> funny, but it was funny exactly because it was crazy.
Put yet another way: I'd _much_ rather have two totally separate pieces
that don't depend on each other, and do different things.
So to take a very practical example: I'd much rather have 'seccomp' and
'ptrace' that have _nothing_ what-so-ever to do with each other, than have
some intermediate layer that then needs to make both of those happy, and
that both have to interact with.
There are cases where we really _want_ to have common code. We want to
have a common VFS interface because we want to show _one_ interface to
user space across a gazillion different filesystems. We want to have a
common driver layer (as far as possible) because - again - we expose a
metric shitload of drivers, and we want to have one unified interface to
them.
But going the other way: trying to share code when the interfaces are
fundamentally _different_ is generally not at all such a great idea. It
ends up tying two conceptually totally separate things together, and
suddenly people who work on feature X aneed to modify infrastructure that
affects feature Y, and it turns ou that details A, B and C are all totally
different for the two features and the middle layer has two conflicting
things it needs to work with.
This is why when somebody brought up "you could do a seccomp-like thing on
top of utrace" that my reaction was and is just totally negative. It shows
all the wrong kinds of tying things together.
Linus
* Linus Torvalds <[email protected]> wrote:
> On Thu, 21 Jan 2010, Frank Ch. Eigler wrote:
>
> > Less passionate analysis would identify a long history of contribution by
> > the the greater affiliated team, including via merged code and by and
> > passing on requirements and experiences.
>
> The reason I'm so passionate is that I dislike the turn the discussion was
> taking, as if "utrace" was somehow _good_ because it allowed various other
> interfaces to hide behind it. And I'm not at all convinced that is true.
>
> And I really didn't want to single out system tap, I very much feel the same
> way abotu some seccomp-replacement "security model that the kernel doesn't
> even need to know about" thing.
>
> So don't take the systemtap part to be the important part, it's the bigger
> issue of "I'd much rather have explicit interfaces than have generic hooks
> that people can then use in any random way".
>
> I realize that my argument is very anti-thetical to the normal CS teaching
> of "general-purpose is good". I often feel that very specific code with very
> clearly defined (and limited) applicability is a good thing - I'd rather
> have just a very specific ptrace layer that does nothing but ptrace, than a
> "generic plugin layer that can be layered under ptrace and other things".
( I think to a certain degree it mirrors the STEAMS hooks situation from a
decade ago - and while there were big flamewars back then we never regretted
not taking the STREAMS opaque hooks upstream. )
> In one case, you know exactly what the users are, and what the semantics are
> going to be. In the other, you don't.
>
> So I really want to see a very big and immediate upside from utrace. Because
> to me, the "it's a generic layer with any application you want to throw at
> it" is a _downside_.
One component of the whole utrace/systemtap codebase that i think would make
sense upstreaming in the near term is the concept of user-space probes. We are
actively looking into it from a 'perf probe' angle, and PeterZ suggested a few
ideas already. Allowing apps to transparently improve the standard set of
events is a plus. (From a pure Linux point of view it's probably more
important than any kernel-only instrumentation.)
Also, if any systemtap person is interested in helping us create a more
generic filter engine out of the current ftrace filter engine (which is really
a precursor of a safe, sandboxed in-kernel script engine), that would be
excellent as well. Right now we support simple C-syntax expressions like:
perf record -R -f -e irq:irq_handler_entry --filter 'irq==18 || irq==19'
More could be done - a simple C-like set of function perhaps - some minimal
per probe local variable state, etc. (perhaps even looping as well, with a
limit on number of predicament executions per filter invocation.)
( _Such_ a facility, could then perhaps be used to allow applications access
to safe syscall sandboxing techniques: i.e. a programmable seccomp concept
in essence, controlled via ASCII space filter expressions [broken down into
predicaments for fast execution], syscall driven and inherited by child
tasks so that security restrictions percolate down automatically.
IMHO that would be a superior concept for security modules too: there's no
reason why all the current somewhat opaque security hooks couldnt be
expressed in terms of more generic filter expressions, via a facility that
can be used both for security and for instrumentation. That's all what
SELinux boils down to in the end: user-space injected policy rules. )
The opaque hookery all around the core kernel just to push everything outside
of mainline is one of the biggest downsides of utrace/systemtap - and neither
uprobes nor the concept of user-defined scripting around existing events is
affected much by that.
So lots of work is left and all that work is going to be rather utilitarian
with little downside: specific functionality with an immediately visible
upside, with no need for opaque hooks.
Ingo
On Fri, Jan 22, 2010 at 19:22, Linus Torvalds
<[email protected]> wrote:
> There are cases where we really _want_ to have common code. We want to
> have a common VFS interface because we want to show _one_ interface to
> user space across a gazillion different filesystems. We want to have a
> common driver layer (as far as possible) because - again - we expose a
> metric shitload of drivers, and we want to have one unified interface to
> them.
So... Everybody agrees that ptrace() is horrible and a royal pain to
use, let alone use correctly and without bugs. Everybody also agrees
that ptrace() needs to stay around for a long time to avoid breaking
all the existing users.
Now how do we get from here to a moderately portable API for
interrogating, controlling, and intercepting process state?
Essentially it would need to support all of the things that a powerful
debugger would want to do, including modifying registers and memory,
substituting syscall return values, etc. I believe that "utrace" is
the kernel side of that API.
The killer app for this will be the ability to delete thousands of
lines of code from GDB, strace, and all the various other tools that
have to painfully work around the major interface gotchas of ptrace(),
while at the same time making their handling of complex processes much
more robust.
The *second* killer app for this is to make it much easier for people
to write new userspace debugging tools. I love the various
crash-catching tools that different distributions or applications
provide, but they all basically have to trap the SIGSEGV and hope
they're still sensible enough to fork() and exec() a gdb process.
Furthermore, I would love to be able to write debugging tools for
scripting languages that allow me to step across Perl, C, PHP,
assembly code, etc, all within the same process. In theory that's all
possible today, but given how much of a *pain* ptrace() is to use
correctly, nobody bothers.
Now, with all that said, "utrace" does not provide any of the
userspace side APIs today... but I think it is a necessary refactoring
if we want to provide a new ideal process-introspection interface
without breaking all the ptrace() users.
Think of the "utrace" interface as very much like the LSM interface.
Just like with LSMs, there is a lot of active research in debugging
and tracing tools, and nobody can even remotely agree what the hell
they want out of the hooks. In theory you could add one hook for
every place each security module needs one... but then your fast-path
is littered with always-false test-and-jump statements. What "utrace"
provides is the one single test in each fast path that then searches
for and executes the appropriate slow path(s) for that process.
I personally would be very happy to see "utrace" merged.
Cheers,
Kyle Moffett
On Sat, Jan 23, 2010 at 2:22 AM, Linus Torvalds
<[email protected]> wrote:
> This is why when somebody brought up "you could do a seccomp-like thing on
> top of utrace" that my reaction was and is just totally negative. It shows
> all the wrong kinds of tying things together.
seccomp-via-utrace should be just removed to be honest before its users.
It entered the tree because it was very small and simple.
If rewritten, it no longer is small and simple because of whole kernel/utrace.c.
> The killer app for this will be the ability to delete thousands of
> lines of code from GDB, strace, and all the various other tools that
> have to painfully work around the major interface gotchas of ptrace(),
> while at the same time making their handling of complex processes much
> more robust.
Years ago (and it really must be years ago because this was about the
time I started hacking on Linux stuff !) there was a proposal to extract
and sanitize the arch specific stuff in binutils and in gdb etc into
sensible libraries that could be used by other apps.
What I don't understand is why that doesn't solve 99% of your problem.
ptrace is not perfect but most of the real ptrace limitations actually
come about because either the CPU can't do something or because the
supporting logic would be too expensive - things like having extra
private debugger pages.
Yes ptrace needs a lot of icky support code, but it's already been
written...
Alan
* Kyle Moffett <[email protected]> wrote:
> On Fri, Jan 22, 2010 at 19:22, Linus Torvalds
> <[email protected]> wrote:
> > There are cases where we really _want_ to have common code. We want to
> > have a common VFS interface because we want to show _one_ interface to
> > user space across a gazillion different filesystems. We want to have a
> > common driver layer (as far as possible) because - again - we expose a
> > metric shitload of drivers, and we want to have one unified interface to
> > them.
>
> So... Everybody agrees that ptrace() is horrible and a royal pain to use,
> let alone use correctly and without bugs. Everybody also agrees that
> ptrace() needs to stay around for a long time to avoid breaking all the
> existing users.
>
> Now how do we get from here to a moderately portable API for interrogating,
> controlling, and intercepting process state? Essentially it would need to
> support all of the things that a powerful debugger would want to do,
> including modifying registers and memory, substituting syscall return
> values, etc. I believe that "utrace" is the kernel side of that API.
The problem is, utrace does not do that really.
What utrace does is that it provides an opaque set of APIs for unspecified and
out of tree _kernel_ modules (such as systemtap). It doesnt support any
'application' per se. It basically removes the kernel's freedom at shaping its
own interaction with debug application.
If utrace was a 'better ptrace' syscall, where the syscall itself is the goal
of the hookery, it would all be rather different. People could argue about
_that_ interface (and the hooks would be a pure kernel internal
implementational detail - not an interface specification), and once people
agree about that ABI and there's enough application momentum behind it, the
hooks are really not that opaque anymore - they are for that ABI and not more.
Note that it's still a _big_ hurdle: it's hard to agree on a new syscall and
it's hard to get 'application momentum' behind it. Special Linux system calls
have a checkered past, they tend to not be used by much anything, and thus
they tend to be a breeding ground of both bugs, maintenance complexity and
security problems. Lack of attention is never good.
In that sense it might be better to fix/enhance ptrace, if there's interest.
I've written a handful of ptrace extensions in the past (none of them went
upstream tho), it can be done in a useful manner and the code is pretty
hackable. There are basic problems left to be solved: for example why is there
still no 'memory block copy' call, why are we _still_ limited to one word per
system call PTRACE_PEEK* memory copies? It's ridiculous. SparcLinux has
PTRACE_WRITE*/READ* support that implements this, but none of the other
architectures have it so it's essentially unused.
Or another possible direction would be to extend the perf events syscall with
interception capabilities. It's far more performant at extracting application
state without scheduling than any ptrace method - and interception/injection
would be a natural next step - if there's interest.
Thanks,
Ingo
Hi -
mingo wrote:
> [...]
> > Now how do we get from here to a moderately portable API for interrogating,
> > controlling, and intercepting process state? Essentially it would need to
> > support all of the things that a powerful debugger would want to do,
> > including modifying registers and memory, substituting syscall return
> > values, etc. I believe that "utrace" is the kernel side of that API.
>
> The problem is, utrace does not do that really.
In fact, it is exactly designed for that.
> What utrace does is that it provides an opaque set of APIs for
> unspecified and out of tree _kernel_ modules (such as systemtap). It
> doesnt support any 'application' per se. It basically removes the
> kernel's freedom at shaping its own interaction with debug
> application.
This claim is hard to take any more seriously than emoting that the
blockio layer is "opaque" because device drivers "remove freedom" for
the kernel to "shape its interaction" with hardware. If you have any
*real evidence* about how any present user of utrace misuses that
capability, or interferes with the "kernel's freedom", show us please.
- FChE
Hi -
On Sat, Jan 23, 2010 at 11:01:21AM +0000, Alan Cox wrote:
> [...]
> What I don't understand is why [libgdb?] doesn't solve 99% of your problem.
> ptrace is not perfect but most of the real ptrace limitations actually
> come about because either the CPU can't do something or because the
> supporting logic would be too expensive - things like having extra
> private debugger pages.
At least one reason is that ptrace is single-usage-only, so for
example you cannot concurrently debug & strace the same program.
OTOH, utrace is designed to permit clean nesting/sharing semantics for
concurrent debugger-type tools operating on the same processes.
- FChE
Hi -
On Sat, Jan 23, 2010 at 07:04:01AM +0100, Ingo Molnar wrote:
> [...] Also, if any systemtap person is interested in helping us
> create a more generic filter engine out of the current ftrace filter
> engine (which is really a precursor of a safe, sandboxed in-kernel
> script engine), that would be excellent as well. [...]
Thank you for the invitation.
> More could be done - a simple C-like set of function perhaps - some minimal
> per probe local variable state, etc. (perhaps even looping as well, with a
> limit on number of predicament executions per filter invocation.)
Yes, at some point when such bytecode intepreter gets rich enough, one
may not need the translated-to-C means of running scripts.
> ( _Such_ a facility, could then perhaps be used to allow applications access
> to safe syscall sandboxing techniques: i.e. a programmable seccomp concept
> in essence, controlled via ASCII space filter expressions [...]
> IMHO that would be a superior concept for security modules too [...]
>
> [...] specific functionality with an immediately visible upside,
> with no need for opaque hooks.
This OTOH seem like rather a stretch. If one claims that "opaque
hooks" are bad, so instead have hooks that jump not to auditable C
code but an bytecode interpreter? And have the bytecodes be uploaded
from userspace? How is this supposed to produce "transparency" from
the kernel/hook point of view?
- FChE
Em Sat, Jan 23, 2010 at 11:01:21AM +0000, Alan Cox escreveu:
> Years ago (and it really must be years ago because this was about the
> time I started hacking on Linux stuff !) there was a proposal to extract
> and sanitize the arch specific stuff in binutils and in gdb etc into
> sensible libraries that could be used by other apps.
Aleluiah if it had happened at that time, but sadly... :-(
- Arnaldo
On Sat, Jan 23, 2010 at 06:47:29AM -0500, Frank Ch. Eigler wrote:
> > What utrace does is that it provides an opaque set of APIs for
> > unspecified and out of tree _kernel_ modules (such as systemtap). It
> > doesnt support any 'application' per se. It basically removes the
> > kernel's freedom at shaping its own interaction with debug
> > application.
>
> This claim is hard to take any more seriously than emoting that the
> blockio layer is "opaque" because device drivers "remove freedom" for
> the kernel to "shape its interaction" with hardware. If you have any
> *real evidence* about how any present user of utrace misuses that
> capability, or interferes with the "kernel's freedom", show us please.
The fundamental issue which Ingo is trying to say (and which you
apparently don't seem to be understanding) is that utrace doesn't
export a syscall (which is an ABI that we are willing to promise will
be stable), but rather a set of kernel API's (which we never promise
to be stable), and the fact that there will be out-of-tree programs
that are going to be trying to depend on that interface (much like
Systemtap does today when it creates kernel modules) is something that
is considered on par with Nvidia trying to ship proprietary video
drivers.
(OK, maybe not *quite* as evil as Nvidia because at least SystemTap is
open source, but the bottom line is that enabling out-of-tree modules
isn't considered a good thing, and if we know in advance that there
are out-of-tree modules, there is a strong tendency to want to nip
those in the bud.)
The reason why I avoid Nvidia hardware like the plague is because I
work on bleeding-edge kernels, and even though companies like Nvidia
and Broadcom try very hard to keep up with released upstream kernels,
#1, there is always the concern of what happens if they decide to
change that policy, and #2, invariably something will break during the
-rc1 or -rc2 stage, and then my laptop is useless for running bleeding
edge kernels. It's one of the reasons why many kernel developers gave
up on SystemTap, because it's not something that can be trusted to be
there, and the fault is not on our changing the API's, it's on
SystemTap depending on API's that were never guaranteed to be stable
in the first place.
If you want to try to slide utrace in, such that we're able to ignore
the fact that there will be this external house that will be built on
quicksand, pointing at how nice the external house will be isn't going
to be helpful. Nor is pointing at the ability that other people will
be able to build other really nice houses on the aforementioned
quicksand (i.e., out-of-tree kernel modules that depend on kernel
API's).
A simple "code cleanup" argument is not carrying the day (Look! We
can cleanup the ptree code!). It's going to have to be a **really**
cool in-tree kernel funtionality that provides a killer feature (in
Linus's words), enough so that people are willing to overlook the fact
that there's this monster external out-of-tree project that wants to
be depend on API's that may not be stable, and which, even if the
developers don't grump at us, users will grump at us when we change
API's that we had never guaranteed will be stable, and then Systemtap
breaks.
This is probably why Ingo invited you to think about ways of doing
some kind of safe in-kernel bytecode approach. That has the advantage
of doing away with external kernel modules, with all of their many
downsides: its dependency on unstable kernel API's, the fact that many
financial customers have security policies that prohibit C compilers
on production machines, the inherent security risk of allowing
external random kernel modules to be delivered and loaded into a
system, etc.
- Ted
On Sat, 23 Jan 2010, Kyle Moffett wrote:
>
> Now how do we get from here to a moderately portable API for
> interrogating, controlling, and intercepting process state?
Umm? ptrace?
It's not _pretty_, but it's a hell of a lot more portable than utrace is
ever going to be. Yes, the details differ between OS's (and between
architectures), but let's face it, things like register state probing is
_never_ going to be portable across different architectures simply because
the register state isn't the same.
> The killer app for this will be the ability to delete thousands of
> lines of code from GDB, strace, and all the various other tools that
> have to painfully work around the major interface gotchas of ptrace(),
> while at the same time making their handling of complex processes much
> more robust.
No. There is absolutely _no_ reason to believe that gdb et al would ever
delete the ptrace interfaces anyway.
That really is my point. Adding a new interface, when an old and crufty
(but working) interface is inevitably going to be around anyway - and is
inevitably always going to have portability issues - is STUPID.
Let's take strace, for example.
Yes, ptrace() is crufty, but have you actually looked at strace source
code? The problem isn't really a crufty interface to read registers etc,
the bigger problem for strace is that different architectures and OS's
have different system call argument rules, different ways to read/write
system call numbers yadda yadda yadda.
Take a look at strace sources some day. Moving away from ptrace on Linux
(even if you decided that you don't care about old versions of the kernel
that don't know anything else) would simplify ABSOLUTELY NOTHING.
Really. Quiet the reverse, I suspect. The Solaris and FreeBSD support uses
ptrace too, afaik, so you' just be confusing the issue.
And the fact is, strace would still end up supporting ptrace anyway, just
so that you could run it on old kernels.
So the whole "making a new utrace interface would simpligy things" is
simply a total lie. The fact that ptrace is a bit of an odd interface IN
NO WAY means that any other interface would end up being appreciably
simpler.
It would just result in _more_ code in strace, and more confusion.
Linus
On Sat, Jan 23, 2010 at 09:04:56PM -0800, Linus Torvalds wrote:
> > The killer app for this will be the ability to delete thousands of
> > lines of code from GDB, strace, and all the various other tools that
> > have to painfully work around the major interface gotchas of ptrace(),
> > while at the same time making their handling of complex processes much
> > more robust.
>
> No. There is absolutely _no_ reason to believe that gdb et al would ever
> delete the ptrace interfaces anyway.
More to the point, gdb *couldn't* use utrace, because utrace only
exports a kernel API; not a syscall interface. And if the Red Hat
Toolchain folks are thinking about encouraging gdb to start creating
out-of-tree kernel modules, so that (a) gdb requires root privs, and
(b) gdb is as (un)stable as SystemTap with respect to development
kernels by making it dependent on internal kernel API's, the Red Hat
Toolchain group needs to be smacked upside the head...
- Ted
Hi -
On Sun, Jan 24, 2010 at 05:25:13AM -0500, [email protected] wrote:
> [...]
> > > The killer app for this will be the ability to delete thousands of
> > > lines of code from GDB, strace, and all the various other tools that
> > > have to painfully work around the major interface gotchas of ptrace(),
> > > while at the same time making their handling of complex processes much
> > > more robust.
> >
> > No. There is absolutely _no_ reason to believe that gdb et al would ever
> > delete the ptrace interfaces anyway.
>
> More to the point, gdb *couldn't* use utrace, because utrace only
> exports a kernel API; not a syscall interface.
Yes, this might explain why Kyle wrote:
> > > [...] I believe that "utrace" is the kernel side of that
> > > API. [...]
> And if the Red Hat Toolchain folks are thinking about encouraging
> gdb to start creating out-of-tree kernel modules [...] the Red Hat
> Toolchain group needs to be smacked upside the head...
Those keeping up will note that an ordinary in-tree, non-modular,
non-root-only, already-works-with-standard-gdb,
potentially-better-than-ptrace debugger interface has already been
prototyped & posted on lkml as an RFC.
- FChE
On Sat, 23 Jan 2010, Frank Ch. Eigler wrote:
> On Sat, Jan 23, 2010 at 07:04:01AM +0100, Ingo Molnar wrote:
>
> > [...] Also, if any systemtap person is interested in helping us
> > create a more generic filter engine out of the current ftrace filter
> > engine (which is really a precursor of a safe, sandboxed in-kernel
> > script engine), that would be excellent as well. [...]
>
> Thank you for the invitation.
>
> > More could be done - a simple C-like set of function perhaps - some minimal
> > per probe local variable state, etc. (perhaps even looping as well, with a
> > limit on number of predicament executions per filter invocation.)
>
> Yes, at some point when such bytecode intepreter gets rich enough, one
> may not need the translated-to-C means of running scripts.
>
>
> > ( _Such_ a facility, could then perhaps be used to allow applications access
> > to safe syscall sandboxing techniques: i.e. a programmable seccomp concept
> > in essence, controlled via ASCII space filter expressions [...]
> > IMHO that would be a superior concept for security modules too [...]
> >
> > [...] specific functionality with an immediately visible upside,
> > with no need for opaque hooks.
>
> This OTOH seem like rather a stretch. If one claims that "opaque
> hooks" are bad, so instead have hooks that jump not to auditable C
> code but an bytecode interpreter? And have the bytecodes be uploaded
> from userspace? How is this supposed to produce "transparency" from
> the kernel/hook point of view?
Simply because the kernel controls which byte code is executed and has
control over the functionality behind it. That makes the hooks well
defined and transparent.
Thanks,
tglx
Hi -
tytso wrote:
> [...]
Let me see if I can paraphrase those of your concerns that were substantive:
1) That if utrace is merged, and systemtap keeps on using it, there may be
some sort of chilling effect on kernel developers that would impede
utrace's future development.
This might sound plausible to an outsider, but luckily we're not stuck
with having to speculate: one can examine history. Systemtap has been
around, working roughly the same way, for about *five years*.
Systemtap modules use more than a handful of mainstream
module-accessible kernel services. During all this time, how many
examples have there been when when systemtap developers have pleaded
with lkml to avoid changing some prior interface? How many of those
successfully? (That last one is a trick question, since both numbers
are really close to *zero*.) How much real impediment to change has
our mere existence caused?
2) That systemtap is not portable to all kernel versions.
Problems do periodically occur. However, one can again refer to
historical facts to assess whether in fact they warrant long term
grudges. In every release note, we list the range of kernel versions
we test against. We may have one of the broadest ranges of support,
2.6.9 through to many current -rc*s and non-linus trees. We have
several mechanisms which let us easily adapt to most changes. It may
interest readers to find out that the number of systemtap changes we
have had to add on account of kernel changes is on the order of a *few
per year*. The usual turnaround, once reported, is on the order of a
*few days*.
3) That systemtap users will complain to kernel developers if
systemtap becomes incompatible.
Let's go to the historical record again. How many such complaints
have actually been seen in inappropriate fora such as lkml? How
difficult were they to diagnose / redirect to the proper venue? Have
they constituted a "loss of face" for kernel developers?
4) That systemtap is almost but not quite as evil as nvidia.
It seems factors like ...
- always being completely open source project
- keeping in regular contact with lkml and other constituencies
- not being related to essential hardware enablement, so users
not wanting it don't have to touch it
- the compile-to-C approach being technologically necessary since
there was no alternative plausible way at the time (and still now)
- repeatedly offering infrastructure code with non-stap uses
... all add up to a mere nudge away from entirely "evil". If so, I
wonder if your sort of grossly bimodal view of ethical virtue is going
to foster the right sorts of change in the linux kernel community.
- FChE
On Sat, Jan 23, 2010 at 14:48, <[email protected]> wrote:
> The fundamental issue which Ingo is trying to say (and which you
> apparently don't seem to be understanding) is that utrace doesn't
> export a syscall (which is an ABI that we are willing to promise will
> be stable), but rather a set of kernel API's (which we never promise
> to be stable),
The point that's being missed is that there is a chicken-and-egg
problem here. The "chicken" is a replacement or extension to the
debugger interface that would make it possible for me to do things
like GDB a process while it's being strace'd or vice versa. The "egg"
is the "utrace" bits, an unstable but somewhat arch-generic ABI that
abstracts out ptrace() to make it possible to stack both in-kernel and
userspace debuggers/tracers/etc and have multiple simultaneous users.
> and the fact that there will be out-of-tree programs
> that are going to be trying to depend on that interface (much like
> Systemtap does today when it creates kernel modules) is something that
> is considered on par with Nvidia trying to ship proprietary video
> drivers.
Ugh... perhaps we should derive a variation of Godwin's law for this:
"As an LKML discussion grows longer, the probability of an unfavorable
comparison involving nVidia or Microsoft approaches 1."
> If you want to try to slide utrace in, such that we're able to ignore
> the fact that there will be this external house that will be built on
> quicksand, pointing at how nice the external house will be isn't going
> to be helpful. Â Nor is pointing at the ability that other people will
> be able to build other really nice houses on the aforementioned
> quicksand (i.e., out-of-tree kernel modules that depend on kernel
> API's).
Personally I don't give a flying **** about SystemTap; I'm interested
in things like the ability to stack gdb with strace, the RFC gdb-stub
posted a week ago, etc. None of those abilities would be out-of-tree
modules at all, and therefore the "quicksand" analogy is specious.
> A simple "code cleanup" argument is not carrying the day (Look! Â We
> can cleanup the ptree code!). Â It's going to have to be a **really**
> cool in-tree kernel funtionality that provides a killer feature (in
> Linus's words), enough so that people are willing to overlook the fact
> that there's this monster external out-of-tree project that wants to
> be depend on API's that may not be stable, and which, even if the
> developers don't grump at us, users will grump at us when we change
> API's that we had never guaranteed will be stable, and then Systemtap
> breaks.
I would be willing to guess that something like 95% of the people
using SystemTap or other tools are doing so on Red Hat Enterprise
Linux or other enterprise supported platforms, and so when something
breaks they go whinge at Red Hat, etc. If I recall correctly Red Hat
and many of the other vendors already heavily fiddle with kernel
patches they apply to provide some amount of binary module
compatibility.
> This is probably why Ingo invited you to think about ways of doing
> some kind of safe in-kernel bytecode approach. Â That has the advantage
> of doing away with external kernel modules, with all of their many
> downsides: its dependency on unstable kernel API's, the fact that many
> financial customers have security policies that prohibit C compilers
> on production machines, the inherent security risk of allowing
> external random kernel modules to be delivered and loaded into a
> system, etc.
There are substantial non-SystemTap uses for utrace that would *not*
be satisfied by an "in-kernel bytecode approach", starting with
stacking debuggers and tracers. Furthermore, let's say they did go
off and build the in-kernel bytecode interpreter. I can pretty much
guarantee that people would say the hooks into the rest of the kernel
are too invasive and they should be abstracted out into an API. *This
is that API!*
Cheers,
Kyle Moffett
On Sat, Jan 23, 2010 at 12:23:33PM +0100, Ingo Molnar wrote:
>
> * Kyle Moffett <[email protected]> wrote:
>
> > On Fri, Jan 22, 2010 at 19:22, Linus Torvalds
> > <[email protected]> wrote:
...
> In that sense it might be better to fix/enhance ptrace, if there's interest.
> I've written a handful of ptrace extensions in the past (none of them went
> upstream tho), it can be done in a useful manner and the code is pretty
> hackable. There are basic problems left to be solved: for example why is there
> still no 'memory block copy' call, why are we _still_ limited to one word per
> system call PTRACE_PEEK* memory copies? It's ridiculous. SparcLinux has
> PTRACE_WRITE*/READ* support that implements this, but none of the other
> architectures have it so it's essentially unused.
>
> Or another possible direction would be to extend the perf events syscall with
> interception capabilities. It's far more performant at extracting application
> state without scheduling than any ptrace method - and interception/injection
> would be a natural next step - if there's interest.
This certainly is now a chicken and egg problem. Everybody agrees that
Linux needs something better than ptrace; legacy ptrace will continue to
live, so will utilities written to it (strace, etc).
But should that limit what Linux can offer? What's the way out?
- Enhance ptrace: At least one ptrace maintainer (Roland) had publically
stated he doesn't prefer enhancing legacy ptrace -- that its already a
beast to maintain, and adding more complexity to it does it no good.
- Extend perf; would perf then use utrace underneath? Or would one have
to redo some of what utrace already does for thread level control?
- Give utrace a syscall and make it the primary way for users to
interact with the layer. There are benefits to this if there is
agreement on the utrace layer itself, maybe with less fexibility than
what it currently offers? If yes, what should it look like?
Any new debug facility will have to incorporate some or most learnings
from what utrace tried to address. It would be sad to just dump utrace
and redo everything from scratch or band-aid existing interfaces.
Ananth
On Sun, Jan 24, 2010 at 08:42:13PM -0500, Kyle Moffett wrote:
>
> Personally I don't give a flying **** about SystemTap; I'm interested
> in things like the ability to stack gdb with strace, the RFC gdb-stub
> posted a week ago, etc. None of those abilities would be out-of-tree
> modules at all, and therefore the "quicksand" analogy is specious.
Great. So what should be reviewed is utrace *plus* these other
userland interfaces, which may get critiqued and improved, and utrace
patches can be reviewed in light of these new features. But be
warned.... if it turns out that only 30% of utrace is only needed to
support gdb stacking with strace, etc., the other 70% will likely get
ejected and the utrace patches streamlined to support these in-tree
users. But since you don't give a flying **** about SystemTap,
presumably you won't mind, right?
> I would be willing to guess that something like 95% of the people
> using SystemTap or other tools are doing so on Red Hat Enterprise
> Linux or other enterprise supported platforms, and so when something
> breaks they go whinge at Red Hat, etc. If I recall correctly Red Hat
> and many of the other vendors already heavily fiddle with kernel
> patches they apply to provide some amount of binary module
> compatibility.
Sure, but as out-of-tree modules, the best they can expect is that
most kernel developers will pretend that they don't exist. Which is
OK, when I tried using SystemTap most of the concerns which I
expressed as being critical for kernel developers were largely ignored
(as near as I could tell) because the target market was RHEL corporate
customers, and they prioritized their resourcing accordingly --- so
they shouldn't mind if kernel developers return the favor.
But that means that we should only merge those portions of utrace that
are needed for these alleged "killer new features", and only if these
new features are cool enough that they justify the new code on their
own merits. At least, IMNSHO.
- Ted
On Mon, 2010-01-25 at 10:29 +0530, Ananth N Mavinakayanahalli wrote:
> - Extend perf; would perf then use utrace underneath? Or would one have
> to redo some of what utrace already does for thread level control?
No, perf is about monitoring/tracing not modifying. Its about minimal
interference, the very opposite of what ptrace/utrace is about.
>From a perf POV if you need to stop a task (changing it scheduling
state) you've lost.
Furthermore, despite the name utrace isn't about tracing at all, its a
full blown debugging infrastructure which completely multiplexes the
task state, not something perf is interested in at all.
On Sun, 24 Jan 2010, Kyle Moffett wrote:
>
> The point that's being missed is that there is a chicken-and-egg
> problem here. The "chicken" is a replacement or extension to the
> debugger interface that would make it possible for me to do things
> like GDB a process while it's being strace'd or vice versa. The "egg"
> is the "utrace" bits, an unstable but somewhat arch-generic ABI that
> abstracts out ptrace() to make it possible to stack both in-kernel and
> userspace debuggers/tracers/etc and have multiple simultaneous users.
Quite frankly, as far as I'm concerned, I'd be a whole lot more interested
in utrace if it's _only_ stated (and implied) goal was to do exactly this.
The thing I object to is the whole "dessert topping _and_ floor wax"
thing, with kernel interfaces for random other users.
If somebody extended ptrace in good ways, that's a totally different
thing. But I think utrace has been over-designed, possibly as a result of
others coming in and saying "hey, I'd like to use that too for xyz".
"Do one thing, and do it well". I'd not mind somebody improving ptrace
(including extending its semantics - I do agree that the whole SIGSTOP
thing makes it hard to have multiple debuggers).
That said, I also suspect that people should still look seriously at
simply just improving ptrace. For example, I suspect that the biggest
problem with ptrace is really just the signalling, and that creating a new
extension for JUST THAT, and then having a model where you can choose - at
PTRACE_ATTACH time - how to wait for events would be a good thing.
But as long as it is "I want to solve all problems", I'm not very
impressed.
Maybe somebody would be interested in trying to take the utrace
improvements, and scaling down what they promise, and ignoring all input
except for "I want to strace and gdb at the same time".
So stop the crazy "new kernel interfaces" crap. Stop the crazy "maybe we
can use it for ftrace and generic user event tracing too". Stop the crazy.
Linus
Hi -
On Mon, Jan 25, 2010 at 08:52:41AM -0800, Linus Torvalds wrote:
> [...] If somebody extended ptrace in good ways, that's a totally
> different thing. But I think utrace has been over-designed, possibly
> as a result of others coming in and saying "hey, I'd like to use
> that too for xyz". [...]
Earlier, you said that you haven't followed utrace "at all". Upon
what real information do you infer that it has been over-designed?
- FChE
On Mon, 25 Jan 2010, Frank Ch. Eigler wrote:
>
> Earlier, you said that you haven't followed utrace "at all". Upon
> what real information do you infer that it has been over-designed?
Upon the information that people are talking about magic new kernel
interfaces to do fancy things. And talking about doing things with it that
are simply not relevant for ptrace/strace.
In fact, in this very thread I've been informed that there are no user
interfaces to utrace at all, which to me says that it's been TOTALLY
MISDESIGNED FROM THE VERY START, and has nothing to do with making ptrace
work for strace/gdb at the same time.
In other words, I may not have followed utrace development, but I sure as
hell can read. And everything I read about it just makes me less inclined
to want to merge it. The people who argue "for" it are actually screwing
themselves by arguing for all the wrong things, and making me convinced I
don't want to touch it with a ten-foot pole.
If somebody were to argue that "this is a simple series of patches to
clean up ptrace and make it possible to strace a debugged process", then
that would have been different. That's not what you or others have been
doing. You've been pushing exactly the _reverse_ of that, namely how great
it is for some random totally new features that I'm convinced aren't even
used by a lot of people.
So give me a populist argument that makes sense for tons of actual users,
not some f*cking "here's a cool infrastructure that developers can do
random crazy out-of-tree crap with". Because I'm not interested in crazy
developers.
Linus
On Mon, 25 Jan 2010, Linus Torvalds wrote:
>
> So give me a populist argument that makes sense for tons of actual users,
> not some f*cking "here's a cool infrastructure that developers can do
> random crazy out-of-tree crap with". Because I'm not interested in crazy
> developers.
In other words, give me the "killer feature". The thing I've asked for all
the time. The thing that you seem to continually NOT EVEN UNDERSTAND.
Linus
On Mon, 2010-01-25 at 09:36 -0800, Linus Torvalds wrote:
> Because I'm not interested in crazy
> developers.
>
> Linus
Uh oh, that's not good for us real-time folks.
http://lwn.net/Articles/357800/
"And, according to Linus, the realtime people are crazy, so they can be
left to deal with the weird stuff."
-- Steve
(Sorry, I just couldn't resist)
> Uh oh, that's not good for us real-time folks.
>
> http://lwn.net/Articles/357800/
>
> "And, according to Linus, the realtime people are crazy, so they can be
> left to deal with the weird stuff."
I'd prefer the trees to be separate for testing purposes: it
doens't make much sense to have SMP support as a normal
kernel feature when most people won't have SMP anyway"
-- Linus Torvalds
Use cases got that into the tree pretty easily, I am sure RT ones will do
the same.
On Mon, 25 Jan 2010, Steven Rostedt wrote:
>
> Uh oh, that's not good for us real-time folks.
>
> http://lwn.net/Articles/357800/
>
> "And, according to Linus, the realtime people are crazy, so they can be
> left to deal with the weird stuff."
The RT people have actually been pretty good at slipping their stuff in,
in small increments, and always with good reasons for why they aren't
crazy.
Yeah, it's taken them years, and they still have out-of-tree stuff. And
yeah, they had to change some things to make them more palatable to the
mainline kernel - the whole fundamental raw spinlock change is just the
most recent example of that.
But on the whole, I think it's actually worked out pretty well for them. I
think the mainline kernel has improved in the process, but I also suspect
that _their_ RT patches have also improved thanks to having to make the
work more palatable to people like me who don't care all that deeply about
their particular flavor of crazy.
And yeah, I still think the hard-RT people are mostly crazy.
So I can work with crazy people, that's not the problem. They just need to
_sell_ their crazy stuff to me using non-crazy arguments, and in small and
well-defined pieces. When I ask for killer features, I want them to lull
me into a safe and cozy world where the stuff they are pushing is actually
useful to mainline people _first_.
In other words, every new crazy feature should be hidden in a nice solid
"Trojan Horse" gift: something that looks _obviously_ good at first sight.
The fact that it may contain the germs for future features should be
hidden so well that not only is it not used as an argument ("Hey, look at
all those soldiers in that horse, imagine what you could do with them"),
it should also not be obvious from the source code ("Look at all those
hooks I sprinkled around, which aren't actually used by anything, but just
imagine what you could do with them").
Linus
On Mon, 2010-01-25 at 10:12 -0800, Linus Torvalds wrote:
> But on the whole, I think it's actually worked out pretty well for them. I
> think the mainline kernel has improved in the process, but I also suspect
> that _their_ RT patches have also improved thanks to having to make the
> work more palatable to people like me who don't care all that deeply about
> their particular flavor of crazy.
Actually this is an understatement. Every feature (and I do mean
_every_) that went from -rt into mainline, undertook 3 or more rewrites
before it was acceptable for mainline. And every time, the end result
made the -rt patch set better as a whole.
Not to mention, that a lot of the early stuff also cleaned up mainline.
You can't have Real-Time without having a clean kernel. And as you
stated, a lot of those patches to clean up the kernel, no one even knew
that the real reason was to help the -rt patch set. They were well
disguised Trojan horses.
Darn, it looks like you are onto our scheme.
-- Steve
On Mon, 25 Jan 2010, Steven Rostedt wrote:
> On Mon, 2010-01-25 at 10:12 -0800, Linus Torvalds wrote:
>
> > But on the whole, I think it's actually worked out pretty well for them. I
> > think the mainline kernel has improved in the process, but I also suspect
> > that _their_ RT patches have also improved thanks to having to make the
> > work more palatable to people like me who don't care all that deeply about
> > their particular flavor of crazy.
>
> Actually this is an understatement. Every feature (and I do mean
> _every_) that went from -rt into mainline, undertook 3 or more rewrites
> before it was acceptable for mainline. And every time, the end result
> made the -rt patch set better as a whole.
>
> Not to mention, that a lot of the early stuff also cleaned up mainline.
> You can't have Real-Time without having a clean kernel. And as you
> stated, a lot of those patches to clean up the kernel, no one even knew
> that the real reason was to help the -rt patch set. They were well
> disguised Trojan horses.
Tsss. Never admit such things.
> Darn, it looks like you are onto our scheme.
Which scheme ? The only Trojan horses in the kernel tree are in
drivers/char/drivers/char/tty_io.c which put Linus himself into
Linux-0.98.2 :)
tglx
On Mon, 2010-01-25 at 09:36 -0800, Linus Torvalds wrote:
> Upon the information that people are talking about magic new kernel
> interfaces to do fancy things. And talking about doing things with it that
> are simply not relevant for ptrace/strace.
Unfortunately ptrace does all that magic already (badly). People don't
just use it for (s)tracing syscalls, but also for tracing signals, for
single step debugging and poking at memory, register state, for process
jailing and virtualization (uml) through syscall emulation.
So when they are talking about these fancy things that is because that
is what ptrace gives them currently. And they hate it, because the
ptrace interface is such a pain to work with. And all these things don't
really work together. You cannot trace, emulate, debug, jail at the same
time.
And all these users have wishes to extend the current ptrace interface
mess. But nobody dares to extend ptrace in any direction because
fixing/cleaning up one of these use cases might break the others in
subtle and not so subtle ways. Which is why the utrace series of patches
is cleaning up all this stuff first.
Cheers,
Mark
* Thomas Gleixner <[email protected]> wrote:
> On Mon, 25 Jan 2010, Steven Rostedt wrote:
>
> > On Mon, 2010-01-25 at 10:12 -0800, Linus Torvalds wrote:
> >
> > > But on the whole, I think it's actually worked out pretty well for them.
> > > I think the mainline kernel has improved in the process, but I also
> > > suspect that _their_ RT patches have also improved thanks to having to
> > > make the work more palatable to people like me who don't care all that
> > > deeply about their particular flavor of crazy.
> >
> > Actually this is an understatement. Every feature (and I do mean _every_)
> > that went from -rt into mainline, undertook 3 or more rewrites before it
> > was acceptable for mainline. And every time, the end result made the -rt
> > patch set better as a whole.
> >
> > Not to mention, that a lot of the early stuff also cleaned up mainline.
> > You can't have Real-Time without having a clean kernel. And as you stated,
> > a lot of those patches to clean up the kernel, no one even knew that the
> > real reason was to help the -rt patch set. They were well disguised Trojan
> > horses.
>
> Tsss. Never admit such things.
Here's four examples of recent kernel features:
- lockdep [1]
- ftrace [2]
- new-style generic mutexes and spin-mutexes [3]
- the new arch/x86 tree [4]
I suspect few would guess that all of these features were motivated by the -rt
kernel originally:
[1] lockdep started out as the 'track irqs-off sections' patches in -rt
[2] ftrace started out as -rt's latency tracer and logdev
[3] mutex.c was motivated by rtmutex.c
[4] arch-x86 was motivated by annoyance with needless porting of -rt
features from 32-bit to 64-bit x86 and back.
[ Nor would you normally guess that Linux itself was motivated by a guy
wanting to toy around with 32-bit x86 assembly ;-) ]
Various forms of craziness that motivate us dont really hurt, as long as the
process is rooted in reality. We can 'wish' for the crazier future stuff and
can help it indirectly, and sometimes it might even happen down the road - but
reality and common-sense utility is what controls.
And note that there's nothing dishonest about doing multi-purpose patches, as
long as the mainstream purpose isnt really just a decoy. When we decouple a
feature from -rt we usually forget its -rt purpose and the intermediate
for-mainstream forms arent even useful for -rt - back-integration into -rt
comes at a later stage. This makes it doubly sure that it's all formed by
mainstream's need, not -rt's needs.
In the few cases where the -rt role is prominent for some weird reason we
declare it as such. It's the exception to the rule really - few useful kernel
features are single purpose. ( When they are then we are likely doing
something wrong. -rt _is_ a special case. )
Ingo
On Mon, 25 Jan 2010, Mark Wielaard wrote:
>
> And all these users have wishes to extend the current ptrace interface
> mess. But nobody dares to extend ptrace in any direction because
> fixing/cleaning up one of these use cases might break the others in
> subtle and not so subtle ways. Which is why the utrace series of patches
> is cleaning up all this stuff first.
I call bullshit.
You can clean up ptrace without introducing odd new interfaces and trying
to sell it as some revolutionary new kernel interface that can do
anything.
I also call bullshit on the "ptrace() is so horribly nasty" argument. Yes,
I've seen the code that uses ptrace in user space, and yes, it's nasty,
but it's invariably _not_ nasty so much because ptrace itself is nasty,
but because it's full of #ifdef so-and-so-os/so-and-so-arch, and the code
is never cleaned up.
There are a couple of obvious cases of ptrace being uglier-than-it-needs-
to-be. Like the traditional ptrace read/write interface being purely "word
at a time", and that clearly is not pretty. Several architectures already
do "copy range" kind of versions on it, though, so that's just a detail,
and if anybody wanted to clean it up, they could have.
The more fundamental problem is the use of signals (while at the same time
wanting to _trap_ non-ptrace signals), without any model for a "connection
state", which is why you can have only one tracer. But again, that's
largely a user interface issue, and apparently utrace does _nothing_ for
that problem at all.
So I do agree that ptrace is not a great interface. However: repeating
that statement over and over in _no_ way excuses some totally unrelated
code that doesn't have anything what-so-ever to do with the actual
problems of ptrace.
Linus
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Linus> No. There is absolutely _no_ reason to believe that gdb et al would ever
Linus> delete the ptrace interfaces anyway.
Yes, in GDB we approximately never delete anything.
Nevertheless, if the Linux kernel were to present a new user-space API,
and if it had an advantage over ptrace, then we would port GDB to use
it. There are other platforms where, IIRC, we now use some /proc thing
instead of ptrace.
There are definitely things we would like from such an API. Here's a
few I can think of immediately, there are probably others.
* Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
Internally we're already using a self-pipe to integrate this into
gdb's main loop. Relatedly, don't mess with the inferior's parentage.
* Support "displaced stepping" in the kernel; I think this would improve
performance when debugging in non-stop mode.
* Support some kind of breakpoint expression in the kernel; this would
improve performance of conditional breakpoints. Perhaps the existing
gdb agent expressions could be used.
Tom
On Mon, 25 Jan 2010, Tom Tromey wrote:
>
> There are definitely things we would like from such an API. Here's a
> few I can think of immediately, there are probably others.
>
> * Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
> Internally we're already using a self-pipe to integrate this into
> gdb's main loop. Relatedly, don't mess with the inferior's parentage.
As I kind of alluded to elsewhere, I heartily agree with this. The really
major design mistake of ptrace (as opposed to just various ugly corners)
is how it has no connection information, and that ends up being one of the
main reasons why you can't have two ptracers working on the same thing.
(There are other things that complicate that too, of course, like simply
just trying to manage various per-thread state like debug registers etc,
but that's a separate class of complications).
> * Support "displaced stepping" in the kernel; I think this would improve
> performance when debugging in non-stop mode.
Don't we already do that at least on x86? Just doing a single-step should
work on an instruction even if it has a breakpoint on it, because we set
the TF bit.
Or maybe I'm not understanding what displaced stepping means to you.
> * Support some kind of breakpoint expression in the kernel; this would
> improve performance of conditional breakpoints. Perhaps the existing
> gdb agent expressions could be used.
I suspect it might be reasonable to do simple expressions on breakpoints,
but not the kind of things gdb exports to users. IOW, maybe you could have
a single conditional on a single value (register or memory) associated
with an expression.
Regardless, internally to the kernel your two later issues are "details".
The "how to connect to the debuggee" is a much more fundamental issue, and
has the biggest design/interface impact. The other would likely just be
new ptrace command extensions that somebody would have to just implement
the grotty details on.
Linus
On Tue, 26 Jan 2010, Renzo Davoli wrote:
>
> The solution is that everybody can code his/her optimized kernel/user
> interface for tracing in his/her kernel module, i.e. utrace.
I don't think people understand. That is simply not a "solution". That is
a PROBLEM. The thing you describe is an absolute disaster. Which is
exactly why I rant against it.
The last thing we want to have is "here, take this, and make your own
kernel module mess around it optimized for your particular crazy
scenario".
But every SINGLE post in this thread that has argued for utrace has argued
exactly this way.
Linus
Let me add my two euro-cents to this discussion.
Mark Wielaard <[email protected]>:
> Unfortunately ptrace does all that magic already (badly). People don't
> just use it for (s)tracing syscalls, but also for tracing signals, for
> single step debugging and poking at memory, register state, for process
> jailing and virtualization (uml) through syscall emulation.
> So when they are talking about these fancy things that is because that
> is what ptrace gives them currently. And they hate it, because the
> ptrace interface is such a pain to work with. And all these things don't
> really work together. You cannot trace, emulate, debug, jail at the same
> time.
I support Mark's words. I don't use ptrace for debugging/tracing and I
have experienced severe limitations of ptrace interface.
(I have tried to post some extensions for ptrace to overcome some
constraints.... see my posts on ptrace_vm or ptrace_multi on LKML).
Oleg Nesterov, writing to Andrew Morton said:
> First of all, utrace makes other things possible. gdbstub,
> nondestructive core dump, uprobes, kmview, hopefully more. I didn't
> look at these projects closely, perhaps other people can tell more. As
> for their merge status, until utrace itself is merged it is very hard to
> develop them out of tree.
In the list above there is also kmview, which is a creature of mines.
umview and kmview are partial virtual machines, processes running
in a [uk]mview machine can have their own view for the file system,
networking support, user-id, system-name, etc.
A [uk]mview machine virtualizes just what the user need: the filesystem
or just a subtree/some subtrees or networking or define one/some
virtual devices, etc. The "view" provided by a [uk]mview machine can be
a composition of real resources (provided by the Linux kernel) and
virtual resources.
Each system call request gets hijacked to a module of [uk]mview when
it refers to a virtual resource. The request is forwarded to the kernel
otherwise.
umview is based on ptrace, kmview uses a kernel module based on utrace.
(umview is included in debian lenny (to sid), tutorial and manuals in
wiki.virtualsquare.org)
IMHO utrace is better than ptrace (or an optimized version of it):
1 - "Frank Ch. Eigler" wrote:
> At least one reason is that ptrace is single-usage-only, so for
> example you cannot concurrently debug & strace the same program.
- exactly. utrace allows multiple tracing engines, this means that kmview
machines can be nested (in a natural way, no extra code is needed for
this feature). In the same way strace/gdb can run on virtualized processes, too.
2 - kmview kernel module implements several optimizations
to minimize the number of requests forwarded to the kmview process
(the virtual machine monitor). kmview is just a module using the
utrace interface, prior attempts of optimized umview required kernel patches.
Like kmview any other service requiring process tracing can include
specific optimizations in its own kernel module.
On the other hand, all these services could use the standardized utrace
interface for their optimizations, instead asking for messy patches
to change code all around the kernel source.
3 - ptrace takes SIGSTOP/SIGCONT for its own management. Strace/gdb and
umview cannot be transparent for programs using these signals.
Oleg Nesterov talking about Ptrace said:
> Of course they can't use other interfaces, we don't have them. And
> without the new abstraction layer we will never have, I think.
I agree.
THe following list includes the execution times I got in a recent test
(make vde-2, see http://www.cs.unibo.it/~renzo/view-os-lk2009.pdf)
plain kernel 22.7s,
kmview (no modules) 23.9s (+5.5%),
full kmview (modules loaded, all syscall virtualized) 38.5s (+70%)
optimized umview 51.0 (+124%),
umview on vanilla kernel 75.7s (+233%).
utrace can be used to speedup virtualization (at least in my case
it worked in this way).
Performance can be useful for debugging but it is a main issue for
virtualization.
Kmview module provides optimizations to select the system call requests
depending on the syscall number, the pathnames or the file descriptors.
http://wiki.virtualsquare.org/index.php/KMview_module_interface_specifications
Trying to add all the optimizations needed by different projects to ptrace is a
never-ending nightmare: the LKML will continue to receive patch proposals
for ptrace...
The solution is that everybody can code his/her optimized kernel/user
interface for tracing in his/her kernel module, i.e. utrace.
renzo
On Fri 2010-01-22 08:43:18, [email protected] wrote:
> On Fri, 22 Jan 2010 10:51:39 +0530, Ananth N Mavinakayanahalli said:
>
> > FWIW, Oleg's implementation of ptrace over utrace is 100% compatible
> > with legacy ptrace; gdb testsuite indicates that
> > (http://lkml.org/lkml/2009/12/21/98).
>
> No, that only proves it's compatible enough for gdb to not care. The problem
> is all those *other* packages that abuse ptrace in totally crackhead ways.
>
> (No, I can't name them - but ptrace is the sort of interface that almost
> encourages its use for things somewhere between crackhead and mad-scientist,
> so they're almost certainly out there.. WAY out there.. :)
strace, subterfugue, ltrace, ...? Plus various homegrown sandboxing tools...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, Jan 25, 2010 at 01:41:57PM -0800, Linus Torvalds wrote:
>
>
> On Mon, 25 Jan 2010, Tom Tromey wrote:
...
> > * Support "displaced stepping" in the kernel; I think this would improve
> > performance when debugging in non-stop mode.
>
> Don't we already do that at least on x86? Just doing a single-step should
> work on an instruction even if it has a breakpoint on it, because we set
> the TF bit.
>
> Or maybe I'm not understanding what displaced stepping means to you.
If Tom is referring to supporting single-stepping out of line, ie., not
putting back the original instruction at the bp location, yes, we
already support it on various architectures for kernel breakpoints,
through the kprobes infrastructure.
For userspace, there are more complications to take care of. We are
reworking a prototype based on community comments (see the long UBP/XOL
thread on lkml from a few days ago). Hopefully the userspace breakpoint
assistance layer will be generic enough for gdb to also take advantage
of, though the interface details need to be hashed out.
Ananth
Hi -
On Mon, Jan 25, 2010 at 02:05:54PM -0700, Tom Tromey wrote:
> [...]
> Nevertheless, if the Linux kernel were to present a new user-space API,
> and if it had an advantage over ptrace, then we would port GDB to use
> it. There are other platforms where, IIRC, we now use some /proc thing
> instead of ptrace.
>
> There are definitely things we would like from such an API. Here's a
> few I can think of immediately, there are probably others.
>
> * Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
> [...] Relatedly, don't mess with the inferior's parentage.
This is satisfied by the gdbstub prototype.
> * Support "displaced stepping" in the kernel [...]
I believe this is tantamount to hardware breakpoint support, which is
already present (via optional uprobes).
> * Support some kind of breakpoint expression in the kernel; this would
> improve performance of conditional breakpoints. Perhaps the existing
> gdb agent expressions could be used.
This is in the todo list.
And that "KILLER FEATURE" of running strace plus gdb on the same
process? It *already works* with the gdbstub, and unmodified strace +
gdb, thanks to utrace multiplexing process control. It is still
artificially restricted in many ways, but this sort of thing is ready
for testing:
% process &
[1] 9999
% strace -o FILE -p 9999 &
% gdb process
(gdb) target remote /proc/9999/gdb
(gdb) backtrace
(gdb) cont
(gdb) ^D
%
[process continues]
% cat FILE
[...]
% kill 9999
- FChE
On Mon, Jan 25, 2010 at 04:07:21PM -0800, Linus Torvalds wrote:
> On Tue, 26 Jan 2010, Renzo Davoli wrote:
> >
> > The solution is that everybody can code his/her optimized kernel/user
> > interface for tracing in his/her kernel module, i.e. utrace.
>
> I don't think people understand. That is simply not a "solution". That is
> a PROBLEM. The thing you describe is an absolute disaster. Which is
> exactly why I rant against it.
>
> The last thing we want to have is "here, take this, and make your own
> kernel module mess around it optimized for your particular crazy
> scenario".
>
> But every SINGLE post in this thread that has argued for utrace has argued
> exactly this way.
I haven't followed much of the utrace discussions, but my impression was
that utrace primarily is a cleanup effort, replacing "don't change it,
you might break it" code with a clean, well defined (and even documented)
implementation. To make it easier for people not familiar
with the low-level architecture details to experiment with
debugging stuff.
Two points to consider:
1. If you'd merge utrace + ptrace-on-utrace, but never anything else
which uses the utrace API, wouldn't it still be an improvement?
2. A well defined utrace API makes debugging code more hackable, thus more
likely that someone might come up with a brilliant killer debug
feature in the future. (This might sound lame, but there are already
a few people doing crazy things with utrace while I'm not aware
that people have done such experiments based on the current ptrace impl.)
BTW, the ptrace improvements discussed elsewhere in this thread
(like using an fd intead of signals/wait) are orthogonal
to utrace, no? IMHO it's a seperate discussion.
Johannes
On Tue, 26 Jan 2010, Johannes Stezenbach wrote:
>
> 1. If you'd merge utrace + ptrace-on-utrace, but never anything else
> which uses the utrace API, wouldn't it still be an improvement?
I already said earlier that I'd be perfectly happy to merge utrace code,
as long as it was clear that I'm not merging a platform for crazy work.
IOW, the end result might be merging 99% of the code, but I want to set
peoples _expectations_ right. I'm not at all interested in merging stuff
that has various exported helper functions for people doing random things,
but I could happily merge stuff that cleans up internal implementation.
> 2. A well defined utrace API makes debugging code more hackable, thus more
> likely that someone might come up with a brilliant killer debug
> feature in the future.
I don't really agree.
Clean code makes things easier to improve, and maybe utrace cleans thigns
up. But defining new API's makes me very worried, and quite frankly, the
last thing I ever want to see is a new interface that out-of-tree modules
starr using for random hacking.
So I'd be much happier without the whole utrace kernel interface and
callbacks, and very much would want to avoid the whole issue of plugins.
I'd like to see ptrace improvements - not something else.
In other words, I'd much much rather keep the utrace thing _internal_ to
ptrace. If people have performance complaints about ptrace, let's look at
fixing those _as_such_, rather than look at new modules etc.
> BTW, the ptrace improvements discussed elsewhere in this thread
> (like using an fd intead of signals/wait) are orthogonal
> to utrace, no? IMHO it's a seperate discussion.
Largely, yes. Tied together to some degree of course, but the whole issue
of code cleanup can be seen as a reasonably independent first step (while
moving to a fd-based interface should probably not be done without some
cleanup first, so they _are_ somewhat tied together).
Linus
On Tue, Jan 26, 2010 at 08:28:15AM -0800, Linus Torvalds wrote:
> I already said earlier that I'd be perfectly happy to merge utrace code,
> as long as it was clear that I'm not merging a platform for crazy work.
> IOW, the end result might be merging 99% of the code, but I want to set
> peoples _expectations_ right. I'm not at all interested in merging stuff
> that has various exported helper functions for people doing random things,
> but I could happily merge stuff that cleans up internal implementation.
> Clean code makes things easier to improve, and maybe utrace cleans thigns
> up. But defining new API's makes me very worried, and quite frankly, the
> last thing I ever want to see is a new interface that out-of-tree modules
> starr using for random hacking.
To be fair Roland and Oleg did a lot of work on improving ptrace support
that was an offsprint of utrace. It would be great if the reamaining
architectures would catch up on beeing converted to it and getting rid
of the existing hairy arch ptrace code as much as possible.
I'm still not really set on utrace either, but the in-kernel gdbstub
Frank has started could be a real killer if it ever gets done up to
a fully usable state. If it really requires all the utrace abstractions
that seem a bit overdone I'm not sure. Might be a better idea to try to
get uprobes and the gdbstub in without it and see how much of the
abstraction will be needed anyway as a fallout, just without exporting
them to modules and thus actually making them published APIs.
Tom Tromey <[email protected]> writes:
> * Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
> Internally we're already using a self-pipe to integrate this into
> gdb's main loop. Relatedly, don't mess with the inferior's parentage.
How would having a kernel based solution be better over your
user space simulation?
BTW there's the new signalfd() system call that might do it
(haven't checked if it works for SIGCHLD)
> * Support "displaced stepping" in the kernel; I think this would improve
> performance when debugging in non-stop mode.
Not sure what "displaced stepping" is exactly, but it
sounds like the branch tracing extensions that got added a
few releases ago? On modern Intel chips they give you a branch
buffer in memory.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, 26 Jan 2010, Andi Kleen wrote:
> Tom Tromey <[email protected]> writes:
>
> > * Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
> > Internally we're already using a self-pipe to integrate this into
> > gdb's main loop. Relatedly, don't mess with the inferior's parentage.
>
> How would having a kernel based solution be better over your
> user space simulation?
Oh, the reason we should do something in the kernel is that you really
can't do certain things with the ptrace() interface.
For example, think about how Wine and UML use ptrace - and then realize
that that makes it impossible to attach a debugger from the outside.
That's a real deficiency in ptrace - much more so than the fact that there
are some odd details (ie the whole "read/write a word at a time" is just a
quirky detail in comparison - not a fundamental problem).
> BTW there's the new signalfd() system call that might do it
> (haven't checked if it works for SIGCHLD)
No, you miss the point.
The problem isn't that you want to turn signals into a file descriptor
just because you like file descriptors.
The problem is that anything that is based on reparenting and signals is
fundamentally a "one parent only" kind of interface. See?
So the reason I think using an fd is a good idea is _not_ because gdb
already uses an fd internally, but because it gives you a "connection"
between the debugger and debuggee that is not fundamentally limited to a
single controller.
(It doesn't have to be a file descriptor, of course, but could be any kind
of other model that allows multiple connections. It's just that in unix
terms, using a file descriptor as the "cookie" for the connection is a
very natural model. So the important part isn't the file descriptor
itself, it's the model you could build).
Linus
> The problem is that anything that is based on reparenting and signals is
> fundamentally a "one parent only" kind of interface. See?
I was actually thinking about that before I wrote the email.
But when I did that i couldn't come up with a good scenario
where multiple debuggers actually make sense. In a sense
being a debugger is really a very "intimate" thing for process. Do you
really want to have multiple of them messing with each other?
If yes how would they know what to touch and what not?
The only thing I could think of was "user space virtualization"
(like old UML) together with a real debugger, but frankly
these solutions all seemed like big race conditions to me
anyways and should be better done in the kernel or below it,
so I have a hard time taking them seriously.
Can you think of any scenario where multiple debuggers
on a process make sense?
-Andi
On 01/26, Linus Torvalds wrote:
>
> The problem is that anything that is based on reparenting and signals is
> fundamentally a "one parent only" kind of interface. See?
Indeed. signals + do_wait() is the horrible model.
> So the reason I think using an fd is a good idea is _not_ because gdb
> already uses an fd internally, but because it gives you a "connection"
> between the debugger and debuggee that is not fundamentally limited to a
> single controller.
>
> (It doesn't have to be a file descriptor, of course, but could be any kind
> of other model that allows multiple connections.
Yes.
But then we need something which represents this connection in kernel:
utrace_engine. Then we need something which allows multiple tracers to
cooperate. Just for example, one tracer wants to resume the tracee,
another tracer wants the tracee to be stopped. Utrace does this. And,
since we should preserve the current ptrace, the tracers should cooperate
with ptrace too.
IOW, this quickly leads to the new abstraction layer, I think. And of
course it is possible to implement this new model on top of utrace.
Yes, utrace itself comes with utrace_engine_ops vector to implement
"whatever you like", perhaps you dislike this part.
Oleg.
On 01/26, Andi Kleen wrote:
>
> But when I did that i couldn't come up with a good scenario
> where multiple debuggers actually make sense. In a sense
> being a debugger is really a very "intimate" thing for process. Do you
> really want to have multiple of them messing with each other?
>
> If yes how would they know what to touch and what not?
Yes, multiple debuggers can confuse each other if they change
the state of debuggee simultaneously. The user should do this ;)
> Can you think of any scenario where multiple debuggers
> on a process make sense?
Simple example. Try to debug/strace strace ot gdb itself. Not trivial,
you can't attach to strace's tracees. Recently I spent 2 days trying to
understand why strace -f hangs. I was able to attach to strace, but
I wasn't able to see what its tracees do.
And, it was not possible to even trace strace until it hangs, with
ptrace the tracee (strace) must stop to report the event and this
shadowed the race.
Oleg.
> Simple example. Try to debug/strace strace ot gdb itself. Not trivial,
> you can't attach to strace's tracees. Recently I spent 2 days trying to
> understand why strace -f hangs. I was able to attach to strace, but
> I wasn't able to see what its tracees do.
But what would the semantics be inside the tracees even if you could?
> And, it was not possible to even trace strace until it hangs, with
> ptrace the tracee (strace) must stop to report the event and this
> shadowed the race.
"Shadowing the race" was the second surname of strace I thought anyways @)
Basically if you care about races never use strace in the first place.
-Andi
--
[email protected] -- Speaking for myself only.
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Tom> * Support "displaced stepping" in the kernel; I think this would improve
Tom> performance when debugging in non-stop mode.
Linus> Don't we already do that at least on x86?
I don't know. If it does, and gdb does not yet use that, then that
would be worth changing.
Linus> Or maybe I'm not understanding what displaced stepping means to you.
In non-stop mode (where you can stop one thread but leave the others
running), gdb wants to have the breakpoints always inserted. So,
something must emulate the displaced instruction.
Tom
Tom> * Use an fd, not SIGCHLD+wait, to report inferior state changes to gdb.
Tom> Internally we're already using a self-pipe to integrate this into
Tom> gdb's main loop. Relatedly, don't mess with the inferior's parentage.
Andi> How would having a kernel based solution be better over your
Andi> user space simulation?
Signals and wait are a pain because if we want to use some random
library in gdb, there might be conflicts. This is true even if we use
signalfd. An fd-for-debugging does not have this problem. This matters
more now that we're letting people script gdb in python.
Tom
On 01/26, Andi Kleen wrote:
>
> > Simple example. Try to debug/strace strace ot gdb itself. Not trivial,
> > you can't attach to strace's tracees. Recently I spent 2 days trying to
> > understand why strace -f hangs. I was able to attach to strace, but
> > I wasn't able to see what its tracees do.
>
> But what would the semantics be inside the tracees even if you could?
In this particular case, all I need was something like "gdb -p" to
attach to the tracee, see the backtrace and detach.
> > And, it was not possible to even trace strace until it hangs, with
> > ptrace the tracee (strace) must stop to report the event and this
> > shadowed the race.
>
> "Shadowing the race" was the second surname of strace I thought anyways @)
> Basically if you care about races never use strace in the first place.
Yes. And utrace doesn't require the tracee to be stopped to report the
event ;) Yes, yes, utrace can't "fix" strace in this sense automatically,
but still.
Oleg.
On Tue, 26 Jan 2010, Tom Tromey wrote:
>
> In non-stop mode (where you can stop one thread but leave the others
> running), gdb wants to have the breakpoints always inserted. So,
> something must emulate the displaced instruction.
I'm almost totally uninterested in breakpoints that actually re-write
instructions. It's impossible to do that efficiently and well, especially
in threaded environments.
So if you do instruction rewriting, I can only say "that's your problem".
But using the hardware breakpoints should automatically DTRT, both wrt
threads _and_ wrt restarting. Sure, there's onyl a limited number of them,
so if somebody wants more than that they are kind of screwed, but that's
just how life is.
Linus
tromey wrote:
> [...]
> In non-stop mode (where you can stop one thread but leave the others
> running), gdb wants to have the breakpoints always inserted. So,
> something must emulate the displaced instruction.
This sounds like the sort of thing that kernel kprobes do, which the
uprobes patch does for userspace. The gdbstub prototype can use
uprobes for such "displaced" breakpoints, and single-step-out-of-line
to execute them on a few platforms like x86-*. This is already
prototyped / working. (gdbstub currently restricts itself to
single-threaded programs only, but that's another todo.)
- FChE
On Tue, 2010-01-26 at 15:37 -0800, Linus Torvalds wrote:
>
> On Tue, 26 Jan 2010, Tom Tromey wrote:
> >
> > In non-stop mode (where you can stop one thread but leave the others
> > running), gdb wants to have the breakpoints always inserted. So,
> > something must emulate the displaced instruction.
>
> I'm almost totally uninterested in breakpoints that actually re-write
> instructions. It's impossible to do that efficiently and well, especially
> in threaded environments.
>
> So if you do instruction rewriting, I can only say "that's your problem".
Right, so you're going to love uprobes, which does exactly that. The
current proposal is overwriting the target instruction with an INT3 and
injecting an extra vma into the target process's address space
containing the original instruction(s) and possible jumps back to the
old code stream.
I'm all in favor of not doing that extra vma and instead use stack or
TLS space, but then people complain about having to make that executable
(which is something I don't really mind, x86 had executable everything
for very long, and also, its only so when debugging the thing anyway).
* Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-01-26 at 15:37 -0800, Linus Torvalds wrote:
> >
> > On Tue, 26 Jan 2010, Tom Tromey wrote:
> > >
> > > In non-stop mode (where you can stop one thread but leave the others
> > > running), gdb wants to have the breakpoints always inserted. So,
> > > something must emulate the displaced instruction.
> >
> > I'm almost totally uninterested in breakpoints that actually re-write
> > instructions. It's impossible to do that efficiently and well, especially
> > in threaded environments.
> >
> > So if you do instruction rewriting, I can only say "that's your problem".
>
> Right, so you're going to love uprobes, which does exactly that. The current
> proposal is overwriting the target instruction with an INT3 and injecting an
> extra vma into the target process's address space containing the original
> instruction(s) and possible jumps back to the old code stream.
>
> I'm all in favor of not doing that extra vma and instead use stack or TLS
> space, but then people complain about having to make that executable (which
> is something I don't really mind, x86 had executable everything for very
> long, and also, its only so when debugging the thing anyway).
I think the best solution for user probes (by far) is to use a simplified
in-kernel instruction emulator for the few common probes instruction. (Kprobes
already partially decodes x86 instructions to make it safe to apply
accelerated probes and there's other decoding logic in the kernel too.)
The design and practical advantages are numerous:
- People want to probe their function prologues most of the time ...
a single INT3 there will in most cases just hit the initial stack
allocation and that's it. We could get quite good coverage (and very fast
emulation) for the common case in not too much code - and much of that code
we already have available. No re-trapping, no extra instruction patching
and complex maintenance of trampolines.
- It's as transparent as it gets - no user-space trampoline or other visible
state that modifies behavior or can be stomped upon by user-space bugs.
- Lightweight and simple probe insertion: no weird setup sequence needing the
stopping of all tasks to install the trampoline. We just add the INT3 and
off you go.
- Emulation is evidently thread-safe, SMP-safe, etc. as it only acts on
task local state.
- The points we can probe are never truly limited as it's all freely
upscalable: if you cannot probe an instruction you want to probe today,
extend the emulator. Deny the rest. _All_ versions of uprobes code i've
seen so far already restricts the probe-compatible instruction set:
RIP-relative instructions are excluded on 64-bit for example.
- Emulation has the _least_ semantical side effects as we really execute
'that' instruction - not some other instruction put elsewhere into a
special vma or into the process/thread stack, or some special in-kernel
trampoline, etc.
- Emulation can be very fast for the common case as well. Nobody will probe
weird, complex instructions. They will use 'perf probe' to insert probes
into their functions 90% of the time ...
- FPU and complex ops and pagefault emulation is not really what i'd expect
to be necessary for simple probing - but it _can_ be added by people who
care about it, if they so wish.
Such a scheme would be _far_ more preferable form a maintenance POV as well,
as the initial code will be small, and we can extend it gradually. All the
other proposals are complex 'all or nothing' schemes with no flexibility for
complexity at all.
Thanks,
Ingo
On Wed, 27 Jan 2010, Peter Zijlstra wrote:
>
> Right, so you're going to love uprobes, which does exactly that. The
> current proposal is overwriting the target instruction with an INT3 and
> injecting an extra vma into the target process's address space
> containing the original instruction(s) and possible jumps back to the
> old code stream.
Just out of interest, how does it handle the threading issue?
Last I saw, at least some CPU people were _very_ nervous about overwriting
instructions if another CPU might be just about to execute them.
Even the "overwrite only the first byte with 'int3'" made them go "umm, I
need to talk to some core CPU people to see if that's ok". They mumble
about possible CPU errata, I$ coherency, instruction retry etc.
I realize kprobes does this very thing, but kprobes is esoteric stuff and
doesn't have much choice. In user space, you _could_ do the modification
on a different physical page and then just switch the page table entry
instead, and not get into the whole D$/I$ coherency thing at all.
Linus
On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote:
>
> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
> >
> > Right, so you're going to love uprobes, which does exactly that. The
> > current proposal is overwriting the target instruction with an INT3 and
> > injecting an extra vma into the target process's address space
> > containing the original instruction(s) and possible jumps back to the
> > old code stream.
>
> Just out of interest, how does it handle the threading issue?
>
> Last I saw, at least some CPU people were _very_ nervous about overwriting
> instructions if another CPU might be just about to execute them.
>
> Even the "overwrite only the first byte with 'int3'" made them go "umm, I
> need to talk to some core CPU people to see if that's ok". They mumble
> about possible CPU errata, I$ coherency, instruction retry etc.
>
> I realize kprobes does this very thing, but kprobes is esoteric stuff and
> doesn't have much choice. In user space, you _could_ do the modification
> on a different physical page and then just switch the page table entry
> instead, and not get into the whole D$/I$ coherency thing at all.
Right, so there's two aspects:
1) concurrency when inserting the probe
2) concurrency when hitting the probe
1) used to be dealt with by using utrace to stop all threads in the
process and then writing the instruction. I suggested to CoW the page,
modify the instruction, set the pagetable and flush tlbs at full speed
-- the very thing you suggest here.
2) so traditionally (and the intel arch manual describes this) is to
replace the instruction, single step it, and write the probe back. This
is racy for multi-threading. The current uprobes stuff solves this by
doing single-step-out-of-line (XOL).
XOL injects a new vma into the target process and puts the old
instruction there, then it single steps on the new location, leaving the
original site with INT3.
This doesn't work for things like RIP relative instructions, so uprobes
considers them un-probable.
Also, I myself really object to inserting a vma in a running process,
its like a land-lord, sure he has the key but he won't come in an poke
through your things.
The alternative is to place the instruction in TLS or stack space, since
each thread can only have a single trap at a time, you only need space
for 1 instruction (plus a possible jump out to the original site). There
is the 'problem' of marking the TLS/stack executable when being probed.
Then there is the whole emulation angle, the uprobes people basically
say its too much effort to write a x86 emulator.
On Wed, 2010-01-27 at 11:55 +0100, Peter Zijlstra wrote:
> Right, so there's two aspects:
>
> 1) concurrency when inserting the probe
> 2) concurrency when hitting the probe
>
> 1) used to be dealt with by using utrace to stop all threads in the
> process and then writing the instruction. I suggested to CoW the page,
> modify the instruction, set the pagetable and flush tlbs at full speed
> -- the very thing you suggest here.
Also, since executable maps are typically MAP_PRIVATE, you have to CoW
anyway in order to modify it and I would exclude MAP_SHARED from being
probable because then the modification could seep through into whatever
was backing that thing.
On Wed, 27 Jan 2010, Peter Zijlstra wrote:
>
> Right, so there's two aspects:
>
> 1) concurrency when inserting the probe
That's the one I worried about. Stopping all threads will fix it,
obviously at a disastrous performance cost, but what do I care? As noted,
there are ways to do it safely with TLB switching, so it's fixable.
> 2) concurrency when hitting the probe
Yeah, I didn't worry about this part, since the only solution is the
out-of-line one, and I don't much care how the memory gets allocated for
it. Inserting a whole new vma seems pretty drastic, but compared to
stopping all threads, it's a small thing.
Linus
On Wed, Jan 27, 2010 at 11:55:16AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote:
> >
> > On Wed, 27 Jan 2010, Peter Zijlstra wrote:
> > >
> > > Right, so you're going to love uprobes, which does exactly that. The
> > > current proposal is overwriting the target instruction with an INT3 and
> > > injecting an extra vma into the target process's address space
> > > containing the original instruction(s) and possible jumps back to the
> > > old code stream.
> >
> > Just out of interest, how does it handle the threading issue?
> >
> > Last I saw, at least some CPU people were _very_ nervous about overwriting
> > instructions if another CPU might be just about to execute them.
> >
> > Even the "overwrite only the first byte with 'int3'" made them go "umm, I
> > need to talk to some core CPU people to see if that's ok". They mumble
> > about possible CPU errata, I$ coherency, instruction retry etc.
> >
> > I realize kprobes does this very thing, but kprobes is esoteric stuff and
> > doesn't have much choice. In user space, you _could_ do the modification
> > on a different physical page and then just switch the page table entry
> > instead, and not get into the whole D$/I$ coherency thing at all.
>
> Right, so there's two aspects:
>
> 1) concurrency when inserting the probe
> 2) concurrency when hitting the probe
>
> 1) used to be dealt with by using utrace to stop all threads in the
> process and then writing the instruction. I suggested to CoW the page,
> modify the instruction, set the pagetable and flush tlbs at full speed
> -- the very thing you suggest here.
>
> 2) so traditionally (and the intel arch manual describes this) is to
> replace the instruction, single step it, and write the probe back. This
> is racy for multi-threading. The current uprobes stuff solves this by
> doing single-step-out-of-line (XOL).
>
> XOL injects a new vma into the target process and puts the old
> instruction there, then it single steps on the new location, leaving the
> original site with INT3.
>
> This doesn't work for things like RIP relative instructions, so uprobes
> considers them un-probable.
Probing RIP-relative instructions work just fine; there are fixups that
take care of it.
> Also, I myself really object to inserting a vma in a running process,
> its like a land-lord, sure he has the key but he won't come in an poke
> through your things.
>
> The alternative is to place the instruction in TLS or stack space, since
> each thread can only have a single trap at a time, you only need space
> for 1 instruction (plus a possible jump out to the original site). There
> is the 'problem' of marking the TLS/stack executable when being probed.
>
> Then there is the whole emulation angle, the uprobes people basically
> say its too much effort to write a x86 emulator.
We don't need to write one. I don't know how easy it is to make the kvm
emulator less kvm-centric (vcpus, kvm_context, etc). Avi?
Ananth
* Linus Torvalds <[email protected]> [2010-01-27 02:43:39]:
>
>
> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
> >
> > Right, so you're going to love uprobes, which does exactly that. The
> > current proposal is overwriting the target instruction with an INT3 and
> > injecting an extra vma into the target process's address space
> > containing the original instruction(s) and possible jumps back to the
> > old code stream.
>
> Just out of interest, how does it handle the threading issue?
I am not sure why threading would be an issue with XOL. Since all
threads of a process would have access to the XOL VMA.
i.e This XOL VMA is a per-process VMA that gets attached to the process address space
only when we hit the first breakpoint.
We reserve a slot for each breakpoint in the XOL VMA, whenever the trap
is hit, we jump to the corresponding slot, single step and jump back
after necessary fix-ups.
We have been able to use this approach in multithreaded applications.
However if you see any issues, can you please let us know?
>
> Last I saw, at least some CPU people were _very_ nervous about overwriting
> instructions if another CPU might be just about to execute them.
>
> Even the "overwrite only the first byte with 'int3'" made them go "umm, I
> need to talk to some core CPU people to see if that's ok". They mumble
> about possible CPU errata, I$ coherency, instruction retry etc.
Thats exactly why we waited for threads to queisce before inserting and
deleting the breakpoints. However we were advised by lkml that there are
better ways to insert/delete breakpoints without quiescing by adjusting
the page table entries similar to what you said just below. And we are
working on switching the page table entry solution.
>
> I realize kprobes does this very thing, but kprobes is esoteric stuff and
> doesn't have much choice. In user space, you _could_ do the modification
> on a different physical page and then just switch the page table entry
> instead, and not get into the whole D$/I$ coherency thing at all.
>
> Linus
>
--
Thanks and Regards
Srikar
On Wed, 2010-01-27 at 16:35 +0530, Ananth N Mavinakayanahalli wrote:
> Probing RIP-relative instructions work just fine; there are fixups that
> take care of it.
Ah my bad then, it was my understanding you simply bailed on those.
Just for my information, how large are the replacement sequences?
On Wed, Jan 27, 2010 at 12:08:31PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 16:35 +0530, Ananth N Mavinakayanahalli wrote:
> > Probing RIP-relative instructions work just fine; there are fixups that
> > take care of it.
>
> Ah my bad then, it was my understanding you simply bailed on those.
>
> Just for my information, how large are the replacement sequences?
The RIP relative instruction is transformed into indirect addressing
mode using a scratch register.
For details http://marc.info/?l=linux-kernel&m=126401936114639&w=2.
Ananth
[ Added Arjan ]
On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote:
>
> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
> >
> > Right, so you're going to love uprobes, which does exactly that. The
> > current proposal is overwriting the target instruction with an INT3 and
> > injecting an extra vma into the target process's address space
> > containing the original instruction(s) and possible jumps back to the
> > old code stream.
>
> Just out of interest, how does it handle the threading issue?
>
> Last I saw, at least some CPU people were _very_ nervous about overwriting
> instructions if another CPU might be just about to execute them.
I think the issue was that ring 0 was never meant to do that, where as,
ring 3 does it all the time. Doesn't the dynamic library modify its
text?
-- Steve
>
> Even the "overwrite only the first byte with 'int3'" made them go "umm, I
> need to talk to some core CPU people to see if that's ok". They mumble
> about possible CPU errata, I$ coherency, instruction retry etc.
>
> I realize kprobes does this very thing, but kprobes is esoteric stuff and
> doesn't have much choice. In user space, you _could_ do the modification
> on a different physical page and then just switch the page table entry
> instead, and not get into the whole D$/I$ coherency thing at all.
>
> Linus
On Wed, Jan 27, 2010 at 03:04:58AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
> >
> > Right, so there's two aspects:
> >
> > 1) concurrency when inserting the probe
>
> That's the one I worried about. Stopping all threads will fix it,
> obviously at a disastrous performance cost, but what do I care? As noted,
> there are ways to do it safely with TLB switching, so it's fixable.
That said, inserting a probe is supposed to be a pretty rare
operation, stopping all threads in a process shouldn't be
painful for this aspect.
On 01/27/2010 05:59 AM, Steven Rostedt wrote:
> [ Added Arjan ]
>
> On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote:
>>
>> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
>>>
>>> Right, so you're going to love uprobes, which does exactly that. The
>>> current proposal is overwriting the target instruction with an INT3 and
>>> injecting an extra vma into the target process's address space
>>> containing the original instruction(s) and possible jumps back to the
>>> old code stream.
>>
>> Just out of interest, how does it handle the threading issue?
>>
>> Last I saw, at least some CPU people were _very_ nervous about overwriting
>> instructions if another CPU might be just about to execute them.
>
> I think the issue was that ring 0 was never meant to do that, where as,
> ring 3 does it all the time. Doesn't the dynamic library modify its
> text?
>
No, it has nothing to do with ring. It has to do with modifying code
that another CPU could be executing at the same time, and with modifying
code on the same processor through another virtual alias (they are
different issues.) The same issues apply regardless of the CPL of the
processor.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Wed, 2010-01-27 at 09:42 -0800, H. Peter Anvin wrote:
> On 01/27/2010 05:59 AM, Steven Rostedt wrote:
> > I think the issue was that ring 0 was never meant to do that, where as,
> > ring 3 does it all the time. Doesn't the dynamic library modify its
> > text?
> >
>
> No, it has nothing to do with ring. It has to do with modifying code
> that another CPU could be executing at the same time, and with modifying
> code on the same processor through another virtual alias (they are
> different issues.) The same issues apply regardless of the CPL of the
> processor.
Thanks for clarifying.
-- Steve
On 01/27/2010 02:43 AM, Linus Torvalds wrote:
>
>
> On Wed, 27 Jan 2010, Peter Zijlstra wrote:
>>
>> Right, so you're going to love uprobes, which does exactly that. The
>> current proposal is overwriting the target instruction with an INT3 and
>> injecting an extra vma into the target process's address space
>> containing the original instruction(s) and possible jumps back to the
>> old code stream.
>
> Just out of interest, how does it handle the threading issue?
>
> Last I saw, at least some CPU people were _very_ nervous about overwriting
> instructions if another CPU might be just about to execute them.
>
> Even the "overwrite only the first byte with 'int3'" made them go "umm, I
> need to talk to some core CPU people to see if that's ok". They mumble
> about possible CPU errata, I$ coherency, instruction retry etc.
>
We actually went through a review of that here at Intel. We do not yet
have an *official* answer (in order for us to have that we have to have
it approved by the architecture committee and published in the SDM), but
to the best of our current knowledge (and I'm allowed to say this) the
int3 method followed by global IPIs should be safe for modifying *one
(atomic) instruction*. This is a specific case of a more general rule,
but I don't want to disclose the whole rule until it has been officially
approved.
> I realize kprobes does this very thing, but kprobes is esoteric stuff and
> doesn't have much choice. In user space, you _could_ do the modification
> on a different physical page and then just switch the page table entry
> instead, and not get into the whole D$/I$ coherency thing at all.
On the more general rule of interpretation: I'm really concerned about
having a bunch of partially-capable x86 interpreters all over the
kernel. x86 is *hard* to emulate, and it will only get harder as the
architecture evolves.
-hpa
On Wed, 2010-01-27 at 09:54 +0100, Ingo Molnar wrote:
...
> I think the best solution for user probes (by far) is to use a simplified
> in-kernel instruction emulator for the few common probes instruction. (Kprobes
> already partially decodes x86 instructions to make it safe to apply
> accelerated probes and there's other decoding logic in the kernel too.)
>
> The design and practical advantages are numerous:
>
> - People want to probe their function prologues most of the time ...
> a single INT3 there will in most cases just hit the initial stack
> allocation and that's it.
Yes, emulating "push %ebp" would buy us a lot of coverage for a lot of
apps on x86 (but see below**). Even there, though, we'd have to address
the page fault we'd occasionally get when extending the stack vma.
> We could get quite good coverage (and very fast
> emulation) for the common case in not too much code - and much of that code
> we already have available. No re-trapping,
As previously discussed, boosting would also get rid of the single-step
trap for most instructions.
> no extra instruction patching
x86_64 rip-relative instructions are the only ones we alter.
> and complex maintenance of trampolines.
>
> - It's as transparent as it gets - no user-space trampoline or other visible
> state that modifies behavior or can be stomped upon by user-space bugs.
The XOL vma isn't writable from user space, so I can't think of how it
could be clobbered merely by a stray memory reference. Yes, it's a vma
that the unprobed app would never have; and yes, a malicious app or
kernel module could remove it or alter the protection and scribble on
it. We don't try to defend the app against such malicious attacks, but
we do our best to ensure that the kernel side handles such attacks
gracefully.
>
> - Lightweight and simple probe insertion: no weird setup sequence needing the
> stopping of all tasks to install the trampoline. We just add the INT3 and
> off you go.
FWIW, we don't stop all threads to set up or extend the XOL vma, which
is typically a one-time event. We just grab a mutex, in case multiple
threads hit previously-unhit probepoints simultaneously, and
simultaneously decide that the XOL area needs to be created or extended.
>
> - Emulation is evidently thread-safe, SMP-safe, etc. as it only acts on
> task local state.
The posted uprobes implementation is, so far as we can tell through code
inspection and testing, also thread-safe and SMP-safe.
>
> - The points we can probe are never truly limited as it's all freely
> upscalable: if you cannot probe an instruction you want to probe today,
> extend the emulator.
I don't see how ripping out existing support for almost* the entire
instruction set, and then putting it back instruction by instruction,
patch by patch, is a win.
Even if we add emulation, it seems sensible to keep the XOL approach as
a backup to handle instructions that aren't yet emulated (and
architectures that don't yet have emulators). That way, if you don't
probe any unemulated instructions, the XOL vma is never created.
> Deny the rest. _All_ versions of uprobes code i've
> seen so far already restricts the probe-compatible instruction set:
*Yes, we currently decline to probe some instructions that look
troublesome and we haven't taken the time to test. These include things
like privileged instructions, int*, in*/out*, and instructions that fuss
with the segment registers. We've never actually seen such instructions
in user apps.
> RIP-relative instructions are excluded on 64-bit for example.
No. As discussed in previous posts, we handle rip-relative
instructions.
>
> - Emulation has the _least_ semantical side effects as we really execute
> 'that' instruction -
It seems to me that emulation is the only approach that DOESN'T execute
the probed instruction.
> not some other instruction put elsewhere into a
> special vma or into the process/thread stack, or some special in-kernel
> trampoline, etc.
>
> - Emulation can be very fast for the common case as well. Nobody will probe
> weird, complex instructions. They will use 'perf probe' to insert probes
> into their functions 90% of the time ...
>
> - FPU and complex ops and pagefault emulation is not really what i'd expect
> to be necessary for simple probing - but it _can_ be added by people who
> care about it, if they so wish.
**In practice, we've had to probe all sorts of instructions, including
FP instructions -- especially where you want to exploit the debug info
to get the names, types, and locations of variables and args. For some
compilers and architectures, the debug info isn't reliable until the end
of the function prologue, at which point you could find any old
instruction. Ditto if you want to probe statements within a function.
>
> Such a scheme would be _far_ more preferable form a maintenance POV as well,
> as the initial code will be small, and we can extend it gradually. All the
> other proposals are complex 'all or nothing' schemes with no flexibility for
> complexity at all.
>
> Thanks,
>
> Ingo
Thanks.
Jim
* Jim Keniston <[email protected]> wrote:
> On Wed, 2010-01-27 at 09:54 +0100, Ingo Molnar wrote:
> ...
> > I think the best solution for user probes (by far) is to use a simplified
> > in-kernel instruction emulator for the few common probes instruction. (Kprobes
> > already partially decodes x86 instructions to make it safe to apply
> > accelerated probes and there's other decoding logic in the kernel too.)
> >
> > The design and practical advantages are numerous:
> >
> > - People want to probe their function prologues most of the time ...
> > a single INT3 there will in most cases just hit the initial stack
> > allocation and that's it.
>
> Yes, emulating "push %ebp" would buy us a lot of coverage for a lot of apps
> on x86 (but see below**). [...]
Coverage in practice is all that matters.
Consider the fact that i get 1000 times more bugreports aided by strace, which
has 1000 times more overhead than even the slowest of uprobes approaches.
This simple fact tell us that while performance matters, it is of little use
if good utility and a clean design is not there. (in fact sane and clean
design will almost automatically result in good performance too down the line,
but i digress.) Faster crap is still crap.
> [...] Even there, though, we'd have to address the page fault we'd
> occasionally get when extending the stack vma.
Nope, in the simplest model not even page fault emulation is needed,
get_user()/put_user() would resolve it automatically. If you either get the
value with the pagefault resolved, or you get a -EFAULT.
If you concentrate only on the common case then emulation can be _really_
simple.
Lets compare the two cases via a drawing. Your current uprobes submission
does:
[kernel] do probe thing single-step trap
^ | ^ |
| v | v
[user] INT3 XOL-ins next ins-stream
( add the need for serialization to make sure the whole single-step thing
does not get out of sync with reality. )
And emulator approach would do:
[kernel] emul-demux-fastpath, do probe thing
^ |
| v
[user] INT3 next ins-stream
far simpler conceptually, and faster as well, because it's one kernel entry.
Generally i get nervous if a piece of instrumentation cannot be expressed in
simple ways. _Especially_ if i consider it to concentrate on all the wrong
things and doesnt even break even with a far less complex scheme.
What would be the 'right things' to concentrate on? Make sure it's all all
around end-to-end package that is _useful to people_. As of today i have yet
to get a _single_ bugreport or kernel improvement requested by an application
writer who found out about the inefficiencies in his app using uprobes. There
is a gaping hole of utility here, a whole cathedral of tools written that just
a handful of ordinary Linux person uses. There's big disconnect and i can say
one thing for sure: needless complexity in the wrong places can outright
stiffle tools from becoming good.
> > We could get quite good coverage (and very fast
> > emulation) for the common case in not too much code - and much of that code
> > we already have available. No re-trapping,
>
> As previously discussed, boosting would also get rid of the single-step trap
> for most instructions.
Boosting is not in the uprobes patch-set you submitted. Even with it present
it wont get rid of the initial INT3. So basically _best-case_ (with boosting)
XOL-uprobes could roughly break even with a pure emulator approach ...
That's a big and fundamental difference.
> > no extra instruction patching
>
> x86_64 rip-relative instructions are the only ones we alter.
>
> > and complex maintenance of trampolines.
> >
> > - It's as transparent as it gets - no user-space trampoline or other visible
> > state that modifies behavior or can be stomped upon by user-space bugs.
>
> The XOL vma isn't writable from user space, so I can't think of how it could
> be clobbered merely by a stray memory reference. [...]
Well there must be some purpose to the instrumentation, there must be some way
to save data, right? If yes and it's in user-space, that data is clobberable.
If it's in kernel-space then we have to enter the kernel anyway (with similar
cost patterns to an INT3 entry) - so we just delayed the kernel entry.
So IMHO you have designed in considerable complexity for little immediate
benefit.
> [...] Yes, it's a vma that the unprobed app would never have; and yes, a
> malicious app or kernel module could remove it or alter the protection and
> scribble on it. We don't try to defend the app against such malicious
> attacks, but we do our best to ensure that the kernel side handles such
> attacks gracefully.
>
> > - Lightweight and simple probe insertion: no weird setup sequence needing the
> > stopping of all tasks to install the trampoline. We just add the INT3 and
> > off you go.
>
> FWIW, we don't stop all threads to set up or extend the XOL vma, which is
> typically a one-time event. We just grab a mutex, in case multiple threads
> hit previously-unhit probepoints simultaneously, and simultaneously decide
> that the XOL area needs to be created or extended.
Still it's more complex than purely local state. Plus slower than even a naive
emulator approach would be able to achieve, due to single-stepping.
> > - Emulation is evidently thread-safe, SMP-safe, etc. as it only acts on
> > task local state.
>
> The posted uprobes implementation is, so far as we can tell through code
> inspection and testing, also thread-safe and SMP-safe.
>
> >
> > - The points we can probe are never truly limited as it's all freely
> > upscalable: if you cannot probe an instruction you want to probe today,
> > extend the emulator.
>
> I don't see how ripping out existing support for almost* the entire
> instruction set, and then putting it back instruction by instruction, patch
> by patch, is a win.
IMO it's a win because it's more controlled in what we can and cannot do
safely, and because it's more transparent to the probed context.
But by far the most important aspect is that it should be far less code with
far less complexity, and hence much more graceful from an upstream POV.
Gradual concepts with easy ways forwards/backwards are good. All-or-nothing
frameworks are bad.
> Even if we add emulation, it seems sensible to keep the XOL approach as a
> backup to handle instructions that aren't yet emulated (and architectures
> that don't yet have emulators). That way, if you don't probe any unemulated
> instructions, the XOL vma is never created.
To turn the argument around: an in-kernel emulator is an all-around facility
to make sure we probe safely and securely, _and_ it is also more portable
because it's simpler (because more gradual) to implement on a new architecture
as you dont actually have to copy around instructions (and make sure they work
in that new place), but have to emulate a limited subset of the instruction
space, on purely local state.
There are far less things that can go wrong in such a model.
> > Deny the rest. _All_ versions of uprobes code i've
> > seen so far already restricts the probe-compatible instruction set:
>
> *Yes, we currently decline to probe some instructions that look troublesome
> and we haven't taken the time to test. These include things like privileged
> instructions, int*, in*/out*, and instructions that fuss with the segment
> registers. We've never actually seen such instructions in user apps.
>
>
> > RIP-relative instructions are excluded on 64-bit for example.
>
> No. As discussed in previous posts, we handle rip-relative
> instructions.
>
> >
> > - Emulation has the _least_ semantical side effects as we really execute
> > 'that' instruction -
>
> It seems to me that emulation is the only approach that DOESN'T execute the
> probed instruction.
None of the approaches executes _that_ instruction in _that_ place - the
instruction is either replaced by an INT3 or by a jump-to-trampoline
instruction.
They may execute the same instruction but in another place.
With an emulator (assuming the emulator is correct) we can execute the precise
semantics of that instruction in that place - without any side-effects from
trampolining/replacement.
> > not some other instruction put elsewhere into a
> > special vma or into the process/thread stack, or some special in-kernel
> > trampoline, etc.
> >
> > - Emulation can be very fast for the common case as well. Nobody will probe
> > weird, complex instructions. They will use 'perf probe' to insert probes
> > into their functions 90% of the time ...
> >
> > - FPU and complex ops and pagefault emulation is not really what i'd expect
> > to be necessary for simple probing - but it _can_ be added by people who
> > care about it, if they so wish.
>
> **In practice, we've had to probe all sorts of instructions, including FP
> instructions -- especially where you want to exploit the debug info to get
> the names, types, and locations of variables and args. For some compilers
> and architectures, the debug info isn't reliable until the end of the
> function prologue, at which point you could find any old instruction. Ditto
> if you want to probe statements within a function.
For those cases, frankly, the right approach is to fix the debug info (or
introduce a new one) and forget the old crap.
You treat debuginfo as some god-given property, while it's one of the suckiest
aspects of all of Linux. But we've had that discussion months (and years) ago.
It has improved in gcc 4.5 so there's some hope.
> > Such a scheme would be _far_ more preferable form a maintenance POV as
> > well, as the initial code will be small, and we can extend it gradually.
> > All the other proposals are complex 'all or nothing' schemes with no
> > flexibility for complexity at all.
I repeat this point. To be able to scale in and out of a design is rather
important, and i dont see that with the current XOL proposal.
Ingo
On Mon, 2010-01-25 at 08:52 -0800, Linus Torvalds wrote:
>
> That said, I also suspect that people should still look seriously at
> simply just improving ptrace. For example, I suspect that the biggest
> problem with ptrace is really just the signalling, and that creating a
> new
> extension for JUST THAT, and then having a model where you can choose
> - at
> PTRACE_ATTACH time - how to wait for events would be a good thing.
like returning a fd to poll() on ? :-)
Cheers,
Ben.
On Fri, 29 Jan 2010, Benjamin Herrenschmidt wrote:
>
> like returning a fd to poll() on ? :-)
Well, there's the possibility of async polling (rather than the
synchronous wait that ptrace forces now), but there are other advantages
to having a "connection" model - like not having to look up the child
process every time like ptrace does now.
Although 'find_task_by_vpid()' is probably cheap enough that nobody really
cares. We do a fair job at those hash tables.
Linus
On Thu, 2010-01-28 at 09:55 +0100, Ingo Molnar wrote:
> * Jim Keniston <[email protected]> wrote:
>
> > On Wed, 2010-01-27 at 09:54 +0100, Ingo Molnar wrote:
> > ...
> >
> > Yes, emulating "push %ebp" would buy us a lot of coverage for a lot of apps
> > on x86 (but see below**). [...]
>
...
>
> > [...] Even there, though, we'd have to address the page fault we'd
> > occasionally get when extending the stack vma.
>
> Nope, in the simplest model not even page fault emulation is needed,
> get_user()/put_user() would resolve it automatically. If you either get the
> value with the pagefault resolved, or you get a -EFAULT.
get_user()/put_user() have to be done in a context where you can sleep,
right? Uprobes currently operates in such contexts, but there's some
talk of moving it all to a DIE_INT3 notifier context, where it can't
sleep.
...
>
> > > We could get quite good coverage (and very fast
> > > emulation) for the common case in not too much code - and much of that code
> > > we already have available. No re-trapping,
> >
> > As previously discussed, boosting would also get rid of the single-step trap
> > for most instructions.
>
> Boosting is not in the uprobes patch-set you submitted. Even with it present
> it wont get rid of the initial INT3. So basically _best-case_ (with boosting)
> XOL-uprobes could roughly break even with a pure emulator approach ...
>
> That's a big and fundamental difference.
To be fair, wrt uprobes, emulation and boosting are both in the same
state: pretty well understood, but not yet implemented.
...
> > >
> > > - It's as transparent as it gets - no user-space trampoline or other visible
> > > state that modifies behavior or can be stomped upon by user-space bugs.
> >
> > The XOL vma isn't writable from user space, so I can't think of how it could
> > be clobbered merely by a stray memory reference. [...]
>
> Well there must be some purpose to the instrumentation, there must be some way
> to save data, right? If yes and it's in user-space, that data is clobberable.
One or two others have advocated an approach (which eliminates the
breakpoint trap) where trace data is stored in the uprobe vma, but I
haven't. (In such a case, "XOL vma" would be a misnomer.) I agree that
in such a scenario, the uprobe vma would of necessity be writable by the
app.
>
> If it's in kernel-space then we have to enter the kernel anyway (with similar
> cost patterns to an INT3 entry) - so we just delayed the kernel entry.
This seems to presume that you have to extract trace data from the
kernel every time a probe is hit. In actual practice, you're often just
checking for unusual arg values, incrementing a counter, or some such.
>
...
> > Even if we add emulation, it seems sensible to keep the XOL approach as a
> > backup to handle instructions that aren't yet emulated (and architectures
> > that don't yet have emulators). That way, if you don't probe any unemulated
> > instructions, the XOL vma is never created.
>
> To turn the argument around: an in-kernel emulator is an all-around facility
> to make sure we probe safely and securely, _and_ it is also more portable
> because it's simpler (because more gradual) to implement on a new architecture
> as you dont actually have to copy around instructions (and make sure they work
> in that new place), but have to emulate a limited subset of the instruction
> space, on purely local state.
I understand the desire to start small and simple and grow gradually
from there. We thought we were doing that. Single-stepping out of line
has been in use for close to a decade, maybe more; and boosting (in
kprobes) has been around for a few years as well. To the *probes folks,
it feels pretty solid.
>
...
>
> With an emulator (assuming the emulator is correct) we can execute the precise
> semantics of that instruction in that place - without any side-effects from
> trampolining/replacement.
And of course, our view has been that the best way to achieve the effect
of the instruction, including all desired side-effects, is to execute
the instruction on the CPU.
...
> >
> > **In practice, we've had to probe all sorts of instructions, including FP
> > instructions -- especially where you want to exploit the debug info to get
> > the names, types, and locations of variables and args. For some compilers
> > and architectures, the debug info isn't reliable until the end of the
> > function prologue, at which point you could find any old instruction. Ditto
> > if you want to probe statements within a function.
>
> For those cases, frankly, the right approach is to fix the debug info (or
> introduce a new one) and forget the old crap.
>
> You treat debuginfo as some god-given property, while it's one of the suckiest
> aspects of all of Linux. But we've had that discussion months (and years) ago.
> It has improved in gcc 4.5 so there's some hope.
Yes, there seems to be considerable movement toward better debug info --
which could make statement probing (and not just function-boundary
probing) more and more feasible.
>
...
> Ingo
Thanks.
Jim
On Thu, Jan 28, 2010 at 09:55:02AM +0100, Ingo Molnar wrote:
...
> Lets compare the two cases via a drawing. Your current uprobes submission
> does:
>
> [kernel] do probe thing single-step trap
> ^ | ^ |
> | v | v
> [user] INT3 XOL-ins next ins-stream
>
> ( add the need for serialization to make sure the whole single-step thing
> does not get out of sync with reality. )
>
> And emulator approach would do:
>
> [kernel] emul-demux-fastpath, do probe thing
> ^ |
> | v
> [user] INT3 next ins-stream
>
> far simpler conceptually, and faster as well, because it's one kernel entry.
Ingo,
Yes, conceptually, emulation is simpler. In fact, it may even be the
right thing to do from a housekeeping POV if gdb were enabled to use
breakpoint assistance in the kernel. However... emulation is not
easy. Just quoting Peter Anvin:
> On the more general rule of interpretation: I'm really concerned about
> having a bunch of partially-capable x86 interpreters all over the
> kernel. x86 is *hard* to emulate, and it will only get harder as the
> architecture evolves.
>
> -hpa
Yes, I know you suggested we start with a small subset.
We already have an implementation of instruction emulation in kernel for
x86 and powerpc, but its too KVM centric. If there is a generic
emulation layer, we would use it.
There are conflicting opinions for either case; complicated as it is,
the XOL scheme works and, to a large extent, it is easily extendable to
other architectures compared to the emulation approach. Uprobes can be
made to use emulation when possible/available, but I don't think this
should be gating decision for the initial implementation of the feature.
Ananth
* Jim Keniston <[email protected]> wrote:
> > > As previously discussed, boosting would also get rid of the single-step
> > > trap for most instructions.
> >
> > Boosting is not in the uprobes patch-set you submitted. Even with it
> > present it wont get rid of the initial INT3. So basically _best-case_
> > (with boosting) XOL-uprobes could roughly break even with a pure emulator
> > approach ...
> >
> > That's a big and fundamental difference.
>
> To be fair, wrt uprobes, emulation and boosting are both in the same state:
> pretty well understood, but not yet implemented.
So, to sum it up: utrace XOL, which is rather complex already, needs even more
complexity (which is not yet implemented) than the much simpler common-case
emulator approach i outlined, just to break even with the performance of the
much simpler approach.
And you've been justifying the complexity of XOL with its performance
advantages.
See why i'm unimpressed by that argument?
[ Note, i'm not dismissing it entirely, the complexity of XOL _might_ be fine
in the future if it brings us real advantages: for example if it avoids
_ALL_ kernel entries.
That can be done too, by using the jump-probe technique in user-space. (the
closest anyone came to proposing this was Avi with the user-space INT3 hack
- but we can do better than that via the jprobes technique.) At that point
the advantage of having a pure user-space callback technique combined with
the advantages of having near full instruction coverage might tip the
balance. There are other complexities to handle in that case though, like
buffering and more. ]
But right now we are nowhere near that stage, and i dont see the path towards
that either. So i'd much rather see something simpler and get on with these
IMHO unimportant performance details to the IMO much more important high level
interface and high level tooling details.
When we merged kprobes ~10 years ago we made the (rather bad) mistake of
merging a raw, opaque facility and leaving 'the rest' up to some other entity.
IBM kprobes hackers vanished the day the original kprobes code went upstream
and the high level entity never truly materialized in-kernel, for nearly a
decade!
With uprobes we should learn from that painful lesson and bring in the high
level users of uprobes via 'perf probe' (or any other real user) straight
away.
Complexity is easy to increase when usage is increasing, it's near impossible
to reduce when usage is not there. (and it's rather hard to reduce even with
increasing usage - especially of aspects of the complexity leak out to
user-space ABIs - which danger XOL has written all over it.)
So the request is simple to sum up: please reduce complexity of the initial
submission and increase all around utility.
Thanks,
Ingo
* Ananth N Mavinakayanahalli <[email protected]> wrote:
> On Thu, Jan 28, 2010 at 09:55:02AM +0100, Ingo Molnar wrote:
>
> ...
>
> > Lets compare the two cases via a drawing. Your current uprobes submission
> > does:
> >
> > [kernel] do probe thing single-step trap
> > ^ | ^ |
> > | v | v
> > [user] INT3 XOL-ins next ins-stream
> >
> > ( add the need for serialization to make sure the whole single-step thing
> > does not get out of sync with reality. )
> >
> > And emulator approach would do:
> >
> > [kernel] emul-demux-fastpath, do probe thing
> > ^ |
> > | v
> > [user] INT3 next ins-stream
> >
> > far simpler conceptually, and faster as well, because it's one kernel entry.
>
> Ingo,
>
> Yes, conceptually, emulation is simpler. In fact, it may even be the
> right thing to do from a housekeeping POV if gdb were enabled to use
> breakpoint assistance in the kernel. However... emulation is not
> easy. Just quoting Peter Anvin:
>
> > On the more general rule of interpretation: I'm really concerned about
> > having a bunch of partially-capable x86 interpreters all over the
> > kernel. x86 is *hard* to emulate, and it will only get harder as the
> > architecture evolves.
> >
> > -hpa
This is obviously true for a full emulator. Except for the fact that:
> Yes, I know you suggested we start with a small subset.
and for the fact that we already have emulators in the kernel.
Plus we _already_ need to decode instructions for safe kprobing and have the
code for that upstream. So it's not like we can avoid decoding the
instructions. (and emulating certain instruction patterns is really just a
natural next step of a good decoder.)
> We already have an implementation of instruction emulation in kernel for x86
> and powerpc, but its too KVM centric. If there is a generic emulation layer,
> we would use it.
So this approach, beyond being simpler, more robust and faster than the
current XOL code, would also trigger (much needed) cleanups in other parts of
the kernel and would share code with other kernel subsystems.
Dont you see the obvious advantages of that?
Ingo
On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
...
> When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> IBM kprobes hackers vanished the day the original kprobes code went upstream
> and the high level entity never truly materialized in-kernel, for nearly a
> decade!
I don't know what you are referring to here... Kprobes was merged in
2.6.9 (~August 2004 -- less than 6 years ago). Since then, we did work
on ports to powerpc and s390. We implemented kretprobes. We made it much
scalable using RCU; we did the powerpc booster to skip single-step when
possible, not to mention various bug fixes over the years.
Yes, we did not do the perf integration, but perf did not exist then, either.
Its simply wrong to say people 'vanished'.
Thanks,
Ananth
On Fri, Jan 29, 2010 at 01:22:40PM +0530, Ananth N Mavinakayanahalli wrote:
> On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
>
> ...
>
> > When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> > merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> > IBM kprobes hackers vanished the day the original kprobes code went upstream
> > and the high level entity never truly materialized in-kernel, for nearly a
> > decade!
>
> I don't know what you are referring to here... Kprobes was merged in
> 2.6.9 (~August 2004 -- less than 6 years ago). Since then, we did work
> on ports to powerpc and s390. We implemented kretprobes. We made it much
> scalable using RCU; we did the powerpc booster to skip single-step when
> possible, not to mention various bug fixes over the years.
>
> Yes, we did not do the perf integration, but perf did not exist then, either.
>
> Its simply wrong to say people 'vanished'.
Oh, and the x86 instruction decoder was initially implemented by us.
Masami has done a great job making it more complete.
Ananth
* Ananth N Mavinakayanahalli <[email protected]> wrote:
> On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
>
> ...
>
> > When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> > merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> > IBM kprobes hackers vanished the day the original kprobes code went upstream
> > and the high level entity never truly materialized in-kernel, for nearly a
> > decade!
>
> I don't know what you are referring to here... Kprobes was merged in 2.6.9
> (~August 2004 -- less than 6 years ago). [...]
Ok, 6 years then :-)
> [...] Since then, we did work on ports to powerpc and s390. We implemented
> kretprobes. We made it much scalable using RCU; we did the powerpc booster
> to skip single-step when possible, not to mention various bug fixes over the
> years.
Except it had no real in-kernel user.
> Yes, we did not do the perf integration, but perf did not exist then,
> either.
>
> Its simply wrong to say people 'vanished'.
It has certainly was a bit stale for years - and with no real users that's
certainly not a surprise. That has changed recently so i'm not complaining. We
just dont want to repeat the same mistake with uprobes.
Ingo
* Ananth N Mavinakayanahalli <[email protected]> wrote:
> On Fri, Jan 29, 2010 at 01:22:40PM +0530, Ananth N Mavinakayanahalli wrote:
> > On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
> >
> > ...
> >
> > > When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> > > merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> > > IBM kprobes hackers vanished the day the original kprobes code went upstream
> > > and the high level entity never truly materialized in-kernel, for nearly a
> > > decade!
> >
> > I don't know what you are referring to here... Kprobes was merged in
> > 2.6.9 (~August 2004 -- less than 6 years ago). Since then, we did work
> > on ports to powerpc and s390. We implemented kretprobes. We made it much
> > scalable using RCU; we did the powerpc booster to skip single-step when
> > possible, not to mention various bug fixes over the years.
> >
> > Yes, we did not do the perf integration, but perf did not exist then, either.
> >
> > Its simply wrong to say people 'vanished'.
>
> Oh, and the x86 instruction decoder was initially implemented by us.
Which he implemented at my suggestion, to make it safe and robust for 'perf
probe' even if debuginfo for some reason gives us the wrong address and we try
to insert a probe where we shouldnt.
> Masami has done a great job making it more complete.
Absolutely and certainly so! I'm not talking about the present - i'm happy
about where kprobes is going currently, and the new jump-probes optimizations
look promising too.
I just see uprobes repeating some of the mistakes of early kprobes, and i want
us to learn from that experience. In my experience real usage and good
integration is the key to that, and we can skip those lost 5 years.
Ingo
On Fri, Jan 29, 2010 at 10:11:16AM +0100, Ingo Molnar wrote:
>
> * Ananth N Mavinakayanahalli <[email protected]> wrote:
>
> > On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
> >
> > ...
> >
> > > When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> > > merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> > > IBM kprobes hackers vanished the day the original kprobes code went upstream
> > > and the high level entity never truly materialized in-kernel, for nearly a
> > > decade!
> >
> > I don't know what you are referring to here... Kprobes was merged in 2.6.9
> > (~August 2004 -- less than 6 years ago). [...]
>
> Ok, 6 years then :-)
>
> > [...] Since then, we did work on ports to powerpc and s390. We implemented
> > kretprobes. We made it much scalable using RCU; we did the powerpc booster
> > to skip single-step when possible, not to mention various bug fixes over the
> > years.
>
> Except it had no real in-kernel user.
Not that I want to rebut you Ingo, but there were in-kernel users since 2006
(net/ipv4/tcp_probe.c) :-)
Aside, I am also glad that we have more flexibility with the perf
integration.
Ananth
* Ananth N Mavinakayanahalli <[email protected]> wrote:
> On Fri, Jan 29, 2010 at 10:11:16AM +0100, Ingo Molnar wrote:
> >
> > * Ananth N Mavinakayanahalli <[email protected]> wrote:
> >
> > > On Fri, Jan 29, 2010 at 08:39:07AM +0100, Ingo Molnar wrote:
> > >
> > > ...
> > >
> > > > When we merged kprobes ~10 years ago we made the (rather bad) mistake of
> > > > merging a raw, opaque facility and leaving 'the rest' up to some other entity.
> > > > IBM kprobes hackers vanished the day the original kprobes code went upstream
> > > > and the high level entity never truly materialized in-kernel, for nearly a
> > > > decade!
> > >
> > > I don't know what you are referring to here... Kprobes was merged in 2.6.9
> > > (~August 2004 -- less than 6 years ago). [...]
> >
> > Ok, 6 years then :-)
> >
> > > [...] Since then, we did work on ports to powerpc and s390. We implemented
> > > kretprobes. We made it much scalable using RCU; we did the powerpc booster
> > > to skip single-step when possible, not to mention various bug fixes over the
> > > years.
> >
> > Except it had no real in-kernel user.
>
> Not that I want to rebut you Ingo, but there were in-kernel users since 2006
> (net/ipv4/tcp_probe.c) :-)
i said 'real' users. That usage in tcp_probe.c was (and is) really minimal and
never expanded really.
> Aside, I am also glad that we have more flexibility with the perf
> integration.
ok, good :)
Ingo
Ingo Molnar <[email protected]> writes:
> [...] So, to sum it up: utrace XOL, which is rather complex already,
> needs even more complexity (which is not yet implemented) than the
> much simpler common-case emulator approach i outlined, just to break
> even with the performance of the much simpler approach. [...]
Is it an uncontroversial claim that emulation of CISC instructions
should perform better than their native execution, followed by an int3
(as in the simplest working scheme) or boosting (as done by kprobes)?
>From my experience with simulators, "simple" software emulation of
cpus can be hundreds of times slower or worse than native execution.
- FChE
On Fri, 2010-01-29 at 08:42 +0100, Ingo Molnar wrote:
> * Ananth N Mavinakayanahalli <[email protected]> wrote:
>
> > On Thu, Jan 28, 2010 at 09:55:02AM +0100, Ingo Molnar wrote:
> >
> > ...
> >
> > > Lets compare the two cases via a drawing. Your current uprobes submission
> > > does:
> > >
> > > [kernel] do probe thing single-step trap
> > > ^ | ^ |
> > > | v | v
> > > [user] INT3 XOL-ins next ins-stream
> > >
> > > ( add the need for serialization to make sure the whole single-step thing
> > > does not get out of sync with reality. )
> > >
> > > And emulator approach would do:
> > >
> > > [kernel] emul-demux-fastpath, do probe thing
> > > ^ |
> > > | v
> > > [user] INT3 next ins-stream
> > >
> > > far simpler conceptually, and faster as well, because it's one kernel entry.
> >
> > Ingo,
> >
> > Yes, conceptually, emulation is simpler. In fact, it may even be the
> > right thing to do from a housekeeping POV if gdb were enabled to use
> > breakpoint assistance in the kernel. However... emulation is not
> > easy. Just quoting Peter Anvin:
> >
> > > On the more general rule of interpretation: I'm really concerned about
> > > having a bunch of partially-capable x86 interpreters all over the
> > > kernel. x86 is *hard* to emulate, and it will only get harder as the
> > > architecture evolves.
> > >
> > > -hpa
>
> This is obviously true for a full emulator. Except for the fact that:
>
> > Yes, I know you suggested we start with a small subset.
>
> and for the fact that we already have emulators in the kernel.
But this would be emulating userspace instructions, correct?
The kernel is limited to what instructions it can perform, no floating
point for example (of course there are some exceptions). But generally,
the instructions in the kernel should be easier to emulate than in
userspace. Userspace is free to do any wacky thing it wants. Will this
limit the ability to probe apps that take advantage of some strange op
code that the user knows is available on their platform?
-- Steve
>
> Plus we _already_ need to decode instructions for safe kprobing and have the
> code for that upstream. So it's not like we can avoid decoding the
> instructions. (and emulating certain instruction patterns is really just a
> natural next step of a good decoder.)
On Sat, 30 Jan 2010, Steven Rostedt wrote:
>
> The kernel is limited to what instructions it can perform, no floating
> point for example (of course there are some exceptions).
Actually, the reason the kernel is limited to not performing floating
point instructions is that teh kernel doesn't own the floating point
register set - it's too big to save/restore, so the kernel leaves it
alone.
But for emulating an instruction from user space, it would be perfectly
fine to do an FP instruction in kernel space, since we're explicitly doing
it on behalf of user space, and with user space owning it.
Of course, that would require that we _only_ touch the registers that user
space wants us to touch, which is likely impossible in practice for
anything but an execute-out-of-line model.
> But generally, the instructions in the kernel should be easier to
> emulate than in userspace.
Yeah, we control the kernel instructions better, and we know what the
environment is. For example, we never have to worry about vm86 mode or
segments when we fix up kernel instructions, but user space can do
anything, of course.
Linus
Ingo Molnar wrote:
>
> * Ananth N Mavinakayanahalli <[email protected]> wrote:
>
>> On Thu, Jan 28, 2010 at 09:55:02AM +0100, Ingo Molnar wrote:
>>
>> ...
>>
>>> Lets compare the two cases via a drawing. Your current uprobes submission
>>> does:
>>>
>>> [kernel] do probe thing single-step trap
>>> ^ | ^ |
>>> | v | v
>>> [user] INT3 XOL-ins next ins-stream
>>>
>>> ( add the need for serialization to make sure the whole single-step thing
>>> does not get out of sync with reality. )
>>>
>>> And emulator approach would do:
>>>
>>> [kernel] emul-demux-fastpath, do probe thing
>>> ^ |
>>> | v
>>> [user] INT3 next ins-stream
>>>
>>> far simpler conceptually, and faster as well, because it's one kernel entry.
>>
>> Ingo,
>>
>> Yes, conceptually, emulation is simpler. In fact, it may even be the
>> right thing to do from a housekeeping POV if gdb were enabled to use
>> breakpoint assistance in the kernel. However... emulation is not
>> easy. Just quoting Peter Anvin:
>>
>>> On the more general rule of interpretation: I'm really concerned about
>>> having a bunch of partially-capable x86 interpreters all over the
>>> kernel. x86 is *hard* to emulate, and it will only get harder as the
>>> architecture evolves.
>>>
>>> -hpa
>
> This is obviously true for a full emulator. Except for the fact that:
>
>> Yes, I know you suggested we start with a small subset.
>
> and for the fact that we already have emulators in the kernel.
>
> Plus we _already_ need to decode instructions for safe kprobing and have the
> code for that upstream. So it's not like we can avoid decoding the
> instructions. (and emulating certain instruction patterns is really just a
> natural next step of a good decoder.)
>
>> We already have an implementation of instruction emulation in kernel for x86
>> and powerpc, but its too KVM centric. If there is a generic emulation layer,
>> we would use it.
>
> So this approach, beyond being simpler, more robust and faster than the
> current XOL code, would also trigger (much needed) cleanups in other parts of
> the kernel and would share code with other kernel subsystems.
Hm, ok. Indeed, we have some x86 emulator-like codes in kernel(see,
arch/x86/mm/pf_in.*). I think it is basically good thing to re-implement
much-better emulator for all. But I think it'll be a long step, because
when I had tried to reuse kvm emulator for decoder, I felt that was too
specialized for kvm, vcpu, guest virtual memory access, and so on.
If we could make an emulator/evaluater/decoder which can provide
functions for those consumers, I'm not so sure it is fast enough,
because I don't think XOL code is so slower than emulating... based on my
experience of kprobe benchmarks, it will need ~500 cycles.
If the emulator can be faster than that, I agreed.
(BTW, apart from uprobes need, I think those codes should be refined
with some well-maintainable instruction maps, like x86-opcode-map.txt :))
> Dont you see the obvious advantages of that?
Hmm, my another concern is if we have to make emulators for each arch,
an XOL implementation could be much simpler than total code of that.
So, summarize my thought, in short term (and only for uprobe), XOL
is better way to go. It can be reused on other archs, generic, and
not-so-slow (and we can boost some opcodes). However, it'll not
transparent from user space(users can see which instruction is probed),
will reduce user space, and might have security issue(?).
In long term, generic x86 emulator is also another way. If we
can make it enough generic, we don't need XOL code. However, it is
hard and takes a time to make it so generic, and can be slower than
XOL on some complex instructions (and also, how many instructions
should be supported is enough for that?). Indeed, I must admit that
implementing an emulator should be exciting for kernel hackers :)
Anyway, if you think we can't avoid generalizing x86 emulators
(even without uprobes), maybe, it a good way to go.
Thank you,
--
Masami Hiramatsu
Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division
e-mail: [email protected]
Hi!
> >>> Right, so you're going to love uprobes, which does exactly that. The
> >>> current proposal is overwriting the target instruction with an INT3 and
> >>> injecting an extra vma into the target process's address space
> >>> containing the original instruction(s) and possible jumps back to the
> >>> old code stream.
> >>
> >> Just out of interest, how does it handle the threading issue?
> >>
> >> Last I saw, at least some CPU people were _very_ nervous about overwriting
> >> instructions if another CPU might be just about to execute them.
> >
> > I think the issue was that ring 0 was never meant to do that, where as,
> > ring 3 does it all the time. Doesn't the dynamic library modify its
> > text?
>
> No, it has nothing to do with ring. It has to do with modifying code
> that another CPU could be executing at the same time, and with modifying
> code on the same processor through another virtual alias (they are
> different issues.) The same issues apply regardless of the CPL of the
> processor.
...but these are always 'there could be cpu bugs around' issues,
right? Like amd k6. AFAICT x86 always supported self-modifying code
without any extra barriers needed...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 02/07/2010 10:54 PM, Pavel Machek wrote:
>>
>> No, it has nothing to do with ring. It has to do with modifying code
>> that another CPU could be executing at the same time, and with modifying
>> code on the same processor through another virtual alias (they are
>> different issues.) The same issues apply regardless of the CPL of the
>> processor.
>
> ...but these are always 'there could be cpu bugs around' issues,
> right? Like amd k6. AFAICT x86 always supported self-modifying code
> without any extra barriers needed...
>
*Self*-modifying code, yes. *Cross*-modifying code, no.
-hpa
On Mon, 8 Feb 2010 07:54:25 +0100
Pavel Machek <[email protected]> wrote:
> > No, it has nothing to do with ring. It has to do with modifying
> > code that another CPU could be executing at the same time, and with
> > modifying code on the same processor through another virtual alias
> > (they are different issues.) The same issues apply regardless of
> > the CPL of the processor.
>
> ...but these are always 'there could be cpu bugs around' issues,
> right? Like amd k6. AFAICT x86 always supported self-modifying code
> without any extra barriers needed...
self modifying code yes, cross modifying code no.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On 01/27/2010 01:05 PM, Ananth N Mavinakayanahalli wrote:
> We don't need to write one. I don't know how easy it is to make the kvm
> emulator less kvm-centric (vcpus, kvm_context, etc). Avi?
>
It's a lot of mindless work but not too difficult; replacing hardcoded
accessors with function pointers.
--
error compiling committee.c: too many arguments to function