2008-07-15 19:43:15

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

James Bottomley <[email protected]> writes:

> One of the big nasties of systemtap is the way it tries to embed
> virtually the entirety of the kernel symbol table in the probe
> modules it constructs.

It is a compromise of conflicting requirements.

> This is highly undesirable because it represents a subversion of the
> kernel API to gain access to unexported symbols.

Please elaborate. Does the translator or its runtime use unexported
symbols? (That would arouse the question about why.)

Or are you talking about being able to *probe* unexported functions or
access unexported data? That would be a deliberate feature.

> At least for kprobes, the correct way to do this is to specify the
> probe point by symbol and offset.

But there won't be just kprobes. Much of this code was built with
anticipation of user-space probing, and there the kernel won't have a
similar mechanism. Similarly, the code is written to work with old
kernels too - ones that predate the symbol+offset kprobe API.

Unless someone is about to rip out pure address-based kprobes, I see
no reason to complicate the code.


- FChE


2008-07-15 19:52:25

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Tue, 2008-07-15 at 15:41 -0400, Frank Ch. Eigler wrote:
> James Bottomley <[email protected]> writes:
>
> > One of the big nasties of systemtap is the way it tries to embed
> > virtually the entirety of the kernel symbol table in the probe
> > modules it constructs.
>
> It is a compromise of conflicting requirements.

Well ... in order to make forward progress, since the systemtap people
expressed a desire to be better integrated with the kernel, the first
order of business is to use the correct APIs ... that has to happen even
before an evaluation can be made of which pieces of the systemtap
runtime should move into the kernel.

> > This is highly undesirable because it represents a subversion of the
> > kernel API to gain access to unexported symbols.
>
> Please elaborate. Does the translator or its runtime use unexported
> symbols? (That would arouse the question about why.)
>
> Or are you talking about being able to *probe* unexported functions or
> access unexported data? That would be a deliberate feature.

No ... I'm talking about _stp_module_relocate() at this point. It's an
unnecessary function, since the kprobes API provides a way to attach to
a symbol and an offset. The API allows access to unexported functions.

> > At least for kprobes, the correct way to do this is to specify the
> > probe point by symbol and offset.
>
> But there won't be just kprobes. Much of this code was built with
> anticipation of user-space probing, and there the kernel won't have a
> similar mechanism. Similarly, the code is written to work with old
> kernels too - ones that predate the symbol+offset kprobe API.

OK ... you've got me there ... why would user space probing necessitate
resolution of kernel space symbols? Surely you plan to use an exported
module API of utrace or whatever the agreed framework is?

> Unless someone is about to rip out pure address-based kprobes, I see
> no reason to complicate the code.

If you actually look, you'll see that pure addressed based kprobes still
work.

Also, I think you'll find it simplifies the code, since tons of the
runtime junk that duplicate the in-kernel symbol resolution can be
thrown out, plus the corresponding pieces of systemtap that have to
worry about this.

There's also the architectural worry: this scheme you currently use is
very fragile. For instance, I don't see it surviving a move to
-ffunction-sections (which patches are already going over linux-arch).

James

2008-07-15 20:09:00

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Tue, Jul 15, 2008 at 02:52:06PM -0500, James Bottomley wrote:

> [...]
> > > One of the big nasties of systemtap is the way it tries to embed
> > > virtually the entirety of the kernel symbol table in the probe
> > > modules it constructs.
> >
> > It is a compromise of conflicting requirements.
>
> Well ... in order to make forward progress, since the systemtap people
> expressed a desire to be better integrated with the kernel, the first
> order of business is to use the correct APIs [...]

Let's concentrate then on those areas where this is more clear-cut.


> > > This is highly undesirable because it represents a subversion of the
> > > kernel API to gain access to unexported symbols.
> >
> > Please elaborate. Does the translator or its runtime use unexported
> > symbols? (That would arouse the question about why.)
> >
> > Or are you talking about being able to *probe* unexported functions or
> > access unexported data? That would be a deliberate feature.
>
> No ... I'm talking about _stp_module_relocate() at this point. It's an
> unnecessary function

Maybe, but what "subversion" are you talking about?

> since the kprobes API provides a way to attach to a symbol and an
> offset. The API allows access to unexported functions.

... but not to e.g. data, which also uses this common mechanism.


> > > At least for kprobes, the correct way to do this is to specify the
> > > probe point by symbol and offset.
> >
> > But there won't be just kprobes. Much of this code was built with
> > anticipation of user-space probing, and there the kernel won't have a
> > similar mechanism. Similarly, the code is written to work with old
> > kernels too - ones that predate the symbol+offset kprobe API.
>
> OK ... you've got me there ... why would user space probing necessitate
> resolution of kernel space symbols? Surely you plan to use an exported
> module API of utrace or whatever the agreed framework is?

Of course, but for our purposes, the kernel will be just one amongst
many probing targets. We will be tracking multiple symbol tables and
unwind data for user-space.


> > Unless someone is about to rip out pure address-based kprobes, I see
> > no reason to complicate the code.
>
> If you actually look, you'll see that pure addressed based kprobes still
> work.

No need for the snark. I know they work; we've been using them for
years. I am simply happy to stay with them.


> Also, I think you'll find it simplifies the code, since tons of the
> runtime junk that duplicate the in-kernel symbol resolution can be
> thrown out, plus the corresponding pieces of systemtap that have to
> worry about this.

Again, please consider user-space. The runtime will need similar
symbol resolution code *for user space* anyway. Keeping it in there
for the kernel is no incremental complexity - if anything, the
opposite.


> There's also the architectural worry: this scheme you currently use
> is very fragile. For instance, I don't see it surviving a move to
> -ffunction-sections (which patches are already going over
> linux-arch).

Let's try it. Whatever actual problems that throws up, we'd also
encounter with userspace.


- FChE

2008-07-15 20:24:34

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Tue, 2008-07-15 at 16:07 -0400, Frank Ch. Eigler wrote:
> On Tue, Jul 15, 2008 at 02:52:06PM -0500, James Bottomley wrote:
>
> > [...]
> > > > One of the big nasties of systemtap is the way it tries to embed
> > > > virtually the entirety of the kernel symbol table in the probe
> > > > modules it constructs.
> > >
> > > It is a compromise of conflicting requirements.
> >
> > Well ... in order to make forward progress, since the systemtap people
> > expressed a desire to be better integrated with the kernel, the first
> > order of business is to use the correct APIs [...]
>
> Let's concentrate then on those areas where this is more clear-cut.

This is the most clear cut of the areas. Until the systemtap modules
get moved to the proper APIs, it's very difficult to tell what else
needs to be cleaned up or changed inside the kernel to support them.

> > > > This is highly undesirable because it represents a subversion of the
> > > > kernel API to gain access to unexported symbols.
> > >
> > > Please elaborate. Does the translator or its runtime use unexported
> > > symbols? (That would arouse the question about why.)
> > >
> > > Or are you talking about being able to *probe* unexported functions or
> > > access unexported data? That would be a deliberate feature.
> >
> > No ... I'm talking about _stp_module_relocate() at this point. It's an
> > unnecessary function
>
> Maybe, but what "subversion" are you talking about?

using a hand crafted relocation function to gain access to kernel
symbols instead of the provided API. Even if it's not used as a
template for every producer of binary modules, it's just incredibly
fragile.

> > since the kprobes API provides a way to attach to a symbol and an
> > offset. The API allows access to unexported functions.
>
> ... but not to e.g. data, which also uses this common mechanism.

That was number three in the original list of problems on my email. I
think it's solvable reasonably easily (as I outlined).

> > > > At least for kprobes, the correct way to do this is to specify the
> > > > probe point by symbol and offset.
> > >
> > > But there won't be just kprobes. Much of this code was built with
> > > anticipation of user-space probing, and there the kernel won't have a
> > > similar mechanism. Similarly, the code is written to work with old
> > > kernels too - ones that predate the symbol+offset kprobe API.
> >
> > OK ... you've got me there ... why would user space probing necessitate
> > resolution of kernel space symbols? Surely you plan to use an exported
> > module API of utrace or whatever the agreed framework is?
>
> Of course, but for our purposes, the kernel will be just one amongst
> many probing targets. We will be tracking multiple symbol tables and
> unwind data for user-space.

You're going to hand roll your own symbol resolution for user space too?
Isn't it pretty easy simply to get ld.so to do that for you?

> > > Unless someone is about to rip out pure address-based kprobes, I see
> > > no reason to complicate the code.
> >
> > If you actually look, you'll see that pure addressed based kprobes still
> > work.
>
> No need for the snark. I know they work; we've been using them for
> years. I am simply happy to stay with them.

I meant if you read the patch I posted, I made sure that pure addressed
based kprobes still work even when they have to use the
symbol_name/offset resolution method ... the new code just works out the
closest symbol and applies an offset.

> > Also, I think you'll find it simplifies the code, since tons of the
> > runtime junk that duplicate the in-kernel symbol resolution can be
> > thrown out, plus the corresponding pieces of systemtap that have to
> > worry about this.
>
> Again, please consider user-space. The runtime will need similar
> symbol resolution code *for user space* anyway. Keeping it in there
> for the kernel is no incremental complexity - if anything, the
> opposite.

I really think there might have to be separate runtime pieces for user
space and for the kernel. Trying to build a single scheme that works in
both places looks cumbersome. In the separate case, the kernel piece,
which is potentially movable inside the kernel, becomes a lot simpler.

> > There's also the architectural worry: this scheme you currently use
> > is very fragile. For instance, I don't see it surviving a move to
> > -ffunction-sections (which patches are already going over
> > linux-arch).
>
> Let's try it. Whatever actual problems that throws up, we'd also
> encounter with userspace.

OK, with -ffunction-sections you can't offset from _stext which seems to
be what _stp_module_relocate() uses. The reason this gets used on
something like parisc is so that we can place jump stubs in between the
function sections if necessary to widen the PCREL17 relocations. That
means that each function address can potentially move depending on the
number of relocation stubs embedded between it and the next function.
The exact same problem occurs between DSOs in user space. Your problem
is that you don't know the separations until someone attempts linking
(and even then, there's nothing except common sense that requires the
linker to use the minimum number of stubs).

Now, you could still offset from the start address of the section, but
that's simply the address of the function, so resolution by
symbol_name/offset is the effective solution.

James

2008-07-15 22:20:15

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Tue, Jul 15, 2008 at 03:24:22PM -0500, James Bottomley wrote:
> [...]
> > > > > This is highly undesirable because it represents a subversion of the
> > > > > kernel API to gain access to unexported symbols.
> > [...]
> > Maybe, but what "subversion" are you talking about?
>
> using a hand crafted relocation function to gain access to kernel
> symbols instead of the provided API.

Please choose your words more carefully. We don't "subvert" anything,
where one would mean sneaking around some sort of protection. We use
an established, existing facility (placing kprobes by address). We
compute addresses correctly (an error would be obvious), and is
conceptually not so different from an address being passed in at run
time from /proc/kallsyms.


> > Of course, but for our purposes, the kernel will be just one amongst
> > many probing targets. We will be tracking multiple symbol tables and
> > unwind data for user-space.
>
> You're going to hand roll your own symbol resolution for user space too?
> Isn't it pretty easy simply to get ld.so to do that for you?

I don't see how. We can't call into ld.so from kernel space. One may
need to probe ld.so itself.


> [...] I made sure that pure addressed based kprobes still work even
> when they have to use the symbol_name/offset resolution method
> ... the new code just works out the closest symbol and applies an
> offset.

(Not important, but did you consider the possibility that this chosen
reference symbol may, for whatever reason, not be listed in the
kernel's kprobe-assisting symbol table?)


> > Again, please consider user-space. The runtime will need similar
> > symbol resolution code *for user space* anyway. Keeping it in there
> > for the kernel is no incremental complexity - if anything, the
> > opposite.
>
> I really think there might have to be separate runtime pieces for
> user space and for the kernel. Trying to build a single scheme that
> works in both places looks cumbersome. In the separate case, the
> kernel piece, which is potentially movable inside the kernel,
> becomes a lot simpler. [...]

It may just be in the eye of the beholder. To me, a single scheme
that supports all the various address spaces (and kernel versions and
configurations!) we deal with is just as appealing to me as increasing
kernel specialization is to you.


> OK, with -ffunction-sections you can't offset from _stext which
> seems to be what _stp_module_relocate() uses. [...] That means
> that each function address can potentially move depending on the
> number of relocation stubs embedded between it and the next
> function.

I may be missing something, but doesn't all that happen during
linking? We process linked executables, not object files subject to
further run-time relaxation/shrinkage.


- FChE

2008-07-16 02:06:36

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Tue, 2008-07-15 at 18:18 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Jul 15, 2008 at 03:24:22PM -0500, James Bottomley wrote:
> > [...]
> > > > > > This is highly undesirable because it represents a subversion of the
> > > > > > kernel API to gain access to unexported symbols.
> > > [...]
> > > Maybe, but what "subversion" are you talking about?
> >
> > using a hand crafted relocation function to gain access to kernel
> > symbols instead of the provided API.
>
> Please choose your words more carefully. We don't "subvert" anything,
> where one would mean sneaking around some sort of protection.

Actually, I did and you do. One of the OED's definition of subvert is
"to undermine or overturn a condition or order of things, a principle or
a law etc." In this particular case, this:

commit 3a872d89baae821a0f6e2c1055d4b47650661137
Author: Ananth N Mavinakayanahalli <[email protected]>
Date: Mon Oct 2 02:17:30 2006 -0700

[PATCH] Kprobes: Make kprobe modules more portable

Which provided a portable input to kprobes (the symbol_name/offset one)
and revoked the global accessibility of the kallsyms_lookup_name().

The design was for kprobes users to stop using kallsyms_lookup_name()
and to use the symbol_name/offset instead. What systemtap did is code
its own _stp_module_lookup() as a fairly direct replacement for
kallsyms_lookup_name(). That's deliberately overturning the condition
or order of things, because you deliberately ignored the specific
replacement API in rolling your own, hence subversion.

It's actually worse than this, though. The kernel API isn't fixed in
stone, it evolves usually by trying to make problematic use cases
better. By refusing to consider using the replacement API, you lost the
opportunity to point out the shortcomings and negotiate for a better
one, so it's languished for two years with no real testing or update.
Worse still, you cut yourself off from the development flow of the
kernel and effectively forked a private API for you own use. Now,
because of this, most kernel developers will be far less inclined to
listen to your input because you've chosen not to listen to theirs. The
give and take of open source development that produces the virtuous
circle of innovation is broken. To redress this, you have to use the
correct API and begin engaging in the dialogue which stalled two years
ago.

But let's examine the consequences objectively. I have a simple single
probe file:

probe kernel.statement("*@block/bsg.c:144") {
print ("here\n");
}

It emits a single probe and produces this in the module build:

-rw-r--r-- 1 root root 17996 2008-07-15 20:45 stap_2154.c

About 600 lines.

However, it also needs this for the symbol table:

-rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h

About 12,500 lines just for the symbols.

Together these produce a module

-rw-r--r-- 1 root root 652509 2008-07-15 20:46 stap_2154.ko

That's well over half a megabyte largely because of the symbols.

By now the embedded guys are already having WTF attacks about your
module wanting a significant portion of their available ram ... and so
on ... Are you seriously arguing that a good 0.6MB of bloat just because
you refuse to use a provided API is a good thing?

I'm afraid this is your classic fish or cut bait issue: You can choose
either to engage in dialogue with the kernel community, try to use the
provided API and improve it based on demonstrated use cases (and if you
choose to do this, I can help you with it, since interaction will need
to go both ways) and thus benefit from the open source innovation
stream, or you can keep within your own community, eschewing the broader
kernel development community, ignoring their feedback and spending all
your effort constructing work arounds for what you consider to be kernel
problems leading the systemtap users to be unhappy and Sun Marketing
pulverising us over our lack of useful tracing tools.

Which is it to be?

James

2008-07-16 10:58:20

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Tue, Jul 15, 2008 at 09:06:23PM -0500, James Bottomley wrote:
> [...]
> > Please choose your words more carefully. We don't "subvert" anything,
> > where one would mean sneaking around some sort of protection.
>
> Actually, I did and you do. One of the OED's definition of subvert is
> "to undermine or overturn a condition or order of things, a principle or
> a law etc." In this particular case, this:
>
> commit 3a872d89baae821a0f6e2c1055d4b47650661137
> Author: Ananth N Mavinakayanahalli <[email protected]>
> Date: Mon Oct 2 02:17:30 2006 -0700
> [PATCH] Kprobes: Make kprobe modules more portable
>
> Which provided a portable input to kprobes (the symbol_name/offset one)
> and revoked the global accessibility of the kallsyms_lookup_name().

That patch served two purposes: a helpful utility for other kprobes
users, and it enabling what LKML deemed more important - unexporting
kallsyms*.


> It's actually worse than this, though. The kernel API isn't fixed in
> stone, it evolves usually by trying to make problematic use cases
> better. By refusing to consider using the replacement API [...]

Your lecture is based upon a misundertanding ...


> [...]
> It emits a single probe and produces this in the module build:
> -rw-r--r-- 1 root root 17996 2008-07-15 20:45 stap_2154.c
> About 600 lines.
> However, it also needs this for the symbol table:
> -rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h

... that this is somehow connected to the kprobe api issue.

IT IS NOT.

We do not use those symbol tables for kprobe placement purposes.
(This part is partially a prototype for user-space parts, and the
sizes will not stay large.)

The way we set up kprobes now could be trivially converted to
"_stext"+offset. Would that alone allay your concerns?


- FChE

2008-07-16 14:56:19

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Wed, 2008-07-16 at 06:56 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Jul 15, 2008 at 09:06:23PM -0500, James Bottomley wrote:
> > [...]
> > > Please choose your words more carefully. We don't "subvert" anything,
> > > where one would mean sneaking around some sort of protection.
> >
> > Actually, I did and you do. One of the OED's definition of subvert is
> > "to undermine or overturn a condition or order of things, a principle or
> > a law etc." In this particular case, this:
> >
> > commit 3a872d89baae821a0f6e2c1055d4b47650661137
> > Author: Ananth N Mavinakayanahalli <[email protected]>
> > Date: Mon Oct 2 02:17:30 2006 -0700
> > [PATCH] Kprobes: Make kprobe modules more portable
> >
> > Which provided a portable input to kprobes (the symbol_name/offset one)
> > and revoked the global accessibility of the kallsyms_lookup_name().
>
> That patch served two purposes: a helpful utility for other kprobes
> users, and it enabling what LKML deemed more important - unexporting
> kallsyms*.
>
>
> > It's actually worse than this, though. The kernel API isn't fixed in
> > stone, it evolves usually by trying to make problematic use cases
> > better. By refusing to consider using the replacement API [...]
>
> Your lecture is based upon a misundertanding ...
>
>
> > [...]
> > It emits a single probe and produces this in the module build:
> > -rw-r--r-- 1 root root 17996 2008-07-15 20:45 stap_2154.c
> > About 600 lines.
> > However, it also needs this for the symbol table:
> > -rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h
>
> ... that this is somehow connected to the kprobe api issue.
>
> IT IS NOT.
>
> We do not use those symbol tables for kprobe placement purposes.
> (This part is partially a prototype for user-space parts, and the
> sizes will not stay large.)

There's no misunderstanding that you're currently embedding the kernel
symbol table and have expanded the sizes of the modules enormously to do
so. The real truth is that coming up with convoluted ways to recompute
some information the kernel already knows is fragile and shouldn't be
done. What should be done is to engage in discussion about how to get
that information out of the kernel ... preferably with use cases and
patches.

> The way we set up kprobes now could be trivially converted to
> "_stext"+offset. Would that alone allay your concerns?

No ... because as I already said, that's broken for -ffunction-sections.
It needs to be the symbol of the actual function the probe is in and the
offset into that function.

James