LinuxLists.cc - Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol

2008-07-17 18:32:20

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

James Bottomley <[email protected]> writes:

> [...]
> Just by way of illustration, this is systemtap fixed up according to
> suggestion number 1. You can see now using your test case that we get:
>
> # probes
> kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> [...]

Can you explain in detail how you believe this is materially
different from offsetting from _stext?

- FChE

2008-07-17 20:12:42

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Thu, 2008-07-17 at 14:30 -0400, Frank Ch. Eigler wrote:
> James Bottomley <[email protected]> writes:
>
> > [...]
> > Just by way of illustration, this is systemtap fixed up according to
> > suggestion number 1. You can see now using your test case that we get:
> >
> > # probes
> > kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> > [...]
>
> Can you explain in detail how you believe this is materially
> different from offsetting from _stext?

Basically because _stext is an incredibly dangerous symbol; being linker
generated it doesn't actually get put in the right place if you look:

jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
ffffffff80209000 T _stext
ffffffff80200000 A _text

Since we can't do negative offsets, you've lost access to the symbols in
the sections that start before _stext. Assuming you meant _text (which
is dangerous because it's a define in the kernel linker script and could
change). Then you can't offset into other sections, like init sections
or modules.

James

James

2008-07-17 20:29:01

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
> [...]
> > Can you explain in detail how you believe this is materially
> > different from offsetting from _stext?
>
> Basically because _stext is an incredibly dangerous symbol; being linker
> generated it doesn't actually get put in the right place if you look:

Thank you for your response.

> jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
> ffffffff80209000 T _stext
> ffffffff80200000 A _text
>
> Since we can't do negative offsets

Actually, "we" as in systemtap could do it just fine if that were
desired. And really _stext is therefore an arbitrary choice - it
could be any other reference.

My point is that the proposed effort to identify a nearby function
symbol to use as a base for each probe's symbol+offset calculation is
wasted.

> you've lost access to the symbols in the sections that start before _stext.

What's between _text and _stext appears to consist of kernel boot-time
functions that are unmapped the time anything like systemtap could
run.

> Assuming you meant _text (which is dangerous because it's a define
> in the kernel linker script and could change).

By "dangerous" do you only mean that it may require a one-liner
catch-up patch in systemtap if the kernel linker scripts change?

> Then you can't offset into other sections, like init sections or
> modules.

Kernel init sections are unprobeable by definition, so that doesn't
matter. Modules are also irrelevant, since their addresses are
relative to their relocation bases / sections, not to a kernel
(vmlinux) symbol.

- FChE

2008-07-17 21:06:23

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Thu, 2008-07-17 at 16:26 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
> > [...]
> > > Can you explain in detail how you believe this is materially
> > > different from offsetting from _stext?
> >
> > Basically because _stext is an incredibly dangerous symbol; being linker
> > generated it doesn't actually get put in the right place if you look:
>
> Thank you for your response.
>
> > jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
> > ffffffff80209000 T _stext
> > ffffffff80200000 A _text
> >
> > Since we can't do negative offsets
>
> Actually, "we" as in systemtap could do it just fine if that were
> desired. And really _stext is therefore an arbitrary choice - it
> could be any other reference.
>
> My point is that the proposed effort to identify a nearby function
> symbol to use as a base for each probe's symbol+offset calculation is
> wasted.

It's not exactly wasted ... the calculations have to be done anyway for
modules.

> > you've lost access to the symbols in the sections that start before _stext.
>
> What's between _text and _stext appears to consist of kernel boot-time
> functions that are unmapped the time anything like systemtap could
> run.

Well, no, they're the head code. It's actually used in CPU boot and
tear down, one of the things it's useful to probe, I think.

> > Assuming you meant _text (which is dangerous because it's a define
> > in the kernel linker script and could change).
>
> By "dangerous" do you only mean that it may require a one-liner
> catch-up patch in systemtap if the kernel linker scripts change?

Dangerous as in it's not necessarily part of the kernel linker scripts.
Some arches have it defined as a symbol, some have it as a linker script
definition ... that's why it's location is strange.

The point, really, is to remove some of the fragile dependencies between
systemtap and the kernel.

> > Then you can't offset into other sections, like init sections or
> > modules.
>
> Kernel init sections are unprobeable by definition, so that doesn't
> matter. Modules are also irrelevant, since their addresses are
> relative to their relocation bases / sections, not to a kernel
> (vmlinux) symbol.

Then the definition needs altering. I can see that the industrial
customers aren't interested but kernel developers are ... a lot of
problems occur in the init sections.

I think you'll find that systemtap will run quite happily from a shell
in an initramfs before the init sections are discarded. Plus there's
always module init sections which can appear at any time.

James

2008-07-17 21:35:43

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:

> [...]
> > My point is that the proposed effort to identify a nearby function
> > symbol to use as a base for each probe's symbol+offset calculation is
> > wasted.
>
> It's not exactly wasted ... the calculations have to be done anyway for
> modules.

Not really - we just anchor off a different (per-module) reference
symbol or address. At the moment, we use the .text* section bases.

> > > you've lost access to the symbols in the sections that start before _stext.
> >
> > What's between _text and _stext appears to consist of kernel boot-time
> > functions that are unmapped the time anything like systemtap could
> > run.
>
> Well, no, they're the head code. It's actually used in CPU boot and
> tear down, one of the things it's useful to probe, I think.

Fair enough - conceivably probing that stuff is useful, as is module
initialization. We don't try to do it yet (and indeed kprobes blocks
it all).

In any case, the method of probe address calculation doesn't affect
that issue. We can calculate .init* addresses relative to any
convenient reference in exactly the same way as non-.init addresses.

> > > Assuming you meant _text (which is dangerous because it's a define
> > > in the kernel linker script and could change).
> >
> > By "dangerous" do you only mean that it may require a one-liner
> > catch-up patch in systemtap if the kernel linker scripts change?
>
> Dangerous as in it's not necessarily part of the kernel linker scripts.
> [...]
> The point, really, is to remove some of the fragile dependencies between
> systemtap and the kernel.

Yes, that is generally desirable - each case is usually a question of
cost/benefit. One significant requirement for us is to keep working
with older kernels.

- FChE

2008-07-17 22:05:40

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Frank Ch. Eigler wrote:
> Hi -
>
>
> On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:
>
>> [...]
>>> My point is that the proposed effort to identify a nearby function
>>> symbol to use as a base for each probe's symbol+offset calculation is
>>> wasted.
>> It's not exactly wasted ... the calculations have to be done anyway for
>> modules.
>
> Not really - we just anchor off a different (per-module) reference
> symbol or address. At the moment, we use the .text* section bases.
>
>
>>>> you've lost access to the symbols in the sections that start before _stext.
>>> What's between _text and _stext appears to consist of kernel boot-time
>>> functions that are unmapped the time anything like systemtap could
>>> run.
>> Well, no, they're the head code. It's actually used in CPU boot and
>> tear down, one of the things it's useful to probe, I think.
>
> Fair enough - conceivably probing that stuff is useful, as is module
> initialization. We don't try to do it yet (and indeed kprobes blocks
> it all).
>
> In any case, the method of probe address calculation doesn't affect
> that issue. We can calculate .init* addresses relative to any
> convenient reference in exactly the same way as non-.init addresses.
>
>
>>>> Assuming you meant _text (which is dangerous because it's a define
>>>> in the kernel linker script and could change).
>>> By "dangerous" do you only mean that it may require a one-liner
>>> catch-up patch in systemtap if the kernel linker scripts change?
>> Dangerous as in it's not necessarily part of the kernel linker scripts.
>> [...]
>> The point, really, is to remove some of the fragile dependencies between
>> systemtap and the kernel.
>
> Yes, that is generally desirable - each case is usually a question of
> cost/benefit. One significant requirement for us is to keep working
> with older kernels.

Hi Frank,

I know we'd better archive that requirement. However, if we lose support
from developers because we are too much focusing on that, we'll also
lose the future of systemtap itself. We have to see the cost/benefit
from the long-term of view.

Could we separate systemtap parser/elaborator and code generator
to support both of old and new kernels?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-07-22 18:00:55

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Thu, 17 Jul 2008 17:33:55 -0400
"Frank Ch. Eigler" <[email protected]> wrote:

> Yes, that is generally desirable - each case is usually a question of
> cost/benefit. One significant requirement for us is to keep working
> with older kernels.

You will have to weigh that against the benefits of making
systemtap generally useful for kernel developers, which
would result in a more active systemtap community and,
eventually, more available scripts and easier end user
functionality.

If a project is not aimed squarely at the developers who
could give the project critical mass, it is essentially
doomed.

--
All rights reversed.

2008-07-22 18:12:47

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

Hi -

On Tue, Jul 22, 2008 at 02:00:15PM -0400, Rik van Riel wrote:
> [...]
> > Yes, that is generally desirable - each case is usually a question of
> > cost/benefit. One significant requirement for us is to keep working
> > with older kernels.
>
> You will have to weigh that against the benefits of making
> systemtap generally useful for kernel developers [...]

Understood & agreed, Rik. If an issue arises where there is genuine
conflict between kernel-developer-usability and something else, we'll
try to solve it favouring the former if at all possible.

(The kprobes addressing argument cannot reasonably be placed into this
category.)

- FChE

2008-07-22 18:31:25

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

On Tue, 2008-07-22 at 14:11 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Jul 22, 2008 at 02:00:15PM -0400, Rik van Riel wrote:
> > [...]
> > > Yes, that is generally desirable - each case is usually a question of
> > > cost/benefit. One significant requirement for us is to keep working
> > > with older kernels.
> >
> > You will have to weigh that against the benefits of making
> > systemtap generally useful for kernel developers [...]
>
> Understood & agreed, Rik. If an issue arises where there is genuine
> conflict between kernel-developer-usability and something else, we'll
> try to solve it favouring the former if at all possible.
>
> (The kprobes addressing argument cannot reasonably be placed into this
> category.)

You have your viewpoint inverted, if the kernel developers think you
have a problem, and you fail to address it, they will walk away.

If you want the kernel people to endorse your project, you'll have to
please them. Its that simple. If that means having to radically
re-structure your design, and/or break backwards compatibility then so
be it. Such are the costs for not collaborating from the start.

If you stubbornly refuse to co-operate you'll either break the project
or invite a fork/rewrite by someone else if the idea is deemed
worthwhile enough.

2008-07-23 15:06:22

by Frank Ch. Eigler

[permalink] [raw]

Subject: systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs

Hi -

I wrote:

> > [...] One significant requirement for us is to keep working with
> > older kernels. [...]

Maybe it's worth elaborating on why the need for backward
compatibility is different for systemtap than for typical kernel-side
code.

The bulk of systemtap is a user-space program, and it does very
user-spacey things like parsing dwarf and invoking compilers, running
network servers. Soon it will include user-space libraries. It is so
different from the stuff normally found there that no one has AFAIK
seriously proposed that the entire software be made part of the kernel
git tree. So it is an ordinary separate user-space package, built by
users and distributors.

It does happen to *generate* kernel modules. The way that such a
module must interface with any particular kernel is naturally subject
to the whims & desires of the kernel du jour. This is why we have a
mass of mechanism to try to automatically speak to each kernel version
as appropriate.

It is desirable to minimize this mass for obvious reasons. When a new
upstream kernel comes out with a tasty new feature -- or a less tasty
API rewrite -- we need to extend systemtap to support that too. We
cannot easily take old support away, because then the same user-space
code base would no longer run against actually installed kernels.

To draw an analogy, systemtap is somewhat like low-level userspace
code like glibc or syslogd or udevd. I hope no one would seriously
propose casually committing code to those packages that would make
them unusable on prior kernel versions. Accepting such a patch would
require their maintainers to fork outright every time a kernel change
occurs.

Things are good however if the low-level userspace changes are
backward-compatible, so that the new kernel facility is used when
present, but the software does not regress if it is not. I believe
this is what we need to aim for, even though it puts the bulk of the
burden on systemtap (or glibc, or ...).

I hope this fills in some of the gaps in the background.

- FChE

2008-07-23 15:29:19

by Arjan van de Ven

[permalink] [raw]

Subject: Re: systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs

On Wed, 23 Jul 2008 11:04:34 -0400
"Frank Ch. Eigler" <[email protected]> wrote:

> Hi -
>
> I wrote:
>
> > > [...] One significant requirement for us is to keep working with
> > > older kernels. [...]
>
> Maybe it's worth elaborating on why the need for backward
> compatibility is different for systemtap than for typical kernel-side
> code.
>
> The bulk of systemtap is a user-space program, and it does very
> user-spacey things like parsing dwarf and invoking compilers, running
> network servers. Soon it will include user-space libraries. It is so
> different from the stuff normally found there that no one has AFAIK
> seriously proposed that the entire software be made part of the kernel
> git tree. So it is an ordinary separate user-space package, built by
> users and distributors.

so far so good...

>
> It does happen to *generate* kernel modules. The way that such a
> module must interface with any particular kernel is naturally subject
> to the whims & desires of the kernel du jour. This is why we have a
> mass of mechanism to try to automatically speak to each kernel version
> as appropriate.

and this is where I strongly disagree.
THIS part *has* to be in the kernel source, so that we can change it
WITH the kernel as we change it. If this means that there's some
userland .so code in the kernel source, so be it. If it means we provide
some template files that your userland fills in the blanks for, even
better. (paint-by-number kernel modules!)

But to have any chance at all of systemtap being sustainable, this part
of the stack has to be together with where the changes happen.

>
> It is desirable to minimize this mass for obvious reasons. When a new
> upstream kernel comes out with a tasty new feature -- or a less tasty
> API rewrite -- we need to extend systemtap to support that too.

At that point you are already 3 months too late for me, and probably
for most of my fellow kernel hackers. THIS is exactly what makes
systemtap not usable for kernel hackers, and this is exactly why you
see very little contributions from kernel hackers.
(and when it's seen it gets a rather luke warm reception, but that's a
different story).

It also means that unless I want to package and build systemtap myself,
I have to wait for my OS vendor to think about moving to the kernel I'm
on before I can use systemtap. For me as kernel developer.. that's the
second show stopper already.

> To draw an analogy, systemtap is somewhat like low-level userspace
> code like glibc or syslogd or udevd. I hope no one would seriously
> propose casually committing code to those packages that would make
> them unusable on prior kernel versions. Accepting such a patch would
> require their maintainers to fork outright every time a kernel change
> occurs.

we've discussed pulling udev into the kernel source several times, and
the jury is still out on it.
But systemtap is NOT like udev or glibc or ..
it's a kernel component.. at least the part that generates the kernel
code is. it needs to breathe and move together with the kernel.
>
> I hope this fills in some of the gaps in the background.

it explains where you're coming from, which is good. However I for one
really disagree with the assumption, and i just tried to point out that
the consequences of this are rather dreadful.

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-23 15:33:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs

On Wed, 2008-07-23 at 11:04 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> I wrote:
>
> > > [...] One significant requirement for us is to keep working with
> > > older kernels. [...]
>
> Maybe it's worth elaborating on why the need for backward
> compatibility is different for systemtap than for typical kernel-side
> code.
>
> The bulk of systemtap is a user-space program, and it does very
> user-spacey things like parsing dwarf and invoking compilers, running
> network servers. Soon it will include user-space libraries. It is so
> different from the stuff normally found there that no one has AFAIK
> seriously proposed that the entire software be made part of the kernel
> git tree. So it is an ordinary separate user-space package, built by
> users and distributors.
>
> It does happen to *generate* kernel modules. The way that such a
> module must interface with any particular kernel is naturally subject
> to the whims & desires of the kernel du jour. This is why we have a
> mass of mechanism to try to automatically speak to each kernel version
> as appropriate.
>
> It is desirable to minimize this mass for obvious reasons. When a new
> upstream kernel comes out with a tasty new feature -- or a less tasty
> API rewrite -- we need to extend systemtap to support that too. We
> cannot easily take old support away, because then the same user-space
> code base would no longer run against actually installed kernels.
>
> To draw an analogy, systemtap is somewhat like low-level userspace
> code like glibc or syslogd or udevd. I hope no one would seriously
> propose casually committing code to those packages that would make
> them unusable on prior kernel versions. Accepting such a patch would
> require their maintainers to fork outright every time a kernel change
> occurs.
>
> Things are good however if the low-level userspace changes are
> backward-compatible, so that the new kernel facility is used when
> present, but the software does not regress if it is not. I believe
> this is what we need to aim for, even though it puts the bulk of the
> burden on systemtap (or glibc, or ...).
>
> I hope this fills in some of the gaps in the background.

Why does a new version of stap have to work on ancient kernels?

A new gnome version requires a new gtk version, a new kde version
requires a new qt etc.. so why does a new stap not require a new kernel?

Why isn't only supporting the last few kernels, say for example as far
back as there are -stable series at the moment of release, good enough?

People who insist on running stale kernels are usually the same people
who run stale userspace - we call those enterprise people - so why can't
they run matching stale version of the kernel and stap?

2008-07-23 20:27:11

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs

Hi,

Peter Zijlstra wrote:
> On Wed, 2008-07-23 at 11:04 -0400, Frank Ch. Eigler wrote:
>> Hi -
>>
>> I wrote:
>>
>>>> [...] One significant requirement for us is to keep working with
>>>> older kernels. [...]
>> Maybe it's worth elaborating on why the need for backward
>> compatibility is different for systemtap than for typical kernel-side
>> code.
>>
>> The bulk of systemtap is a user-space program, and it does very
>> user-spacey things like parsing dwarf and invoking compilers, running
>> network servers. Soon it will include user-space libraries. It is so
>> different from the stuff normally found there that no one has AFAIK
>> seriously proposed that the entire software be made part of the kernel
>> git tree. So it is an ordinary separate user-space package, built by
>> users and distributors.
>>
>> It does happen to *generate* kernel modules. The way that such a
>> module must interface with any particular kernel is naturally subject
>> to the whims & desires of the kernel du jour. This is why we have a
>> mass of mechanism to try to automatically speak to each kernel version
>> as appropriate.
>>
>> It is desirable to minimize this mass for obvious reasons. When a new
>> upstream kernel comes out with a tasty new feature -- or a less tasty
>> API rewrite -- we need to extend systemtap to support that too. We
>> cannot easily take old support away, because then the same user-space
>> code base would no longer run against actually installed kernels.
>>
>> To draw an analogy, systemtap is somewhat like low-level userspace
>> code like glibc or syslogd or udevd. I hope no one would seriously
>> propose casually committing code to those packages that would make
>> them unusable on prior kernel versions. Accepting such a patch would
>> require their maintainers to fork outright every time a kernel change
>> occurs.
>>
>> Things are good however if the low-level userspace changes are
>> backward-compatible, so that the new kernel facility is used when
>> present, but the software does not regress if it is not. I believe
>> this is what we need to aim for, even though it puts the bulk of the
>> burden on systemtap (or glibc, or ...).
>>
>> I hope this fills in some of the gaps in the background.
>
> Why does a new version of stap have to work on ancient kernels?
>
> A new gnome version requires a new gtk version, a new kde version
> requires a new qt etc.. so why does a new stap not require a new kernel?
>
> Why isn't only supporting the last few kernels, say for example as far
> back as there are -stable series at the moment of release, good enough?
>
> People who insist on running stale kernels are usually the same people
> who run stale userspace - we call those enterprise people - so why can't
> they run matching stale version of the kernel and stap?

I agree with you. currently, systemtap is increasingly evolving
on single source tree. But it is obvious that this developing style
can't catch up the upstream development.
I'd like to suggest that we might better branch the tree --
one is stable tree for old kernel, another aims to merge into upstream.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]