LinuxLists.cc - [RFC] fix kallsyms to allow discrimination of local symbols

2008-07-21 21:43:32

Subject: [RFC] fix kallsyms to allow discrimination of local symbols

The problem is that local symbols, being hidden from the linker, might
not be unique. Thus, they don't make good anchors for the symbol
relative addressing used by kprobes (it takes the first occurrence it
finds). Likewise, when they appear in stack traces, it's sometimes not
obvious which local symbol it is (although context usually allows an
easy guess).

Fix all of this by prefixing local symbols with the actual C file name
they occur in separated by '|' (I had to use '|' since ':' is already in
use for module prefixes in kallsyms lookups.

I also had to rewrite mksysmap in perl because the necessary text
formatting changes in shell are painfully slow.

Comments?

James

---

diff --git a/Makefile b/Makefile
index 6192922..a416b35 100644
--- a/Makefile
+++ b/Makefile
@@ -685,7 +685,7 @@ quiet_cmd_vmlinux_version = GEN .version

# Generate System.map
quiet_cmd_sysmap = SYSMAP
- cmd_sysmap = $(CONFIG_SHELL) $(srctree)/scripts/mksysmap
+ cmd_sysmap = $(PERL) $(srctree)/scripts/mksysmap

# Link of vmlinux
# If CONFIG_KALLSYMS is set .version is already updated
@@ -759,7 +759,7 @@ endef

# Generate .S file with all kernel symbols
quiet_cmd_kallsyms = KSYM $@
- cmd_kallsyms = $(NM) -n $< | $(KALLSYMS) \
+ cmd_kallsyms = $(NM) -n -l $< | sed "s|`pwd`/||" | $(KALLSYMS) \
$(if $(CONFIG_KALLSYMS_ALL),--all-symbols) > $@

.tmp_kallsyms1.o .tmp_kallsyms2.o .tmp_kallsyms3.o: %.o: %.S scripts FORCE
diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index ad2434b..0badae2 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -63,11 +63,12 @@ static inline int is_arm_mapping_symbol(const char *str)

static int read_symbol(FILE *in, struct sym_entry *s)
{
- char str[500];
+ char str[500], file[500];
char *sym, stype;
- int rc;
+ int rc, c, line;

- rc = fscanf(in, "%llx %c %499s\n", &s->addr, &stype, str);
+ file[0] = '\0';
+ rc = fscanf(in, "%llx %c %499s", &s->addr, &stype, str);
if (rc != 3) {
if (rc != EOF) {
/* skip line */
@@ -75,6 +76,12 @@ static int read_symbol(FILE *in, struct sym_entry *s)
}
return -1;
}
+ c = fgetc(in);
+ if (c != '\n') {
+ rc = fscanf(in, "%499[^:]:%d\n", file, &line);
+ if (rc != 2)
+ file[0] = '\0';
+ }

sym = str;
/* skip prefix char */
@@ -115,13 +122,22 @@ static int read_symbol(FILE *in, struct sym_entry *s)
/* include the type field in the symbol name, so that it gets
* compressed together */
s->len = strlen(str) + 1;
+ if (islower(stype))
+ s->len += strlen(file) + 1;
s->sym = malloc(s->len + 1);
if (!s->sym) {
fprintf(stderr, "kallsyms failure: "
"unable to allocate required amount of memory\n");
exit(EXIT_FAILURE);
}
- strcpy((char *)s->sym + 1, str);
+ if (islower(stype)) {
+ char *ss = (char *)s->sym + 1;
+
+ strcpy(ss, file);
+ strcat(ss, "|");
+ strcat(ss, str);
+ } else
+ strcpy((char *)s->sym + 1, str);
s->sym[0] = stype;

return 0;
diff --git a/scripts/mksysmap b/scripts/mksysmap
index 6e133a0..496cadd 100644
--- a/scripts/mksysmap
+++ b/scripts/mksysmap
@@ -1,4 +1,4 @@
-#!/bin/sh -x
+#!/usr/bin/perl
# Based on the vmlinux file create the System.map file
# System.map is used by module-init tools and some debugging
# tools to retrieve the actual addresses of symbols in the kernel.
@@ -41,5 +41,21 @@
# so we just ignore them to let readprofile continue to work.
# (At least sparc64 has __crc_ in the middle).

-$NM -n $1 | grep -v '$ [aNUw] $\|$__crc_$\|$ \$[adt]$' > $2
+chomp($cwd = `pwd`);
+open(I, "nm -n -l $ARGV[0]|") || die;
+open(O, ">$ARGV[1]") || die;
+foreach(<I>) {
+ chomp;
+ ($addr, $type, $symbol, $file_and_line) = split(/[ ]/, $_);
+ next if ($type =~ /[aNUw]/ || $type =~ /\$[adt]/);
+ next if ($symbol=~ /__crc_/);
+ if ($type =~ /[a-z]/ && $file_and_line) {
+ ($_) = split(/:/, $file_and_line);
+ (undef, $file) = split(/^$cwd\//, $_);
+ $symbol = $file."|".$symbol;
+ }
+ print O "$addr $type $symbol\n";
+}
+
+

2008-07-22 01:46:19

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

James Bottomley <[email protected]> writes:

> [...] Fix all of this by prefixing local symbols with the actual C
> file name they occur in separated by '|' (I had to use '|' since ':'
> is already in use for module prefixes in kallsyms lookups. [...]
> Comments?

Can we take some time to review how we got here?

- You disprefer systemtap's use of an established, non-deprecated API
for placing kernel probes. (We calculate addresses by a mixture of
elf-analysis and runtime user-space lookup means. That's partly
since kallsyms_lookup was unexported over our objections.) There is
nothing outright broken (e.g. incorrect numbers) with what systemtap
has been doing for years.

- You argue that symbols+offset kprobing is better. We can see that,
in some sense, but ...

- I explain that we are used to final address calculating, as we'll
have to do that regardless for user-space probes. Plus we need to
work with kernels that predate the symbol+offset kprobe api
extension. So this change would not simplify systemtap in any way.
You do not respond.

- I offer _stext+offset (for the kernel) and (.text*)+offset (for
modules) kprobes: basically to use the "better" symbol+offset
kprobes api, but use the same single reference addresses we already
do, and leaving just the final addition to the kernel. You do not
respond materially.

- You argue that it cannot only be any symbol+offset ... but the actual
nearest symbol+offset. But that doesn't work for local symbols. So
you fix that to the nearest globally visible symbol+offset. But this
requires:
- yet more extra work and code from systemtap
- extension to the kernel build system, and kallsyms runtime data to
fix the current local-symbol-ambiguity problem
- storage of all that new file name data in permanent unswappable
kernel data (>>100kB, if done simply prefixing local symbol names
file file names).
- possible further complications related to filename string matching

- You have yet to invent a scheme to allow offloading *data* address
calculations to the kernel. Without that (and perhaps more),
systemtap will *still* have to fetch same base _stext etc.
addresses at run time that it currently does -- even if it did not
use them to compute kprobes addresses.

In total, this path would end up with both systemtap and the kernel
more complex, larger and a bit slower too. Does that still seem an
acceptable cost, just to get systemtap to change its preferred kprobes
api?

- FChE

2008-07-22 03:53:22

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Mon, 2008-07-21 at 21:44 -0400, Frank Ch. Eigler wrote:
> James Bottomley <[email protected]> writes:
>
> > [...] Fix all of this by prefixing local symbols with the actual C
> > file name they occur in separated by '|' (I had to use '|' since ':'
> > is already in use for module prefixes in kallsyms lookups. [...]
> > Comments?
>
> Can we take some time to review how we got here?
>
>
> - You disprefer systemtap's use of an established, non-deprecated API
> for placing kernel probes. (We calculate addresses by a mixture of
> elf-analysis and runtime user-space lookup means. That's partly
> since kallsyms_lookup was unexported over our objections.) There is
> nothing outright broken (e.g. incorrect numbers) with what systemtap
> has been doing for years.

You mean embedding half a megabyte of symbols simply so you can avoid
the inconvenience of using a kernel API? yes, I think it's ...
suboptimal.

> - You argue that symbols+offset kprobing is better. We can see that,
> in some sense, but ...
>
> - I explain that we are used to final address calculating, as we'll
> have to do that regardless for user-space probes. Plus we need to
> work with kernels that predate the symbol+offset kprobe api
> extension. So this change would not simplify systemtap in any way.
> You do not respond.

There is no current userspace infrastructure, since utrace still isn't
in the kernel, so you're predicating this argument on an event which
hasn't happened.

Even assuming utrace is accepted, embedding the symbol table of every
user space process in the probes is still daft. It's this constant
assertion that "it must be done my way" that's causing such a drag on
the open source process. For instance, the obvious way to me of doing
this would be to map the user space stack into the systemtap runtime and
unwind it from there instead of vectoring it into the kernel.

> - I offer _stext+offset (for the kernel) and (.text*)+offset (for
> modules) kprobes: basically to use the "better" symbol+offset
> kprobes api, but use the same single reference addresses we already
> do, and leaving just the final addition to the kernel. You do not
> respond materially.

I thought this and subsequent emails addressed the points pretty well:

http://marc.info/?l=linux-kernel&m=121632572409118

> - You argue that it cannot only be any symbol+offset ... but the actual
> nearest symbol+offset. But that doesn't work for local symbols. So
> you fix that to the nearest globally visible symbol+offset. But this
> requires:
> - yet more extra work and code from systemtap

I'm afraid that's how open source development works ... you iterate to
find the best solution

> - extension to the kernel build system, and kallsyms runtime data to
> fix the current local-symbol-ambiguity problem

Finding weaknesses in APIs and fixing them is what it's all about.

> - storage of all that new file name data in permanent unswappable
> kernel data (>>100kB, if done simply prefixing local symbol names
> file file names).

I'd check my facts before making assertions. The kernel symbol table is
stored in a compressed form that actually eliminates most of these
repetitions.

> - possible further complications related to filename string matching

Any substantiation of that?

> - You have yet to invent a scheme to allow offloading *data* address
> calculations to the kernel. Without that (and perhaps more),
> systemtap will *still* have to fetch same base _stext etc.
> addresses at run time that it currently does -- even if it did not
> use them to compute kprobes addresses.

That would be because I haven't actually started looking at this one
yet. Of course, that would make it a great starting point for others
who wished to help.

> In total, this path would end up with both systemtap and the kernel
> more complex, larger and a bit slower too.

Really? I count the reduction of the probe modules from 500kb to 50kb a
worthwhile saving. I don't even see where anything became larger.

> Does that still seem an
> acceptable cost, just to get systemtap to change its preferred kprobes
> api?

James

2008-07-22 11:52:50

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

Hi -

On Mon, Jul 21, 2008 at 10:53:08PM -0500, James Bottomley wrote:
> [...]
> > - You disprefer systemtap's use of an established, non-deprecated API
> > for placing kernel probes. [...]
>
> You mean embedding half a megabyte of symbols simply so you can avoid
> the inconvenience of using a kernel API? yes, I think it's ...
> suboptimal.

It has been explained already that the symbol table you saw in
stap-symbols.h has nothing to do with the kprobe addressing issue.

http://www.gossamer-threads.com/lists/linux/kernel/947365#957365

> > - You argue that symbols+offset kprobing is better. We can see that,
> > in some sense, but ...
> >
> > - I explain that we are used to final address calculating, as we'll
> > have to do that regardless for user-space probes. Plus we need to
> > work with kernels that predate the symbol+offset kprobe api
> > extension. So this change would not simplify systemtap in any way.
> > You do not respond.
>
> There is no current userspace infrastructure, since utrace still isn't
> in the kernel, so you're predicating this argument on an event which
> hasn't happened.

We exercise professional foresight. And the backward compatibility
issue remains even without that.

> Even assuming utrace is accepted, embedding the symbol table of
> every user space process in the probes is still daft. [....]

It would take space, no question, though we're not talking about
"every" process, just designated ones.

> For instance, the obvious way to me of doing this would be to map
> the user space stack into the systemtap runtime and unwind it from
> there instead of vectoring it into the kernel.

Please elaborate. What does mapping a stack into the runtime mean?
Do you mean to suggest having the userspace program unwind itself? Or
relying on the userspace programs' possibly-paged-out unwind data?
That would be intrusive.

> > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
> > modules) kprobes: basically to use the "better" symbol+offset
> > kprobes api, but use the same single reference addresses we already
> > do, and leaving just the final addition to the kernel. You do not
> > respond materially.
>
> I thought this and subsequent emails addressed the points pretty well:
>
> http://marc.info/?l=linux-kernel&m=121632572409118

No, they didn't. Every time I explained about how it does work, you
just claimed "not", without even a single worked-out substantiating
example.

> [...]
> > - storage of all that new file name data in permanent unswappable
> > kernel data (>>100kB, if done simply prefixing local symbol names
> > file file names).
>
> I'd check my facts before making assertions. The kernel symbol table is
> stored in a compressed form that actually eliminates most of these
> repetitions.

A careful reader will notice the "if" in my sentence. Anyway, that
or a superior compression scheme could apply to systemtap's various
tables too.

> > - possible further complications related to filename string matching
>
> Any substantiation of that?

We have had reported problems with differences between kernels
hand-built with long absolute source path names versus the smallest
"kernel/foo.c" names. If such canonicalization takes place but
inconsistently by the different tools, we will have a problem.

> [...]
> > In total, this path would end up with both systemtap and the kernel
> > more complex, larger and a bit slower too.
>
> Really? I count the reduction of the probe modules from 500kb to 50kb a
> worthwhile saving.

The red clupea harengus again.

> I don't even see where anything became larger.

Even with ksymtab compression, there is still new data to be stored in
the kernel, and it is extra for each systemtap probe datum.

> > Does that still seem an acceptable cost, just to get systemtap to
> > change its preferred kprobes api?

> [no answer]

Indeed.

- FChE

2008-07-22 15:14:56

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Tue, 2008-07-22 at 07:51 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Mon, Jul 21, 2008 at 10:53:08PM -0500, James Bottomley wrote:
> > [...]
> > > - You disprefer systemtap's use of an established, non-deprecated API
> > > for placing kernel probes. [...]
> >
> > You mean embedding half a megabyte of symbols simply so you can avoid
> > the inconvenience of using a kernel API? yes, I think it's ...
> > suboptimal.
>
> It has been explained already that the symbol table you saw in
> stap-symbols.h has nothing to do with the kprobe addressing issue.
>
> http://www.gossamer-threads.com/lists/linux/kernel/947365#957365

You're confusing issues. I said embedding half a megabyte of symbol
table that the kernel already has is a bad idea full stop. The ultimate
think I'm looking to do is to evolve kernel APIs that makes this
practice unnecessary.

> > > - You argue that symbols+offset kprobing is better. We can see that,
> > > in some sense, but ...
> > >
> > > - I explain that we are used to final address calculating, as we'll
> > > have to do that regardless for user-space probes. Plus we need to
> > > work with kernels that predate the symbol+offset kprobe api
> > > extension. So this change would not simplify systemtap in any way.
> > > You do not respond.
> >
> > There is no current userspace infrastructure, since utrace still isn't
> > in the kernel, so you're predicating this argument on an event which
> > hasn't happened.
>
> We exercise professional foresight. And the backward compatibility
> issue remains even without that.

No ... you're trying to constrain the open source process to a
pre-conceived design which is unrealised by in-kernel code. This is
directly producing an impasse.

Backwards compatibility is the province of distros. The job of upstream
is to produce the best tool and environment we can. After this, the
backwards portability of desirable pieces can be discussed.

> > Even assuming utrace is accepted, embedding the symbol table of
> > every user space process in the probes is still daft. [....]
>
> It would take space, no question, though we're not talking about
> "every" process, just designated ones.
>
> > For instance, the obvious way to me of doing this would be to map
> > the user space stack into the systemtap runtime and unwind it from
> > there instead of vectoring it into the kernel.
>
> Please elaborate. What does mapping a stack into the runtime mean?

It means that the systemtap runtime and the process would share a
mapping for the process stacks Obviously the process would have to be
quiesced to poke about in it, but it obviates the need to vector
megabytes of stack information through the kernel.

> Do you mean to suggest having the userspace program unwind itself?

It's possible ... but more likely that the stap runtime would do the
unwinding. Which is more efficient won't really be known until someone
actually tries coding it.

> Or relying on the userspace programs' possibly-paged-out unwind data?
> That would be intrusive.

I think you'll find doing it in user space is an advantage for paged out
data. It's much more complex to get to it in the kernel because you
have to be careful of context while you're asking for it to be paged
back in.

> > > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
> > > modules) kprobes: basically to use the "better" symbol+offset
> > > kprobes api, but use the same single reference addresses we already
> > > do, and leaving just the final addition to the kernel. You do not
> > > respond materially.
> >
> > I thought this and subsequent emails addressed the points pretty well:
> >
> > http://marc.info/?l=linux-kernel&m=121632572409118
>
> No, they didn't. Every time I explained about how it does work, you
> just claimed "not", without even a single worked-out substantiating
> example.

Really? The mutability of _stext vs _text; the problem probing init
sections I think they're real issues.

> > [...]
> > > - storage of all that new file name data in permanent unswappable
> > > kernel data (>>100kB, if done simply prefixing local symbol names
> > > file file names).
> >
> > I'd check my facts before making assertions. The kernel symbol table is
> > stored in a compressed form that actually eliminates most of these
> > repetitions.
>
> A careful reader will notice the "if" in my sentence. Anyway, that
> or a superior compression scheme could apply to systemtap's various
> tables too.
>
>
> > > - possible further complications related to filename string matching
> >
> > Any substantiation of that?
>
> We have had reported problems with differences between kernels
> hand-built with long absolute source path names versus the smallest
> "kernel/foo.c" names. If such canonicalization takes place but
> inconsistently by the different tools, we will have a problem.

What it currently does is add tree relative names, so, for example this
is a cut and paste from System.map:

c0100000 T _text
c0100000 T startup_32
c0100054 t arch/x86/kernel/head_32.S|default_entry
c01000a8 T startup_32_smp
c010012a t arch/x86/kernel/head_32.S|checkCPUtype
c01001ab t arch/x86/kernel/head_32.S|is486
c01001b2 t arch/x86/kernel/head_32.S|is386
c0100220 T initial_code
c0100224 t arch/x86/kernel/head_32.S|check_x87

There's no ambiguity about how the path is constructed.

> > [...]
> > > In total, this path would end up with both systemtap and the kernel
> > > more complex, larger and a bit slower too.
> >
> > Really? I count the reduction of the probe modules from 500kb to 50kb a
> > worthwhile saving.
>
> The red clupea harengus again.

only if alio populo leges tantum adhiberent

> > I don't even see where anything became larger.
>
> Even with ksymtab compression, there is still new data to be stored in
> the kernel, and it is extra for each systemtap probe datum.

I think the requirement that your huge data problem be solved at
absolutely no cost is possibly a bit ambitious. However, it can be done
at little cost to the kernel.

> > > Does that still seem an acceptable cost, just to get systemtap to
> > > change its preferred kprobes api?
>
> > [no answer]
>
> Indeed.

Perhaps because that's the wrong question. It's not about trying to
change a "preferred" API (and the concept of preferred means according
to your preset design). It's about trying to get systemtap actually to
engage with the kernel and iterate to an actual solution.

James

2008-07-22 16:07:13

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

James Bottomley <[email protected]> writes:

> [...]
>> > [...]
>> > > - You disprefer systemtap's use of an established, non-deprecated API
>> > > for placing kernel probes. [...]
>> >
>> > You mean embedding half a megabyte of symbols simply so you can avoid
>> > the inconvenience of using a kernel API? yes, I think it's ...
>> > suboptimal.
>>
>> It has been explained already that the symbol table you saw in
>> stap-symbols.h has nothing to do with the kprobe addressing issue.
>> [...]
>
> You're confusing issues. I said embedding half a megabyte of symbol
> table that the kernel already has is a bad idea full stop.

Be that as it may, but it is not appropriate to knowingly bring up a
topic that is irrelevant to the patch series you're actually proposing.

> The ultimate think I'm looking to do is to evolve kernel APIs that
> makes this practice unnecessary.

If by "this practice" you mean "stap-symbols.h tables", then you're
worrying on the wrong area of code (anything in/near kprobes). I am
happy to suggest more appropriate areas to help with that.

>> > There is no current userspace infrastructure, since utrace still isn't
>> > in the kernel, so you're predicating this argument on an event which
>> > hasn't happened.
>>
>> We exercise professional foresight. And the backward compatibility
>> issue remains even without that.
>
> No ... you're trying to constrain the open source process
> to a pre-conceived design which is unrealised by in-kernel code.

The "open source process" does not entail welcoming unnecessary changes.

> This is directly producing an impasse.

I don't recognize an impasse. This aspect of systemtap has been
working just fine with the kernel as has been. We are not blocked on
a resolution to the current issue.

>> [...]
>> > For instance, the obvious way to me of doing this would be to map
>> > the user space stack into the systemtap runtime and unwind it from
>> > there instead of vectoring it into the kernel.
>>
>> Please elaborate. What does mapping a stack into the runtime mean?
>
> It means that the systemtap runtime and the process would share a
> mapping for the process stacks Obviously the process would have to be
> quiesced to poke about in it, but it obviates the need to vector
> megabytes of stack information through the kernel.

I still don't understand. No one has seriously proposed "vectoring
megabytes of stack information through the kernel", if by that you
mean copying it all to userspace. Even frame-pointer-based dtrace
doesn't do that.

Accessing the process stack from systemtap modules is not the problem
in the first place: rather, it's instant access to the unwind & symbol
data needed to decode it.

>> Do you mean to suggest having the userspace program unwind itself?
>
> It's possible ... but more likely that the stap runtime would do the
> unwinding. Which is more efficient won't really be known until someone
> actually tries coding it.

It's not just a matter of efficiency but a matter of non-disruptiveness.

>> Or relying on the userspace programs' possibly-paged-out unwind data?
>> That would be intrusive.
>
> I think you'll find doing it in user space is an advantage for paged out
> data. It's much more complex to get to it in the kernel because you
> have to be careful of context while you're asking for it to be paged
> back in.

We would not ask for anything to be paged back in. That is one kind
of disruption to system state that we strain to avoid.

>> > > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
>> > > modules) kprobes [...]
>> > I thought this and subsequent emails addressed the points pretty well:
>> > http://marc.info/?l=linux-kernel&m=121632572409118
>>
>> No, they didn't. Every time I explained about how it does work, you
>> just claimed "not", without even a single worked-out substantiating
>> example.
>
> Really? The mutability of _stext vs _text; the problem probing init
> sections I think they're real issues.

Then please do me the courtesy of replying to the points already made
on those items.

(Recap: we don't care whether the vmlinux reference symbol is _text or
_stext or something else. Whatever works, and so far _stext does
everywhere we've needed it.)

For init sections in general, they are irrelevant for systemtap at
this time as kprobes blocks them, and we'd need extra kernel/module.c
infrastructure to let us hook module-loaded init-not-yet-run events
(and the corresponding exit transitions). Once those are solved,
systemtap can trivially calculate .init addresses relative to _stext
or whatever -- and it does today!)

So, they do not appear to be real issues.

>[...]
>> We have had reported problems with differences between kernels
>> hand-built with long absolute source path names versus the smallest
>> "kernel/foo.c" names. If such canonicalization takes place but
>> inconsistently by the different tools, we will have a problem.
>
> What it currently does is add tree relative names, so, for example this
> is a cut and paste from System.map: [...]
> There's no ambiguity about how the path is constructed.

Perhaps, as long as the dwarf-parsing side canonicalizes them the same way.

>> Even with ksymtab compression, there is still new data to be stored in
>> the kernel, and it is extra for each systemtap probe datum.
>
> I think the requirement that your huge data problem be solved at
> absolutely no cost is possibly a bit ambitious.

What "huge data problem"? Forget the stap-symbols.h already.

>> > > Does that still seem an acceptable cost, just to get systemtap to
>> > > change its preferred kprobes api?
>> > [no answer]
>> Indeed.
>
> Perhaps because that's the wrong question. It's not about trying to
> change a "preferred" API (and the concept of preferred means according
> to your preset design).

Preferred, as in working, stable, not in any way deprecated, backward
compatible -- and more. It ain't broke and it don't need fixin'.

> It's about trying to get systemtap actually to engage with the
> kernel and iterate to an actual solution.

It is unfortunate that this technical trivium wants to turn into jabs
about "engaging with the kernel" or "constraining the open source
process" or something. Let the idea stand on its own merits.

- FChE

2008-07-23 02:43:00

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Tue, Jul 22, 2008 at 12:05:23PM -0400, Frank Ch. Eigler wrote:
> > You're confusing issues. I said embedding half a megabyte of symbol
> > table that the kernel already has is a bad idea full stop.
>
> Be that as it may, but it is not appropriate to knowingly bring up a
> topic that is irrelevant to the patch series you're actually proposing.
>
> > The ultimate think I'm looking to do is to evolve kernel APIs that
> > makes this practice unnecessary.
>
> If by "this practice" you mean "stap-symbols.h tables", then you're
> worrying on the wrong area of code (anything in/near kprobes). I am
> happy to suggest more appropriate areas to help with that.

I think there is a certain amount of talking past each other which is
going on here.

Let me see if I can summarize what is going on. My understanding of
the situation is as follows:

1) Right now, when trying to resolve the address to install a kprobe,
Systemtap is using the DWARF information from the vmlinux file and/or
the module's debuginfo files.

2) Kprobe can support setting probes based on raw addresses (which is
what Systemtap is doing now) as well as a text symbol looked up from
kallsyms plus an offset. The latter was introduced two years ago, but
Systemtap does not take advantage of it.

3) James is offended by the fact that Systemtap is not utilizing this
interface, and has offerred up patches which adds the capability of
using the symbol+offset feature of the kprobes interface. There seems
to be some question as to whether his patch has all of the
functionality of the existing code which determines addresses via
pawing through DWARF headers (especially where a function may have
been inlined one or more times) but presumably James' patches could be
enhanced to deal with this issues, if he hasn't done so already.

Am I missing anything?

So the main arguments against this seems to be that it by itself this
doesn't actually reduce any complexity or code in Systemtap because
there are other places (kernel data segment, user space tracing if it
ever manages to get merged into mainline, etc.) which needs to be able
to paw through DWARF headers.

James' argument that this Systemtap is cramming over half a megabyte
is regarded by is somewhat of a red herring with respect to this
specific patch, since systemtap does not calculate probe addresses at
runtime. This 600k or so of symbol tables is being used for something
else. (What, exactly?!?) James clearly in the long term wants to
make this go away, which seems reasonable. He views changing how
kprobes are placed is the first in a series of changes that he would
like to make. So perhaps the resistance in accepting his first patch
troubles him since it appears that Systemtap folks aren't willing to
take his suggestions about how things should be improved.

May I make a suggestion and not try to come to a conclusion about the
big picture question for a moment and focus on the very short-term of
whether it is better that when I implement a probe such as:

probe module("ext4dev").function("ext4_fill_super")
{
printk("here am I!\n");
}

This should be done via passing a hard-coded address, or via using the
kprobes function+offset facility? It would seem that there are
advantages to James' patch all by itself, in that it will will work
even if the debuginfo information for the ext4dev module can't be
found, since the kallsyms information would be used instead. It also
means that the resulting systemtap probe modules will be easier to
make more kernel independent, since it won't be using a hard-coded
address, but rather a symbolic name.

So what is the good reason *not* to do things the way James has
suggested? The kprobes kallsyms facility has been around for a while
now. Is it that we need to make changes?

Maybe if things are focused on somewhat more concrete questions, we
can make progress.

Regards,

- Ted

2008-07-23 04:18:08

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

Hi, Ted -

On Tue, Jul 22, 2008 at 09:48:04PM -0400, Theodore Tso wrote:

> [...]
> 1) Right now, when trying to resolve the address to install a kprobe,
> Systemtap is using the DWARF information from the vmlinux file and/or
> the module's debuginfo files.

Or as a fallback, it can use textual symbol tables, such as a
System.map or /proc/kallsyms dump.

> 2) Kprobe can support setting probes based on raw addresses (which is
> what Systemtap is doing now) as well as a text symbol looked up from
> kallsyms plus an offset. The latter was introduced two years ago, but
> Systemtap does not take advantage of it.

Right.

> 3) James is offended by the fact that Systemtap is not utilizing this
> interface, and has offerred up patches which adds the capability of
> using the symbol+offset feature of the kprobes interface. [...]
>
> Am I missing anything?

I also proposed a compromise where systemtap would use the
symbol+offset interface, but choose a single convenient symbol as base
for all probes in a particular elf file (/section).

> So the main arguments against this seems to be that it by itself
> this doesn't actually reduce any complexity or code in Systemtap
> because there are other places (kernel data segment, user space
> tracing if it ever manages to get merged into mainline, etc.) which
> needs to be able to paw through DWARF headers.

Right.

> James' argument that this Systemtap is cramming over half a megabyte
> is regarded by is somewhat of a red herring with respect to this
> specific patch, since systemtap does not calculate probe addresses
> at runtime.

Or more precisely, not from that table.

> This 600k or so of symbol tables is being used for something else.
> (What, exactly?!?)

They are there to enable scripts to perform address-to-symbol and
symbol-to-address mappings on demand. This comes up in several
contexts, one of which is symbolic stack unwinding, another is a set
of explicit utility functions that are/will be available.

It is possible that through analysis of any particular script's
contents, systemtap will be able to determine that this data is
unused, and thus compile it out.

> James clearly in the long term wants to make this go away, which
> seems reasonable.

Right, though this is an active development area whose ultimate
costs/benefits we do not yet know.

> He views changing how kprobes are placed is the first in a series of
> changes that he would like to make.

Right, keeping in mind that this change does not advance that prior
goal.

> So perhaps the resistance in accepting his first patch troubles him
> since it appears that Systemtap folks aren't willing to take his
> suggestions about how things should be improved.

Actually, this is not James' first patch; I am grateful for several
that are in systemtap already and others are in the queue. But point
taken.

> May I make a suggestion and not try to come to a conclusion about the
> big picture question for a moment and focus on the very short-term of
> whether it is better that when I implement a probe such as:
>
> probe module("ext4dev").function("ext4_fill_super")
> {
> printk("here am I!\n");
> }

Sure.

> This should be done via passing a hard-coded address, or via using
> the kprobes function+offset facility? It would seem that there are
> advantages to James' patch all by itself, in that it will will work
> even if the debuginfo information for the ext4dev module can't be
> found, since the kallsyms information would be used instead.

As a quality-of-implementation matter, systemtap checks at translation
time that such probes make sense -- that "ext4_fill_super" even
exists. (That is needed also to expand wildcards.) The only way it
can do that is if it has dwarf or separate textual symbol table data
(see above). Both of those carry addresses as well, so we might as
well use them.

> It also means that the resulting systemtap probe modules will be
> easier to make more kernel independent, since it won't be using a
> hard-coded address, but rather a symbolic name.

This is true, but I think it would apply to only a small class of
scripts. Such a script would have to access no $context variables
(since those require dwarf location/type data), and cannot include
statement probes at function interiors (since those would require
dwarf-derived non-zero symbol offsets), and cannot use any other
runtime facility that involves arbitrary symbol<->address lookups.

Even then, kernel-independence in the sense of being able to copy and
reuse a compiled probe module amongst separate kernel builds seems
like a faint hope anyway, considering modversions and our own internal
version-matching checks. But maybe there is an opportunity here.

> So what is the good reason *not* to do things the way James has
> suggested? The kprobes kallsyms facility has been around for a
> while now. Is it that we need to make changes?

One reason is that James' proposed code breaks systemtap for
pre-kallsyms-kprobes kernels, and those kernels too that have
kallsyms-kprobes but not the RFC'd new one that has the source file
names encoded within them. It could instead use e.g. our autoconf*
facility to generate code compatible with them all.

Another reason is that it likely breaks systemtap for impending code
for user-space by replacing rather than extending various internal
data structures that deal with this.

I have seen little sympathy expressed for either of these concerns,
which means that the new code would primarily allay some offense (but
not constitute a bug fix or "usability for kernel developers" matter),
and leave us to pick up the pieces. We can't make a habit out of that
sort of thing, but maybe as a one-off in the interest of mutual
goodwill, we should work out a way to get it done.

> Maybe if things are focused on somewhat more concrete questions, we
> can make progress.

Thank you.

- FChE

2008-07-23 16:25:26

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Wed, Jul 23, 2008 at 12:16:12AM -0400, Frank Ch. Eigler wrote:
> I also proposed a compromise where systemtap would use the
> symbol+offset interface, but choose a single convenient symbol as base
> for all probes in a particular elf file (/section).

I guess I don't see the value of that over just using the address
directly. James' point wasn't just to use the symbol+offset feature
just for the sake of using it, but rather as a better way of
specifying how to insert a probe into a kernel. For example, it may
be that by allowing the kernel to have a bit more semantic knowledge
of where a probe is going, it could more easily do various
safety-related checks that can't be done if all it is given is a a raw
address.

> > probe module("ext4dev").function("ext4_fill_super")
> > {
> > printk("here am I!\n");
> > }
>
> > This should be done via passing a hard-coded address, or via using
> > the kprobes function+offset facility? It would seem that there are
> > advantages to James' patch all by itself, in that it will will work
> > even if the debuginfo information for the ext4dev module can't be
> > found, since the kallsyms information would be used instead.
>
> As a quality-of-implementation matter, systemtap checks at translation
> time that such probes make sense -- that "ext4_fill_super" even
> exists. (That is needed also to expand wildcards.) The only way it
> can do that is if it has dwarf or separate textual symbol table data
> (see above). Both of those carry addresses as well, so we might as
> well use them.

True, though I'll note for modern kernels, with /proc/kallsyms, we
should hopefully be able to do this (offset=0 probes) without DWARF
headers. BTW, one of the things which I have wondered is whether
DWARF was really the right approach after all, given how bloated and
space-inefficient it seems to be. Needing to keep 600 megs of module
debuginfo for each kernel that I build (and yes, I build a fairly
complete module-rich kernel, but having 59 megs of modules expand into
606 megs of debuginfo files is more than a little bit obnoxious.)

Maybe the kernel could be one of the places where we experiment with
using something like CTF. Apparently CTF is compact enough that the
Solaris folks are willing to keep the CTF information for the
currently booted kernel in unswappable kernel memory --- which I know
is one of those things we wouldn't want to do with DWARF information!

> I have seen little sympathy expressed for either of these concerns,
> which means that the new code would primarily allay some offense (but
> not constitute a bug fix or "usability for kernel developers" matter),
> and leave us to pick up the pieces. We can't make a habit out of that
> sort of thing, but maybe as a one-off in the interest of mutual
> goodwill, we should work out a way to get it done.

This may have been a communication disconnect. I suspect if the
answer was, "please make the following changes and we will accept the
patch", that James may have been willing to make the changes ---
especially if it is for the sake of backwards compatibility with older
kernels. Most kernel developers do care very much about backwards
compatibility between userspace and older kernel versions. (Some of
the folks who weren't as careful have been flamed pretty heavily about
that; I'm still not a great fan of the sysfs interface from a
userspace usability perspective, but we haven't had a major breakage
there in quite a while.)

So there is a big difference between "please do X, Y, and Z first and
the patch would then be acceptable" versus "for reasons A, B and C
this patch is totally unacceptable and in fact what you are trying to
do is pointless".

Regards,

- Ted

2008-07-23 16:41:23

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Wed, 2008-07-23 at 12:25 -0400, Theodore Tso wrote:
> On Wed, Jul 23, 2008 at 12:16:12AM -0400, Frank Ch. Eigler wrote:
> > I also proposed a compromise where systemtap would use the
> > symbol+offset interface, but choose a single convenient symbol as base
> > for all probes in a particular elf file (/section).
>
> I guess I don't see the value of that over just using the address
> directly. James' point wasn't just to use the symbol+offset feature
> just for the sake of using it, but rather as a better way of
> specifying how to insert a probe into a kernel. For example, it may
> be that by allowing the kernel to have a bit more semantic knowledge
> of where a probe is going, it could more easily do various
> safety-related checks that can't be done if all it is given is a a raw
> address.
>
> > > probe module("ext4dev").function("ext4_fill_super")
> > > {
> > > printk("here am I!\n");
> > > }
> >
> > > This should be done via passing a hard-coded address, or via using
> > > the kprobes function+offset facility? It would seem that there are
> > > advantages to James' patch all by itself, in that it will will work
> > > even if the debuginfo information for the ext4dev module can't be
> > > found, since the kallsyms information would be used instead.
> >
> > As a quality-of-implementation matter, systemtap checks at translation
> > time that such probes make sense -- that "ext4_fill_super" even
> > exists. (That is needed also to expand wildcards.) The only way it
> > can do that is if it has dwarf or separate textual symbol table data
> > (see above). Both of those carry addresses as well, so we might as
> > well use them.
>
> True, though I'll note for modern kernels, with /proc/kallsyms, we
> should hopefully be able to do this (offset=0 probes) without DWARF
> headers. BTW, one of the things which I have wondered is whether
> DWARF was really the right approach after all, given how bloated and
> space-inefficient it seems to be. Needing to keep 600 megs of module
> debuginfo for each kernel that I build (and yes, I build a fairly
> complete module-rich kernel, but having 59 megs of modules expand into
> 606 megs of debuginfo files is more than a little bit obnoxious.)

Yes ... the Sun solution to this was something called CTF which is
basically compressed dwarf with duplicates eliminated (as you say
below). We could go a long way with simple duplicate elimination in
dwarf. For instance dwarf tends to deposit a definition of a structure
wherever it's used. For popular structures (like struct task) this can
lead to hundreds of duplicates in the debug info.

However, as has been rightly pointed out, most of what people want is
simple entry and exit notification on functions ... this we can do with
kprobes without any dwarf at all (so it's something systemtap should be
able to do). In theory, we can also get at the actual function call
arguments without resorting to dwarf, although that requires someone
actually knowing the order and type (without dwarf we can't cross check
unless we do something like parse the kernel headers).

Where it gets tricky is if you want access to variables in the middle of
the routine. This is very useful to kernel developers and for static
tracepoints. Note that the current markers project has a solution to
this which could be done dwarfless ... the problem is that it has to
perturb the optimiser at that point to spill the variables of interest
to locations we know, so the price we pay is zero impact when disabled.
The approach I was taking more or less relies on dwarf (or CTF) to be
zero impact because the variable could be aynwhere ... including in a
register or temporary stack location.

> Maybe the kernel could be one of the places where we experiment with
> using something like CTF. Apparently CTF is compact enough that the
> Solaris folks are willing to keep the CTF information for the
> currently booted kernel in unswappable kernel memory --- which I know
> is one of those things we wouldn't want to do with DWARF information!

I'm not sure we'd want to do it with CTF either ... the only reason to
have it in the kernel is if the kernel wants to use it. Right at the
moment with the dwarf, systemtap userspace is the only consumer, so it
doesn't need to be in the kernel; I don't really see that changing with
CTF. I think dtrace does it because the D runtime is in-kernel so some
of the run time typing needs the CTF.

James

2008-07-23 17:48:13

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

On Wed, Jul 23, 2008 at 12:40:26PM -0400, James Bottomley wrote:
> I'm not sure we'd want to do it with CTF either ... the only reason to
> have it in the kernel is if the kernel wants to use it. Right at the
> moment with the dwarf, systemtap userspace is the only consumer, so it
> doesn't need to be in the kernel; I don't really see that changing with
> CTF. I think dtrace does it because the D runtime is in-kernel so some
> of the run time typing needs the CTF.

As far as I understand things, all of that is done in userspace and
byte-compiled down to run in their simple DTrace VM. The reason for
CTF being in the Open Slaris is for their kernel debugger (kmdb) and
so that a crash dump can be guaranteed to be easily analyzed.[1]

- Ted

[1] http://blogs.sun.com/levon/entry/reducing_ctf_overhead

2008-07-24 16:04:48

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] fix kallsyms to allow discrimination of local symbols

Hi -

On Wed, Jul 23, 2008 at 12:25:05PM -0400, Theodore Tso wrote:
> > I also proposed a compromise where systemtap would use the
> > symbol+offset interface, but choose a single convenient symbol as base
> > for all probes in a particular elf file (/section).
>
> I guess I don't see the value of that over just using the address
> directly. James' point wasn't just to use the symbol+offset feature
> just for the sake of using it, but rather as a better way of
> specifying how to insert a probe into a kernel.

Right, I understand that this is the theory, but I believe the
difference between symbol+offset vs. _stext+offset
vs. absolute-address is almost entirely aesthetic rather than
functional.

> For example, it may be that by allowing the kernel to have a bit
> more semantic knowledge of where a probe is going, it could more
> easily do various safety-related checks that can't be done if all it
> is given is a raw address.

This is unlikely to be the case. The kernel can map from addresses to
symbols internally on demand, should such extra safety checks come
into existence. It can already check for __kprobes marked-ness,
regardless of the API.

> > As a quality-of-implementation matter, systemtap checks at translation
> > time that such probes make sense -- that "ext4_fill_super" even
> > exists. (That is needed also to expand wildcards.) The only way it
> > can do that is if it has dwarf or separate textual symbol table data
> > (see above). Both of those carry addresses as well, so we might as
> > well use them.
>
> True, though I'll note for modern kernels, with /proc/kallsyms, we
> should hopefully be able to do this (offset=0 probes) without DWARF
> headers. [...]

Yes, that's what I was referring to ("... or separate textual symbol
table"). Note that this table contains addresses too.

> BTW, one of the things which I have wondered is whether DWARF was
> really the right approach after all, given how bloated and
> space-inefficient it seems to be. [...]

Yeah, it is probably the main source of pain in using systemtap at its
fullest.

> [...] So there is a big difference between "please do X, Y, and Z
> first and the patch would then be acceptable" versus "for reasons A,
> B and C this patch is totally unacceptable and in fact what you are
> trying to do is pointless".

I am sorry for my part in lowering the tone of the discussion. We
will work out how to put James' patch in, with backward-compatiblity
extensions.

- FChE