2022-02-09 08:53:39

by Nick Alcock

[permalink] [raw]
Subject: [PATCH v8] kallsyms: new /proc/kallmodsyms with builtin modules

The kallmodsyms patch series was originally posted in Nov 2019, and the thread
(https://lore.kernel.org/linux-kbuild/[email protected]/t/#u)
shows review comments, questions, and feedback from interested parties.

All review comments have been satisfied, as far as I know: in particular
Yamada's note about translation units that are shared between built-in modules
is satisfied with a better representation which is also much, much smaller.

A kernel tree containing this series alone, atop -rc3:
https://github.com/oracle/dtrace-linux-kernel kallmodsyms/5.17-rc3

Trees for trying this out, if you want to try this series in conjunction
with its major current user:

userspace tree for the dtrace tool itself:
https://github.com/oracle/dtrace-utils.git, dev branch
kernel tree comprising this series and a few other patches needed by
dtrace:
https://github.com/oracle/dtrace-linux-kernel, v2/5.17-rc2 branch

(See the README.md in the latter for dtrace build instructions. Note the need for a
reasonably recent binutils, a trunk GCC, and a cross-bpf toolchain.)


/proc/kallsyms is very useful for tracers and other tools that need to
map kernel symbols to addresses.

It would be useful if there were a mapping between kernel symbol and module
name that only changed when the kernel source code is changed. This mapping
should not change simply because a module becomes built into the kernel, so
that it's not broken by changes in user configuration. (DTrace for Linux
already uses the approach in this patch for this purpose.)

In brief we do this by mapping from address ranges to object files (with
assistance from the linker map file), then mapping from object files to
potential kernel modules. Because the number of object files is much smaller
than the number of symbols, this is a fairly efficient representation, even with
a bit of extra complexity to allow object files to be in more than one module at
once.

The size impact of all of this is minimal: in one of my tests, vmlinux grew by
0.17% (10824 bytes), and the compressed vmlinux only grew by 0.08% (7552 bytes):
though this is very configuration-dependent, it seems likely to scale roughly
with the kernel as a whole.

This is all controlled by a new config parameter CONFIG_KALLMODSYMS, which when
set results in output in /proc/kallmodsyms that looks like this:

ffffffff8b013d20 409 t pt_buffer_setup_aux
ffffffff8b014130 11f T intel_pt_interrupt
ffffffff8b014250 2d T cpu_emergency_stop_pt
ffffffff8b014280 13a t rapl_pmu_event_init [intel_rapl_perf]
ffffffff8b0143c0 bb t rapl_event_update [intel_rapl_perf]
ffffffff8b014480 10 t rapl_pmu_event_read [intel_rapl_perf]
ffffffff8b014490 a3 t rapl_cpu_offline [intel_rapl_perf]
ffffffff8b014540 24 t __rapl_event_show [intel_rapl_perf]
ffffffff8b014570 f2 t rapl_pmu_event_stop [intel_rapl_perf]

These symbols are notated as [intel_rapl_perf] even if intel_rapl_perf is built
into the kernel.

Further down, we see what happens when object files are reused by
multiple modules, all of which are built in to the kernel:

ffffffffa22b3aa0 ab t handle_timestamp [liquidio]
ffffffffa22b3b50 4a t free_netbuf [liquidio]
ffffffffa22b3ba0 8d t liquidio_ptp_settime [liquidio]
ffffffffa22b3c30 b3 t liquidio_ptp_adjfreq [liquidio]
[...]
ffffffffa22b9490 203 t lio_vf_rep_create [liquidio]
ffffffffa22b96a0 16b t lio_vf_rep_destroy [liquidio]
ffffffffa22b9810 1f t lio_vf_rep_modinit [liquidio]
ffffffffa22b9830 1f t lio_vf_rep_modexit [liquidio]
ffffffffa22b9850 d2 t lio_ethtool_get_channels [liquidio] [liquidio_vf]
ffffffffa22b9930 9c t lio_ethtool_get_ringparam [liquidio] [liquidio_vf]
ffffffffa22b99d0 11 t lio_get_msglevel [liquidio] [liquidio_vf]
ffffffffa22b99f0 11 t lio_vf_set_msglevel [liquidio] [liquidio_vf]
ffffffffa22b9a10 2b t lio_get_pauseparam [liquidio] [liquidio_vf]
ffffffffa22b9a40 738 t lio_get_ethtool_stats [liquidio] [liquidio_vf]
ffffffffa22ba180 368 t lio_vf_get_ethtool_stats [liquidio] [liquidio_vf]
ffffffffa22ba4f0 37 t lio_get_regs_len [liquidio] [liquidio_vf]
ffffffffa22ba530 18 t lio_get_priv_flags [liquidio] [liquidio_vf]
ffffffffa22ba550 2e t lio_set_priv_flags [liquidio] [liquidio_vf]
ffffffffa22ba580 69 t lio_set_fecparam [liquidio] [liquidio_vf]
ffffffffa22ba5f0 92 t lio_get_fecparam [liquidio] [liquidio_vf]
[...]
ffffffffa22cbd10 175 t liquidio_set_mac [liquidio_vf]
ffffffffa22cbe90 ab t handle_timestamp [liquidio_vf]
ffffffffa22cbf40 4a t free_netbuf [liquidio_vf]
ffffffffa22cbf90 2b t octnet_link_status_change [liquidio_vf]
ffffffffa22cbfc0 7e t liquidio_vxlan_port_command.constprop.0 [liquidio_vf]

Like /proc/kallsyms, the output is driven by address, so keeps the
curious property of /proc/kallsyms that symbols (like free_netbuf above)
may appear repeatedly with different addresses: but now, unlike in
/proc/kallsyms, we can see that those symbols appear repeatedly because
they are *different symbols* that ultimately belong to different
modules, all of which are built in to the kernel.

Those symbols that come from object files that are genuinely reused and that
appear only once in memory get a /proc/kallmodsyms line with [multiple]
[modules] on it: consumers will have to be ready to handle such lines.

Also, kernel symbols for built-in modules will probably appear interspersed with
other symbols that are part of different modules and non-modular always-built-in
symbols, which, as usual, have no square-bracketed module denotation.

As with /proc/kallsyms, non-root usage produces addresses that are all zero.

I am open to changing the name and/or format of /proc/kallmodsyms, but felt it
best to split it out of /proc/kallsyms to avoid breaking existing kallsyms
parsers. Another possible syntax might be to use {curly brackets} or something
to denote built-in modules: it might be possible to drop /proc/kallmodsyms and
make /proc/kallsyms emit things in this format. (Equally, now kallmodsyms data
uses very little space, the CONFIG_KALLMODSYMS config option might be something
people don't want to bother with.)



The commits in this series all have reviewed-by tags: they're all from
internal reviews, so please ignore them.


Differences from v7, December:

- Adjust for changes in the v5.17 merge window. Adjust a few commit
messages and shrink the cover letter.
- Drop the symbol-size patch, probably better done from userspace.

Differences from v6, November:

- Adjust for rewrite of confdata machinery in v5.16 (tristate.conf
handling is now more of a rewrite than a reversion)

Differences from v5, October:

- Fix generation of mapfiles under UML

Differences from v4, September:

- Fix building of tristate.conf if missing (usually concealed by the
syncconfig being run for other reasons, but not always: the kernel
test robot spotted it).
- Forward-port atop v5.15-rc3.

Differences from v3, August:

- Fix a kernel test robot warning in get_ksymbol_core (possible
use of uninitialized variable if kallmodsyms was wanted but
kallsyms_module_offsets was not present, which is most unlikely).

Differences from v2, June:

- Split the series up. In particular, the size impact of the table
optimizer is now quantified, and the symbol-size patch is split out and
turned into an RFC patch, with the /proc/kallmodsyms format before that
patch lacking a size column. Some speculation on how to make the symbol
sizes less space-wasteful is added (but not yet implemented).

- Drop a couple of unnecessary #includes, one unnecessarily exported
symbol, and a needless de-staticing.

Differences from v1, a year or so back:

- Move from a straight symbol->module name mapping to a mapping from
address-range to TU to module name list, bringing major space savings
over the previous approach and support for object files used by many
built-in modules at the same time, at the cost of a slightly more complex
approach (unavoidably so, I think, given that we have to merge three data
sources together: the link map in .tmp_vmlinux.ranges, the nm output on
stdin, and the mapping from TU name to module names in
modules_thick.builtin).

We do opportunistic merging of TUs if they cite the same modules and
reuse module names where doing so is simple: see optimize_obj2mod
below. I considered more extensive searches for mergeable entries and
more intricate encodings of the module name list allowing TUs that are
used by overlapping sets of modules to share their names, but such
modules are rare enough (and such overlapping sharings are vanishingly
rare) that it seemed likely to save only a few bytes at the cost of much
more hard-to-test code. This is doubly true now that the tables needed
are only a few kilobytes in length.

Signed-off-by: Nick Alcock <[email protected]>
Signed-off-by: Eugene Loh <[email protected]>
Reviewed-by: Kris Van Hees <[email protected]>

Nick Alcock (6):
kbuild: bring back tristate.conf
kbuild: add modules_thick.builtin
kbuild: generate an address ranges map at vmlinux link time
kallsyms: introduce sections needed to map symbols to built-in modules
kallsyms: optimize .kallsyms_modules*
kallsyms: add /proc/kallmodsyms


2022-02-09 09:55:29

by Nick Alcock

[permalink] [raw]
Subject: [PATCH v8 3/6] kbuild: generate an address ranges map at vmlinux link time

This emits a new file, .tmp_vmlinux.ranges, which maps address
range/size pairs in vmlinux to the object files which make them up,
e.g., in part:

0x0000000000000000 0x30 arch/x86/kernel/cpu/common.o
0x0000000000001000 0x1000 arch/x86/events/intel/ds.o
0x0000000000002000 0x4000 arch/x86/kernel/irq_64.o
0x0000000000006000 0x5000 arch/x86/kernel/process.o
0x000000000000b000 0x1000 arch/x86/kernel/cpu/common.o
0x000000000000c000 0x5000 arch/x86/mm/cpu_entry_area.o
0x0000000000011000 0x10 arch/x86/kernel/espfix_64.o
0x0000000000011010 0x2 arch/x86/kernel/cpu/common.o
[...]

In my simple tests this seems to work with clang too, but if I'm not
sure how stable the format of clang's linker mapfiles is: if it turns
out not to work in some versions, the mapfile-massaging awk script added
here might need some adjustment.

Signed-off-by: Nick Alcock <[email protected]>
Reviewed-by: Kris Van Hees <[email protected]>
---

Notes:
v6: use ${wl} where appropriate to avoid failure on UML

scripts/link-vmlinux.sh | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index 666f7bbc13eb..981cd441ca21 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -203,7 +203,7 @@ vmlinux_link()
${ld} ${ldflags} -o ${output} \
${wl}--whole-archive ${objs} ${wl}--no-whole-archive \
${wl}--start-group ${libs} ${wl}--end-group \
- $@ ${ldlibs}
+ ${wl}-Map=.tmp_vmlinux.map $@ ${ldlibs}
}

# generate .BTF typeinfo from DWARF debuginfo
@@ -246,6 +246,19 @@ kallsyms()
{
local kallsymopt;

+ # read the linker map to identify ranges of addresses:
+ # - for each *.o file, report address, size, pathname
+ # - most such lines will have four fields
+ # - but sometimes there is a line break after the first field
+ # - start reading at "Linker script and memory map"
+ # - stop reading at ".brk"
+ ${AWK} '
+ /\.o$/ && start==1 { print $(NF-2), $(NF-1), $NF }
+ /^Linker script and memory map/ { start = 1 }
+ /^\.brk/ { exit(0) }
+ ' .tmp_vmlinux.map | sort > .tmp_vmlinux.ranges
+
+ # get kallsyms options
if is_enabled CONFIG_KALLSYMS_ALL; then
kallsymopt="${kallsymopt} --all-symbols"
fi
--
2.35.0.260.gb82b153193.dirty


2022-02-09 13:21:40

by Nick Alcock

[permalink] [raw]
Subject: [PATCH v8 5/6] kallsyms: optimize .kallsyms_modules*

These symbols are terribly inefficiently stored at the moment. Add a
simple optimizer which fuses obj2mod_elem entries and uses this to
implement three cheap optimizations:

- duplicate names are eliminated from .kallsyms_module_names.

- entries in .kallsyms_modules which point at single-file modules which
also appear in a multi-module list are redirected to point inside
that list, and the single-file entry is dropped from
.kallsyms_module_names. Thus, modules which contain some object
files shared with other modules and some object files exclusive to
them do not double up the module name. (There might still be some
duplication between multiple multi-module lists, but this is an
extremely marginal size effect, and resolving it would require an
extra layer of lookup tables which would be even more complex, and
incompressible to boot).

- Entries in .kallsyms_modules that would contain the same value after
the above optimizations are fused together, along with their
corresponding .kallsyms_module_addresses/offsets entries. Due to
this fusion process, and because object files can be split apart into
multiple parts by the linker for hot/cold partitioning and the like,
entries in .kallsyms_module_addresses/offsets no longer correspond
1:1 to object files, but more to some contiguous range of addresses
which are guaranteed to belong to a single built-in module, but which
may well stretch over multiple object files.

The optimizer's time complexity is O(log n) in the number of objfiles at
most (and probably much lower), so, given the relatively low number of
objfiles, its runtime overhead is in the noise.

Optimization reduces the overhead of the kallmodsyms tables by about
7500 items, dropping the .tmp_kallsyms2.o object file size by about
33KiB, leaving it 8672 bytes larger than before: a gain of .4%.

The vmlinux size is not yet affected because the variables are not used
and are eliminated by the linker: but if they were used (after the next
commit), the size impact of all of this on the final kernel is minimal:
in my testing, vmlinux grew by 0.17% (10824 bytes), and the compressed
vmlinux only grew by 0.08% (7552 bytes): though this is very
configuration-dependent, it seems likely to scale roughly with the
kernel as a whole.

Signed-off-by: Nick Alcock <[email protected]>
Reviewed-by: Kris Van Hees <[email protected]>
---
scripts/kallsyms.c | 267 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 258 insertions(+), 9 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 8f87b724d0fa..93fdf0dcf587 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -85,6 +85,17 @@ static unsigned int strhash(const char *s)
return hash;
}

+static unsigned int memhash(char *s, size_t len)
+{
+ /* fnv32 hash */
+ unsigned int hash = 2166136261U;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ hash = (hash ^ *(s + i)) * 0x01000193;
+ return hash;
+}
+
#define OBJ2MOD_BITS 10
#define OBJ2MOD_N (1 << OBJ2MOD_BITS)
#define OBJ2MOD_MASK (OBJ2MOD_N - 1)
@@ -94,14 +105,24 @@ struct obj2mod_elem {
size_t nmods; /* number of modules in "mods" */
size_t mods_size; /* size of all mods together */
int mod_offset; /* offset in .kallsyms_module_names */
+ /*
+ * If set at emission time, this points at another obj2mod entry that
+ * contains the module name we need (possibly at a slightly later
+ * offset, if the entry is for an objfile that appears in many modules).
+ */
+ struct obj2mod_elem *xref;
struct obj2mod_elem *obj2mod_next;
+ struct obj2mod_elem *mod2obj_next;
};

/*
- * Map from object files to obj2mod entries (a unique mapping).
+ * Map from object files to obj2mod entries (a unique mapping), and vice versa
+ * (not unique, but entries for objfiles in more than one module in this hash
+ * are ignored).
*/

static struct obj2mod_elem *obj2mod[OBJ2MOD_N];
+static struct obj2mod_elem *mod2obj[OBJ2MOD_N];
static size_t num_objfiles;

/*
@@ -143,6 +164,8 @@ static void obj2mod_add(char *obj, char *mod)

elem = obj2mod_get(obj);
if (!elem) {
+ int j = strhash(mod) & OBJ2MOD_MASK;
+
elem = malloc(sizeof(struct obj2mod_elem));
if (!elem)
goto oom;
@@ -156,8 +179,15 @@ static void obj2mod_add(char *obj, char *mod)

elem->obj2mod_next = obj2mod[i];
obj2mod[i] = elem;
+ elem->mod2obj_next = mod2obj[j];
+ mod2obj[j] = elem;
num_objfiles++;
} else {
+ /*
+ * TU appears in multiple modules. mod2obj for this entry will
+ * be ignored from now on, except insofar as it is needed to
+ * maintain the hash chain.
+ */
elem->mods = realloc(elem->mods, elem->mods_size +
strlen(mod) + 1);
if (!elem->mods)
@@ -177,6 +207,164 @@ static void obj2mod_add(char *obj, char *mod)
fprintf(stderr, "kallsyms: out of memory\n");
exit(1);
}
+
+/*
+ * Used inside optimize_obj2mod to identify duplicate module entries.
+ */
+struct obj2mod_modhash_elem {
+ struct obj2mod_elem *elem;
+ unsigned int modhash; /* hash value of this entry */
+};
+
+static int qstrcmp(const void *a, const void *b)
+{
+ return strcmp((const char *) a, (const char *) b);
+}
+
+static int qmodhash(const void *a, const void *b)
+{
+ const struct obj2mod_modhash_elem *el_a = a;
+ const struct obj2mod_modhash_elem *el_b = b;
+ if (el_a->modhash < el_b->modhash)
+ return -1;
+ else if (el_a->modhash > el_b->modhash)
+ return 1;
+ return 0;
+}
+
+/*
+ * Associate all TUs in obj2mod which refer to the same module with a single
+ * obj2mod entry for emission, preferring to point into the module list in a
+ * multi-module objfile.
+ */
+static void optimize_obj2mod(void)
+{
+ size_t i;
+ size_t n = 0;
+ struct obj2mod_elem *elem;
+ struct obj2mod_elem *dedup;
+ /* An array of all obj2mod_elems, later sorted by hashval. */
+ struct obj2mod_modhash_elem *uniq;
+ struct obj2mod_modhash_elem *last;
+
+ /*
+ * Canonicalize all module lists by sorting them, then compute their
+ * hash values.
+ */
+ uniq = malloc(sizeof(struct obj2mod_modhash_elem) * num_objfiles);
+ if (uniq == NULL)
+ goto oom;
+
+ for (i = 0; i < OBJ2MOD_N; i++) {
+ for (elem = obj2mod[i]; elem; elem = elem->obj2mod_next) {
+ if (elem->nmods >= 2) {
+ char **sorter;
+ char *walk;
+ char *tmp_mods;
+ size_t j;
+
+ tmp_mods = malloc(elem->mods_size);
+ sorter = malloc(sizeof(char *) * elem->nmods);
+ if (sorter == NULL || tmp_mods == NULL)
+ goto oom;
+ memcpy(tmp_mods, elem->mods, elem->mods_size);
+
+ for (j = 0, walk = tmp_mods; j < elem->nmods;
+ j++) {
+ sorter[j] = walk;
+ walk += strlen(walk) + 1;
+ }
+ qsort(sorter, elem->nmods, sizeof (char *),
+ qstrcmp);
+ for (j = 0, walk = elem->mods; j < elem->nmods;
+ j++) {
+ strcpy(walk, sorter[j]);
+ walk += strlen(walk) + 1;
+ }
+ free(tmp_mods);
+ free(sorter);
+ }
+
+ uniq[n].elem = elem;
+ uniq[n].modhash = memhash(elem->mods, elem->mods_size);
+ n++;
+ }
+ }
+
+ qsort (uniq, num_objfiles, sizeof (struct obj2mod_modhash_elem),
+ qmodhash);
+
+ /*
+ * Work over multimodule entries. These must be emitted into
+ * .kallsyms_module_names as a unit, but we can still optimize by
+ * reusing some other identical entry. Single-file modules are amenable
+ * to the same optimization, but we avoid doing it for now so that we
+ * can prefer to point them directly inside a multimodule entry.
+ */
+ for (i = 0, last = NULL; i < num_objfiles; i++) {
+ const char *onemod;
+ size_t j;
+
+ if (uniq[i].elem->nmods < 2)
+ continue;
+
+ /* Duplicate multimodule. Reuse the first we saw. */
+ if (last != NULL && last->modhash == uniq[i].modhash) {
+ uniq[i].elem->xref = last->elem;
+ continue;
+ }
+
+ /*
+ * Single-module entries relating to modules also emitted as
+ * part of this multimodule entry can refer to it: later, we
+ * will hunt down the right specific module name within this
+ * multimodule entry and point directly to it.
+ */
+ onemod = uniq[i].elem->mods;
+ for (j = uniq[i].elem->nmods; j > 0; j--) {
+ int h = strhash(onemod) & OBJ2MOD_MASK;
+
+ for (dedup = mod2obj[h]; dedup;
+ dedup = dedup->mod2obj_next) {
+ if (dedup->nmods > 1)
+ continue;
+
+ if (strcmp(dedup->mods, onemod) != 0)
+ continue;
+ dedup->xref = uniq[i].elem;
+ assert (uniq[i].elem->xref == NULL);
+ }
+ onemod += strlen(onemod) + 1;
+ }
+
+ last = &uniq[i];
+ }
+
+ /*
+ * Now traverse all single-module entries, xreffing every one that
+ * relates to a given module to the first one we saw that refers to that
+ * module.
+ */
+ for (i = 0, last = NULL; i < num_objfiles; i++) {
+ if (uniq[i].elem->nmods > 1)
+ continue;
+
+ if (uniq[i].elem->xref != NULL)
+ continue;
+
+ /* Duplicate module name. Reuse the first we saw. */
+ if (last != NULL && last->modhash == uniq[i].modhash) {
+ uniq[i].elem->xref = last->elem;
+ assert (last->elem->xref == NULL);
+ continue;
+ }
+ last = &uniq[i];
+ }
+ return;
+oom:
+ fprintf(stderr, "kallsyms: out of memory optimizing module list\n");
+ exit(EXIT_FAILURE);
+}
#endif /* CONFIG_KALLMODSYMS */

static void usage(void)
@@ -479,7 +667,7 @@ static void output_kallmodsyms_modules(void)
size_t i;

/*
- * Traverse and emit, updating mod_offset accordingly.
+ * Traverse and emit, chasing xref and updating mod_offset accordingly.
* Emit a single \0 at the start, to encode non-modular objfiles.
*/
output_label("kallsyms_module_names");
@@ -489,9 +677,15 @@ static void output_kallmodsyms_modules(void)
elem = elem->obj2mod_next) {
const char *onemod;
size_t i;
+ struct obj2mod_elem *out_elem = elem;

- elem->mod_offset = offset;
- onemod = elem->mods;
+ if (elem->xref)
+ out_elem = elem->xref;
+ if (out_elem->mod_offset != 0)
+ continue; /* Already emitted. */
+
+ out_elem->mod_offset = offset;
+ onemod = out_elem->mods;

/*
* Technically this is a waste of space: we could just
@@ -500,13 +694,13 @@ static void output_kallmodsyms_modules(void)
* entry, but doing it this way makes it more obvious
* when an entry is a multimodule entry.
*/
- if (elem->nmods != 1) {
+ if (out_elem->nmods != 1) {
printf("\t.byte\t0\n");
- printf("\t.byte\t%zi\n", elem->nmods);
+ printf("\t.byte\t%zi\n", out_elem->nmods);
offset += 2;
}

- for (i = elem->nmods; i > 0; i--) {
+ for (i = out_elem->nmods; i > 0; i--) {
printf("\t.asciz\t\"%s\"\n", onemod);
offset += strlen(onemod) + 1;
onemod += strlen(onemod) + 1;
@@ -533,6 +727,13 @@ static void output_kallmodsyms_objfiles(void)
long long offset;
int overflow;

+ /*
+ * Fuse consecutive address ranges citing the same object file
+ * into one.
+ */
+ if (i > 0 && addrmap[i-1].objfile == addrmap[i].objfile)
+ continue;
+
if (base_relative) {
if (!absolute_percpu) {
offset = addrmap[i].addr - relative_base;
@@ -558,6 +759,13 @@ static void output_kallmodsyms_objfiles(void)

for (i = 0; i < addrmap_num; i++) {
struct obj2mod_elem *elem = addrmap[i].objfile;
+ int orig_nmods;
+ const char *orig_modname;
+ int mod_offset;
+
+ if (i > 0 && addrmap[i-1].objfile == addrmap[i].objfile)
+ continue;
+
/*
* Address range cites no object file: point at 0, the built-in
* module.
@@ -568,13 +776,53 @@ static void output_kallmodsyms_objfiles(void)
continue;
}

+ orig_nmods = elem->nmods;
+ orig_modname = elem->mods;
+
+ /*
+ * Chase down xrefs, if need be. There can only be one layer of
+ * these: from single-module entry to other single-module entry,
+ * or from single- or multi-module entry to another multi-module
+ * entry. Single -> single and multi -> multi always points at
+ * the start of the xref target, so its offset can be used as is.
+ */
+ if (elem->xref)
+ elem = elem->xref;
+
+ if (elem->nmods == 1 || orig_nmods > 1)
+ mod_offset = elem->mod_offset;
+ else {
+ /*
+ * If this is a reference from a single-module entry to
+ * a multi-module entry, hunt down the offset to this
+ * specific module's name (which is guaranteed to be
+ * present: see optimize_obj2mod).
+ */
+
+ size_t j = elem->nmods;
+ const char *onemod = elem->mods;
+ mod_offset = elem->mod_offset;
+
+ for (; j > 0; j--) {
+ if (strcmp(orig_modname, onemod) == 0)
+ break;
+ onemod += strlen(onemod) + 1;
+ }
+ assert (j > 0);
+ /*
+ * +2 to skip the null byte and count at the start of
+ * the multimodule entry.
+ */
+ mod_offset += onemod - elem->mods + 2;
+ }
+
/*
* Zero offset is the initial \0, there to catch uninitialized
* obj2mod entries, and is forbidden.
*/
- assert (elem->mod_offset != 0);
+ assert (mod_offset != 0);

- printf("\t.long\t0x%x\n", elem->mod_offset);
+ printf("\t.long\t0x%x\n", mod_offset);
emitted_objfiles++;
}

@@ -1093,6 +1341,7 @@ static void read_modules(const char *modules_builtin)

free(module_name);
modules_thick_iter_free(i);
+ optimize_obj2mod();

/*
* Read linker map.
--
2.35.0.260.gb82b153193.dirty


2022-02-12 16:04:35

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v8] kallsyms: new /proc/kallmodsyms with builtin modules

On Tue, Feb 08, 2022 at 06:43:03PM +0000, Nick Alcock wrote:
> The kallmodsyms patch series was originally posted in Nov 2019, and the thread
> (https://lore.kernel.org/linux-kbuild/[email protected]/t/#u)
> shows review comments, questions, and feedback from interested parties.
>
> All review comments have been satisfied, as far as I know: in particular
> Yamada's note about translation units that are shared between built-in modules
> is satisfied with a better representation which is also much, much smaller.
>
> A kernel tree containing this series alone, atop -rc3:
> https://github.com/oracle/dtrace-linux-kernel kallmodsyms/5.17-rc3
>
> Trees for trying this out, if you want to try this series in conjunction
> with its major current user:
>
> userspace tree for the dtrace tool itself:
> https://github.com/oracle/dtrace-utils.git, dev branch
> kernel tree comprising this series and a few other patches needed by
> dtrace:
> https://github.com/oracle/dtrace-linux-kernel, v2/5.17-rc2 branch
>
> (See the README.md in the latter for dtrace build instructions. Note the need for a
> reasonably recent binutils, a trunk GCC, and a cross-bpf toolchain.)
>
>
> /proc/kallsyms is very useful for tracers and other tools that need to
> map kernel symbols to addresses.
>
> It would be useful if there were a mapping between kernel symbol and module
> name that only changed when the kernel source code is changed. This mapping
> should not change simply because a module becomes built into the kernel, so
> that it's not broken by changes in user configuration. (DTrace for Linux
> already uses the approach in this patch for this purpose.)
>
> In brief we do this by mapping from address ranges to object files (with
> assistance from the linker map file), then mapping from object files to
> potential kernel modules. Because the number of object files is much smaller
> than the number of symbols, this is a fairly efficient representation, even with
> a bit of extra complexity to allow object files to be in more than one module at
> once.
>
> The size impact of all of this is minimal: in one of my tests, vmlinux grew by
> 0.17% (10824 bytes), and the compressed vmlinux only grew by 0.08% (7552 bytes):
> though this is very configuration-dependent, it seems likely to scale roughly
> with the kernel as a whole.
>
> This is all controlled by a new config parameter CONFIG_KALLMODSYMS, which when
> set results in output in /proc/kallmodsyms that looks like this:
>
> ffffffff8b013d20 409 t pt_buffer_setup_aux
> ffffffff8b014130 11f T intel_pt_interrupt
> ffffffff8b014250 2d T cpu_emergency_stop_pt
> ffffffff8b014280 13a t rapl_pmu_event_init [intel_rapl_perf]
> ffffffff8b0143c0 bb t rapl_event_update [intel_rapl_perf]
> ffffffff8b014480 10 t rapl_pmu_event_read [intel_rapl_perf]
> ffffffff8b014490 a3 t rapl_cpu_offline [intel_rapl_perf]
> ffffffff8b014540 24 t __rapl_event_show [intel_rapl_perf]
> ffffffff8b014570 f2 t rapl_pmu_event_stop [intel_rapl_perf]

hi,
I tried this version and can't see the symbols size

[root@qemu jolsa]# cat /proc/kallmodsyms | grep ksys_ | head -5
ffffffff81094720 T ksys_ioperm
ffffffff81141110 T ksys_unshare
ffffffff81160410 T ksys_setsid
ffffffff811c64b0 T ksys_sync_helper
ffffffff813213c0 T ksys_fadvise64_64

I have CONFIG_KALLMODSYMS=y, but I haven't checked if I need
anything else

jirka

2022-02-14 21:13:58

by Nick Alcock

[permalink] [raw]
Subject: Re: [PATCH v8] kallsyms: new /proc/kallmodsyms with builtin modules

On 11 Feb 2022, Jiri Olsa verbalised:
> On Tue, Feb 08, 2022 at 06:43:03PM +0000, Nick Alcock wrote:
>> This is all controlled by a new config parameter CONFIG_KALLMODSYMS, which when
>> set results in output in /proc/kallmodsyms that looks like this:
>>
>> ffffffff8b013d20 409 t pt_buffer_setup_aux
>> ffffffff8b014130 11f T intel_pt_interrupt
>> ffffffff8b014250 2d T cpu_emergency_stop_pt
>> ffffffff8b014280 13a t rapl_pmu_event_init [intel_rapl_perf]
>> ffffffff8b0143c0 bb t rapl_event_update [intel_rapl_perf]
>> ffffffff8b014480 10 t rapl_pmu_event_read [intel_rapl_perf]
>> ffffffff8b014490 a3 t rapl_cpu_offline [intel_rapl_perf]
>> ffffffff8b014540 24 t __rapl_event_show [intel_rapl_perf]
>> ffffffff8b014570 f2 t rapl_pmu_event_stop [intel_rapl_perf]
>
> hi,
> I tried this version and can't see the symbols size
>
> [root@qemu jolsa]# cat /proc/kallmodsyms | grep ksys_ | head -5
> ffffffff81094720 T ksys_ioperm
> ffffffff81141110 T ksys_unshare
> ffffffff81160410 T ksys_setsid
> ffffffff811c64b0 T ksys_sync_helper
> ffffffff813213c0 T ksys_fadvise64_64

UGH, sorry, I should have regenerated the output in the cover letter!
The cover letter is buggy :)

This is entirely expected because I dropped the symbol size patch
(because it's formally unnecessary because you can do it from userspace
by examination of vmlinux or the .ko files, and the symbol size
representation is big, adding hundreds of KiB to the kernel image).
And then I failed to regenerate the output to show this :/

In this series, /proc/kallmodsyms now looks identical in format to
/proc/kallsyms, except that you can have [multiple] [modules] on a line
and the meaning of a [module] entry is different. (This may be a small
enough change in semantics that merging the two is possible, but I doubt
it -- existing users will surely expect that a [module] entry means that
module.ko exists, which with /proc/kallmodsyms is not always true.)

--
NULL && (void)