On Wed, Nov 15, 2023 at 9:49 AM Andi Kleen <[email protected]> wrote:
This is a standalone script, which is unrelated to kbuild.
Please change the subject to
"scripts: Add inline-account tool"
>
> A common cause of binary code bloat is excessive inlining. Traditional
> tools (like nm --size-sort -t d) don't address that directly because
> they only see the final functions, but don't know about inlines.
>
> This patch adds inline-account that makes it easy to track that down
> by accounting code bytes to all functions visible in the debug information,
> as well as code lines.
>
> Here are some examples:
>
> Show all inlines that increase code size by >1K in the core scheduler:
>
> $ inline-account.py --min-bytes=1000 kernel/sched/core.o
> Total code bytes seen 75690
>
> Code bytes by functions:
> Function Total Avg Num
> rq_pin_lock 1401 (0.02%) 35 39
> __sched_setscheduler 1277 (0.02%) 41 31
> perf_fetch_caller_regs 1012 (0.01%) 17 58
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> kernel/sched/sched.h:1610 1387 (0.02%)
> include/trace/events/sched.h:16 1172 (0.02%)
> include/trace/events/sched.h:222 1058 (0.01%)
>
> This indicates that rq_pin_lock should likely be not inline,
> and perhaps perf_fetch_caller_regs not either.
>
> Note that not all large inlines are necessary bloat. If there is only
necessarily bloat?
> a single call site it isn't bloat (the tool currently cannot distinguish
> that case). For example it is commonly seen with syscall definitions
> that use single large inlines with only a single caller. In the example
> above I think it's the case with __sched_setscheduler.
>
> Show the >1K inlines in lib/maple_tree.o, which for some reason
> comes in at a incredible 73k of code size:
>
> $ inline-account.py --min-bytes 1000 lib/maple_tree.o
> Total code bytes seen 73578
>
> Code bytes by functions:
> Function Total Avg Num
> mas_mab_cp 5537 (0.08%) 37 149
> mas_pop_node 3798 (0.05%) 28 131
> ma_slots 2368 (0.03%) 14 162
> ma_pivots 2353 (0.03%) 10 222
> mas_destroy_rebalance 2056 (0.03%) 42 48
> mas_start 1661 (0.02%) 13 125
> mas_set_parent 1454 (0.02%) 20 72
> mas_set_alloc_req 1410 (0.02%) 17 80
> mte_node_type 1360 (0.02%) 5 228
> mas_data_end 1189 (0.02%) 16 74
> mte_to_node 1085 (0.01%) 3 276
> mas_split 1053 (0.01%) 65 16
> mas_topiary_replace 1033 (0.01%) 38 27
> mas_root_expand 1001 (0.01%) 35 28
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> lib/maple_tree.c:210 1360 (0.02%)
> include/trace/events/maple_tree.h:80 1283 (0.02%)
> lib/maple_tree.c:649 1193 (0.02%)
> lib/maple_tree.c:288 1097 (0.01%)
>
> It's clear there is a lot of potential for shrinking here, as a quick
> experiment shows:
>
> $ size lib/maple_tree.o
> text data bss dec hex filename
> 72257 5312 8 77577 12f09 lib/maple_tree.o
> $ sed -i -e s/__always_inline// -e 's/ inline/ /' lib/maple_tree.c
> $ make -s lib/maple_tree.o
> $ size lib/maple_tree.o
> text data bss dec hex filename
> 47774 4720 8 52502 cd16 lib/maple_tree.o
>
> 34% reduction just from trusting the compiler. Most of it seems
> to come from abuse of __always_inline. I suppose a large scale
> tree purge of that would give some decent binary size results.
>
> $ inline-account.py --show=5 kernel/workqueue.o
> Total code bytes seen 40403
>
> Code bytes by functions:
> Function Total Avg Num
> bitmap_copy 1477 (0.04%) 26 56
> show_pwq 912 (0.02%) 76 12
> workqueue_init_early 846 (0.02%) 29 29
> __flush_workqueue 753 (0.02%) 31 24
> alloc_and_link_pwqs 558 (0.01%) 69 8
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> include/linux/bitmap.h:268 1336 (0.03%)
> include/trace/events/workqueue.h:23 1038 (0.03%)
> include/trace/events/workqueue.h:108 732 (0.02%)
> include/trace/events/workqueue.h:59 694 (0.02%)
> include/trace/events/workqueue.h:82 670 (0.02%)
> $
>
> This is an interesting case because bitmap_copy is just
>
> static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
> unsigned int nbits)
> {
> unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
>
> if (small_const_nbits(nbits))
> *dst = *src;
> else
> memcpy(dst, src, len);
> }
>
> memcpy (which is a macro) must sometimes generate a lot of
> code. The small_const_nbits case definitely should be inlined though
> because it's likely even smaller than a call. Would need
> more investigation.
>
> The other large inlines are trace points. Perhaps there is something
> there that could be done to shrink that a bit.
>
> Finally we can do a global accounting (currently with multiple runs):
>
> (ignore the percentage numbers since they are just for the local file)
>
> $ find -name '*.o' | xargs -n1 inline-account.py > a
> $ sort -n -r -k 2 a | head -30
> ZSTD_count 81799 (0.19%) 32 2514
> ZSTD_count 52233 (0.25%) 33 1544
> kmalloc 43324 (0.00%) 12 3334
> pv_queued_spin_unlock 42027 (0.00%) 9 4580
> constant_test_bit 41667 (0.00%) 5 8005
> arch/x86/include/asm/paravirt.h:591 41044 (0.00%)
> arch/x86/include/asm/bitops.h:207 40153 (0.00%)
> __refcount_add 37968 (0.00%) 24 1532
> page_fixed_fake_head 36368 (0.00%) 19 1832
> include/linux/slab.h:599 35654 (0.00%)
> arch/x86/include/asm/jump_label.h:27 35156 (0.00%)
> spin_lock 32170 (0.00%) 10 3007
> __refcount_sub_and_test 32068 (0.00%) 17 1842
> include/linux/spinlock.h:351 31102 (0.00%)
> arch_static_branch 30874 (0.00%) 4 7022
> get_current 30714 (0.00%) 9 3351
> arch/x86/include/asm/current.h:41 29912 (0.00%)
> trace_trigger_soft_disabled 29814 (0.00%) 21 1368
> perf_fetch_caller_regs 27504 (0.00%) 16 1634
> ZSTD_storeSeq 26060 (0.06%) 30 862
> hid_map_usage 25582 (0.00%) 88 288
> ZSTD_compressBlock_lazy_generic 24953 (0.12%) 46 535
> ZSTD_compressBlock_lazy_generic 24953 (0.06%) 46 535
> paravirt_ret0 24152 (0.00%) 24152 1
> spin_unlock_irqrestore 23253 (0.00%) 10 2281
> include/linux/spinlock.h:406 22526 (0.00%)
> ZSTD_RowFindBestMatch 21527 (0.10%) 23 922
> ZSTD_RowFindBestMatch 21527 (0.05%) 23 922
> __list_add 21209 (0.00%) 11 1851
> include/linux/refcount.h:283 20642 (0.00%)
>
> - So my kernel is spending around ~30K just for getting task_structs in
> current.
> - I'm sure ZSTD is great, but is it >200K in duplicated code worth great?
> - page_fixed_fake_head probably shouldn't be inlined
> - There might be some potential in out lining reference counts
> (although that one might be truly performance critical)
> - There's maybe some potential in shrinking trace point bloat?
>
> ... and more similar insights.
>
> Also of course there can be critical inlines that really need
> to be inline in many sites for best performance. But that's rarely the case
> if they are big because it's unlikely the small call overhead is making
> a significant difference for a large chunk of code.
>
> In any case the tool is useful, so I think it deserves its place
> in scripts/
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
>
> v2: Address review comments. Change balancing to be address based.
> Add option to set objdump binary.
> ---
> scripts/inline-account.py | 201 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 201 insertions(+)
> create mode 100755 scripts/inline-account.py
>
> diff --git a/scripts/inline-account.py b/scripts/inline-account.py
> new file mode 100755
> index 000000000000..c32cb0547172
> --- /dev/null
> +++ b/scripts/inline-account.py
> @@ -0,0 +1,201 @@
> +#!/usr/bin/env python3
> +# account code bytes per source code / functions from objdump -Sl output
> +# useful to find inline bloat
> +#
> +# SPDX-License-Identifier: GPL-2.0-only
Please move this above.
The first line is the shebang.
The second line is SPDX.
> +# Author: Andi Kleen
> +
> +import os
> +import sys
> +import re
> +import argparse
> +import bisect
> +import multiprocessing
Please sort imports alphabetically.
> +from collections import Counter
> +from functools import reduce, partial
> +
> +def get_args():
> + p = argparse.ArgumentParser(
> + description="""
> +Account code bytes per source code / functions from objdump.
> +Useful to find inline bloat.
> +
> +The line numbers are the beginning of a block, so the actual code can be later.
> +Line numbers can be a also little off due to objdump bugs
> +also some misaccounting can happen due to inexact gcc debug information.
> +The number output for functions may account a single large function multiple
> +times. program/object files need to be built with -g.
> +
> +This is somewhat slow due to objdump -S being slow. It helps to have
> +plenty of cores.""")
> + p.add_argument('--min-bytes', type=int, help='minimum bytes to report', default=100)
> + p.add_argument('--threads', '-t', type=int, default=multiprocessing.cpu_count(),
Please check that args.threads >= 1.
If not, please show an error message.
> + help='Number of objdump processes to run')
> + p.add_argument('--verbose', '-v', action='store_true', help="Print more")
> + p.add_argument('--show', type=int, help='Number of results to show')
> + p.add_argument('--objdump', default="objdump", help="Set objdump binary to run")
> + p.add_argument('file', help='object file/program as input')
> + return p.parse_args()
> +
> +def get_syms(fn):
> + syms = []
> + pc = None
> + with os.popen("nm -n --print-size " + fn) as f:
> + for l in f:
> + n = l.split()
> + if len(n) > 2 and n[2].upper() == "T":
> + pc = int(n[0], 16)
> + syms.append(pc)
> + ln = int(n[1], 16)
> + if not pc:
> + sys.exit(fn + " has no symbols")
"if not pc" becomes true in two cases:
- pc is zero
- pc is None
Only the latter case is "no symbols" because
a symbol address may become zero in relocatable ELF.
if len(syms) == 0:
...
will work correctly.
And, "pc = None" is unneeded.
> + syms.append(pc + ln)
> + return syms
> +
> +class Account:
> + def __init__(self):
> + self.funcbytes = Counter()
> + self.linebytes = Counter()
> + self.funccount = Counter()
> + self.nolinebytes = 0
> + self.nofuncbytes = 0
> + self.total = 0
> +
> + def add(self, b):
> + self.funcbytes += b.funcbytes
> + self.linebytes += b.linebytes
> + self.funccount += b.funccount
> + self.nolinebytes += b.nolinebytes
> + self.nofuncbytes += b.nofuncbytes
> + self.total += b.total
> + return self
> +
Since this script implements a class,
please follow this blank line rule.
https://peps.python.org/pep-0008/#blank-lines
> +# dont add sys.exit here, causes deadlocks
> +def account_range(args, r):
> + a = Account()
> + line = None
> + func = None
> + codefunc = None
> +
> + cmd = ("%s -Sl %s --start-address=%#x --stop-address=%#x" %
> + (args.objdump, args.file, r[0], r[1]))
> + if args.verbose:
> + print(cmd)
> + with os.popen(cmd) as f:
> + for l in f:
> + # 250: e8 00 00 00 00 callq 255 <proc_skip_spaces+0x5>
> + m = re.match(r'\s*([0-9a-fA-F]+):\s+(.*)', l)
> + if m:
> + bytes = len(re.findall(r'[0-9a-f][0-9a-f] ', m.group(2)))
> + if not func:
> + a.nofuncbytes += bytes
> + continue
> + if not line:
> + a.nolinebytes += bytes
> + continue
> + a.total += bytes
> + a.funcbytes[func] += bytes
> + a.linebytes[(file, line)] += bytes
> + codefunc = func
> + continue
> +
> + # sysctl_init():
> + m = re.match(r'([a-zA-Z_][a-zA-Z0-9_]*)\(\):$', l)
> + if m:
> + if codefunc and m.group(1) != codefunc:
> + a.funccount[codefunc] += 1
> + codefunc = None
> + func = m.group(1)
> + continue
> +
> + # /sysctl.c:1666
> + m = re.match(r'^([^:]+):(\d+)$', l)
> + if m:
> + file, line = m.group(1), int(m.group(2))
> + continue
> +
> + if codefunc:
> + a.funccount[codefunc] += 1
> + return a
> +
> +def get_boundaries(syms, sym_sizes, chunk):
> + run = 0
> + boundaries = [syms[0]]
> + for i, x in enumerate(sym_sizes):
> + run += x
> + if run >= chunk:
> + boundaries.append(syms[i])
> + run = 0
> + boundaries.append(syms[-1])
> + return boundaries
> +
> +
> +def process(args):
> + # objdump -S is slow, so we parallelize
> +
> + # split symbol table into chunks for parallelization
> + # we split on functions boundaries to avoid mis-accounting
> + syms = get_syms(args.file)
> + if len(syms) < 2:
> + print("not enough symbols")
> + return
This message should go to stderr instead of stdout.
sys.exit("not enough symbols")
> + sym_sizes = [syms[x + 1] - syms[x] for x, _ in enumerate(syms[:-1])]
> + sym_total = sum(sym_sizes)
This is equivalent to
sym_total = syms[-1] - syms[0]
isn't it?
> + chunk = max(int(sym_total / args.threads), 1)
Do you know the "//" operator ?
int(sym_total / args.threads)
should be equivalent to
sym_total // args.threads
> + boundaries = get_boundaries(syms, sym_sizes, chunk)
> + ranges = [(boundaries[x], boundaries[x+1]) for x in range(0, len(boundaries) - 1)]
Very tedious code.
And, your get_boundaries() helper uses less threads than args.threads.
How about this to compute the ranges directly?
def get_ranges(syms, threads):
ranges = []
prev = syms[0]
i = 1
while threads > 0:
boundary = prev + 1 + (syms[-1] - prev - 1) // threads
while syms[i] < boundary:
i += 1
ranges.append((prev, syms[i]))
prev = syms[i]
threads -= 1
if (prev == syms[-1]):
break
return ranges
ranges = get_ranges(syms, args.threads)
> + assert ranges[0][0] == syms[0]
> + assert ranges[-1][1] == syms[-1]
> +
> + # map-reduce
> + account_func = partial(account_range, args)
> + if args.threads == 1:
> + al = list(map(account_func, ranges))
> + else:
> + al = multiprocessing.Pool(args.threads).map(account_func, ranges)
> + a = reduce(lambda a, b: a.add(b), al)
> +
> + print("Total code bytes seen", a.total)
> + if args.verbose:
> + print("Bytes with no function %d (%.2f%%)" % (a.nofuncbytes, 100.0*(float(a.nofuncbytes)/a.total)))
> + print("Bytes with no lines %d (%.2f%%)" % (a.nolinebytes, 100.0*(float(a.nolinebytes)/a.total)))
> +
> + def sort_map(m):
> + return sorted(list(m.keys()), key=lambda x: m[x], reverse=True)
> +
> + print("\nCode bytes by functions:")
> + print("%-50s %-5s %-5s %-5s %-5s" % ("Function", "Total", "", "Avg", "Num"))
> + for i, j in enumerate(sort_map(a.funcbytes)):
> + if a.funcbytes[j] < args.min_bytes:
> + break
> + if args.show and i >= args.show:
> + break
> + print("%-50s %-5d (%.2f%%) %-5d %-5d" % (
> + j,
> + a.funcbytes[j],
> + a.funcbytes[j] / float(a.total),
> + a.funcbytes[j] / a.funccount[j],
> + a.funccount[j]))
> +
> + for j in list(a.linebytes.keys()):
> + if a.linebytes[j] < args.min_bytes:
> + del a.linebytes[j]
> +
> + prefix = os.path.commonprefix([x[0] for x in list(a.linebytes.keys())])
> +
> + print("\nCode bytes by nearby source line blocks:")
> + print("prefix", prefix)
> +
> + print("%-50s %-5s" % ("Line", "Total"))
> + for i, j in enumerate(sort_map(a.linebytes)):
> + if args.show and i >= args.show:
> + break
> + print("%-50s %-5d (%.2f%%)" % (
> + "%s:%d" % (j[0][len(prefix):], j[1]),
> + a.linebytes[j],
> + a.linebytes[j] / float(a.total)))
> + if len(a.linebytes) == 0:
> + print("Nothing found. enable CONFIG_DEBUG_INFO / -g?")
I thnik this should be checked much earlier.
If CONFIG_DEBUG_INFO is disabled, there is nothing to print.
> +
> +if __name__ == '__main__':
> + process(get_args())
> --
> 2.41.0
>
--
Best Regards
Masahiro Yamada
On Mon, Nov 20, 2023 at 1:32 AM Masahiro Yamada <[email protected]> wrote:
>
> On Wed, Nov 15, 2023 at 9:49 AM Andi Kleen <[email protected]> wrote:
>
>
> This is a standalone script, which is unrelated to kbuild.
>
> Please change the subject to
>
> "scripts: Add inline-account tool"
>
>
>
>
>
> >
> > A common cause of binary code bloat is excessive inlining. Traditional
> > tools (like nm --size-sort -t d) don't address that directly because
> > they only see the final functions, but don't know about inlines.
> >
> > This patch adds inline-account that makes it easy to track that down
> > by accounting code bytes to all functions visible in the debug information,
> > as well as code lines.
> >
> > Here are some examples:
> >
> > Show all inlines that increase code size by >1K in the core scheduler:
> >
> > $ inline-account.py --min-bytes=1000 kernel/sched/core.o
> > Total code bytes seen 75690
> >
> > Code bytes by functions:
> > Function Total Avg Num
> > rq_pin_lock 1401 (0.02%) 35 39
> > __sched_setscheduler 1277 (0.02%) 41 31
> > perf_fetch_caller_regs 1012 (0.01%) 17 58
> >
> > Code bytes by nearby source line blocks:
> > prefix /home/ak/lsrc/git/linux/
> > Line Total
> > kernel/sched/sched.h:1610 1387 (0.02%)
> > include/trace/events/sched.h:16 1172 (0.02%)
> > include/trace/events/sched.h:222 1058 (0.01%)
> >
> > This indicates that rq_pin_lock should likely be not inline,
> > and perhaps perf_fetch_caller_regs not either.
> >
> > Note that not all large inlines are necessary bloat. If there is only
>
>
> necessarily bloat?
>
>
>
>
>
>
>
>
>
> > a single call site it isn't bloat (the tool currently cannot distinguish
> > that case). For example it is commonly seen with syscall definitions
> > that use single large inlines with only a single caller. In the example
> > above I think it's the case with __sched_setscheduler.
> >
> > Show the >1K inlines in lib/maple_tree.o, which for some reason
> > comes in at a incredible 73k of code size:
> >
> > $ inline-account.py --min-bytes 1000 lib/maple_tree.o
> > Total code bytes seen 73578
> >
> > Code bytes by functions:
> > Function Total Avg Num
> > mas_mab_cp 5537 (0.08%) 37 149
> > mas_pop_node 3798 (0.05%) 28 131
> > ma_slots 2368 (0.03%) 14 162
> > ma_pivots 2353 (0.03%) 10 222
> > mas_destroy_rebalance 2056 (0.03%) 42 48
> > mas_start 1661 (0.02%) 13 125
> > mas_set_parent 1454 (0.02%) 20 72
> > mas_set_alloc_req 1410 (0.02%) 17 80
> > mte_node_type 1360 (0.02%) 5 228
> > mas_data_end 1189 (0.02%) 16 74
> > mte_to_node 1085 (0.01%) 3 276
> > mas_split 1053 (0.01%) 65 16
> > mas_topiary_replace 1033 (0.01%) 38 27
> > mas_root_expand 1001 (0.01%) 35 28
> >
> > Code bytes by nearby source line blocks:
> > prefix /home/ak/lsrc/git/linux/
> > Line Total
> > lib/maple_tree.c:210 1360 (0.02%)
> > include/trace/events/maple_tree.h:80 1283 (0.02%)
> > lib/maple_tree.c:649 1193 (0.02%)
> > lib/maple_tree.c:288 1097 (0.01%)
> >
> > It's clear there is a lot of potential for shrinking here, as a quick
> > experiment shows:
> >
> > $ size lib/maple_tree.o
> > text data bss dec hex filename
> > 72257 5312 8 77577 12f09 lib/maple_tree.o
> > $ sed -i -e s/__always_inline// -e 's/ inline/ /' lib/maple_tree.c
> > $ make -s lib/maple_tree.o
> > $ size lib/maple_tree.o
> > text data bss dec hex filename
> > 47774 4720 8 52502 cd16 lib/maple_tree.o
> >
> > 34% reduction just from trusting the compiler. Most of it seems
> > to come from abuse of __always_inline. I suppose a large scale
> > tree purge of that would give some decent binary size results.
> >
> > $ inline-account.py --show=5 kernel/workqueue.o
> > Total code bytes seen 40403
> >
> > Code bytes by functions:
> > Function Total Avg Num
> > bitmap_copy 1477 (0.04%) 26 56
> > show_pwq 912 (0.02%) 76 12
> > workqueue_init_early 846 (0.02%) 29 29
> > __flush_workqueue 753 (0.02%) 31 24
> > alloc_and_link_pwqs 558 (0.01%) 69 8
> >
> > Code bytes by nearby source line blocks:
> > prefix /home/ak/lsrc/git/linux/
> > Line Total
> > include/linux/bitmap.h:268 1336 (0.03%)
> > include/trace/events/workqueue.h:23 1038 (0.03%)
> > include/trace/events/workqueue.h:108 732 (0.02%)
> > include/trace/events/workqueue.h:59 694 (0.02%)
> > include/trace/events/workqueue.h:82 670 (0.02%)
> > $
> >
> > This is an interesting case because bitmap_copy is just
> >
> > static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
> > unsigned int nbits)
> > {
> > unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
> >
> > if (small_const_nbits(nbits))
> > *dst = *src;
> > else
> > memcpy(dst, src, len);
> > }
> >
> > memcpy (which is a macro) must sometimes generate a lot of
> > code. The small_const_nbits case definitely should be inlined though
> > because it's likely even smaller than a call. Would need
> > more investigation.
> >
> > The other large inlines are trace points. Perhaps there is something
> > there that could be done to shrink that a bit.
> >
> > Finally we can do a global accounting (currently with multiple runs):
> >
> > (ignore the percentage numbers since they are just for the local file)
> >
> > $ find -name '*.o' | xargs -n1 inline-account.py > a
> > $ sort -n -r -k 2 a | head -30
> > ZSTD_count 81799 (0.19%) 32 2514
> > ZSTD_count 52233 (0.25%) 33 1544
> > kmalloc 43324 (0.00%) 12 3334
> > pv_queued_spin_unlock 42027 (0.00%) 9 4580
> > constant_test_bit 41667 (0.00%) 5 8005
> > arch/x86/include/asm/paravirt.h:591 41044 (0.00%)
> > arch/x86/include/asm/bitops.h:207 40153 (0.00%)
> > __refcount_add 37968 (0.00%) 24 1532
> > page_fixed_fake_head 36368 (0.00%) 19 1832
> > include/linux/slab.h:599 35654 (0.00%)
> > arch/x86/include/asm/jump_label.h:27 35156 (0.00%)
> > spin_lock 32170 (0.00%) 10 3007
> > __refcount_sub_and_test 32068 (0.00%) 17 1842
> > include/linux/spinlock.h:351 31102 (0.00%)
> > arch_static_branch 30874 (0.00%) 4 7022
> > get_current 30714 (0.00%) 9 3351
> > arch/x86/include/asm/current.h:41 29912 (0.00%)
> > trace_trigger_soft_disabled 29814 (0.00%) 21 1368
> > perf_fetch_caller_regs 27504 (0.00%) 16 1634
> > ZSTD_storeSeq 26060 (0.06%) 30 862
> > hid_map_usage 25582 (0.00%) 88 288
> > ZSTD_compressBlock_lazy_generic 24953 (0.12%) 46 535
> > ZSTD_compressBlock_lazy_generic 24953 (0.06%) 46 535
> > paravirt_ret0 24152 (0.00%) 24152 1
> > spin_unlock_irqrestore 23253 (0.00%) 10 2281
> > include/linux/spinlock.h:406 22526 (0.00%)
> > ZSTD_RowFindBestMatch 21527 (0.10%) 23 922
> > ZSTD_RowFindBestMatch 21527 (0.05%) 23 922
> > __list_add 21209 (0.00%) 11 1851
> > include/linux/refcount.h:283 20642 (0.00%)
> >
> > - So my kernel is spending around ~30K just for getting task_structs in
> > current.
> > - I'm sure ZSTD is great, but is it >200K in duplicated code worth great?
> > - page_fixed_fake_head probably shouldn't be inlined
> > - There might be some potential in out lining reference counts
> > (although that one might be truly performance critical)
> > - There's maybe some potential in shrinking trace point bloat?
> >
> > ... and more similar insights.
> >
> > Also of course there can be critical inlines that really need
> > to be inline in many sites for best performance. But that's rarely the case
> > if they are big because it's unlikely the small call overhead is making
> > a significant difference for a large chunk of code.
> >
> > In any case the tool is useful, so I think it deserves its place
> > in scripts/
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> >
> > v2: Address review comments. Change balancing to be address based.
> > Add option to set objdump binary.
> > ---
> > scripts/inline-account.py | 201 ++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 201 insertions(+)
> > create mode 100755 scripts/inline-account.py
> >
> > diff --git a/scripts/inline-account.py b/scripts/inline-account.py
> > new file mode 100755
> > index 000000000000..c32cb0547172
> > --- /dev/null
> > +++ b/scripts/inline-account.py
> > @@ -0,0 +1,201 @@
> > +#!/usr/bin/env python3
> > +# account code bytes per source code / functions from objdump -Sl output
> > +# useful to find inline bloat
> > +#
> > +# SPDX-License-Identifier: GPL-2.0-only
>
>
> Please move this above.
>
> The first line is the shebang.
> The second line is SPDX.
>
>
>
>
>
> > +# Author: Andi Kleen
> > +
> > +import os
> > +import sys
> > +import re
> > +import argparse
> > +import bisect
> > +import multiprocessing
>
>
> Please sort imports alphabetically.
>
>
>
> > +from collections import Counter
> > +from functools import reduce, partial
> > +
> > +def get_args():
> > + p = argparse.ArgumentParser(
> > + description="""
> > +Account code bytes per source code / functions from objdump.
> > +Useful to find inline bloat.
> > +
> > +The line numbers are the beginning of a block, so the actual code can be later.
> > +Line numbers can be a also little off due to objdump bugs
> > +also some misaccounting can happen due to inexact gcc debug information.
> > +The number output for functions may account a single large function multiple
> > +times. program/object files need to be built with -g.
> > +
> > +This is somewhat slow due to objdump -S being slow. It helps to have
> > +plenty of cores.""")
> > + p.add_argument('--min-bytes', type=int, help='minimum bytes to report', default=100)
> > + p.add_argument('--threads', '-t', type=int, default=multiprocessing.cpu_count(),
>
>
> Please check that args.threads >= 1.
>
> If not, please show an error message.
>
>
>
>
>
> > + help='Number of objdump processes to run')
> > + p.add_argument('--verbose', '-v', action='store_true', help="Print more")
> > + p.add_argument('--show', type=int, help='Number of results to show')
> > + p.add_argument('--objdump', default="objdump", help="Set objdump binary to run")
> > + p.add_argument('file', help='object file/program as input')
> > + return p.parse_args()
> > +
> > +def get_syms(fn):
> > + syms = []
> > + pc = None
> > + with os.popen("nm -n --print-size " + fn) as f:
> > + for l in f:
> > + n = l.split()
> > + if len(n) > 2 and n[2].upper() == "T":
> > + pc = int(n[0], 16)
> > + syms.append(pc)
> > + ln = int(n[1], 16)
> > + if not pc:
> > + sys.exit(fn + " has no symbols")
>
>
> "if not pc" becomes true in two cases:
> - pc is zero
> - pc is None
>
>
> Only the latter case is "no symbols" because
> a symbol address may become zero in relocatable ELF.
>
>
>
> if len(syms) == 0:
> ...
>
> will work correctly.
>
> And, "pc = None" is unneeded.
>
>
>
>
>
>
>
>
> > + syms.append(pc + ln)
> > + return syms
> > +
> > +class Account:
> > + def __init__(self):
> > + self.funcbytes = Counter()
> > + self.linebytes = Counter()
> > + self.funccount = Counter()
> > + self.nolinebytes = 0
> > + self.nofuncbytes = 0
> > + self.total = 0
> > +
> > + def add(self, b):
> > + self.funcbytes += b.funcbytes
> > + self.linebytes += b.linebytes
> > + self.funccount += b.funccount
> > + self.nolinebytes += b.nolinebytes
> > + self.nofuncbytes += b.nofuncbytes
> > + self.total += b.total
> > + return self
> > +
>
>
> Since this script implements a class,
> please follow this blank line rule.
>
>
> https://peps.python.org/pep-0008/#blank-lines
>
>
>
>
>
>
>
>
>
> > +# dont add sys.exit here, causes deadlocks
> > +def account_range(args, r):
> > + a = Account()
> > + line = None
> > + func = None
> > + codefunc = None
> > +
> > + cmd = ("%s -Sl %s --start-address=%#x --stop-address=%#x" %
> > + (args.objdump, args.file, r[0], r[1]))
> > + if args.verbose:
> > + print(cmd)
> > + with os.popen(cmd) as f:
> > + for l in f:
> > + # 250: e8 00 00 00 00 callq 255 <proc_skip_spaces+0x5>
> > + m = re.match(r'\s*([0-9a-fA-F]+):\s+(.*)', l)
> > + if m:
> > + bytes = len(re.findall(r'[0-9a-f][0-9a-f] ', m.group(2)))
> > + if not func:
> > + a.nofuncbytes += bytes
> > + continue
> > + if not line:
> > + a.nolinebytes += bytes
> > + continue
> > + a.total += bytes
> > + a.funcbytes[func] += bytes
> > + a.linebytes[(file, line)] += bytes
> > + codefunc = func
> > + continue
> > +
> > + # sysctl_init():
> > + m = re.match(r'([a-zA-Z_][a-zA-Z0-9_]*)\(\):$', l)
> > + if m:
> > + if codefunc and m.group(1) != codefunc:
> > + a.funccount[codefunc] += 1
> > + codefunc = None
> > + func = m.group(1)
> > + continue
> > +
> > + # /sysctl.c:1666
> > + m = re.match(r'^([^:]+):(\d+)$', l)
> > + if m:
> > + file, line = m.group(1), int(m.group(2))
> > + continue
> > +
> > + if codefunc:
> > + a.funccount[codefunc] += 1
> > + return a
> > +
> > +def get_boundaries(syms, sym_sizes, chunk):
> > + run = 0
> > + boundaries = [syms[0]]
> > + for i, x in enumerate(sym_sizes):
> > + run += x
> > + if run >= chunk:
> > + boundaries.append(syms[i])
> > + run = 0
> > + boundaries.append(syms[-1])
> > + return boundaries
> > +
> > +
> > +def process(args):
> > + # objdump -S is slow, so we parallelize
> > +
> > + # split symbol table into chunks for parallelization
> > + # we split on functions boundaries to avoid mis-accounting
> > + syms = get_syms(args.file)
> > + if len(syms) < 2:
> > + print("not enough symbols")
> > + return
>
> This message should go to stderr instead of stdout.
>
>
> sys.exit("not enough symbols")
>
Or, you can remove this check entirely if you
check the length in get_syms().
get_syms() requires at least one element from the loop,
and append() one more at the end.
So, get_syms() always has two or more elements.
--
Best Regards
Masahiro Yamada
>
> > + sym_sizes = [syms[x + 1] - syms[x] for x, _ in enumerate(syms[:-1])]
> > + sym_total = sum(sym_sizes)
>
>
> This is equivalent to
>
> sym_total = syms[-1] - syms[0]
>
> isn't it?
No it's not when there are gaps in the symbols, which always happens in
the kernel. The previous version used the range but that led to inbalances.
> def get_ranges(syms, threads):
>
> ranges = []
> prev = syms[0]
> i = 1
>
> while threads > 0:
>
> boundary = prev + 1 + (syms[-1] - prev - 1) // threads
>
> while syms[i] < boundary:
> i += 1
Note the boundary is not the address, but the sum of symbol sizes
-Andi
On Tue, Nov 21, 2023 at 5:08 AM Andi Kleen <[email protected]> wrote:
>
> >
> > > + sym_sizes = [syms[x + 1] - syms[x] for x, _ in enumerate(syms[:-1])]
> > > + sym_total = sum(sym_sizes)
> >
> >
> > This is equivalent to
> >
> > sym_total = syms[-1] - syms[0]
> >
> > isn't it?
>
> No it's not when there are gaps in the symbols, which always happens in
> the kernel. The previous version used the range but that led to inbalances.
I do not understand because there is nothing to handle gaps
in your code.
get_syms() appends only 'pc' into the list.
Then,
sym_sizes = [syms[x + 1] - syms[x] for x, _ in enumerate(syms[:-1])]
computes each symbol size by:
(address of the next symbol) - (address of the current symbol).
So, your code is equivalent to:
sym_total = syms[-1] - syms[0]
> > def get_ranges(syms, threads):
> >
> > ranges = []
> > prev = syms[0]
> > i = 1
> >
> > while threads > 0:
> >
> > boundary = prev + 1 + (syms[-1] - prev - 1) // threads
> >
> > while syms[i] < boundary:
> > i += 1
>
>
> Note the boundary is not the address, but the sum of symbol sizes
>
>
> -Andi
--
Best Regards
Masahiro Yamada