2021-06-07 13:24:22

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

Em Fri, Jun 04, 2021 at 07:55:17PM -0700, Andrii Nakryiko escreveu:
> On Thu, Jun 3, 2021 at 7:57 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > Em Sat, May 29, 2021 at 05:40:17PM -0700, Andrii Nakryiko escreveu:

> > > At some point it probably would make sense to formalize
> > > "btf_encoder" as a struct with its own state instead of passing in
> > > multiple variables. It would probably also

> > Take a look at the tmp.master branch at:

> > https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=tmp.master

> Oh wow, that's a lot of commits! :) Great that you decided to do this
> refactoring, thanks!

> > that btf_elf class isn't used anymore by btf_loader, that uses only
> > libbpf's APIs, and now we have a btf_encoder class with all the globals,
> > etc, more baby steps are needed to finally ditch btf_elf altogether and
> > move on to the parallelization.

> So do you plan to try to parallelize as a next step? I'm pretty

So, I haven't looked at details but what I thought would be interesting
to investigate is to see if we can piggyback DWARF generation with BTF
one, i.e. when we generate a .o file with -g we encode the DWARF info,
so, right after this, we could call pahole as-is and encode BTF, then,
when vmlinux is linked, we would do the dedup.

I.e. when generating ../build/v5.13.0-rc4+/kernel/fork.o, that comes
with:

⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep debug
[78] .debug_info PROGBITS 0000000000000000 00daec 032968 00 0 0 1
[79] .rela.debug_info RELA 0000000000000000 040458 053b68 18 I 95 78 8
[80] .debug_abbrev PROGBITS 0000000000000000 093fc0 0012e9 00 0 0 1
[81] .debug_loclists PROGBITS 0000000000000000 0952a9 00aa43 00 0 0 1
[82] .rela.debug_loclists RELA 0000000000000000 09fcf0 009d98 18 I 95 81 8
[83] .debug_aranges PROGBITS 0000000000000000 0a9a88 000080 00 0 0 1
[84] .rela.debug_aranges RELA 0000000000000000 0a9b08 0000a8 18 I 95 83 8
[85] .debug_rnglists PROGBITS 0000000000000000 0a9bb0 001509 00 0 0 1
[86] .rela.debug_rnglists RELA 0000000000000000 0ab0c0 001bc0 18 I 95 85 8
[87] .debug_line PROGBITS 0000000000000000 0acc80 0086b7 00 0 0 1
[88] .rela.debug_line RELA 0000000000000000 0b5338 002550 18 I 95 87 8
[89] .debug_str PROGBITS 0000000000000000 0b7888 0177ad 01 MS 0 0 1
[90] .debug_line_str PROGBITS 0000000000000000 0cf035 001308 01 MS 0 0 1
[93] .debug_frame PROGBITS 0000000000000000 0d0370 000e38 00 0 0 8
[94] .rela.debug_frame RELA 0000000000000000 0d11a8 000e70 18 I 95 93 8
⬢[acme@toolbox perf]$

We would do:

⬢[acme@toolbox perf]$ pahole -J ../build/v5.13.0-rc4+/kernel/fork.o
⬢[acme@toolbox perf]$

Which would get us to have:

⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep BTF
[103] .BTF PROGBITS 0000000000000000 0db658 030550 00 0 0 1
⬢[acme@toolbox perf]

⬢[acme@toolbox perf]$ pahole -F btf -C hlist_node ../build/v5.13.0-rc4+/kernel/fork.o
struct hlist_node {
struct hlist_node * next; /* 0 8 */
struct hlist_node * * pprev; /* 8 8 */

/* size: 16, cachelines: 1, members: 2 */
/* last cacheline: 16 bytes */
};
⬢[acme@toolbox perf]$

So, a 'pahole --dedup_btf vmlinux' would just go on looking at:

⬢[acme@toolbox perf]$ readelf -wi ../build/v5.13.0-rc4+/vmlinux | grep -A10 DW_TAG_compile_unit | grep -w DW_AT_name | grep fork
<f220eb> DW_AT_name : (indirect line string, offset: 0x62e7): /var/home/acme/git/linux/kernel/fork.c

To go there and go on extracting those ELF sections to combine and
dedup.

This combine thing could be done even by the linker, I think, when all
the DWARF data in the .o file are combined into vmlinux, we could do it
for the .BTF sections as well, that way would be even more elegant, I
think. Then, the combined vmlinux .BTF section would be read and fed in
one go to libbtf's dedup arg.

This way the encoding of BTF would be as paralellized as the kernel build
process, following the same logic (-j NR_PROCESSORS).

wdyt?

If this isn't the case, we can process vmlinux as is today and go on
creating N threads and feeding each with a DW_TAG_compile_unit
"container", i.e. each thread would consume all the tags below each
DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
combined and deduped by libbpf.

Doing it as my first sketch above would take advantage of locality of
reference, i.e. the DWARF data would be freshly produced and in the
cache hierarchy when we first encode BTF, later, when doing the
combine+dedup we wouldn't be touching the more voluminous DWARF data.

- Arnaldo

> confident about BTF encoding part: dump each CU into its own BTF, use
> btf__add_type() to merge multiple BTFs together. Just need to re-map
> IDs (libbpf internally has API to visit each field that contains
> type_id, it's well-defined enough to expose that as a public API, if
> necessary). Then final btf_dedup().

> But the DWARF loading and parsing part is almost a black box to me, so
> I'm not sure how much work it would involve.

> > I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
> > very piecemeal as I'm doing will help bisecting any subtle bug this may
> > introduce.

> > > allow to parallelize BTF generation, where each CU would proceed in
> > > parallel generating local BTF, and then the final pass would merge and
> > > dedup BTFs. Currently reading and processing DWARF is the slowest part
> > > of the DWARF-to-BTF conversion, parallelization and maybe some other
> > > optimization seems like the only way to speed the process up.

> > > Acked-by: Andrii Nakryiko <[email protected]>

> > Thanks!


2021-06-07 15:46:39

by Yonghong Song

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22



On 6/7/21 6:20 AM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jun 04, 2021 at 07:55:17PM -0700, Andrii Nakryiko escreveu:
>> On Thu, Jun 3, 2021 at 7:57 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>>> Em Sat, May 29, 2021 at 05:40:17PM -0700, Andrii Nakryiko escreveu:
>
>>>> At some point it probably would make sense to formalize
>>>> "btf_encoder" as a struct with its own state instead of passing in
>>>> multiple variables. It would probably also
>
>>> Take a look at the tmp.master branch at:
>
>>> https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=tmp.master
>
>> Oh wow, that's a lot of commits! :) Great that you decided to do this
>> refactoring, thanks!
>
>>> that btf_elf class isn't used anymore by btf_loader, that uses only
>>> libbpf's APIs, and now we have a btf_encoder class with all the globals,
>>> etc, more baby steps are needed to finally ditch btf_elf altogether and
>>> move on to the parallelization.
>
>> So do you plan to try to parallelize as a next step? I'm pretty
>
> So, I haven't looked at details but what I thought would be interesting
> to investigate is to see if we can piggyback DWARF generation with BTF
> one, i.e. when we generate a .o file with -g we encode the DWARF info,
> so, right after this, we could call pahole as-is and encode BTF, then,
> when vmlinux is linked, we would do the dedup.
>
> I.e. when generating ../build/v5.13.0-rc4+/kernel/fork.o, that comes
> with:
>
> ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep debug
> [78] .debug_info PROGBITS 0000000000000000 00daec 032968 00 0 0 1
> [79] .rela.debug_info RELA 0000000000000000 040458 053b68 18 I 95 78 8
> [80] .debug_abbrev PROGBITS 0000000000000000 093fc0 0012e9 00 0 0 1
> [81] .debug_loclists PROGBITS 0000000000000000 0952a9 00aa43 00 0 0 1
> [82] .rela.debug_loclists RELA 0000000000000000 09fcf0 009d98 18 I 95 81 8
> [83] .debug_aranges PROGBITS 0000000000000000 0a9a88 000080 00 0 0 1
> [84] .rela.debug_aranges RELA 0000000000000000 0a9b08 0000a8 18 I 95 83 8
> [85] .debug_rnglists PROGBITS 0000000000000000 0a9bb0 001509 00 0 0 1
> [86] .rela.debug_rnglists RELA 0000000000000000 0ab0c0 001bc0 18 I 95 85 8
> [87] .debug_line PROGBITS 0000000000000000 0acc80 0086b7 00 0 0 1
> [88] .rela.debug_line RELA 0000000000000000 0b5338 002550 18 I 95 87 8
> [89] .debug_str PROGBITS 0000000000000000 0b7888 0177ad 01 MS 0 0 1
> [90] .debug_line_str PROGBITS 0000000000000000 0cf035 001308 01 MS 0 0 1
> [93] .debug_frame PROGBITS 0000000000000000 0d0370 000e38 00 0 0 8
> [94] .rela.debug_frame RELA 0000000000000000 0d11a8 000e70 18 I 95 93 8
> ⬢[acme@toolbox perf]$
>
> We would do:
>
> ⬢[acme@toolbox perf]$ pahole -J ../build/v5.13.0-rc4+/kernel/fork.o
> ⬢[acme@toolbox perf]$
>
> Which would get us to have:
>
> ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep BTF
> [103] .BTF PROGBITS 0000000000000000 0db658 030550 00 0 0 1
> ⬢[acme@toolbox perf]
>
> ⬢[acme@toolbox perf]$ pahole -F btf -C hlist_node ../build/v5.13.0-rc4+/kernel/fork.o
> struct hlist_node {
> struct hlist_node * next; /* 0 8 */
> struct hlist_node * * pprev; /* 8 8 */
>
> /* size: 16, cachelines: 1, members: 2 */
> /* last cacheline: 16 bytes */
> };
> ⬢[acme@toolbox perf]$
>
> So, a 'pahole --dedup_btf vmlinux' would just go on looking at:
>
> ⬢[acme@toolbox perf]$ readelf -wi ../build/v5.13.0-rc4+/vmlinux | grep -A10 DW_TAG_compile_unit | grep -w DW_AT_name | grep fork
> <f220eb> DW_AT_name : (indirect line string, offset: 0x62e7): /var/home/acme/git/linux/kernel/fork.c
>
> To go there and go on extracting those ELF sections to combine and
> dedup.
>
> This combine thing could be done even by the linker, I think, when all
> the DWARF data in the .o file are combined into vmlinux, we could do it
> for the .BTF sections as well, that way would be even more elegant, I

The linker will do the combine. It should just concatenate
all .BTF sections together like
.BTF section
.BTF data from file 1
.BTF data from file 2
...

> think. Then, the combined vmlinux .BTF section would be read and fed in
> one go to libbtf's dedup arg.

I think this should work based on today's implementation but we do have
a caveat here.

The issue is related to DATASEC's. In DATASEC, we tried to encode
section offset for variables. These section offset should be
relocated during linking stage. But currently pahole does not
generate reloations for such variables so linker will ignore
them.

This shouldn't be an issue for global variables as we can find its
name in VAR and look up final symbol table for its section offset.

But this might be an issue for static variables with the same
name and just matching names in VAR is not enough as their
may be multiple ones in the symbol table. We could have a
workaround though, e.g., rename all static variables with a unique name
like <file_name>.[<func_name>.]<var_name> and went to dwarf
to find this static variable offset. dwarf should have
static variable section offset properly relocated.

Another solution is for pahole to generate .rel.BTF which
encodes relocations.

Currently, we don't emit static variables in vmlinux BTF (only
percpu globals), but not sure whether in the future we still
won't.

>
> This way the encoding of BTF would be as paralellized as the kernel build
> process, following the same logic (-j NR_PROCESSORS).
>
> wdyt?
>
> If this isn't the case, we can process vmlinux as is today and go on
> creating N threads and feeding each with a DW_TAG_compile_unit
> "container", i.e. each thread would consume all the tags below each
> DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
> combined and deduped by libbpf.
>
> Doing it as my first sketch above would take advantage of locality of
> reference, i.e. the DWARF data would be freshly produced and in the
> cache hierarchy when we first encode BTF, later, when doing the
> combine+dedup we wouldn't be touching the more voluminous DWARF data.
>
> - Arnaldo
>
>> confident about BTF encoding part: dump each CU into its own BTF, use
>> btf__add_type() to merge multiple BTFs together. Just need to re-map
>> IDs (libbpf internally has API to visit each field that contains
>> type_id, it's well-defined enough to expose that as a public API, if
>> necessary). Then final btf_dedup().
>
>> But the DWARF loading and parsing part is almost a black box to me, so
>> I'm not sure how much work it would involve.
>
>>> I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
>>> very piecemeal as I'm doing will help bisecting any subtle bug this may
>>> introduce.
>
>>>> allow to parallelize BTF generation, where each CU would proceed in
>>>> parallel generating local BTF, and then the final pass would merge and
>>>> dedup BTFs. Currently reading and processing DWARF is the slowest part
>>>> of the DWARF-to-BTF conversion, parallelization and maybe some other
>>>> optimization seems like the only way to speed the process up.
>
>>>> Acked-by: Andrii Nakryiko <[email protected]>
>
>>> Thanks!

2021-06-08 00:57:08

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

On Mon, Jun 7, 2021 at 6:20 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Fri, Jun 04, 2021 at 07:55:17PM -0700, Andrii Nakryiko escreveu:
> > On Thu, Jun 3, 2021 at 7:57 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > > Em Sat, May 29, 2021 at 05:40:17PM -0700, Andrii Nakryiko escreveu:
>
> > > > At some point it probably would make sense to formalize
> > > > "btf_encoder" as a struct with its own state instead of passing in
> > > > multiple variables. It would probably also
>
> > > Take a look at the tmp.master branch at:
>
> > > https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=tmp.master
>
> > Oh wow, that's a lot of commits! :) Great that you decided to do this
> > refactoring, thanks!
>
> > > that btf_elf class isn't used anymore by btf_loader, that uses only
> > > libbpf's APIs, and now we have a btf_encoder class with all the globals,
> > > etc, more baby steps are needed to finally ditch btf_elf altogether and
> > > move on to the parallelization.
>
> > So do you plan to try to parallelize as a next step? I'm pretty
>
> So, I haven't looked at details but what I thought would be interesting
> to investigate is to see if we can piggyback DWARF generation with BTF
> one, i.e. when we generate a .o file with -g we encode the DWARF info,
> so, right after this, we could call pahole as-is and encode BTF, then,
> when vmlinux is linked, we would do the dedup.
>
> I.e. when generating ../build/v5.13.0-rc4+/kernel/fork.o, that comes
> with:
>
> ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep debug
> [78] .debug_info PROGBITS 0000000000000000 00daec 032968 00 0 0 1
> [79] .rela.debug_info RELA 0000000000000000 040458 053b68 18 I 95 78 8
> [80] .debug_abbrev PROGBITS 0000000000000000 093fc0 0012e9 00 0 0 1
> [81] .debug_loclists PROGBITS 0000000000000000 0952a9 00aa43 00 0 0 1
> [82] .rela.debug_loclists RELA 0000000000000000 09fcf0 009d98 18 I 95 81 8
> [83] .debug_aranges PROGBITS 0000000000000000 0a9a88 000080 00 0 0 1
> [84] .rela.debug_aranges RELA 0000000000000000 0a9b08 0000a8 18 I 95 83 8
> [85] .debug_rnglists PROGBITS 0000000000000000 0a9bb0 001509 00 0 0 1
> [86] .rela.debug_rnglists RELA 0000000000000000 0ab0c0 001bc0 18 I 95 85 8
> [87] .debug_line PROGBITS 0000000000000000 0acc80 0086b7 00 0 0 1
> [88] .rela.debug_line RELA 0000000000000000 0b5338 002550 18 I 95 87 8
> [89] .debug_str PROGBITS 0000000000000000 0b7888 0177ad 01 MS 0 0 1
> [90] .debug_line_str PROGBITS 0000000000000000 0cf035 001308 01 MS 0 0 1
> [93] .debug_frame PROGBITS 0000000000000000 0d0370 000e38 00 0 0 8
> [94] .rela.debug_frame RELA 0000000000000000 0d11a8 000e70 18 I 95 93 8
> ⬢[acme@toolbox perf]$
>
> We would do:
>
> ⬢[acme@toolbox perf]$ pahole -J ../build/v5.13.0-rc4+/kernel/fork.o
> ⬢[acme@toolbox perf]$
>
> Which would get us to have:
>
> ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep BTF
> [103] .BTF PROGBITS 0000000000000000 0db658 030550 00 0 0 1
> ⬢[acme@toolbox perf]
>
> ⬢[acme@toolbox perf]$ pahole -F btf -C hlist_node ../build/v5.13.0-rc4+/kernel/fork.o
> struct hlist_node {
> struct hlist_node * next; /* 0 8 */
> struct hlist_node * * pprev; /* 8 8 */
>
> /* size: 16, cachelines: 1, members: 2 */
> /* last cacheline: 16 bytes */
> };
> ⬢[acme@toolbox perf]$
>
> So, a 'pahole --dedup_btf vmlinux' would just go on looking at:
>
> ⬢[acme@toolbox perf]$ readelf -wi ../build/v5.13.0-rc4+/vmlinux | grep -A10 DW_TAG_compile_unit | grep -w DW_AT_name | grep fork
> <f220eb> DW_AT_name : (indirect line string, offset: 0x62e7): /var/home/acme/git/linux/kernel/fork.c
>
> To go there and go on extracting those ELF sections to combine and
> dedup.
>
> This combine thing could be done even by the linker, I think, when all
> the DWARF data in the .o file are combined into vmlinux, we could do it
> for the .BTF sections as well, that way would be even more elegant, I
> think. Then, the combined vmlinux .BTF section would be read and fed in
> one go to libbtf's dedup arg.
>
> This way the encoding of BTF would be as paralellized as the kernel build
> process, following the same logic (-j NR_PROCESSORS).
>
> wdyt?

I think it's very fragile and it will be easy to get
broken/invalid/incomplete BTF. Yonghong already brought up the case
for static variables. There might be some other issues that exist
today, or we might run into when we further extend BTF. Like some
custom linker script that will do something to vmlinux.o that we won't
know about.

And also this will be purely vmlinux-specific approach relying on
extra and custom Kbuild integration.

While if you parallelize DWARF loading and BTF generation, that will
be more reliably correct (modulo any bugs of course) and will work for
any DWARF-to-BTF cases that might come up in the future.

So I wouldn't even bother with individual .o's, tbh.

>
> If this isn't the case, we can process vmlinux as is today and go on
> creating N threads and feeding each with a DW_TAG_compile_unit
> "container", i.e. each thread would consume all the tags below each
> DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
> combined and deduped by libbpf.
>
> Doing it as my first sketch above would take advantage of locality of
> reference, i.e. the DWARF data would be freshly produced and in the
> cache hierarchy when we first encode BTF, later, when doing the
> combine+dedup we wouldn't be touching the more voluminous DWARF data.

Yep, that's what I'd do.

>
> - Arnaldo
>
> > confident about BTF encoding part: dump each CU into its own BTF, use
> > btf__add_type() to merge multiple BTFs together. Just need to re-map
> > IDs (libbpf internally has API to visit each field that contains
> > type_id, it's well-defined enough to expose that as a public API, if
> > necessary). Then final btf_dedup().
>
> > But the DWARF loading and parsing part is almost a black box to me, so
> > I'm not sure how much work it would involve.
>
> > > I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
> > > very piecemeal as I'm doing will help bisecting any subtle bug this may
> > > introduce.
>
> > > > allow to parallelize BTF generation, where each CU would proceed in
> > > > parallel generating local BTF, and then the final pass would merge and
> > > > dedup BTFs. Currently reading and processing DWARF is the slowest part
> > > > of the DWARF-to-BTF conversion, parallelization and maybe some other
> > > > optimization seems like the only way to speed the process up.
>
> > > > Acked-by: Andrii Nakryiko <[email protected]>
>
> > > Thanks!

2021-06-09 01:36:26

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> On Mon, Jun 7, 2021 at 6:20 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> >
> > Em Fri, Jun 04, 2021 at 07:55:17PM -0700, Andrii Nakryiko escreveu:
> > > On Thu, Jun 3, 2021 at 7:57 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > > > Em Sat, May 29, 2021 at 05:40:17PM -0700, Andrii Nakryiko escreveu:
> >
> > > > > At some point it probably would make sense to formalize
> > > > > "btf_encoder" as a struct with its own state instead of passing in
> > > > > multiple variables. It would probably also
> >
> > > > Take a look at the tmp.master branch at:
> >
> > > > https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=tmp.master
> >
> > > Oh wow, that's a lot of commits! :) Great that you decided to do this
> > > refactoring, thanks!
> >
> > > > that btf_elf class isn't used anymore by btf_loader, that uses only
> > > > libbpf's APIs, and now we have a btf_encoder class with all the globals,
> > > > etc, more baby steps are needed to finally ditch btf_elf altogether and
> > > > move on to the parallelization.
> >
> > > So do you plan to try to parallelize as a next step? I'm pretty
> >
> > So, I haven't looked at details but what I thought would be interesting
> > to investigate is to see if we can piggyback DWARF generation with BTF
> > one, i.e. when we generate a .o file with -g we encode the DWARF info,
> > so, right after this, we could call pahole as-is and encode BTF, then,
> > when vmlinux is linked, we would do the dedup.
> >
> > I.e. when generating ../build/v5.13.0-rc4+/kernel/fork.o, that comes
> > with:
> >
> > ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep debug
> > [78] .debug_info PROGBITS 0000000000000000 00daec 032968 00 0 0 1
> > [79] .rela.debug_info RELA 0000000000000000 040458 053b68 18 I 95 78 8
> > [80] .debug_abbrev PROGBITS 0000000000000000 093fc0 0012e9 00 0 0 1
> > [81] .debug_loclists PROGBITS 0000000000000000 0952a9 00aa43 00 0 0 1
> > [82] .rela.debug_loclists RELA 0000000000000000 09fcf0 009d98 18 I 95 81 8
> > [83] .debug_aranges PROGBITS 0000000000000000 0a9a88 000080 00 0 0 1
> > [84] .rela.debug_aranges RELA 0000000000000000 0a9b08 0000a8 18 I 95 83 8
> > [85] .debug_rnglists PROGBITS 0000000000000000 0a9bb0 001509 00 0 0 1
> > [86] .rela.debug_rnglists RELA 0000000000000000 0ab0c0 001bc0 18 I 95 85 8
> > [87] .debug_line PROGBITS 0000000000000000 0acc80 0086b7 00 0 0 1
> > [88] .rela.debug_line RELA 0000000000000000 0b5338 002550 18 I 95 87 8
> > [89] .debug_str PROGBITS 0000000000000000 0b7888 0177ad 01 MS 0 0 1
> > [90] .debug_line_str PROGBITS 0000000000000000 0cf035 001308 01 MS 0 0 1
> > [93] .debug_frame PROGBITS 0000000000000000 0d0370 000e38 00 0 0 8
> > [94] .rela.debug_frame RELA 0000000000000000 0d11a8 000e70 18 I 95 93 8
> > ⬢[acme@toolbox perf]$
> >
> > We would do:
> >
> > ⬢[acme@toolbox perf]$ pahole -J ../build/v5.13.0-rc4+/kernel/fork.o
> > ⬢[acme@toolbox perf]$
> >
> > Which would get us to have:
> >
> > ⬢[acme@toolbox perf]$ readelf -SW ../build/v5.13.0-rc4+/kernel/fork.o | grep BTF
> > [103] .BTF PROGBITS 0000000000000000 0db658 030550 00 0 0 1
> > ⬢[acme@toolbox perf]
> >
> > ⬢[acme@toolbox perf]$ pahole -F btf -C hlist_node ../build/v5.13.0-rc4+/kernel/fork.o
> > struct hlist_node {
> > struct hlist_node * next; /* 0 8 */
> > struct hlist_node * * pprev; /* 8 8 */
> >
> > /* size: 16, cachelines: 1, members: 2 */
> > /* last cacheline: 16 bytes */
> > };
> > ⬢[acme@toolbox perf]$
> >
> > So, a 'pahole --dedup_btf vmlinux' would just go on looking at:
> >
> > ⬢[acme@toolbox perf]$ readelf -wi ../build/v5.13.0-rc4+/vmlinux | grep -A10 DW_TAG_compile_unit | grep -w DW_AT_name | grep fork
> > <f220eb> DW_AT_name : (indirect line string, offset: 0x62e7): /var/home/acme/git/linux/kernel/fork.c
> >
> > To go there and go on extracting those ELF sections to combine and
> > dedup.
> >
> > This combine thing could be done even by the linker, I think, when all
> > the DWARF data in the .o file are combined into vmlinux, we could do it
> > for the .BTF sections as well, that way would be even more elegant, I
> > think. Then, the combined vmlinux .BTF section would be read and fed in
> > one go to libbtf's dedup arg.
> >
> > This way the encoding of BTF would be as paralellized as the kernel build
> > process, following the same logic (-j NR_PROCESSORS).
> >
> > wdyt?
>
> I think it's very fragile and it will be easy to get
> broken/invalid/incomplete BTF. Yonghong already brought up the case

I thought about that as it would be almost like the compiler generating
BTF, but you are right, the vmlinux prep process is a complex beast and
probably it is best to go with the second approach I outlined and you
agreed to be less fragile, so I'll go with that, thanks for your
comments.

- Arnaldo

> for static variables. There might be some other issues that exist
> today, or we might run into when we further extend BTF. Like some
> custom linker script that will do something to vmlinux.o that we won't
> know about.
>
> And also this will be purely vmlinux-specific approach relying on
> extra and custom Kbuild integration.
>
> While if you parallelize DWARF loading and BTF generation, that will
> be more reliably correct (modulo any bugs of course) and will work for
> any DWARF-to-BTF cases that might come up in the future.
>
> So I wouldn't even bother with individual .o's, tbh.
>
> >
> > If this isn't the case, we can process vmlinux as is today and go on
> > creating N threads and feeding each with a DW_TAG_compile_unit
> > "container", i.e. each thread would consume all the tags below each
> > DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
> > combined and deduped by libbpf.
> >
> > Doing it as my first sketch above would take advantage of locality of
> > reference, i.e. the DWARF data would be freshly produced and in the
> > cache hierarchy when we first encode BTF, later, when doing the
> > combine+dedup we wouldn't be touching the more voluminous DWARF data.
>
> Yep, that's what I'd do.
>
> >
> > - Arnaldo
> >
> > > confident about BTF encoding part: dump each CU into its own BTF, use
> > > btf__add_type() to merge multiple BTFs together. Just need to re-map
> > > IDs (libbpf internally has API to visit each field that contains
> > > type_id, it's well-defined enough to expose that as a public API, if
> > > necessary). Then final btf_dedup().
> >
> > > But the DWARF loading and parsing part is almost a black box to me, so
> > > I'm not sure how much work it would involve.
> >
> > > > I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
> > > > very piecemeal as I'm doing will help bisecting any subtle bug this may
> > > > introduce.
> >
> > > > > allow to parallelize BTF generation, where each CU would proceed in
> > > > > parallel generating local BTF, and then the final pass would merge and
> > > > > dedup BTFs. Currently reading and processing DWARF is the slowest part
> > > > > of the DWARF-to-BTF conversion, parallelization and maybe some other
> > > > > optimization seems like the only way to speed the process up.
> >
> > > > > Acked-by: Andrii Nakryiko <[email protected]>
> >
> > > > Thanks!

--

- Arnaldo

2021-06-15 19:02:39

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > I think it's very fragile and it will be easy to get
> > broken/invalid/incomplete BTF. Yonghong already brought up the case

> I thought about that as it would be almost like the compiler generating
> BTF, but you are right, the vmlinux prep process is a complex beast and
> probably it is best to go with the second approach I outlined and you
> agreed to be less fragile, so I'll go with that, thanks for your
> comments.

So, just to write some notes here from what I saw so far:

1. In the LTO cases there are inter-CU references, so the current code
combines all CUs into one and we end up not being able to parallelize
much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
think the current algorithm is ideal, can be improved.

2. The case where there's no inter CU refs, which so far is the most
common, seems easier, we create N threads, all sharing the dwarf_loader
state and the btf_encoder, as-is now. we can process one CU per thread,
and as soon as we finish it, just grab a lock and call
btf_encoder__encode_cu() with the just produced CU data structures
(tags, types, functions, variables, etc), consume them and delete the
CU.

So each thread will consume one CU, push it to the 'struct btf' class
as-is now and then ask for the next CU, using the dwarf_loader state,
still under that lock, then go back to processing dwarf tags, then
lock, btf add types, rinse, repeat.

The ordering will be different than what we have now, as some smaller
CUs (object files with debug) will be processed faster so will get its
btf encoding slot faster, but that, at btf__dedup() time shouldn't make
a difference, right?

I think I'm done with refactoring the btf_encoder code, which should be
by now a thin layer on top of the excellent libbpf BTF API, just getting
what the previous loader (DWARF) produced and feeding libbpf.

I thought about fancy thread pools, etc, researching some pre-existing
thing or doing some kthread + workqueue lifting from the kernel but will
instead start with the most spartan code, we can improve later.

There it is, dumped my thoughts on this, time to go do some coding
before I get preempted...

- Arnaldo

> - Arnaldo
>
> > for static variables. There might be some other issues that exist
> > today, or we might run into when we further extend BTF. Like some
> > custom linker script that will do something to vmlinux.o that we won't
> > know about.
> >
> > And also this will be purely vmlinux-specific approach relying on
> > extra and custom Kbuild integration.
> >
> > While if you parallelize DWARF loading and BTF generation, that will
> > be more reliably correct (modulo any bugs of course) and will work for
> > any DWARF-to-BTF cases that might come up in the future.
> >
> > So I wouldn't even bother with individual .o's, tbh.
> >
> > >
> > > If this isn't the case, we can process vmlinux as is today and go on
> > > creating N threads and feeding each with a DW_TAG_compile_unit
> > > "container", i.e. each thread would consume all the tags below each
> > > DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
> > > combined and deduped by libbpf.
> > >
> > > Doing it as my first sketch above would take advantage of locality of
> > > reference, i.e. the DWARF data would be freshly produced and in the
> > > cache hierarchy when we first encode BTF, later, when doing the
> > > combine+dedup we wouldn't be touching the more voluminous DWARF data.
> >
> > Yep, that's what I'd do.
> >
> > >
> > > - Arnaldo
> > >
> > > > confident about BTF encoding part: dump each CU into its own BTF, use
> > > > btf__add_type() to merge multiple BTFs together. Just need to re-map
> > > > IDs (libbpf internally has API to visit each field that contains
> > > > type_id, it's well-defined enough to expose that as a public API, if
> > > > necessary). Then final btf_dedup().
> > >
> > > > But the DWARF loading and parsing part is almost a black box to me, so
> > > > I'm not sure how much work it would involve.
> > >
> > > > > I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
> > > > > very piecemeal as I'm doing will help bisecting any subtle bug this may
> > > > > introduce.
> > >
> > > > > > allow to parallelize BTF generation, where each CU would proceed in
> > > > > > parallel generating local BTF, and then the final pass would merge and
> > > > > > dedup BTFs. Currently reading and processing DWARF is the slowest part
> > > > > > of the DWARF-to-BTF conversion, parallelization and maybe some other
> > > > > > optimization seems like the only way to speed the process up.
> > >
> > > > > > Acked-by: Andrii Nakryiko <[email protected]>
> > >
> > > > > Thanks!
>
> --
>
> - Arnaldo

--

- Arnaldo

2021-06-15 19:15:31

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > I think it's very fragile and it will be easy to get
> > > broken/invalid/incomplete BTF. Yonghong already brought up the case
>
> > I thought about that as it would be almost like the compiler generating
> > BTF, but you are right, the vmlinux prep process is a complex beast and
> > probably it is best to go with the second approach I outlined and you
> > agreed to be less fragile, so I'll go with that, thanks for your
> > comments.
>
> So, just to write some notes here from what I saw so far:
>
> 1. In the LTO cases there are inter-CU references, so the current code
> combines all CUs into one and we end up not being able to parallelize
> much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> think the current algorithm is ideal, can be improved.
>

Yeah, let's worry about LTO later.

> 2. The case where there's no inter CU refs, which so far is the most
> common, seems easier, we create N threads, all sharing the dwarf_loader
> state and the btf_encoder, as-is now. we can process one CU per thread,
> and as soon as we finish it, just grab a lock and call
> btf_encoder__encode_cu() with the just produced CU data structures
> (tags, types, functions, variables, etc), consume them and delete the
> CU.
>
> So each thread will consume one CU, push it to the 'struct btf' class
> as-is now and then ask for the next CU, using the dwarf_loader state,
> still under that lock, then go back to processing dwarf tags, then
> lock, btf add types, rinse, repeat.

Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
appending to it for each processed CU until we run out of CUs be
simpler? So each thread does as much as possible locally without any
locks. And only at the very end we merge everything together and then
dedup. Or we can even dedup inside each worker before merging final
btf, that probably would give quite a lot of speed up and some memory
saving. Would be interesting to experiment with that.

So I like the idea of a fixed pool of threads (can be customized, and
I'd default to num_workers == num_cpus), but I think we can and should
keep them independent for as long as possible.

Another disadvantage of generating small struct btf and then lock +
merge is that we don't get as efficient string re-use, we'll churn
more on string memory allocation. Keeping bigger local struct btfs
allow for more efficient memory re-use (and probably a tiny bit of CPU
savings).

So please consider that, it also seems simpler overall.


>
> The ordering will be different than what we have now, as some smaller
> CUs (object files with debug) will be processed faster so will get its
> btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> a difference, right?

Right, order doesn't matter.

>
> I think I'm done with refactoring the btf_encoder code, which should be
> by now a thin layer on top of the excellent libbpf BTF API, just getting
> what the previous loader (DWARF) produced and feeding libbpf.

Great.

>
> I thought about fancy thread pools, etc, researching some pre-existing
> thing or doing some kthread + workqueue lifting from the kernel but will
> instead start with the most spartan code, we can improve later.

Agree, simple is good. Really curious how much faster we can get. I
think anything fancy will give a relatively small improvement. The
biggest one will come from any parallelization.

>
> There it is, dumped my thoughts on this, time to go do some coding
> before I get preempted...
>
> - Arnaldo
>
> > - Arnaldo
> >
> > > for static variables. There might be some other issues that exist
> > > today, or we might run into when we further extend BTF. Like some
> > > custom linker script that will do something to vmlinux.o that we won't
> > > know about.
> > >
> > > And also this will be purely vmlinux-specific approach relying on
> > > extra and custom Kbuild integration.
> > >
> > > While if you parallelize DWARF loading and BTF generation, that will
> > > be more reliably correct (modulo any bugs of course) and will work for
> > > any DWARF-to-BTF cases that might come up in the future.
> > >
> > > So I wouldn't even bother with individual .o's, tbh.
> > >
> > > >
> > > > If this isn't the case, we can process vmlinux as is today and go on
> > > > creating N threads and feeding each with a DW_TAG_compile_unit
> > > > "container", i.e. each thread would consume all the tags below each
> > > > DW_TAG_compile_unit and produce a foo.BTF file that in the end would be
> > > > combined and deduped by libbpf.
> > > >
> > > > Doing it as my first sketch above would take advantage of locality of
> > > > reference, i.e. the DWARF data would be freshly produced and in the
> > > > cache hierarchy when we first encode BTF, later, when doing the
> > > > combine+dedup we wouldn't be touching the more voluminous DWARF data.
> > >
> > > Yep, that's what I'd do.
> > >
> > > >
> > > > - Arnaldo
> > > >
> > > > > confident about BTF encoding part: dump each CU into its own BTF, use
> > > > > btf__add_type() to merge multiple BTFs together. Just need to re-map
> > > > > IDs (libbpf internally has API to visit each field that contains
> > > > > type_id, it's well-defined enough to expose that as a public API, if
> > > > > necessary). Then final btf_dedup().
> > > >
> > > > > But the DWARF loading and parsing part is almost a black box to me, so
> > > > > I'm not sure how much work it would involve.
> > > >
> > > > > > I'm doing 'pahole -J vmlinux && btfdiff' after each cset and doing it
> > > > > > very piecemeal as I'm doing will help bisecting any subtle bug this may
> > > > > > introduce.
> > > >
> > > > > > > allow to parallelize BTF generation, where each CU would proceed in
> > > > > > > parallel generating local BTF, and then the final pass would merge and
> > > > > > > dedup BTFs. Currently reading and processing DWARF is the slowest part
> > > > > > > of the DWARF-to-BTF conversion, parallelization and maybe some other
> > > > > > > optimization seems like the only way to speed the process up.
> > > >
> > > > > > > Acked-by: Andrii Nakryiko <[email protected]>
> > > >
> > > > > > Thanks!
> >
> > --
> >
> > - Arnaldo
>
> --
>
> - Arnaldo

2021-06-15 19:39:05

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <[email protected]> wrote:

> > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > I think it's very fragile and it will be easy to get
> > > > broken/invalid/incomplete BTF. Yonghong already brought up the case

> > > I thought about that as it would be almost like the compiler generating
> > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > probably it is best to go with the second approach I outlined and you
> > > agreed to be less fragile, so I'll go with that, thanks for your
> > > comments.

> > So, just to write some notes here from what I saw so far:

> > 1. In the LTO cases there are inter-CU references, so the current code
> > combines all CUs into one and we end up not being able to parallelize
> > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > think the current algorithm is ideal, can be improved.

> Yeah, let's worry about LTO later.

> > 2. The case where there's no inter CU refs, which so far is the most
> > common, seems easier, we create N threads, all sharing the dwarf_loader
> > state and the btf_encoder, as-is now. we can process one CU per thread,
> > and as soon as we finish it, just grab a lock and call
> > btf_encoder__encode_cu() with the just produced CU data structures
> > (tags, types, functions, variables, etc), consume them and delete the
> > CU.
> >
> > So each thread will consume one CU, push it to the 'struct btf' class
> > as-is now and then ask for the next CU, using the dwarf_loader state,
> > still under that lock, then go back to processing dwarf tags, then
> > lock, btf add types, rinse, repeat.
>
> Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> appending to it for each processed CU until we run out of CUs be
> simpler?

I thought about this as a logical next step, I would love to have a
'btf__merge_argv(struct btf *btf[]), is there one?

But from what I've read after this first paragraph of yours, lemme try
to rephrase:

1. pahole calls btf_encoder__new(...)

Creates a single struct btf.

2. dwarf_loader will create N threads, each will call a
dwarf_get_next_cu() that is locked and will return a CU to process, when
it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
lock. Rinse repeat.

Until all the threads have consumed all CUs.

then btf_encoder__encode(), which should be probably renamed to
btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
or raw file.

My first reaction to your first paragraph was:

Yeah, we can have multiple 'struct btf' instances, one per thread, that
will each contain a subset of DWARF CU's encoded as BTF, and then I have
to merge the per-thread BTF and then dedup. O think my rephrase above is
better, no?

> So each thread does as much as possible locally without any
> locks. And only at the very end we merge everything together and then
> dedup. Or we can even dedup inside each worker before merging final
> btf, that probably would give quite a lot of speed up and some memory
> saving. Would be interesting to experiment with that.
>
> So I like the idea of a fixed pool of threads (can be customized, and
> I'd default to num_workers == num_cpus), but I think we can and should
> keep them independent for as long as possible.

Sure, this should map the whatever the user passes to -j in the kernel
make command line, if nothing is passed as an argument, then default to
getconf(_NPROCESSORS_ONLN).

There is a nice coincidence here where we probably don't care about -J
anymore and want to deal only with -j (detached btf) that is the same as
what 'make' expects to state how many "jobs" (thread pool size) the user
wants 8-)

> Another disadvantage of generating small struct btf and then lock +
> merge is that we don't get as efficient string re-use, we'll churn
> more on string memory allocation. Keeping bigger local struct btfs
> allow for more efficient memory re-use (and probably a tiny bit of CPU
> savings).

I think we're in the same page, the contention for adding the CU to a
single 'struct btf' (amongst all DWARF loading threads) after we just
produced it should be minimal, so we grab all the advantages: locality
of reference, minimal contention as DWARF reading/creating the pahole
internal, neutral, data structures should be higher than adding
types/functions/variables via the libbpf BTF API.

I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
trying to do now is to paralellize the DWARF _loading_, right?

> So please consider that, it also seems simpler overall.

> > The ordering will be different than what we have now, as some smaller
> > CUs (object files with debug) will be processed faster so will get its
> > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > a difference, right?

> Right, order doesn't matter.

> > I think I'm done with refactoring the btf_encoder code, which should be
> > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > what the previous loader (DWARF) produced and feeding libbpf.

> Great.

> > I thought about fancy thread pools, etc, researching some pre-existing
> > thing or doing some kthread + workqueue lifting from the kernel but will
> > instead start with the most spartan code, we can improve later.

> Agree, simple is good. Really curious how much faster we can get. I
> think anything fancy will give a relatively small improvement. The
> biggest one will come from any parallelization.

And I think that is possible, modulo elfutils libraries saying no, I
hope that will not be the case.

- Arnaldo

2021-06-15 20:07:10

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

On Tue, Jun 15, 2021 at 12:38 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> > On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> > > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > > I think it's very fragile and it will be easy to get
> > > > > broken/invalid/incomplete BTF. Yonghong already brought up the case
>
> > > > I thought about that as it would be almost like the compiler generating
> > > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > > probably it is best to go with the second approach I outlined and you
> > > > agreed to be less fragile, so I'll go with that, thanks for your
> > > > comments.
>
> > > So, just to write some notes here from what I saw so far:
>
> > > 1. In the LTO cases there are inter-CU references, so the current code
> > > combines all CUs into one and we end up not being able to parallelize
> > > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > > think the current algorithm is ideal, can be improved.
>
> > Yeah, let's worry about LTO later.
>
> > > 2. The case where there's no inter CU refs, which so far is the most
> > > common, seems easier, we create N threads, all sharing the dwarf_loader
> > > state and the btf_encoder, as-is now. we can process one CU per thread,
> > > and as soon as we finish it, just grab a lock and call
> > > btf_encoder__encode_cu() with the just produced CU data structures
> > > (tags, types, functions, variables, etc), consume them and delete the
> > > CU.
> > >
> > > So each thread will consume one CU, push it to the 'struct btf' class
> > > as-is now and then ask for the next CU, using the dwarf_loader state,
> > > still under that lock, then go back to processing dwarf tags, then
> > > lock, btf add types, rinse, repeat.
> >
> > Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> > appending to it for each processed CU until we run out of CUs be
> > simpler?
>
> I thought about this as a logical next step, I would love to have a
> 'btf__merge_argv(struct btf *btf[]), is there one?
>
> But from what I've read after this first paragraph of yours, lemme try
> to rephrase:
>
> 1. pahole calls btf_encoder__new(...)
>
> Creates a single struct btf.
>
> 2. dwarf_loader will create N threads, each will call a
> dwarf_get_next_cu() that is locked and will return a CU to process, when
> it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
> lock. Rinse repeat.
>
> Until all the threads have consumed all CUs.
>
> then btf_encoder__encode(), which should be probably renamed to
> btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
> or raw file.
>
> My first reaction to your first paragraph was:
>
> Yeah, we can have multiple 'struct btf' instances, one per thread, that
> will each contain a subset of DWARF CU's encoded as BTF, and then I have
> to merge the per-thread BTF and then dedup. O think my rephrase above is
> better, no?

I think I understood what you want to do from the previous email, so
you didn't have to re-phrase it, it's pretty clear already. I just
don't feel like having per-thread struct btf adds any complexity at
all and gives more flexibility and more parallelism. The next most
expensive thing after loading DWARF is string deduplication
(btf__add_str()), so it would be good to do that at per-thread level
as well as much as possible.

>
> > So each thread does as much as possible locally without any
> > locks. And only at the very end we merge everything together and then
> > dedup. Or we can even dedup inside each worker before merging final
> > btf, that probably would give quite a lot of speed up and some memory
> > saving. Would be interesting to experiment with that.
> >
> > So I like the idea of a fixed pool of threads (can be customized, and
> > I'd default to num_workers == num_cpus), but I think we can and should
> > keep them independent for as long as possible.
>
> Sure, this should map the whatever the user passes to -j in the kernel
> make command line, if nothing is passed as an argument, then default to
> getconf(_NPROCESSORS_ONLN).
>

Yep, cool. I've been told that `make -j` puts no upper limit on number
of jobs, so we shouldn't follow make model completely :-P

> There is a nice coincidence here where we probably don't care about -J
> anymore and want to deal only with -j (detached btf) that is the same as
> what 'make' expects to state how many "jobs" (thread pool size) the user
> wants 8-)
>
> > Another disadvantage of generating small struct btf and then lock +
> > merge is that we don't get as efficient string re-use, we'll churn
> > more on string memory allocation. Keeping bigger local struct btfs
> > allow for more efficient memory re-use (and probably a tiny bit of CPU
> > savings).
>
> I think we're in the same page, the contention for adding the CU to a
> single 'struct btf' (amongst all DWARF loading threads) after we just
> produced it should be minimal, so we grab all the advantages: locality
> of reference, minimal contention as DWARF reading/creating the pahole
> internal, neutral, data structures should be higher than adding
> types/functions/variables via the libbpf BTF API.

I disagree, I think contention might be noticeable because merging
BTFs is still a relatively expensive/slow operation. But feel free to
start with that, I just thought that doing per-thread struct btf
wouldn't add any complexity, which is why I mentioned that.

>
> I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
> trying to do now is to paralellize the DWARF _loading_, right?

We are trying to speed up DWARF-to-BTF generation in general, not
specifically DWARF loading. DWARF loading is an obvious most expensive
part, string deduplication is the next one, if you look at profiling
data. The third one will be btf__dedup, which is why I mentioned that
it might be faster still to do pre-dedup in each thread at the very
end, right before we do final dedup. Each individual dedup will
probably significantly reduce total size of data/strings, so I have a
feeling that it will result in a very nice speed-ups in the end.

So just my 2 cents.

>
> > So please consider that, it also seems simpler overall.
>
> > > The ordering will be different than what we have now, as some smaller
> > > CUs (object files with debug) will be processed faster so will get its
> > > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > > a difference, right?
>
> > Right, order doesn't matter.
>
> > > I think I'm done with refactoring the btf_encoder code, which should be
> > > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > > what the previous loader (DWARF) produced and feeding libbpf.
>
> > Great.
>
> > > I thought about fancy thread pools, etc, researching some pre-existing
> > > thing or doing some kthread + workqueue lifting from the kernel but will
> > > instead start with the most spartan code, we can improve later.
>
> > Agree, simple is good. Really curious how much faster we can get. I
> > think anything fancy will give a relatively small improvement. The
> > biggest one will come from any parallelization.
>
> And I think that is possible, modulo elfutils libraries saying no, I
> hope that will not be the case.

You can't imagine how eagerly I'm awaiting this bright future of
faster BTF generation step in the kernel build process. :)

>
> - Arnaldo

2021-06-15 20:26:46

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

Em Tue, Jun 15, 2021 at 01:05:30PM -0700, Andrii Nakryiko escreveu:
> On Tue, Jun 15, 2021 at 12:38 PM Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> >
> > Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> > > On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
> >
> > > > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > > > I think it's very fragile and it will be easy to get
> > > > > > broken/invalid/incomplete BTF. Yonghong already brought up the case
> >
> > > > > I thought about that as it would be almost like the compiler generating
> > > > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > > > probably it is best to go with the second approach I outlined and you
> > > > > agreed to be less fragile, so I'll go with that, thanks for your
> > > > > comments.
> >
> > > > So, just to write some notes here from what I saw so far:
> >
> > > > 1. In the LTO cases there are inter-CU references, so the current code
> > > > combines all CUs into one and we end up not being able to parallelize
> > > > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > > > think the current algorithm is ideal, can be improved.
> >
> > > Yeah, let's worry about LTO later.
> >
> > > > 2. The case where there's no inter CU refs, which so far is the most
> > > > common, seems easier, we create N threads, all sharing the dwarf_loader
> > > > state and the btf_encoder, as-is now. we can process one CU per thread,
> > > > and as soon as we finish it, just grab a lock and call
> > > > btf_encoder__encode_cu() with the just produced CU data structures
> > > > (tags, types, functions, variables, etc), consume them and delete the
> > > > CU.
> > > >
> > > > So each thread will consume one CU, push it to the 'struct btf' class
> > > > as-is now and then ask for the next CU, using the dwarf_loader state,
> > > > still under that lock, then go back to processing dwarf tags, then
> > > > lock, btf add types, rinse, repeat.
> > >
> > > Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> > > appending to it for each processed CU until we run out of CUs be
> > > simpler?
> >
> > I thought about this as a logical next step, I would love to have a
> > 'btf__merge_argv(struct btf *btf[]), is there one?
> >
> > But from what I've read after this first paragraph of yours, lemme try
> > to rephrase:
> >
> > 1. pahole calls btf_encoder__new(...)
> >
> > Creates a single struct btf.
> >
> > 2. dwarf_loader will create N threads, each will call a
> > dwarf_get_next_cu() that is locked and will return a CU to process, when
> > it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
> > lock. Rinse repeat.
> >
> > Until all the threads have consumed all CUs.
> >
> > then btf_encoder__encode(), which should be probably renamed to
> > btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
> > or raw file.
> >
> > My first reaction to your first paragraph was:
> >
> > Yeah, we can have multiple 'struct btf' instances, one per thread, that
> > will each contain a subset of DWARF CU's encoded as BTF, and then I have
> > to merge the per-thread BTF and then dedup. O think my rephrase above is
> > better, no?
>
> I think I understood what you want to do from the previous email, so
> you didn't have to re-phrase it, it's pretty clear already. I just
> don't feel like having per-thread struct btf adds any complexity at
> all and gives more flexibility and more parallelism. The next most
> expensive thing after loading DWARF is string deduplication
> (btf__add_str()), so it would be good to do that at per-thread level
> as well as much as possible.

So you think a per-thread dedup at the end of each thread is good, ok,
no locking, good.

But what about that question I made:

> > I thought about this as a logical next step, I would love to have a
> > 'btf__merge_argv(struct btf *btf[]), is there one?

I haven't checked, is there alredy an libbpf BTF API that can merge an
array or pre-deduped BTF, deduping it one more time?

Anyway, so you suggest I start by having each dwarf_loader thread tied
to a separate btf_encoder (a shim layer on top of a 'struct btf' and
then at the end dedup each one, then combine the N 'struct btf' into
one, then dump it into an ELF or raw file?

- Arnaldo

> > > So each thread does as much as possible locally without any
> > > locks. And only at the very end we merge everything together and then
> > > dedup. Or we can even dedup inside each worker before merging final
> > > btf, that probably would give quite a lot of speed up and some memory
> > > saving. Would be interesting to experiment with that.
> > >
> > > So I like the idea of a fixed pool of threads (can be customized, and
> > > I'd default to num_workers == num_cpus), but I think we can and should
> > > keep them independent for as long as possible.
> >
> > Sure, this should map the whatever the user passes to -j in the kernel
> > make command line, if nothing is passed as an argument, then default to
> > getconf(_NPROCESSORS_ONLN).
> >
>
> Yep, cool. I've been told that `make -j` puts no upper limit on number
> of jobs, so we shouldn't follow make model completely :-P
>
> > There is a nice coincidence here where we probably don't care about -J
> > anymore and want to deal only with -j (detached btf) that is the same as
> > what 'make' expects to state how many "jobs" (thread pool size) the user
> > wants 8-)
> >
> > > Another disadvantage of generating small struct btf and then lock +
> > > merge is that we don't get as efficient string re-use, we'll churn
> > > more on string memory allocation. Keeping bigger local struct btfs
> > > allow for more efficient memory re-use (and probably a tiny bit of CPU
> > > savings).
> >
> > I think we're in the same page, the contention for adding the CU to a
> > single 'struct btf' (amongst all DWARF loading threads) after we just
> > produced it should be minimal, so we grab all the advantages: locality
> > of reference, minimal contention as DWARF reading/creating the pahole
> > internal, neutral, data structures should be higher than adding
> > types/functions/variables via the libbpf BTF API.
>
> I disagree, I think contention might be noticeable because merging
> BTFs is still a relatively expensive/slow operation. But feel free to
> start with that, I just thought that doing per-thread struct btf
> wouldn't add any complexity, which is why I mentioned that.
>
> >
> > I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
> > trying to do now is to paralellize the DWARF _loading_, right?
>
> We are trying to speed up DWARF-to-BTF generation in general, not
> specifically DWARF loading. DWARF loading is an obvious most expensive
> part, string deduplication is the next one, if you look at profiling
> data. The third one will be btf__dedup, which is why I mentioned that
> it might be faster still to do pre-dedup in each thread at the very
> end, right before we do final dedup. Each individual dedup will
> probably significantly reduce total size of data/strings, so I have a
> feeling that it will result in a very nice speed-ups in the end.
>
> So just my 2 cents.
>
> >
> > > So please consider that, it also seems simpler overall.
> >
> > > > The ordering will be different than what we have now, as some smaller
> > > > CUs (object files with debug) will be processed faster so will get its
> > > > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > > > a difference, right?
> >
> > > Right, order doesn't matter.
> >
> > > > I think I'm done with refactoring the btf_encoder code, which should be
> > > > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > > > what the previous loader (DWARF) produced and feeding libbpf.
> >
> > > Great.
> >
> > > > I thought about fancy thread pools, etc, researching some pre-existing
> > > > thing or doing some kthread + workqueue lifting from the kernel but will
> > > > instead start with the most spartan code, we can improve later.
> >
> > > Agree, simple is good. Really curious how much faster we can get. I
> > > think anything fancy will give a relatively small improvement. The
> > > biggest one will come from any parallelization.
> >
> > And I think that is possible, modulo elfutils libraries saying no, I
> > hope that will not be the case.
>
> You can't imagine how eagerly I'm awaiting this bright future of
> faster BTF generation step in the kernel build process. :)
>
> >
> > - Arnaldo

--

- Arnaldo

2021-06-15 21:29:24

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22

On Tue, Jun 15, 2021 at 1:26 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Tue, Jun 15, 2021 at 01:05:30PM -0700, Andrii Nakryiko escreveu:
> > On Tue, Jun 15, 2021 at 12:38 PM Arnaldo Carvalho de Melo
> > <[email protected]> wrote:
> > >
> > > Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> > > > On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > >
> > > > > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > > > > I think it's very fragile and it will be easy to get
> > > > > > > broken/invalid/incomplete BTF. Yonghong already brought up the case
> > >
> > > > > > I thought about that as it would be almost like the compiler generating
> > > > > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > > > > probably it is best to go with the second approach I outlined and you
> > > > > > agreed to be less fragile, so I'll go with that, thanks for your
> > > > > > comments.
> > >
> > > > > So, just to write some notes here from what I saw so far:
> > >
> > > > > 1. In the LTO cases there are inter-CU references, so the current code
> > > > > combines all CUs into one and we end up not being able to parallelize
> > > > > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > > > > think the current algorithm is ideal, can be improved.
> > >
> > > > Yeah, let's worry about LTO later.
> > >
> > > > > 2. The case where there's no inter CU refs, which so far is the most
> > > > > common, seems easier, we create N threads, all sharing the dwarf_loader
> > > > > state and the btf_encoder, as-is now. we can process one CU per thread,
> > > > > and as soon as we finish it, just grab a lock and call
> > > > > btf_encoder__encode_cu() with the just produced CU data structures
> > > > > (tags, types, functions, variables, etc), consume them and delete the
> > > > > CU.
> > > > >
> > > > > So each thread will consume one CU, push it to the 'struct btf' class
> > > > > as-is now and then ask for the next CU, using the dwarf_loader state,
> > > > > still under that lock, then go back to processing dwarf tags, then
> > > > > lock, btf add types, rinse, repeat.
> > > >
> > > > Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> > > > appending to it for each processed CU until we run out of CUs be
> > > > simpler?
> > >
> > > I thought about this as a logical next step, I would love to have a
> > > 'btf__merge_argv(struct btf *btf[]), is there one?
> > >
> > > But from what I've read after this first paragraph of yours, lemme try
> > > to rephrase:
> > >
> > > 1. pahole calls btf_encoder__new(...)
> > >
> > > Creates a single struct btf.
> > >
> > > 2. dwarf_loader will create N threads, each will call a
> > > dwarf_get_next_cu() that is locked and will return a CU to process, when
> > > it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
> > > lock. Rinse repeat.
> > >
> > > Until all the threads have consumed all CUs.
> > >
> > > then btf_encoder__encode(), which should be probably renamed to
> > > btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
> > > or raw file.
> > >
> > > My first reaction to your first paragraph was:
> > >
> > > Yeah, we can have multiple 'struct btf' instances, one per thread, that
> > > will each contain a subset of DWARF CU's encoded as BTF, and then I have
> > > to merge the per-thread BTF and then dedup. O think my rephrase above is
> > > better, no?
> >
> > I think I understood what you want to do from the previous email, so
> > you didn't have to re-phrase it, it's pretty clear already. I just
> > don't feel like having per-thread struct btf adds any complexity at
> > all and gives more flexibility and more parallelism. The next most
> > expensive thing after loading DWARF is string deduplication
> > (btf__add_str()), so it would be good to do that at per-thread level
> > as well as much as possible.
>
> So you think a per-thread dedup at the end of each thread is good, ok,
> no locking, good.
>
> But what about that question I made:
>
> > > I thought about this as a logical next step, I would love to have a
> > > 'btf__merge_argv(struct btf *btf[]), is there one?
>

Right, sorry, got too excited about parallelisation, forgot to reply to this.

I thought about this a bit in the context of BPF static linker work.
This is a bit problematic as a general API, because merging two BTFs
is not just appending types one after the other and calling it a day.
For DATASEC, for instance, you need to take few DATASEC with the same
name and combine them into a single DATASEC, otherwise resulting BTF
is non-sensical. While you are doing that, you need to re-adjust
variable offsets, take into account original data section alignment
requirements, etc. This operation can't be done safely in BTF only,
you need to know original ELF information (e.g., that ELF section
alignment). This is all done by static linker explicitly because only
static linker has enough information to do that. It goes even further,
extern VARs and FUNCs have to be resolved and de-duplicated (e.g.,
extern can be replaced with globals), etc. There is too much attached
semantics to some of BTF data.

So as a general API I don't see how it can be done nicely. Unless we
say that we'll error out if any VAR or DATASEC is found, or any extern
FUNC. Which sounds like a not-so-great idea right now.

But pahole has a bit simpler case, because BTF vars and DATASEC(s) are
generated at the very end, so it shouldn't be so complicated for
pahole. libbpf provides a generic btf__add_type() API that copies over
any type and associated strings (field names, func/struct names, etc).
That's quite a reduction in amount of code written. The only thing is
that after that IDs have to be updated and adjusted, because libbpf
doesn't have enough info to do this. So take a look at btf__add_type()
as a starting point.

Next, libbpf internally has btf_type_visit_type_ids() which will
provide a callback for each place in any btf_type that contains an ID.
This is the best way to adjust those IDs. We can probably expose those
APIs as public API because they are well-defined and have
straightforward semantics. So let me know.


> I haven't checked, is there alredy an libbpf BTF API that can merge an
> array or pre-deduped BTF, deduping it one more time?

btf__dedup() can be called multiple times on any struct btf, so once
you merge (see above), you can just dedup the merged btf to make it
small again.

>
> Anyway, so you suggest I start by having each dwarf_loader thread tied
> to a separate btf_encoder (a shim layer on top of a 'struct btf' and
> then at the end dedup each one, then combine the N 'struct btf' into
> one, then dump it into an ELF or raw file?

Yes, exactly.

>
> - Arnaldo
>
> > > > So each thread does as much as possible locally without any
> > > > locks. And only at the very end we merge everything together and then
> > > > dedup. Or we can even dedup inside each worker before merging final
> > > > btf, that probably would give quite a lot of speed up and some memory
> > > > saving. Would be interesting to experiment with that.
> > > >
> > > > So I like the idea of a fixed pool of threads (can be customized, and
> > > > I'd default to num_workers == num_cpus), but I think we can and should
> > > > keep them independent for as long as possible.
> > >
> > > Sure, this should map the whatever the user passes to -j in the kernel
> > > make command line, if nothing is passed as an argument, then default to
> > > getconf(_NPROCESSORS_ONLN).
> > >
> >
> > Yep, cool. I've been told that `make -j` puts no upper limit on number
> > of jobs, so we shouldn't follow make model completely :-P
> >
> > > There is a nice coincidence here where we probably don't care about -J
> > > anymore and want to deal only with -j (detached btf) that is the same as
> > > what 'make' expects to state how many "jobs" (thread pool size) the user
> > > wants 8-)
> > >
> > > > Another disadvantage of generating small struct btf and then lock +
> > > > merge is that we don't get as efficient string re-use, we'll churn
> > > > more on string memory allocation. Keeping bigger local struct btfs
> > > > allow for more efficient memory re-use (and probably a tiny bit of CPU
> > > > savings).
> > >
> > > I think we're in the same page, the contention for adding the CU to a
> > > single 'struct btf' (amongst all DWARF loading threads) after we just
> > > produced it should be minimal, so we grab all the advantages: locality
> > > of reference, minimal contention as DWARF reading/creating the pahole
> > > internal, neutral, data structures should be higher than adding
> > > types/functions/variables via the libbpf BTF API.
> >
> > I disagree, I think contention might be noticeable because merging
> > BTFs is still a relatively expensive/slow operation. But feel free to
> > start with that, I just thought that doing per-thread struct btf
> > wouldn't add any complexity, which is why I mentioned that.
> >
> > >
> > > I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
> > > trying to do now is to paralellize the DWARF _loading_, right?
> >
> > We are trying to speed up DWARF-to-BTF generation in general, not
> > specifically DWARF loading. DWARF loading is an obvious most expensive
> > part, string deduplication is the next one, if you look at profiling
> > data. The third one will be btf__dedup, which is why I mentioned that
> > it might be faster still to do pre-dedup in each thread at the very
> > end, right before we do final dedup. Each individual dedup will
> > probably significantly reduce total size of data/strings, so I have a
> > feeling that it will result in a very nice speed-ups in the end.
> >
> > So just my 2 cents.
> >
> > >
> > > > So please consider that, it also seems simpler overall.
> > >
> > > > > The ordering will be different than what we have now, as some smaller
> > > > > CUs (object files with debug) will be processed faster so will get its
> > > > > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > > > > a difference, right?
> > >
> > > > Right, order doesn't matter.
> > >
> > > > > I think I'm done with refactoring the btf_encoder code, which should be
> > > > > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > > > > what the previous loader (DWARF) produced and feeding libbpf.
> > >
> > > > Great.
> > >
> > > > > I thought about fancy thread pools, etc, researching some pre-existing
> > > > > thing or doing some kthread + workqueue lifting from the kernel but will
> > > > > instead start with the most spartan code, we can improve later.
> > >
> > > > Agree, simple is good. Really curious how much faster we can get. I
> > > > think anything fancy will give a relatively small improvement. The
> > > > biggest one will come from any parallelization.
> > >
> > > And I think that is possible, modulo elfutils libraries saying no, I
> > > hope that will not be the case.
> >
> > You can't imagine how eagerly I'm awaiting this bright future of
> > faster BTF generation step in the kernel build process. :)
> >
> > >
> > > - Arnaldo
>
> --
>
> - Arnaldo