2022-01-24 23:29:36

by Hao Luo

[permalink] [raw]
Subject: [Question] How to reliably get BuildIDs from bpf prog

Dear BPF experts,

I'm working on collecting some kernel performance data using BPF
tracing prog. Our performance profiling team wants to associate the
data with user stack information. One of the requirements is to
reliably get BuildIDs from bpf_get_stackid() and other similar helpers
[1].

As part of an early investigation, we found that there are a couple
issues that make bpf_get_stackid() much less reliable than we'd like
for our use:

1. The first page of many binaries (which contains the ELF headers and
thus the BuildID that we need) is often not in memory. The failure of
find_get_page() (called from build_id_parse()) is higher than we would
want.

2. When anonymous huge pages are used to hold some regions of process
text, build_id_parse() also fails to get a BuildID because
vma->vm_file is NULL.

These two issues are critical blockers for us to use BPF in
production. Can we do better? What do other users do to reliably get
build ids?

Thanks very much,
Hao

[1] https://man7.org/linux/man-pages/man7/bpf-helpers.7.html


2022-01-25 09:12:08

by Song Liu

[permalink] [raw]
Subject: Re: [Question] How to reliably get BuildIDs from bpf prog

On Mon, Jan 24, 2022 at 2:43 PM Hao Luo <[email protected]> wrote:
>
> Dear BPF experts,
>
> I'm working on collecting some kernel performance data using BPF
> tracing prog. Our performance profiling team wants to associate the
> data with user stack information. One of the requirements is to
> reliably get BuildIDs from bpf_get_stackid() and other similar helpers
> [1].
>
> As part of an early investigation, we found that there are a couple
> issues that make bpf_get_stackid() much less reliable than we'd like
> for our use:
>
> 1. The first page of many binaries (which contains the ELF headers and
> thus the BuildID that we need) is often not in memory. The failure of
> find_get_page() (called from build_id_parse()) is higher than we would
> want.

Our top use case of bpf_get_stack() is called from NMI, so there isn't
much we can do. Maybe it is possible to improve it by changing the
layout of the binary and the libraries? Specifically, if the text is
also in the first page, it is likely to stay in memory?

> 2. When anonymous huge pages are used to hold some regions of process
> text, build_id_parse() also fails to get a BuildID because
> vma->vm_file is NULL.

How did the text get in anonymous memory? I guess it is NOT from JIT?
We had a hack to use transparent huge page for application text. The
hack looks like:

"At run time, the application creates an 8MB temporary buffer and the
hot section of the executable memory is copied to it. The 8MB region in
the executable memory is then converted to a huge page (by way of an
mmap() to anonymous pages and an madvise() to create a huge page), the
data is copied back to it, and it is made executable again using
mprotect()."

If your case is the same (or similar), it can probably be fixed with
CONFIG_READ_ONLY_THP_FOR_FS, and modified user space.

Thanks,
Song

2022-01-26 13:09:04

by Hao Luo

[permalink] [raw]
Subject: Re: [Question] How to reliably get BuildIDs from bpf prog

Thanks Song for your suggestion.

On Mon, Jan 24, 2022 at 11:08 PM Song Liu <[email protected]> wrote:
>
> On Mon, Jan 24, 2022 at 2:43 PM Hao Luo <[email protected]> wrote:
> >
> > Dear BPF experts,
> >
> > I'm working on collecting some kernel performance data using BPF
> > tracing prog. Our performance profiling team wants to associate the
> > data with user stack information. One of the requirements is to
> > reliably get BuildIDs from bpf_get_stackid() and other similar helpers
> > [1].
> >
> > As part of an early investigation, we found that there are a couple
> > issues that make bpf_get_stackid() much less reliable than we'd like
> > for our use:
> >
> > 1. The first page of many binaries (which contains the ELF headers and
> > thus the BuildID that we need) is often not in memory. The failure of
> > find_get_page() (called from build_id_parse()) is higher than we would
> > want.
>
> Our top use case of bpf_get_stack() is called from NMI, so there isn't
> much we can do. Maybe it is possible to improve it by changing the
> layout of the binary and the libraries? Specifically, if the text is
> also in the first page, it is likely to stay in memory?
>

We are seeing 30-40% of stack frames not able to get build ids due to
this. This is a place where we could improve the reliability of build
id.

There were a few proposals coming up when we found this issue. One of
them is to have userspace mlock the first page. This would be the
easiest fix, if it works. Another proposal from Ian Rogers (cc'ed) is
to embed build id in vma. This is an idea similar to [1], but it's
unclear (at least to me) where to store the string. I'm wondering if
we can introduce a sleepable version of bpf_get_stack() if it helps.
When a page is not present, sleepable bpf_get_stack() can bring in the
page.

[1] https://lwn.net/Articles/867818/

> > 2. When anonymous huge pages are used to hold some regions of process
> > text, build_id_parse() also fails to get a BuildID because
> > vma->vm_file is NULL.
>
> How did the text get in anonymous memory? I guess it is NOT from JIT?
> We had a hack to use transparent huge page for application text. The
> hack looks like:
>
> "At run time, the application creates an 8MB temporary buffer and the
> hot section of the executable memory is copied to it. The 8MB region in
> the executable memory is then converted to a huge page (by way of an
> mmap() to anonymous pages and an madvise() to create a huge page), the
> data is copied back to it, and it is made executable again using
> mprotect()."
>
> If your case is the same (or similar), it can probably be fixed with
> CONFIG_READ_ONLY_THP_FOR_FS, and modified user space.
>

In our use cases, we have text mapped to huge pages that are not
backed by files. vma->vm_file could be null or points some fake file.
This causes challenges for us on getting build id for these code text.

> Thanks,
> Song

2022-01-26 13:41:24

by Song Liu

[permalink] [raw]
Subject: Re: [Question] How to reliably get BuildIDs from bpf prog

On Tue, Jan 25, 2022 at 3:54 PM Hao Luo <[email protected]> wrote:
>
> Thanks Song for your suggestion.
>
> On Mon, Jan 24, 2022 at 11:08 PM Song Liu <[email protected]> wrote:
> >
> > On Mon, Jan 24, 2022 at 2:43 PM Hao Luo <[email protected]> wrote:
> > >
> > > Dear BPF experts,
> > >
> > > I'm working on collecting some kernel performance data using BPF
> > > tracing prog. Our performance profiling team wants to associate the
> > > data with user stack information. One of the requirements is to
> > > reliably get BuildIDs from bpf_get_stackid() and other similar helpers
> > > [1].
> > >
> > > As part of an early investigation, we found that there are a couple
> > > issues that make bpf_get_stackid() much less reliable than we'd like
> > > for our use:
> > >
> > > 1. The first page of many binaries (which contains the ELF headers and
> > > thus the BuildID that we need) is often not in memory. The failure of
> > > find_get_page() (called from build_id_parse()) is higher than we would
> > > want.
> >
> > Our top use case of bpf_get_stack() is called from NMI, so there isn't
> > much we can do. Maybe it is possible to improve it by changing the
> > layout of the binary and the libraries? Specifically, if the text is
> > also in the first page, it is likely to stay in memory?
> >
>
> We are seeing 30-40% of stack frames not able to get build ids due to
> this. This is a place where we could improve the reliability of build
> id.
>
> There were a few proposals coming up when we found this issue. One of
> them is to have userspace mlock the first page. This would be the
> easiest fix, if it works. Another proposal from Ian Rogers (cc'ed) is
> to embed build id in vma. This is an idea similar to [1], but it's
> unclear (at least to me) where to store the string. I'm wondering if
> we can introduce a sleepable version of bpf_get_stack() if it helps.
> When a page is not present, sleepable bpf_get_stack() can bring in the
> page.

I guess it is possible to have different flavors of bpf_get_stack().
However, I am not sure whether the actual use case could use sleepable
BPF programs. Our user of bpf_get_stack() is a profiler. The BPF program
which triggers a perf_event from NMI, where we really cannot sleep.

If we have target use case that could sleep, sleepable bpf_get_stack() sounds
reasonable to me.

>
> [1] https://lwn.net/Articles/867818/
>
> > > 2. When anonymous huge pages are used to hold some regions of process
> > > text, build_id_parse() also fails to get a BuildID because
> > > vma->vm_file is NULL.
> >
> > How did the text get in anonymous memory? I guess it is NOT from JIT?
> > We had a hack to use transparent huge page for application text. The
> > hack looks like:
> >
> > "At run time, the application creates an 8MB temporary buffer and the
> > hot section of the executable memory is copied to it. The 8MB region in
> > the executable memory is then converted to a huge page (by way of an
> > mmap() to anonymous pages and an madvise() to create a huge page), the
> > data is copied back to it, and it is made executable again using
> > mprotect()."
> >
> > If your case is the same (or similar), it can probably be fixed with
> > CONFIG_READ_ONLY_THP_FOR_FS, and modified user space.
> >
>
> In our use cases, we have text mapped to huge pages that are not
> backed by files. vma->vm_file could be null or points some fake file.
> This causes challenges for us on getting build id for these code text.

So, what is the ideal output in these cases? If there isn't a back file, we
don't really have good build-id for it, right?

Thanks,
Song

2022-02-07 05:58:44

by Hao Luo

[permalink] [raw]
Subject: Re: [Question] How to reliably get BuildIDs from bpf prog

On Tue, Jan 25, 2022 at 4:16 PM Song Liu <[email protected]> wrote:
>
> On Tue, Jan 25, 2022 at 3:54 PM Hao Luo <[email protected]> wrote:
> >
> > Thanks Song for your suggestion.
> >
> > On Mon, Jan 24, 2022 at 11:08 PM Song Liu <[email protected]> wrote:
> > >
> > > On Mon, Jan 24, 2022 at 2:43 PM Hao Luo <[email protected]> wrote:
> > > >
> > > > Dear BPF experts,
> > > >
> > > > I'm working on collecting some kernel performance data using BPF
> > > > tracing prog. Our performance profiling team wants to associate the
> > > > data with user stack information. One of the requirements is to
> > > > reliably get BuildIDs from bpf_get_stackid() and other similar helpers
> > > > [1].
> > > >
> > > > As part of an early investigation, we found that there are a couple
> > > > issues that make bpf_get_stackid() much less reliable than we'd like
> > > > for our use:
> > > >
> > > > 1. The first page of many binaries (which contains the ELF headers and
> > > > thus the BuildID that we need) is often not in memory. The failure of
> > > > find_get_page() (called from build_id_parse()) is higher than we would
> > > > want.
> > >
> > > Our top use case of bpf_get_stack() is called from NMI, so there isn't
> > > much we can do. Maybe it is possible to improve it by changing the
> > > layout of the binary and the libraries? Specifically, if the text is
> > > also in the first page, it is likely to stay in memory?
> > >
> >
> > We are seeing 30-40% of stack frames not able to get build ids due to
> > this. This is a place where we could improve the reliability of build
> > id.
> >
> > There were a few proposals coming up when we found this issue. One of
> > them is to have userspace mlock the first page. This would be the
> > easiest fix, if it works. Another proposal from Ian Rogers (cc'ed) is
> > to embed build id in vma. This is an idea similar to [1], but it's
> > unclear (at least to me) where to store the string. I'm wondering if
> > we can introduce a sleepable version of bpf_get_stack() if it helps.
> > When a page is not present, sleepable bpf_get_stack() can bring in the
> > page.
>
> I guess it is possible to have different flavors of bpf_get_stack().
> However, I am not sure whether the actual use case could use sleepable
> BPF programs. Our user of bpf_get_stack() is a profiler. The BPF program
> which triggers a perf_event from NMI, where we really cannot sleep.
>
> If we have target use case that could sleep, sleepable bpf_get_stack() sounds
> reasonable to me.
>
> >
> > [1] https://lwn.net/Articles/867818/
> >
> > > > 2. When anonymous huge pages are used to hold some regions of process
> > > > text, build_id_parse() also fails to get a BuildID because
> > > > vma->vm_file is NULL.
> > >
> > > How did the text get in anonymous memory? I guess it is NOT from JIT?
> > > We had a hack to use transparent huge page for application text. The
> > > hack looks like:
> > >
> > > "At run time, the application creates an 8MB temporary buffer and the
> > > hot section of the executable memory is copied to it. The 8MB region in
> > > the executable memory is then converted to a huge page (by way of an
> > > mmap() to anonymous pages and an madvise() to create a huge page), the
> > > data is copied back to it, and it is made executable again using
> > > mprotect()."
> > >
> > > If your case is the same (or similar), it can probably be fixed with
> > > CONFIG_READ_ONLY_THP_FOR_FS, and modified user space.
> > >
> >
> > In our use cases, we have text mapped to huge pages that are not
> > backed by files. vma->vm_file could be null or points some fake file.
> > This causes challenges for us on getting build id for these code text.
>
> So, what is the ideal output in these cases? If there isn't a back file, we
> don't really have good build-id for it, right?
>

Right, I don't have a solution for this case unfortunately. Probably
will just discard the failed frames. :(

But in the case where the problem is the page not in mem, Song, do you
also see a similar high rate of build id parsing failure in your use
case (30 ~ 40% of frames)? If no, we may have done something wrong on
our side. If yes, is that a problem for your use case?

> Thanks,
> Song

2022-02-07 14:17:43

by Song Liu

[permalink] [raw]
Subject: Re: [Question] How to reliably get BuildIDs from bpf prog

On Fri, Feb 4, 2022 at 11:29 AM Hao Luo <[email protected]> wrote:
>
> On Tue, Jan 25, 2022 at 4:16 PM Song Liu <[email protected]> wrote:
> >
> > On Tue, Jan 25, 2022 at 3:54 PM Hao Luo <[email protected]> wrote:
> > >
> > > Thanks Song for your suggestion.
> > >
> > > On Mon, Jan 24, 2022 at 11:08 PM Song Liu <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 24, 2022 at 2:43 PM Hao Luo <[email protected]> wrote:
> > > > >
> > > > > Dear BPF experts,
> > > > >
> > > > > I'm working on collecting some kernel performance data using BPF
> > > > > tracing prog. Our performance profiling team wants to associate the
> > > > > data with user stack information. One of the requirements is to
> > > > > reliably get BuildIDs from bpf_get_stackid() and other similar helpers
> > > > > [1].
> > > > >
> > > > > As part of an early investigation, we found that there are a couple
> > > > > issues that make bpf_get_stackid() much less reliable than we'd like
> > > > > for our use:
> > > > >
> > > > > 1. The first page of many binaries (which contains the ELF headers and
> > > > > thus the BuildID that we need) is often not in memory. The failure of
> > > > > find_get_page() (called from build_id_parse()) is higher than we would
> > > > > want.
> > > >
> > > > Our top use case of bpf_get_stack() is called from NMI, so there isn't
> > > > much we can do. Maybe it is possible to improve it by changing the
> > > > layout of the binary and the libraries? Specifically, if the text is
> > > > also in the first page, it is likely to stay in memory?
> > > >
> > >
> > > We are seeing 30-40% of stack frames not able to get build ids due to
> > > this. This is a place where we could improve the reliability of build
> > > id.
> > >
> > > There were a few proposals coming up when we found this issue. One of
> > > them is to have userspace mlock the first page. This would be the
> > > easiest fix, if it works. Another proposal from Ian Rogers (cc'ed) is
> > > to embed build id in vma. This is an idea similar to [1], but it's
> > > unclear (at least to me) where to store the string. I'm wondering if
> > > we can introduce a sleepable version of bpf_get_stack() if it helps.
> > > When a page is not present, sleepable bpf_get_stack() can bring in the
> > > page.
> >
> > I guess it is possible to have different flavors of bpf_get_stack().
> > However, I am not sure whether the actual use case could use sleepable
> > BPF programs. Our user of bpf_get_stack() is a profiler. The BPF program
> > which triggers a perf_event from NMI, where we really cannot sleep.
> >
> > If we have target use case that could sleep, sleepable bpf_get_stack() sounds
> > reasonable to me.
> >
> > >
> > > [1] https://lwn.net/Articles/867818/
> > >
> > > > > 2. When anonymous huge pages are used to hold some regions of process
> > > > > text, build_id_parse() also fails to get a BuildID because
> > > > > vma->vm_file is NULL.
> > > >
> > > > How did the text get in anonymous memory? I guess it is NOT from JIT?
> > > > We had a hack to use transparent huge page for application text. The
> > > > hack looks like:
> > > >
> > > > "At run time, the application creates an 8MB temporary buffer and the
> > > > hot section of the executable memory is copied to it. The 8MB region in
> > > > the executable memory is then converted to a huge page (by way of an
> > > > mmap() to anonymous pages and an madvise() to create a huge page), the
> > > > data is copied back to it, and it is made executable again using
> > > > mprotect()."
> > > >
> > > > If your case is the same (or similar), it can probably be fixed with
> > > > CONFIG_READ_ONLY_THP_FOR_FS, and modified user space.
> > > >
> > >
> > > In our use cases, we have text mapped to huge pages that are not
> > > backed by files. vma->vm_file could be null or points some fake file.
> > > This causes challenges for us on getting build id for these code text.
> >
> > So, what is the ideal output in these cases? If there isn't a back file, we
> > don't really have good build-id for it, right?
> >
>
> Right, I don't have a solution for this case unfortunately. Probably
> will just discard the failed frames. :(
>
> But in the case where the problem is the page not in mem, Song, do you
> also see a similar high rate of build id parsing failure in your use
> case (30 ~ 40% of frames)? If no, we may have done something wrong on
> our side. If yes, is that a problem for your use case?

The latest data I found (which is not too recent) is about 3 % missing symbols.
I think there must be something different here.

Thanks,
Song