We have a service that forks a child process in a namespace-based
sandbox where the mount namespace is intentionally designed to reflect
a totally empty filesystem. Our use case is very similar to Chrome's
sandbox, for example, but on a server. Within the sandbox, not even
the service's own binary is present in the mount namespace.
Process tree looks like this:
$ sudo pstree -psc 63989
edgeworker(63989)─┬─edgeworker/sbox(255716)─┬─edgeworker/zygt(255718)
│ ├─{edgeworker/sbox}(255719)
│ ├─{edgeworker/sbox}(255720)
│ ├─{edgeworker/sbox}(255721)
├─edgeworker/stry(5803)
├─edgeworker/stry(63990)
├─edgeworker/stry(106218)
├─edgeworker/stry(191905)
├─edgeworker/stry(255695)
├─edgeworker/supr(255717)
Here sbox processes do actual work living in an empty mount namespaces
and stry is a helper process for error reporting. All tasks come from
the same binary that lives in the root mount namespace, launched by
systemd.
During "perf script" run on a trace obtained from the system there are
these possible outcomes:
1. The first pid to be processed is a non-namespaced helper and
symbols are present.
2. The first pid is not found and symbols are present.
3. The first pid is a sandboxed task and symbols are missing.
Symbols are missing, because "perf script" tries to jump into an empty
sandbox and find a binary there, when in fact it lives outside:
getcwd("/state/home/ivan", 4096) = 17
open("/proc/self/ns/mnt", O_RDONLY) = 5
open("/proc/255719/ns/mnt", O_RDONLY) = 6
setns(6, CLONE_NEWNS) = 0
stat("/usr/local/bin/edgeworker", 0x7ffedb9b3ca0) = -1 ENOENT (No such
file or directory)
In the second outcome we don't have a PID to figure out the namespace
to jump into, so this doesn't happen. It's a good fallback, but it was
a bit confusing during debugging.
It's not entirely clear to me why sometimes a helper PID is picked,
even though it's not the first sample in the recorded trace (at least
not in the output). This happens deterministically, or at least
appears so. In my process tree it's 255695.
I think perf should try to fallback to the default namespace to look
up symbols if they are not found inside to cover our case. Relevant
piece of logic is here:
* https://elixir.free-electrons.com/linux/v5.4.1/source/tools/perf/util/dso.c#L520
Em Wed, Dec 04, 2019 at 07:46:10PM -0800, Ivan Babrou escreveu:
> We have a service that forks a child process in a namespace-based
> sandbox where the mount namespace is intentionally designed to reflect
> a totally empty filesystem. Our use case is very similar to Chrome's
> sandbox, for example, but on a server. Within the sandbox, not even
> the service's own binary is present in the mount namespace.
>
> Process tree looks like this:
>
> $ sudo pstree -psc 63989
> edgeworker(63989)─┬─edgeworker/sbox(255716)─┬─edgeworker/zygt(255718)
> │ ├─{edgeworker/sbox}(255719)
> │ ├─{edgeworker/sbox}(255720)
> │ ├─{edgeworker/sbox}(255721)
> ├─edgeworker/stry(5803)
> ├─edgeworker/stry(63990)
> ├─edgeworker/stry(106218)
> ├─edgeworker/stry(191905)
> ├─edgeworker/stry(255695)
> ├─edgeworker/supr(255717)
>
> Here sbox processes do actual work living in an empty mount namespaces
> and stry is a helper process for error reporting. All tasks come from
> the same binary that lives in the root mount namespace, launched by
> systemd.
>
> During "perf script" run on a trace obtained from the system there are
> these possible outcomes:
>
> 1. The first pid to be processed is a non-namespaced helper and
> symbols are present.
> 2. The first pid is not found and symbols are present.
> 3. The first pid is a sandboxed task and symbols are missing.
>
> Symbols are missing, because "perf script" tries to jump into an empty
> sandbox and find a binary there, when in fact it lives outside:
>
> getcwd("/state/home/ivan", 4096) = 17
> open("/proc/self/ns/mnt", O_RDONLY) = 5
> open("/proc/255719/ns/mnt", O_RDONLY) = 6
> setns(6, CLONE_NEWNS) = 0
> stat("/usr/local/bin/edgeworker", 0x7ffedb9b3ca0) = -1 ENOENT (No such
> file or directory)
>
> In the second outcome we don't have a PID to figure out the namespace
> to jump into, so this doesn't happen. It's a good fallback, but it was
> a bit confusing during debugging.
>
> It's not entirely clear to me why sometimes a helper PID is picked,
> even though it's not the first sample in the recorded trace (at least
> not in the output). This happens deterministically, or at least
> appears so. In my process tree it's 255695.
>
> I think perf should try to fallback to the default namespace to look
> up symbols if they are not found inside to cover our case. Relevant
> piece of logic is here:
That should work for your use case, as you're sure that looking up by
pathname only will find, outside the namespace, the binary you want.
Even with pathname based looukups being fragile, it works for your
usecase, so please consider providing a patch for such fallback,
together with a pr_debug() or even pr_warning() if this don't get too
noisy, to warn the user.
- Arnaldo
> * https://elixir.free-electrons.com/linux/v5.4.1/source/tools/perf/util/dso.c#L520
--
- Arnaldo
I'm not very good at this, but the following works for me. If you this
is in general vicinity of what you expected, I can email patch
properly.
Initially I hoped that setting dso->nsinfo->need_setns to false in
dso_open would do the trick, but it did not work.
$ cat 0001-perf-fallback-to-opening-dso-from-outside-of-mount-n.patch
| sed 's/\t/ /g'
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Ivan Babrou <[email protected]>
Date: Thu, 5 Dec 2019 16:27:48 -0800
Subject: [PATCH] perf: fallback to opening dso from outside of mount namespace
Some tasks enter mount namespace for isolation and this fallback
allows perf to read symbols from binaries that live outside of
mount namespace of the running task.
Signed-off-by: Ivan Babrou <[email protected]>
---
tools/perf/util/dso.c | 7 +++++++
tools/perf/util/symbol.c | 20 +++++++++++++++-----
2 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index e11ddf86f2b3..dac6bf42e43e 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -527,6 +527,13 @@ static int open_dso(struct dso *dso, struct
machine *machine)
fd = __open_dso(dso, machine);
if (dso->binary_type != DSO_BINARY_TYPE__BUILD_ID_CACHE)
nsinfo__mountns_exit(&nsc);
+
+ if (fd < 0) {
+ fd = __open_dso(dso, machine);
+ if (fd >= 0) {
+ pr_warning("Using debug info for %s from
outside of its active mount namespace.\n", dso->long_name);
+ }
+ }
if (fd >= 0) {
dso__list_add(dso);
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index a8f80e427674..e85d57dfcc14 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1679,11 +1679,21 @@ int dso__load(struct dso *dso, struct map *map)
* Read the build id if possible. This is required for
* DSO_BINARY_TYPE__BUILDID_DEBUGINFO to work
*/
- if (!dso->has_build_id &&
- is_regular_file(dso->long_name)) {
- __symbol__join_symfs(name, PATH_MAX, dso->long_name);
- if (filename__read_build_id(name, build_id, BUILD_ID_SIZE) > 0)
- dso__set_build_id(dso, build_id);
+ if (!dso->has_build_id) {
+ bool is_reg = is_regular_file(dso->long_name);
+ if (!is_reg) {
+ nsinfo__mountns_exit(&nsc);
+ is_reg = is_regular_file(dso->long_name);
+ if (!is_reg) {
+ nsinfo__mountns_enter(dso->nsinfo, &nsc);
+ }
+ }
+
+ if (is_reg) {
+ __symbol__join_symfs(name, PATH_MAX, dso->long_name);
+ if (filename__read_build_id(name, build_id, BUILD_ID_SIZE) > 0)
+ dso__set_build_id(dso, build_id);
+ }
}
/*
--
2.24.0
/*
--
2.24.0
On Thu, Dec 5, 2019 at 4:33 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Wed, Dec 04, 2019 at 07:46:10PM -0800, Ivan Babrou escreveu:
> > We have a service that forks a child process in a namespace-based
> > sandbox where the mount namespace is intentionally designed to reflect
> > a totally empty filesystem. Our use case is very similar to Chrome's
> > sandbox, for example, but on a server. Within the sandbox, not even
> > the service's own binary is present in the mount namespace.
> >
> > Process tree looks like this:
> >
> > $ sudo pstree -psc 63989
> > edgeworker(63989)─┬─edgeworker/sbox(255716)─┬─edgeworker/zygt(255718)
> > │ ├─{edgeworker/sbox}(255719)
> > │ ├─{edgeworker/sbox}(255720)
> > │ ├─{edgeworker/sbox}(255721)
> > ├─edgeworker/stry(5803)
> > ├─edgeworker/stry(63990)
> > ├─edgeworker/stry(106218)
> > ├─edgeworker/stry(191905)
> > ├─edgeworker/stry(255695)
> > ├─edgeworker/supr(255717)
> >
> > Here sbox processes do actual work living in an empty mount namespaces
> > and stry is a helper process for error reporting. All tasks come from
> > the same binary that lives in the root mount namespace, launched by
> > systemd.
> >
> > During "perf script" run on a trace obtained from the system there are
> > these possible outcomes:
> >
> > 1. The first pid to be processed is a non-namespaced helper and
> > symbols are present.
> > 2. The first pid is not found and symbols are present.
> > 3. The first pid is a sandboxed task and symbols are missing.
> >
> > Symbols are missing, because "perf script" tries to jump into an empty
> > sandbox and find a binary there, when in fact it lives outside:
> >
> > getcwd("/state/home/ivan", 4096) = 17
> > open("/proc/self/ns/mnt", O_RDONLY) = 5
> > open("/proc/255719/ns/mnt", O_RDONLY) = 6
> > setns(6, CLONE_NEWNS) = 0
> > stat("/usr/local/bin/edgeworker", 0x7ffedb9b3ca0) = -1 ENOENT (No such
> > file or directory)
> >
> > In the second outcome we don't have a PID to figure out the namespace
> > to jump into, so this doesn't happen. It's a good fallback, but it was
> > a bit confusing during debugging.
> >
> > It's not entirely clear to me why sometimes a helper PID is picked,
> > even though it's not the first sample in the recorded trace (at least
> > not in the output). This happens deterministically, or at least
> > appears so. In my process tree it's 255695.
> >
> > I think perf should try to fallback to the default namespace to look
> > up symbols if they are not found inside to cover our case. Relevant
> > piece of logic is here:
>
> That should work for your use case, as you're sure that looking up by
> pathname only will find, outside the namespace, the binary you want.
>
> Even with pathname based looukups being fragile, it works for your
> usecase, so please consider providing a patch for such fallback,
> together with a pr_debug() or even pr_warning() if this don't get too
> noisy, to warn the user.
>
> - Arnaldo
>
> > * https://elixir.free-electrons.com/linux/v5.4.1/source/tools/perf/util/dso.c#L520
>
> --
>
> - Arnaldo
On Fri, Dec 6, 2019 at 2:17 AM Ivan Babrou <[email protected]> wrote:
>
> I'm not very good at this, but the following works for me. If you this
> is in general vicinity of what you expected, I can email patch
> properly.
>
Thanks for the patch, I can confirm it works. I had this problem today
when playing
with gvisor. Gvisor is starting up in a fresh mount namespace and perf fails
to read the symbols. Stracing perf shows:
11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
11913 setns(197, CLONE_NEWNS) = 0
11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
(No such file or directory)
11913 setns(196, CLONE_NEWNS) = 0
Which of course makes no sense - the runsc-debug binary does not exist in the
empty mount namespace of the restricted runsc process.
Marek
On Tue, Feb 04, 2020 at 03:09:48PM +0000, Marek Majkowski wrote:
> On Fri, Dec 6, 2019 at 2:17 AM Ivan Babrou <[email protected]> wrote:
> >
> > I'm not very good at this, but the following works for me. If you this
> > is in general vicinity of what you expected, I can email patch
> > properly.
> >
>
> Thanks for the patch, I can confirm it works. I had this problem today
> when playing
> with gvisor. Gvisor is starting up in a fresh mount namespace and perf fails
> to read the symbols. Stracing perf shows:
>
> 11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
> 11913 setns(197, CLONE_NEWNS) = 0
> 11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
> (No such file or directory)
> 11913 setns(196, CLONE_NEWNS) = 0
>
> Which of course makes no sense - the runsc-debug binary does not exist in the
> empty mount namespace of the restricted runsc process.
hi,
could you guys please share more details on what you run exactly,
and perhaps that change you mentioned?
thanks,
jirka
Jirka,
On Tue, Feb 4, 2020 at 7:27 PM Jiri Olsa <[email protected]> wrote:
> > 11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
> > 11913 setns(197, CLONE_NEWNS) = 0
> > 11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
> > (No such file or directory)
> > 11913 setns(196, CLONE_NEWNS) = 0
>
> hi,
> could you guys please share more details on what you run exactly,
> and perhaps that change you mentioned?
I was debugging gvisor (runsc), which does execve(/proc/self/exe), and
then messes up with its mount namespace. The effect is that the binary
running doesn't exist in the mount namespace. This confuses perf,
which fails to load symbols for that process.
To my understanding, by default, perf looks for the binary in the
process mount namespace. In this case clearly the binary wasn't there.
Ivan wrote a rough patch [1], which I just confirmed works. The patch
attempts, after a failure to load binary from pids mount namespace, to
load binary from the default mount namespace (the one running perf).
[1] https://lkml.org/lkml/2019/12/5/878
Marek
Em Tue, Feb 11, 2020 at 10:06:35AM +0000, Marek Majkowski escreveu:
> Jirka,
>
> On Tue, Feb 4, 2020 at 7:27 PM Jiri Olsa <[email protected]> wrote:
> > > 11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
> > > 11913 setns(197, CLONE_NEWNS) = 0
> > > 11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
> > > (No such file or directory)
> > > 11913 setns(196, CLONE_NEWNS) = 0
> >
> > hi,
> > could you guys please share more details on what you run exactly,
> > and perhaps that change you mentioned?
>
> I was debugging gvisor (runsc), which does execve(/proc/self/exe), and
> then messes up with its mount namespace. The effect is that the binary
> running doesn't exist in the mount namespace. This confuses perf,
> which fails to load symbols for that process.
>
> To my understanding, by default, perf looks for the binary in the
> process mount namespace. In this case clearly the binary wasn't there.
> Ivan wrote a rough patch [1], which I just confirmed works. The patch
> attempts, after a failure to load binary from pids mount namespace, to
> load binary from the default mount namespace (the one running perf).
>
> [1] https://lkml.org/lkml/2019/12/5/878
That is a fallback that works in this specific case, and, with a warning
or some explicitely specified option makes perf work with this specific
usecase, but either a warning ("fallback to root namespace binary
/foo/bar") or the explicit option, please, is that what that patch does?
- Arnaldo
On Tue, Feb 11, 2020 at 1:46 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Tue, Feb 11, 2020 at 10:06:35AM +0000, Marek Majkowski escreveu:
> > Jirka,
> >
> > On Tue, Feb 4, 2020 at 7:27 PM Jiri Olsa <[email protected]> wrote:
> > > > 11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
> > > > 11913 setns(197, CLONE_NEWNS) = 0
> > > > 11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
> > > > (No such file or directory)
> > > > 11913 setns(196, CLONE_NEWNS) = 0
> > >
> > > hi,
> > > could you guys please share more details on what you run exactly,
> > > and perhaps that change you mentioned?
> >
> > I was debugging gvisor (runsc), which does execve(/proc/self/exe), and
> > then messes up with its mount namespace. The effect is that the binary
> > running doesn't exist in the mount namespace. This confuses perf,
> > which fails to load symbols for that process.
> >
> > To my understanding, by default, perf looks for the binary in the
> > process mount namespace. In this case clearly the binary wasn't there.
> > Ivan wrote a rough patch [1], which I just confirmed works. The patch
> > attempts, after a failure to load binary from pids mount namespace, to
> > load binary from the default mount namespace (the one running perf).
> >
> > [1] https://lkml.org/lkml/2019/12/5/878
>
> That is a fallback that works in this specific case, and, with a warning
> or some explicitely specified option makes perf work with this specific
> usecase, but either a warning ("fallback to root namespace binary
> /foo/bar") or the explicit option, please, is that what that patch does?
You got it right, custom patch, to do something custom (look up in top
mount ns) yet on failure. I'm not sure how to make it more generic.
Furthermore, there is one more use case this patch doesn't support:
namely a situation when the binary is reachable in some mount
namespace, but not under sensible path. This can happen when we launch
a command under gvisor. Gvisor-sandbox runs under empty mount
namespace, the binary is delivered over 9p from gvisor-gofer process,
from potentially arbitrary path. In that scenario we have three mount
namespaces: the empty one running process, another one with access to
the binary, and host one.
I have two ideas how to solve the symbol discovery here:
(a) give perf an explicit link (potentially including mount namespace
pid) to the binary
(b) supply perf with /tmp/perf-<pid>.map file with symbols, extracted
via some external helper.
I tried (b) but failed, I'm not sure how to produce perf-pid.map from
a proper binary, using basic tools like readelf.
Em Tue, Feb 11, 2020 at 01:54:33PM +0000, Marek Majkowski escreveu:
> On Tue, Feb 11, 2020 at 1:46 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > Em Tue, Feb 11, 2020 at 10:06:35AM +0000, Marek Majkowski escreveu:
> > > On Tue, Feb 4, 2020 at 7:27 PM Jiri Olsa <[email protected]> wrote:
> > > > > 11913 openat(AT_FDCWD, "/proc/9512/ns/mnt", O_RDONLY) = 197
> > > > > 11913 setns(197, CLONE_NEWNS) = 0
> > > > > 11913 stat("/home/marek/bin/runsc-debug", 0x7fffffff8480) = -1 ENOENT
> > > > > (No such file or directory)
> > > > > 11913 setns(196, CLONE_NEWNS) = 0
> > > > could you guys please share more details on what you run exactly,
> > > > and perhaps that change you mentioned?
> > > I was debugging gvisor (runsc), which does execve(/proc/self/exe), and
> > > then messes up with its mount namespace. The effect is that the binary
> > > running doesn't exist in the mount namespace. This confuses perf,
> > > which fails to load symbols for that process.
> > > To my understanding, by default, perf looks for the binary in the
> > > process mount namespace. In this case clearly the binary wasn't there.
> > > Ivan wrote a rough patch [1], which I just confirmed works. The patch
> > > attempts, after a failure to load binary from pids mount namespace, to
> > > load binary from the default mount namespace (the one running perf).
> > > [1] https://lkml.org/lkml/2019/12/5/878
> > That is a fallback that works in this specific case, and, with a warning
> > or some explicitely specified option makes perf work with this specific
> > usecase, but either a warning ("fallback to root namespace binary
> > /foo/bar") or the explicit option, please, is that what that patch does?
> You got it right, custom patch, to do something custom (look up in top
> mount ns) yet on failure. I'm not sure how to make it more generic.
We have buildids in binaries:
[acme@quaco ~]$ file /bin/bash
/bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=0cb50a07a621d02a0d2c7efec6743fddec845bfb, stripped
[acme@quaco ~]$ file /lib64/libc-2.29.so
/lib64/libc-2.29.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7ddecbbf9f22ec76c9e4a256fd1c06004a1907ce, for GNU/Linux 3.2.0, not stripped, too many notes (256)
[acme@quaco ~]$
We need to get this somehow from a given executable map, this comes and
goes in situations like this :-\
I.e. this info is in an ELF section:
[acme@quaco ~]$ readelf -SW /bin/bash | grep build-id
[ 4] .note.gnu.build-id NOTE 0000000000000340 000340 000024 00 A 0 0 4
[acme@quaco ~]$
Somebody needs to associate that with that executable mmap at load time,
so that perf gets it via PERF_RECORD_MMAP3 instead of having to try,
optimistically, to get it from the binary (that may not be there when we
try to read it, or maybe in some place like you describe in this
message, or...) when generating its build-id perf.data header section:
[acme@seventh ~]$ perf record stress-ng --cpu 1 --timeout 1s
stress-ng: info: [17622] dispatching hogs: 1 cpu
stress-ng: info: [17622] successful run completed in 1.02s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.159 MB perf.data (4105 samples) ]
[acme@seventh ~]$ perf buildid-list
e9e69be73f7c5a4cee110ced52409371e95fe2a8 [kernel.kallsyms]
7133e5dbdfae821a9bbe4ba5467e49f6cf166e1d /usr/bin/stress-ng
bd5e36f101b175755c7943105390078dff596657 /usr/lib64/ld-2.29.so
1e292b30223c69eff986710c62eda48c561d43ca [vdso]
b8d7438178da8f84d89869addf6b5e36d356c555 /usr/lib64/libm-2.29.so
7ddecbbf9f22ec76c9e4a256fd1c06004a1907ce /usr/lib64/libc-2.29.so
[acme@seventh ~]$ file /usr/bin/stress-ng
/usr/bin/stress-ng: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=7133e5dbdfae821a9bbe4ba5467e49f6cf166e1d, stripped, too many notes (256)
[acme@seventh ~]$
> Furthermore, there is one more use case this patch doesn't support:
> namely a situation when the binary is reachable in some mount
> namespace, but not under sensible path. This can happen when we launch
> a command under gvisor. Gvisor-sandbox runs under empty mount
> namespace, the binary is delivered over 9p from gvisor-gofer process,
> from potentially arbitrary path. In that scenario we have three mount
> namespaces: the empty one running process, another one with access to
> the binary, and host one.
> I have two ideas how to solve the symbol discovery here:
> (a) give perf an explicit link (potentially including mount namespace
> pid) to the binary
> (b) supply perf with /tmp/perf-<pid>.map file with symbols, extracted
> via some external helper.
>
> I tried (b) but failed, I'm not sure how to produce perf-pid.map from
> a proper binary, using basic tools like readelf.
Have you looked at:
[acme@quaco ~]$ perf report -h symfs
Usage: perf report [<options>]
--symfs <directory>
Look for files with symbols relative to this directory
[acme@quaco ~]$
?
- Arnaldo