2022-05-09 11:40:53

by Craig Small

[permalink] [raw]
Subject: procfs: mnt namespace behaviour with block devices (resend)

(resending as plain text as the first got bounced)

Hi,
I'm the maintainer of the psmisc package that provides system tools
for things like fuser and killall. I am trying to establish if
something I have found with the proc filesystem is as intended
(knowing why would be nice) or if it's a strange corner-case bug.

Apologies to the non-procfs maintainers but these two lists are what
MAINTAINER said to go to. If you could CC me on replies that would be
great.

The proc file descriptor for a block device mounted in a different
namespace will show the device id of that different namespace and not
the device id of the process stat()ing the file.

The issue came up in fuser not finding certain processes that were
directly accessing a block device, see
https://gitlab.com/psmisc/psmisc/-/issues/39 Programs such as lsof are
caught by this too.

My question is: When I am in the bash mount namespace (4026531840 below)
then shouldn't all the device IDs be from that namespace? In other
words, the device id of the dereferenced symlink and what it points to
are the same (device id 5) and not symlink has 44 and /dev/dm-8 has 5.

I get that if I could look at the device IDs in qemu or use nsenter to
switch to its namespace, then the device should be 44 for the symlink
and device (which it is and seems correct to me).

How to replicate
=============
# uname -a
Linux elmo 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15)
x86_64 GNU/Linux

The easiest way to replicate this is to make a qemu virtual machine and
have it mount a block device. I suspect there are other ways, but I
don't have many things that mount a device and switch namespaces. The
qemu process (here it is 136775) will have a different mount namespace.

# ps -o pid,mntns,comm $$ 136775
PID MNTNS COMMAND
136775 4026532762 qemu-system-x86
142359 4026531840 bash

File descriptor 23 is what qemu is using to mount the block device
# ls -l /proc/136775/fd/23
lrwx------ 1 libvirt-qemu libvirt-qemu 64 Apr 12 16:34
/proc/136775/fd/23 -> /dev/dm-8

However, the dereferenced symlink and where the symlink points to show
different data.

# stat -L /proc/136775/fd/23
File: /proc/136775/fd/23
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
Access: 2022-04-12 16:34:25.687147886 +1000
Modify: 2022-04-12 16:34:25.519151533 +1000
Change: 2022-04-12 16:34:25.595149882 +1000
Birth: -

# stat /dev/dm-8
File: /dev/dm-8
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 348 Links: 1 Device type: fd,8
Access: (0660/brw-rw----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-04-12 16:15:12.684434884 +1000
Modify: 2022-04-12 16:15:12.684434884 +1000
Change: 2022-04-12 16:15:12.684434884 +1000
Birth: -

If we change to the qemu process' mount namespace then we do see that
/dev/dm-8 has the same device/inode as the symlink.

# nsenter -m -t 136775 stat /dev/dm-8
File: /dev/dm-8
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
Access: 2022-04-12 16:34:25.687147886 +1000
Modify: 2022-04-12 16:34:25.519151533 +1000
Change: 2022-04-12 16:34:25.595149882 +1000
Birth: -

Thanks for your time.

- Craig


2022-05-09 18:54:08

by Stephen Brennan

[permalink] [raw]
Subject: Re: procfs: mnt namespace behaviour with block devices (resend)

Hi Craig,

On 5/9/22 03:20, Craig Small wrote:
> (resending as plain text as the first got bounced)
>
> Hi,
> I'm the maintainer of the psmisc package that provides system tools
> for things like fuser and killall. I am trying to establish if
> something I have found with the proc filesystem is as intended
> (knowing why would be nice) or if it's a strange corner-case bug.
>
> Apologies to the non-procfs maintainers but these two lists are what
> MAINTAINER said to go to. If you could CC me on replies that would be
> great.
>
> The proc file descriptor for a block device mounted in a different
> namespace will show the device id of that different namespace and not
> the device id of the process stat()ing the file.
>
> The issue came up in fuser not finding certain processes that were
> directly accessing a block device, see
> https://gitlab.com/psmisc/psmisc/-/issues/39 Programs such as lsof are
> caught by this too.
>
> My question is: When I am in the bash mount namespace (4026531840 below)
> then shouldn't all the device IDs be from that namespace? In other
> words, the device id of the dereferenced symlink and what it points to
> are the same (device id 5) and not symlink has 44 and /dev/dm-8 has 5.
I'm no expert here, but I think this is working as intended.
It's definitely confusing!

Consider a process in a separate mount namespace from the init
namespace, e.g. a container. Say I were to open python in that container
and then do `os.open("/etc/passwd")`. If I were to then look at that
process's file descriptors (from the host's perspective), I'd see the
following (pid 220854 is the python process in the container):

$ ls -lah /proc/220854/fd/
total 0
dr-x------ 2 stepbren stepbren 0 May 9 11:06 .
dr-xr-xr-x 9 stepbren stepbren 0 May 9 11:06 ..
lrwx------ 1 stepbren stepbren 64 May 9 11:06 0 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May 9 11:06 1 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May 9 11:06 2 -> /dev/pts/0
lr-x------ 1 stepbren stepbren 64 May 9 11:06 3 -> /etc/passwd

$ cat /proc/220854/fd/3
<contents of container /etc/passwd>

$ cat /etc/passwd
<contents of host /etc/passwd>

$ stat -L /proc/220854/fd/3
File: /proc/220854/fd/3
Size: 900 Blocks: 8 IO Block: 4096 regular file
Device: 4eh/78d Inode: 5508982 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-10-27 10:24:28.000000000 -0700
Modify: 2020-10-27 10:24:28.000000000 -0700
Change: 2020-10-27 10:24:30.255374190 -0700
Birth: 2020-10-27 10:24:30.255374190 -0700

$ stat /etc/passwd
File: /etc/passwd
Size: 3216 Blocks: 8 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 24917416 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-05-08 15:06:18.837117765 -0700
Modify: 2021-11-30 09:08:45.163873193 -0800
Change: 2021-11-30 09:08:45.167873237 -0800
Birth: 2021-11-30 09:08:45.163873193 -0800

## INSIDE CONTAINER'S MOUNT NAMESPACE
$ stat /etc/passwd
File: /etc/passwd
Size: 900 Blocks: 8 IO Block: 4096 regular file
Device: 4eh/78d Inode: 5508982 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-10-27 17:24:28.000000000 +0000
Modify: 2020-10-27 17:24:28.000000000 +0000
Change: 2020-10-27 17:24:30.255374190 +0000
Birth: -

As you can see, it's the same behavior: the path /etc/passwd resolves to
a different inode in the init mount namespace compared to the
container's mount namespace. The secret sauce of the /proc/$pid/fd/$fd
files is that they don't behave like a normal symlink: instead of using
the file path to lookup the target inode, they directly lookup the file
and inode of the target process's table.

When you do a readlink(), the kernel has to create a path string, and it
has to do it from the perspective of the mount namespace of $pid, not
your monitoring command. The reason is that there may not even be a
corresponding path outside the mount namespace of $pid. Imagine I
created and opened "/etc/foobar" inside the container: that file may not
exist outside the container, so how could readlink() make a path
specific to your mount namespace?

Hopefully this helps, but maybe I'm off base and missing the thrust of
your question, let me know either way.

Stephen

>
> I get that if I could look at the device IDs in qemu or use nsenter to
> switch to its namespace, then the device should be 44 for the symlink
> and device (which it is and seems correct to me).
>
> How to replicate
> =============
> # uname -a
> Linux elmo 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15)
> x86_64 GNU/Linux
>
> The easiest way to replicate this is to make a qemu virtual machine and
> have it mount a block device. I suspect there are other ways, but I
> don't have many things that mount a device and switch namespaces. The
> qemu process (here it is 136775) will have a different mount namespace.
>
> # ps -o pid,mntns,comm $$ 136775
> PID MNTNS COMMAND
> 136775 4026532762 qemu-system-x86
> 142359 4026531840 bash
>
> File descriptor 23 is what qemu is using to mount the block device
> # ls -l /proc/136775/fd/23
> lrwx------ 1 libvirt-qemu libvirt-qemu 64 Apr 12 16:34
> /proc/136775/fd/23 -> /dev/dm-8
>
> However, the dereferenced symlink and where the symlink points to show
> different data.
>
> # stat -L /proc/136775/fd/23
> File: /proc/136775/fd/23
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
> Birth: -
>
> # stat /dev/dm-8
> File: /dev/dm-8
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 5h/5d Inode: 348 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2022-04-12 16:15:12.684434884 +1000
> Modify: 2022-04-12 16:15:12.684434884 +1000
> Change: 2022-04-12 16:15:12.684434884 +1000
> Birth: -
>
> If we change to the qemu process' mount namespace then we do see that
> /dev/dm-8 has the same device/inode as the symlink.
>
> # nsenter -m -t 136775 stat /dev/dm-8
> File: /dev/dm-8
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
> Birth: -
>
> Thanks for your time.
>
> - Craig