Alexey Gladkov <[email protected]> writes:
> Greetings!
>
> Preface
> -------
> This patch set can be applied over:
>
> git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
I am not going to seriously look at this for merging until after the
merge window closes.
Have you thought about the possibility of relaxing the permission checks
to mount proc such that we don't need to verify there is an existing
mount of proc? With just the subset pids I think this is feasible. It
might not be worth it at this point, but it is definitely worth asking
the question. As one of the benefits early propopents of the idea of a
subset of proc touted was that they would not be as restricted as they
are with today's proc.
I ask because this has a bearing on the other options you are playing
with.
Do we want to find a way to have the benefit of relaxed permission
checks while still including a few more files.
> Overview
> --------
> Directories and files can be created and deleted by dynamically loaded modules.
> Not all of these files are virtualized and safe inside the container.
>
> However, subset=pid is not enough because many containers wants to have
> /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> files per procfs mountpoint.
Is it desirable to have meminfo and cpuinfo as they are today or do
people want them to reflect the ``container'' context. So that
applications like the JVM don't allocation too many cpus or don't try
and consume too much memory, or run on nodes that cgroups current make
unavailable.
Are there any users or planned users of this functionality yet?
I am concerned that you might be adding functionality that no one will
ever use that will just add code to the kernel that no one cares about,
that will then accumulate bugs. Having had to work through a few of
those cases to make each mount of proc have it's own super block I am
not a great fan of adding another one.
If the runc, lxc and other container runtime folks can productively use
such and option to do useful things and they are sensible things to do I
don't have any fundamental objection. But I do want to be certain this
is a feature that is going to be used.
Eric
> Introduced changes
> ------------------
> Allow to specify the names of files and directories in the subset= parameter and
> thereby make a whitelist of top-level permitted names.
>
>
> Alexey Gladkov (2):
> proc: use subset option to hide some top-level procfs entries
> docs: proc: update documentation about subset= parameter
>
> Documentation/filesystems/proc.rst | 6 +++
> fs/proc/base.c | 15 +++++-
> fs/proc/generic.c | 75 +++++++++++++++++++++------
> fs/proc/inode.c | 18 ++++---
> fs/proc/internal.h | 12 +++++
> fs/proc/root.c | 81 ++++++++++++++++++++++++------
> include/linux/proc_fs.h | 11 ++--
> 7 files changed, 175 insertions(+), 43 deletions(-)
On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
> Alexey Gladkov <[email protected]> writes:
>
> > Greetings!
> >
> > Preface
> > -------
> > This patch set can be applied over:
> >
> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
>
> I am not going to seriously look at this for merging until after the
> merge window closes.
>
> Have you thought about the possibility of relaxing the permission checks
> to mount proc such that we don't need to verify there is an existing
> mount of proc? With just the subset pids I think this is feasible. It
> might not be worth it at this point, but it is definitely worth asking
> the question. As one of the benefits early propopents of the idea of a
> subset of proc touted was that they would not be as restricted as they
> are with today's proc.
>
> I ask because this has a bearing on the other options you are playing
> with.
>
> Do we want to find a way to have the benefit of relaxed permission
> checks while still including a few more files.
>
> > Overview
> > --------
> > Directories and files can be created and deleted by dynamically loaded modules.
> > Not all of these files are virtualized and safe inside the container.
> >
> > However, subset=pid is not enough because many containers wants to have
> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> > files per procfs mountpoint.
>
> Is it desirable to have meminfo and cpuinfo as they are today or do
> people want them to reflect the ``container'' context. So that
> applications like the JVM don't allocation too many cpus or don't try
> and consume too much memory, or run on nodes that cgroups current make
> unavailable.
>
> Are there any users or planned users of this functionality yet?
>
> I am concerned that you might be adding functionality that no one will
> ever use that will just add code to the kernel that no one cares about,
> that will then accumulate bugs. Having had to work through a few of
> those cases to make each mount of proc have it's own super block I am
> not a great fan of adding another one.
>
> If the runc, lxc and other container runtime folks can productively use
> such and option to do useful things and they are sensible things to do I
> don't have any fundamental objection. But I do want to be certain this
> is a feature that is going to be used.
I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but
I haven't had time to look at this patchset).
In any case, we are currently virtualizing:
/proc/cpuinfo
/proc/diskstats
/proc/loadavg
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
for each container with a tiny in-userspace filesystem LXCFS
( https://github.com/lxc/lxcfs )
and have been doing that for years.
Having meminfo and cpuinfo virtualized in procfs was something we have
been wanting for a long time and there have been patches by other people
(from Siteground, I believe) to achieve this a few years back but were
disregarded.
I think meminfo and cpuinfo would already be great. And if we're
virtualizing cpuinfo we also need to virtualize the cpu bits exposed in
/proc/stat. It would also be great to virtualize /proc/uptime. Right now
we're achieving this essentially by substracting the time the init
process of the pid namespace has started since system boot time, minus
the time when the system started to get the actual reaper age (It's a
bit more involved but that's the gist.).
This is all on the topic list for this year's virtual container's
microconference at Plumber's and I would suggest we try to discuss the
various requirements for something like this there. (I'm about to send
the CFP out.)
Christian
On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
> Alexey Gladkov <[email protected]> writes:
>
> > Greetings!
> >
> > Preface
> > -------
> > This patch set can be applied over:
> >
> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
>
> I am not going to seriously look at this for merging until after the
> merge window closes.
OK. I'll wait.
> Have you thought about the possibility of relaxing the permission checks
> to mount proc such that we don't need to verify there is an existing
> mount of proc? With just the subset pids I think this is feasible. It
> might not be worth it at this point, but it is definitely worth asking
> the question. As one of the benefits early propopents of the idea of a
> subset of proc touted was that they would not be as restricted as they
> are with today's proc.
I'm not sure I follow.
What do you mean by the possibility of relaxing the permission checks to
mount proc?
Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
options? If so then this is an interesting idea.
> I ask because this has a bearing on the other options you are playing
> with.
I can not agree with this because I do not touch on other options.
The hidepid and subset=pid has no relation to the visibility of regular
files. On the other hand, in procfs there is absolutely no way to restrict
access other than selinux.
> Do we want to find a way to have the benefit of relaxed permission
> checks while still including a few more files.
In fact, I see no problem allowing the user to mount procfs with the
hidepid=2,subset=pid options.
We can make subset=self, which would allow not only pids subset but also
other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever
add virtualization to meminfo, cpuinfo etc.
> > Overview
> > --------
> > Directories and files can be created and deleted by dynamically loaded modules.
> > Not all of these files are virtualized and safe inside the container.
> >
> > However, subset=pid is not enough because many containers wants to have
> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> > files per procfs mountpoint.
>
> Is it desirable to have meminfo and cpuinfo as they are today or do
> people want them to reflect the ``container'' context. So that
> applications like the JVM don't allocation too many cpus or don't try
> and consume too much memory, or run on nodes that cgroups current make
> unavailable.
Of course, it would be better if these files took into account the
limitations of cgroups or some kind of ``containerized'' context.
> Are there any users or planned users of this functionality yet?
I know that java uses meminfo for sure.
The purpose of this patch is to isolate the container from unwanted files
in procfs.
> I am concerned that you might be adding functionality that no one will
> ever use that will just add code to the kernel that no one cares about,
> that will then accumulate bugs. Having had to work through a few of
> those cases to make each mount of proc have it's own super block I am
> not a great fan of adding another one.
>
> If the runc, lxc and other container runtime folks can productively use
> such and option to do useful things and they are sensible things to do I
> don't have any fundamental objection. But I do want to be certain this
> is a feature that is going to be used.
Ok, just an example how docker or runc (actually almost all golang-based
container systems) is trying to block access to something in procfs:
$ docker run -it --rm busybox
# mount |grep /proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
For now I'm just trying ti create a better way to restrict access in
the procfs than this since procfs is used in containers.
--
Rgrds, legion
On Thu, Jun 04, 2020 at 11:32:20PM +0200, Christian Brauner wrote:
> > Is it desirable to have meminfo and cpuinfo as they are today or do
> > people want them to reflect the ``container'' context. So that
> > applications like the JVM don't allocation too many cpus or don't try
> > and consume too much memory, or run on nodes that cgroups current make
> > unavailable.
> >
> > Are there any users or planned users of this functionality yet?
> >
> > I am concerned that you might be adding functionality that no one will
> > ever use that will just add code to the kernel that no one cares about,
> > that will then accumulate bugs. Having had to work through a few of
> > those cases to make each mount of proc have it's own super block I am
> > not a great fan of adding another one.
> >
> > If the runc, lxc and other container runtime folks can productively use
> > such and option to do useful things and they are sensible things to do I
> > don't have any fundamental objection. But I do want to be certain this
> > is a feature that is going to be used.
>
> I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but
> I haven't had time to look at this patchset).
No. Not yet :) I just suggest a way to restrict access to files in the
procfs inside a container about which you know nothing.
> In any case, we are currently virtualizing:
> /proc/cpuinfo
> /proc/diskstats
> /proc/loadavg
> /proc/meminfo
> /proc/stat
> /proc/swaps
> /proc/uptime
> for each container with a tiny in-userspace filesystem LXCFS
> ( https://github.com/lxc/lxcfs )
> and have been doing that for years.
I know about it. The reason for the appearance of such a solution is also
clear.
> Having meminfo and cpuinfo virtualized in procfs was something we have
> been wanting for a long time and there have been patches by other people
> (from Siteground, I believe) to achieve this a few years back but were
> disregarded.
>
> I think meminfo and cpuinfo would already be great. And if we're
> virtualizing cpuinfo we also need to virtualize the cpu bits exposed in
> /proc/stat. It would also be great to virtualize /proc/uptime. Right now
> we're achieving this essentially by substracting the time the init
> process of the pid namespace has started since system boot time, minus
> the time when the system started to get the actual reaper age (It's a
> bit more involved but that's the gist.).
>
> This is all on the topic list for this year's virtual container's
> microconference at Plumber's and I would suggest we try to discuss the
> various requirements for something like this there. (I'm about to send
> the CFP out.)
>
> Christian
>
--
Rgrds, legion
Alexey Gladkov <[email protected]> writes:
> On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
>> Alexey Gladkov <[email protected]> writes:
>>
>> > Greetings!
>> >
>> > Preface
>> > -------
>> > This patch set can be applied over:
>> >
>> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
>>
>> I am not going to seriously look at this for merging until after the
>> merge window closes.
>
> OK. I'll wait.
That will mean your patches can be based on -rc1.
>> Have you thought about the possibility of relaxing the permission checks
>> to mount proc such that we don't need to verify there is an existing
>> mount of proc? With just the subset pids I think this is feasible. It
>> might not be worth it at this point, but it is definitely worth asking
>> the question. As one of the benefits early propopents of the idea of a
>> subset of proc touted was that they would not be as restricted as they
>> are with today's proc.
>
> I'm not sure I follow.
>
> What do you mean by the possibility of relaxing the permission checks to
> mount proc?
>
> Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
> options? If so then this is an interesting idea.
The key part would be subset=pid. You would still need to be root in
your user namespace, and mount namespace. You would not need to have a
separate copy of proc with nothing hidden already mounted.
>> I ask because this has a bearing on the other options you are playing
>> with.
>
> I can not agree with this because I do not touch on other options.
> The hidepid and subset=pid has no relation to the visibility of regular
> files. On the other hand, in procfs there is absolutely no way to restrict
> access other than selinux.
Untrue. At a practical level the user namespace greatly restricts
access to proc because many of the non-process files are limited to
global root only.
>> Do we want to find a way to have the benefit of relaxed permission
>> checks while still including a few more files.
>
> In fact, I see no problem allowing the user to mount procfs with the
> hidepid=2,subset=pid options.
>
> We can make subset=self, which would allow not only pids subset but also
> other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever
> add virtualization to meminfo, cpuinfo etc.
>
>> > Overview
>> > --------
>> > Directories and files can be created and deleted by dynamically loaded modules.
>> > Not all of these files are virtualized and safe inside the container.
>> >
>> > However, subset=pid is not enough because many containers wants to have
>> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
>> > files per procfs mountpoint.
>>
>> Is it desirable to have meminfo and cpuinfo as they are today or do
>> people want them to reflect the ``container'' context. So that
>> applications like the JVM don't allocation too many cpus or don't try
>> and consume too much memory, or run on nodes that cgroups current make
>> unavailable.
>
> Of course, it would be better if these files took into account the
> limitations of cgroups or some kind of ``containerized'' context.
>
>> Are there any users or planned users of this functionality yet?
>
> I know that java uses meminfo for sure.
>
> The purpose of this patch is to isolate the container from unwanted files
> in procfs.
If what we want is the ability not to use the original but to have
a modified version of these files. We probably want empty files that
serve as mount points.
Or possibly a version of these files that takes into account
restrictions. In either even we need to do the research through real
programs and real kernel options to see what is our best option for
exporting the limitations that programs have and deciding on the long
term API for that.
If we research things and we decide the best way to let java know of
it's limitations is to change /proc/meminfo. That needs to be a change
that always applies to meminfo and is not controlled by options.
>> I am concerned that you might be adding functionality that no one will
>> ever use that will just add code to the kernel that no one cares about,
>> that will then accumulate bugs. Having had to work through a few of
>> those cases to make each mount of proc have it's own super block I am
>> not a great fan of adding another one.
>>
>> If the runc, lxc and other container runtime folks can productively use
>> such and option to do useful things and they are sensible things to do I
>> don't have any fundamental objection. But I do want to be certain this
>> is a feature that is going to be used.
>
> Ok, just an example how docker or runc (actually almost all golang-based
> container systems) is trying to block access to something in procfs:
>
> $ docker run -it --rm busybox
> # mount |grep /proc
> proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
> proc on /proc/bus type proc (ro,relatime)
> proc on /proc/fs type proc (ro,relatime)
> proc on /proc/irq type proc (ro,relatime)
> proc on /proc/sys type proc (ro,relatime)
> proc on /proc/sysrq-trigger type proc (ro,relatime)
> tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime)
> tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
> tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
>
> For now I'm just trying ti create a better way to restrict access in
> the procfs than this since procfs is used in containers.
Docker historically has been crap about having a sensible policy. The
problem is that Docker wanted to allow real root in a container and
somehow make it safe by blocking access to proc files and by dropping
capabilities.
Practically everything that Docker has done is much better and simpler by
restricting the processes to a user namespace, with a root user whose
uid is not the global root user.
Which is why I want us to make certain we are doing something that makes
sense, and is architecturally sound.
You have cleared the big hurdle and proc now has options that are
usable. I really appreciate that. I am not opposed to the general
direction you are going to find a way to make proc more usable. I just
want our next step to be solid.
Eric
On Thu, Jun 04, 2020 at 11:17:38PM -0500, Eric W. Biederman wrote:
> >> I am not going to seriously look at this for merging until after the
> >> merge window closes.
> >
> > OK. I'll wait.
>
> That will mean your patches can be based on -rc1.
OK.
> > Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
> > options? If so then this is an interesting idea.
>
> The key part would be subset=pid. You would still need to be root in
> your user namespace, and mount namespace. You would not need to have a
> separate copy of proc with nothing hidden already mounted.
Can you tell me more about your idea ? I thought I understood it, but it
seems my understanding is different.
I thought that you are suggesting that you move in the direction of
allowing procfs to mount an unprivileged user.
> > I can not agree with this because I do not touch on other options.
> > The hidepid and subset=pid has no relation to the visibility of regular
> > files. On the other hand, in procfs there is absolutely no way to restrict
> > access other than selinux.
>
> Untrue. At a practical level the user namespace greatly restricts
> access to proc because many of the non-process files are limited to
> global root only.
I am not worried about the files created in procfs by the kernel itself
because the permissions are set correctly and are checked correctly.
I worry about kernel modules, especially about modules out of tree.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/gadget/function/rndis.c#n904
I certainly understand that 0660 is not 0666, but still.
> > I know that java uses meminfo for sure.
> >
> > The purpose of this patch is to isolate the container from unwanted files
> > in procfs.
>
> If what we want is the ability not to use the original but to have
> a modified version of these files. We probably want empty files that
> serve as mount points.
>
> Or possibly a version of these files that takes into account
> restrictions. In either even we need to do the research through real
> programs and real kernel options to see what is our best option for
> exporting the limitations that programs have and deciding on the long
> term API for that.
Yes, but that's a slightly different story. It would be great if all of
these files provide modified information.
My patch is about those files that we don’t know about and which we don’t
want.
> If we research things and we decide the best way to let java know of
> it's limitations is to change /proc/meminfo. That needs to be a change
> that always applies to meminfo and is not controlled by options.
>
> > For now I'm just trying ti create a better way to restrict access in
> > the procfs than this since procfs is used in containers.
>
> Docker historically has been crap about having a sensible policy. The
> problem is that Docker wanted to allow real root in a container and
> somehow make it safe by blocking access to proc files and by dropping
> capabilities.
>
> Practically everything that Docker has done is much better and simpler by
> restricting the processes to a user namespace, with a root user whose
> uid is not the global root user.
>
> Which is why I want us to make certain we are doing something that makes
> sense, and is architecturally sound.
Ok. Then ignore this patchset.
> You have cleared the big hurdle and proc now has options that are
> usable. I really appreciate that. I am not opposed to the general
> direction you are going to find a way to make proc more usable. I just
> want our next step to be solid.
--
Rgrds, legion