Hi,
Many Linux kernel vulnerabilities including the recently exploited
Netfilter CVE-2024-1086 require CAP_NET_ADMIN in a namespace, yet a
typically recommended mitigation is to disable user namespaces (not just
network namespaces).
Further, while on Debian/Ubuntu it is possible to disable just
unprivileged user namespaces with the Debian-specific sysctl
kernel.unprivileged_userns_clone=0, on other distros we'd have to use
user.max_user_namespaces=0, which (unnecessarily) prevents starting of
containers even by root.
Fredrik Nystrom on Rocky Linux Mattermost channel Security pointed out
that it is reasonable to disable just network namespaces with
user.max_net_namespaces=0 instead, and that the negative effects of
doing so and how to cope with them are well-documented for Apptainer,
with its documentation also covering Docker, Podman, and systemd:
https://apptainer.org/docs/admin/latest/user_namespace.html#disabling-network-namespaces
I hope some of us in here find this useful, and maybe we (including
distros) will start recommending this milder mitigation when sufficient.
I include this section of the Apptainer documentation below, as taken
from its source at
https://github.com/apptainer/apptainer-admindocs/blob/main/user_namespace.rst
---
******************************
Disabling network namespaces
******************************
There have been many Linux kernel exploits that have made use of
unprivileged user namespaces as a point of entry, but almost all of them
in the last few years have been in combination with network namespaces.
Therefore even though the Apptainer project recommends enabling
unprivileged user namespaces, it recommends disabling network namespaces
when possible in order to substantially reduce the risk profile
and need for urgent updates when vulnerabilities are announced.
Network namespaces can be disabled on most Linux-based systems
like this:
.. code:: bash
echo "user.max_net_namespaces = 0" \
>/etc/sysctl.d/90-max_net_namespaces.conf
sysctl -p /etc/sysctl.d/90-max_net_namespaces.conf
Apptainer does not by default make use of network namespaces, but it
does have some little-used privileged options beginning with ``--net``
that do.
Those options will not work when network namespaces are disabled.
Unfortunately it is not possible to disable only unprivileged
network namespaces, so this will affect programs that use them
even if run as root.
Some other container runtimes such as Docker and Podman do make use
of network namespaces by default.
Those two runtimes can still work when network namespaces are disabled
by adding the ``--net=host`` option.
Disabling network namespaces also blocks the systemd PrivateNetwork
feature.
To find services that use it, look for ``PrivateNetwork=true``
or ``PrivateNetwork=yes`` in ``/lib/systemd/system/*.service``.
This can be turned off for each service through a
``/etc/systemd/system/<service>.d/*.conf`` file, for example for
``systemd-hostnamed``:
.. code:: bash
cd /etc/systemd/system
mkdir -p systemd-hostnamed.service.d
(echo "[Service]"; echo "PrivateNetwork=no") \
>systemd-hostnamed.service.d/no-private-network.conf
If the service is enabled (that is, actively used) then restart it
and check its status:
.. code:: bash
systemctl status systemd-hostnamed
systemctl daemon-reload
systemctl restart systemd-hostnamed
systemctl status systemd-hostnamed
---
Alexander
On Sun, Apr 14, 2024 at 09:08:55PM +0200, Solar Designer wrote:
> Hi,
>
> Many Linux kernel vulnerabilities including the recently exploited
> Netfilter CVE-2024-1086 require CAP_NET_ADMIN in a namespace, yet a
> typically recommended mitigation is to disable user namespaces (not just
> network namespaces).
>
> Further, while on Debian/Ubuntu it is possible to disable just
> unprivileged user namespaces with the Debian-specific sysctl
> kernel.unprivileged_userns_clone=0, on other distros we'd have to use
> user.max_user_namespaces=0, which (unnecessarily) prevents starting of
> containers even by root.
>
> Fredrik Nystrom on Rocky Linux Mattermost channel Security pointed out
> that it is reasonable to disable just network namespaces with
> user.max_net_namespaces=0 instead, and that the negative effects of
> doing so and how to cope with them are well-documented for Apptainer,
> with its documentation also covering Docker, Podman, and systemd:
>
> https://apptainer.org/docs/admin/latest/user_namespace.html#disabling-network-namespaces
>
> I hope some of us in here find this useful, and maybe we (including
> distros) will start recommending this milder mitigation when sufficient.
Is this still compatible with Firefox?
IMO an ideal solution would be:
1. Provide a privileged helper daemon that sets up containers based on
user requirements.
2. Port programs that use containers to use this helper.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
On Sun, Apr 14, 2024 at 06:47:26PM -0400, Demi Marie Obenour wrote:
> On Sun, Apr 14, 2024 at 09:08:55PM +0200, Solar Designer wrote:
> > Fredrik Nystrom on Rocky Linux Mattermost channel Security pointed out
> > that it is reasonable to disable just network namespaces with
> > user.max_net_namespaces=0 instead, and that the negative effects of
> > doing so and how to cope with them are well-documented for Apptainer,
> > with its documentation also covering Docker, Podman, and systemd:
> >
> > https://apptainer.org/docs/admin/latest/user_namespace.html#disabling-network-namespaces
> >
> > I hope some of us in here find this useful, and maybe we (including
> > distros) will start recommending this milder mitigation when sufficient.
>
> Is this still compatible with Firefox?
No. Per my testing, setting user.max_net_namespaces=0 while keeping
user.max_user_namespaces at greater than 0 is _not_ compatible with
Firefox 124.0.2. However, setting user.max_user_namespaces=0 is
compatible with it, regardless of whether user.max_net_namespaces is 0
or not. I guess it only has fallbacks (perhaps weakening its sandbox)
for the case when user namespaces can't be created, but not for this
mixed case when user can be, but net can't.
Breaking Firefox or weakening its sandbox is indeed not great.
I primarily meant these settings for headless servers, which wouldn't
commonly run Firefox. However, even there I can see how weakening
systemd service sandboxing is also not great. Maybe we need to invent a
kernel.unprivileged_netns_clone setting similar to Debian's
kernel.unprivileged_userns_clone, so that systemd (running as root)
would still be able to create network namespaces. And/or make Debian's
kernel.unprivileged_userns_clone official upstream and use that. Why
did Debian choose to deprecate (but not yet drop?) theirs and go with
upstream's user.max_user_namespaces, which doesn't provide exactly the
same functionality? Was there an attempt at upstreaming?
> IMO an ideal solution would be:
>
> 1. Provide a privileged helper daemon that sets up containers based on
> user requirements.
>
> 2. Port programs that use containers to use this helper.
Not likely to happen universally and not good in terms of introducing a
middle project and dependency that could dictate rules to others.
Alexander
On Sun, 14 Apr 2024 at 21:08:55 +0200, Solar Designer forwarded:
> Some other container runtimes such as Docker and Podman do make use
> of network namespaces by default.
As an example of a less traditional container environment, Flatpak
optionally uses network namespaces (as implemented by bubblewrap,
bwrap(1)) to isolate apps from the network, and disabling
network namespaces will break the ability to run apps that have
`--unshare=network` in their manifests. I believe it will "fail closed"
in this situation (refusing to run the affected app, rather than running
the app but giving it unintended network access).
A workaround would be to run the affected apps with
`flatpak run --share=network ...`, or permanently reconfigure their
sandboxing parameters with `flatpak override --share=network ...`, but
either of those workarounds would remove the network isolation feature
and give the affected apps unrestricted network access.
Similarly, libgnome-desktop uses bubblewrap to run sandboxed thumbnailers
with no network access, mitigating vulnerabilities that might exist in
thumbnailers or the libraries that they use. Again, I believe it will
"fail closed", but I haven't checked.
Similarly, WebKitGTK uses bubblewrap to sandbox parts of itself with no
network access, xdg-desktop-portal uses bubblewrap for sandboxed icon
validation, and I'm sure there are others.
(<https://codesearch.debian.net/search?q=--unshare-net>)
So I suspect that the mitigation of disabling network namespaces is
likely to be too disruptive to be applicable on desktops, and only useful
on servers.
smcv
On Mon, 15 Apr 2024 at 17:13:09 +0200, Solar Designer wrote:
> And/or make Debian's
> kernel.unprivileged_userns_clone official upstream and use that. Why
> did Debian choose to deprecate (but not yet drop?) theirs and go with
> upstream's user.max_user_namespaces, which doesn't provide exactly the
> same functionality? Was there an attempt at upstreaming?
I am not a kernel developer, so this is second-hand information; but I
believe the implementation of kernel.unprivileged_userns_clone used in
Debian (and subsequently copied from Debian by various other distros)
is derived from patches that were already proposed and rejected upstream,
so the feeling was that trying again to upstream that feature would be a
waste of time and upstream goodwill, because it would just get rejected
again by the same kernel maintainer.
kernel.unprivileged_userns_clone was a tradeoff between kernel attack
surface and user-space attack surface. Disabling it mitigates various
attacks that user-space can attempt on the kernel, but forces user-space
sandboxing things (such as bubblewrap and the Chromium sandbox) to
be setuid root if they are going to be used, which turns them into a
user-space root privilege escalation risk. Conversely, with unprivileged
namespaces, we can sandbox user-space processes without adding that risk,
but we're relying on a larger kernel attack surface being secure.
(Current versions of Debian still have the kernel.unprivileged_userns_clone
patch, but it's left enabled by default, resulting in behaviour that is
equivalent to upstream kernels.)
smcv
Hey.
There's even an allegedly "wontfix" bug of mine where I requested that
Debian switches back to a secure default and disables user namesapce which
have a long history of being exploitable:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1012547
Don't think the current hole one will have been the last one.
Unfortunately it seems a feature that only a group of people will need is
valued more important than keeping users secure. :-(
Regards,
Philippe
On Tue, Apr 16, 2024 at 11:31:43PM +0200, Philippe Cerfon wrote:
> Hey.
>
> There's even an allegedly "wontfix" bug of mine where I requested that
> Debian switches back to a secure default and disables user namesapce which
> have a long history of being exploitable:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1012547
>
> Don't think the current hole one will have been the last one.
>
> Unfortunately it seems a feature that only a group of people will need is
> valued more important than keeping users secure. :-(
The problem with disabling unprivileged userns is that in the desktop
Linux case it actually causes serious problems, because creating a
sandbox is now a privileged operation. IMO Landlock + seccomp is a much
better solution for sandboxing, but I don't think it can do everything
browsers need yet.
For containers, I'm not aware of a good solution right now.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
On Sun, 2024-04-14 at 21:08 +0200, Solar Designer wrote:
> Hi,
>
> Many Linux kernel vulnerabilities including the recently exploited
> Netfilter CVE-2024-1086 require CAP_NET_ADMIN in a namespace, yet a
> typically recommended mitigation is to disable user namespaces (not just
> network namespaces).
>
> Further, while on Debian/Ubuntu it is possible to disable just
> unprivileged user namespaces with the Debian-specific sysctl
> kernel.unprivileged_userns_clone=0, on other distros we'd have to use
> user.max_user_namespaces=0, which (unnecessarily) prevents starting of
> containers even by root.
I just wanted to add that in the Ubuntu Noble Numbat release we are
using AppArmor to restrict unprivileged user namespaces.
Applications that don't have an AppArmor profile will use a default
profile which denies the use of capabilities within the user namespace.
Applications that need to use capabilities will have to be confined by
a profile. Since we understand that creating an AppArmor profile might
not be a trivial task for large programs, we introduced the
"unconfined" flag which makes the profile act as if it were unconfined
from the perspective of AppArmor, allowing all operations.
There are more details here:
https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890#unprivileged-user-namespace-restrictions-13
>
> Fredrik Nystrom on Rocky Linux Mattermost channel Security pointed out
> that it is reasonable to disable just network namespaces with
> user.max_net_namespaces=0 instead, and that the negative effects of
> doing so and how to cope with them are well-documented for Apptainer,
> with its documentation also covering Docker, Podman, and systemd:
>
> https://apptainer.org/docs/admin/latest/user_namespace.html#disabling-network-namespaces
>
> I hope some of us in here find this useful, and maybe we (including
> distros) will start recommending this milder mitigation when sufficient.
>
> I include this section of the Apptainer documentation below, as taken
> from its source at
> https://github.com/apptainer/apptainer-admindocs/blob/main/user_namespace.rst
>
> ---
> ******************************
> Disabling network namespaces
> ******************************
>
> There have been many Linux kernel exploits that have made use of
> unprivileged user namespaces as a point of entry, but almost all of them
> in the last few years have been in combination with network namespaces.
> Therefore even though the Apptainer project recommends enabling
> unprivileged user namespaces, it recommends disabling network namespaces
> when possible in order to substantially reduce the risk profile
> and need for urgent updates when vulnerabilities are announced.
>
> Network namespaces can be disabled on most Linux-based systems
> like this:
>
> .. code:: bash
>
> echo "user.max_net_namespaces = 0" \
> >/etc/sysctl.d/90-max_net_namespaces.conf
> sysctl -p /etc/sysctl.d/90-max_net_namespaces.conf
>
> Apptainer does not by default make use of network namespaces, but it
> does have some little-used privileged options beginning with ``--net``
> that do.
> Those options will not work when network namespaces are disabled.
> Unfortunately it is not possible to disable only unprivileged
> network namespaces, so this will affect programs that use them
> even if run as root.
>
> Some other container runtimes such as Docker and Podman do make use
> of network namespaces by default.
> Those two runtimes can still work when network namespaces are disabled
> by adding the ``--net=host`` option.
>
> Disabling network namespaces also blocks the systemd PrivateNetwork
> feature.
> To find services that use it, look for ``PrivateNetwork=true``
> or ``PrivateNetwork=yes`` in ``/lib/systemd/system/*.service``.
> This can be turned off for each service through a
> ``/etc/systemd/system/<service>.d/*.conf`` file, for example for
> ``systemd-hostnamed``:
>
> .. code:: bash
>
> cd /etc/systemd/system
> mkdir -p systemd-hostnamed.service.d
> (echo "[Service]"; echo "PrivateNetwork=no") \
> >systemd-hostnamed.service.d/no-private-network.conf
>
> If the service is enabled (that is, actively used) then restart it
> and check its status:
>
> .. code:: bash
>
> systemctl status systemd-hostnamed
> systemctl daemon-reload
> systemctl restart systemd-hostnamed
> systemctl status systemd-hostnamed
> ---
>
> Alexander
On Wed, Apr 17, 2024 at 09:52:10AM -0300, Georgia Garcia wrote:
> I just wanted to add that in the Ubuntu Noble Numbat release we are
> using AppArmor to restrict unprivileged user namespaces.
For those who like me are confused by release names, this is 24.04 LTS.
> Applications that don't have an AppArmor profile will use a default
> profile which denies the use of capabilities within the user namespace.
In other words, there's now precedent of allowing namespace creation
while disallowing use of capabilities in the namespace. I started
thinking of doing the same, but in a lightweight distro-neutral way, by
introducing a new sysctl.
Possible logic could be to set the maximum namespace nesting depth where
capabilities (or maybe specifically CAP_NET_ADMIN) still work. We could
have this apply to unprivileged user namespaces only or to all. I guess
systemd's PrivateNetwork services generally don't configure networking
(they just give up network access), so would continue to work even with
capabilities disallowed?
A max depth setting of 1 could allow network configuration in top-level
containers if needed, while reducing the kernel's attack surface exposed
to further sandboxed programs, nested containers, and unintended
namespaces created by exploits running as a user inside a top-level
container. My thinking is that if someone uses containers with custom
network configuration, they probably mostly care about attacks by
container users (and nested containers, if any) rather than by host
users. They could also care about attacks by top-level container root,
but there's little we can do here while allowing container root to
configure networking.
Does this sound like it has a chance of getting accepted upstream?
Meanwhile, this looks implementable via security_capable() LSM hook, and
I am thinking of experimenting with it in LKRG:
https://github.com/lkrg-org/lkrg/issues/331
Limiting this new logic only to unprivileged user namespaces feels
tricky or hackish as there doesn't appear to be an existing struct field
to indicate parent's capabilities at namespace creation time. There is
parent_could_setfcap, which we maybe could abuse. Any better ideas?
Detailed discussion of implementation wouldn't belong on oss-security.
We'll need to move to kernel-hardening or linux-hardening for that. But
initial feedback on the idea is fine to have in here, especially from
perspective of required functionality.
> Applications that need to use capabilities will have to be confined by
> a profile. Since we understand that creating an AppArmor profile might
> not be a trivial task for large programs, we introduced the
> "unconfined" flag which makes the profile act as if it were unconfined
> from the perspective of AppArmor, allowing all operations.
Thank you, this is helpful.
> There are more details here:
> https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890#unprivileged-user-namespace-restrictions-13
Looks like the direct link to this section of the release notes has
since changed, now it is:
https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890/1#unprivileged-user-namespace-restrictions-14
I'll quote this section's content below, for archival:
---
Unprivileged user namespace restrictions
In combination with the apparmor package, the Ubuntu kernel now
restricts the use of unprivileged user namespaces. This affects all
programs on the system that are unprivileged and unconfined. A default
AppArmor profile is provided that allows the use of user namespaces for
unprivileged and unconfined applications but will deny the subsequent
use of any capabilities within the user namespace. A common use-case for
unprivileged user namespaces is applications that construct their own
sandboxes or work with styles of container workloads. As such, AppArmor
profiles that allow the use of unprivileged user namespaces are also
provided for common applications and frameworks that come from the
Ubuntu archive, as well as popular third party applications like Google
Chrome, Discord and others. This is a subsequent step towards trying to
mitigate the larger attack surface presented by unprivileged user
namespaces (the first being the introduction of this feature in Ubuntu
23.10 where it was not enabled by default).
Whilst significant effort has been expended to try and identify all
applications that may require such profiles, it is expected that there
may be cases where additional profiles are required.
In this case, there are several options if you run into problems:
- Confine your applications with an AppArmor profile. Because this can
be potentially onerous, a new unconfined profile mode/flag has been
added to AppArmor. This designates the profile to essentially act like
the unconfined mode for AppArmor where an application is not restricted,
and it allows additional permissions to be added, such as the userns,
permission. Such profile for, e.g. Google Chrome, would look like the
following, and it would be located within the /etc/apparmor.d/chrome
file:
abi <abi/4.0>,
include <tunables/global>
/opt/google/chrome/chrome flags=(unconfined) {
userns,
# Site-specific additions and overrides. See local/README for details.
include if exists <local/chrome>
}
Alternatively, a complete AppArmor profile for the application can be
created (see the AppArmor 1 documentation).
- Launch your application in a way that doesn't use unprivileged user
namespaces, e.g. google-chrome-stable --no-sandbox. However, since this
disables the use of an internal security feature within the application,
this is not recommended. Instead, use the unconfined profile mode
described above instead.
- Disable this restriction on the entire system for one boot by
executing echo 0 | sudo tee
/proc/sys/kernel/apparmor_restrict_unprivileged_userns. This setting is
lost on reboot. This similar to the previous behaviour, but it does not
mitigate against kernel exploits that abuse the unprivileged user
namespaces feature.
- Disable this restriction using a persistent setting by adding a new
file (/etc/sysctl.d/60-apparmor-namespace.conf) with the following
contents:
kernel.apparmor_restrict_unprivileged_userns=0
Reboot. This is similar to the previous behaviour, but it does not
mitigate against kernel exploits that abuse the unprivileged user
namespaces feature.
---
Alexander
On Fri, 19 Apr 2024 at 17:44:35 +0200, Solar Designer wrote:
> I guess
> systemd's PrivateNetwork services generally don't configure networking
> (they just give up network access), so would continue to work even with
> capabilities disallowed?
I can't speak for systemd's PrivateNetwork services, but for the
bubblewrap use-cases that I described elsewhere in the thread (Flatpak,
libgnome-desktop etc.), `bwrap --unshare-net` does bring up the "lo"
interface with address 127.0.0.1 and a route to 127.0.0.0/8 before it
relinquishes its capabilities and execs the sandboxed program.
Presumably this is because it's common for ordinary user-space applications
to assume that they can "talk to themselves" via loopback, even if there is
no external connectivity.
smcv
On Wed, Apr 17, 2024 at 09:52:10AM GMT, Georgia Garcia wrote:
> I just wanted to add that in the Ubuntu Noble Numbat release we are
> using AppArmor to restrict unprivileged user namespaces.
> Applications that don't have an AppArmor profile will use a default
> profile which denies the use of capabilities within the user
> namespace. Applications that need to use capabilities will have to
> be confined by a profile. Since we understand that creating an
> AppArmor profile might not be a trivial task for large programs, we
> introduced the "unconfined" flag which makes the profile act as if
> it were unconfined from the perspective of AppArmor, allowing all
> operations.
> There are more details here:
> https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890#unprivileged-user-namespace-restrictions-13
I wonder if this (at least the kernel part of it) is already in the
latest PopOS rolling updates? I see some nodes in /proc/sys/kernel
that look very related.
--
Ian
On Fri, Apr 19, 2024 at 06:25:02PM +0100, Simon McVittie wrote:
> On Fri, 19 Apr 2024 at 17:44:35 +0200, Solar Designer wrote:
> > I guess
> > systemd's PrivateNetwork services generally don't configure networking
> > (they just give up network access), so would continue to work even with
> > capabilities disallowed?
>
> I can't speak for systemd's PrivateNetwork services, but for the
> bubblewrap use-cases that I described elsewhere in the thread (Flatpak,
> libgnome-desktop etc.), `bwrap --unshare-net` does bring up the "lo"
> interface with address 127.0.0.1 and a route to 127.0.0.0/8 before it
> relinquishes its capabilities and execs the sandboxed program.
>
> Presumably this is because it's common for ordinary user-space applications
> to assume that they can "talk to themselves" via loopback, even if there is
> no external connectivity.
Thank you. So with my idea/proposal, someone using these tools on a
desktop system would need to set the max depth to 1. That would leave
the kernel's full attack surface exposed on the host system, but not to
sandboxed programs because those would run with capabilities already
relinquished (per what you write above) and would not be able to regain
them by creating a nested namespace. Sounds like a worthwhile feature?
Does bubblewrap maybe already relinquish also the ability to create
nested namespaces, which it probably could do with seccomp? I guess not
as that would break its usage to sandbox programs like Firefox that also
create a namespace for their own sandbox. With namespace creation still
allowed but capabilities ineffective, I guess such programs maybe could
still work if they don't need to configure networking in the sandbox.
Alexander
On Sat, 20 Apr 2024 at 21:33:07 +0000, Jordan Glover wrote:
> On Saturday, April 20th, 2024 at 8:12 PM, Solar Designer <[email protected]> wrote:
> > Does bubblewrap maybe already relinquish also the ability to create
> > nested namespaces, which it probably could do with seccomp?
bubblewrap doesn't rely on seccomp itself, because linking to libseccomp
and compiling seccomp programs would be a concerning amount of attack
surface for a program that is optionally setuid root, but it has options
that can be used to make it receive a precompiled seccomp program as a
binary blob and submit it to the kernel. The intention is that a larger
framework like Flatpak, WebKitGTK or similar, which isn't setuid, can
supply a seccomp program if it wants to.
The design of bubblewrap is that it's a toolkit for making sandboxes,
but is not, itself, a ready-made sandboxing solution - so the sandbox can
be secure but limited, insecure but versatile, or anywhere in between,
and larger frameworks like Flatpak are responsible for designing their
own security model and constructing a bubblewrap command-line that will
implement it.
> bubblwrap has --disable-userns option
...
> Flatpak uses this (or seccomp filter) to block nested namespaces
Development versions of Flatpak (1.15.6+) use both seccomp and
--disable-userns: possibly redundant, but it's better to be safe,
and Flatpak wants to disable some other syscalls with seccomp anyway
(for example access to the kernel keyring).
Older versions of Flatpak completely relied on seccomp, because
--disable-userns is a recent addition to bubblewrap.
bubblewrap also uses PR_SET_NO_NEW_PRIVS (this is hard-coded and not
optional), but creating a new user namespace in which you have all
capabilities is not considered to be a new privilege for the purposes
of that prctl, so that doesn't help us here.
> For this reason firefox own sandbox doesn't use namespaces in flatpak
Flatpak does have a feature (the somewhat misleadingly named
"sub-sandboxes") where a sandboxed program can ask Flatpak to create a
new user namespace on its behalf, in parallel with the one it uses for
the original program. This can either be done with the same restrictions
as the original program and therefore no effective security boundary
between the original program and the sub-sandbox (Steam does this, to
run parts of itself with a different /usr), or with tighter restrictions
(the original purpose of this feature).
But, as noted on the Firefox bug, this implies some IPC, a new user
namespace and an execve(), so it's higher-overhead than just fork()ing:
if the original program wants to share state with the new program,
it needs to do that explicitly, perhaps by using shared memory or an
AF_UNIX-based protocol like D-Bus. A stronger security boundary means
more effort needs to be put into crossing that boundary safely and with
the desired performance characteristics, so t's a trade-off with no
single correct answer.
smcv
On Sat, 20 Apr 2024 at 20:12:11 +0200, Solar Designer wrote:
> So with my idea/proposal, someone using these tools on a
> desktop system would need to set the max depth to 1. That would leave
> the kernel's full attack surface exposed on the host system, but not to
> sandboxed programs because those would run with capabilities already
> relinquished (per what you write above) and would not be able to regain
> them by creating a nested namespace.
I believe that's all correct. If someone prototypes this, a way to verify
it would be, minimally:
$ ip addr ls
(should show all your IP addresses)
$ bwrap --dev-bind / / -- ip addr ls
(same output)
$ bwrap --dev-bind / / --unshare-net -- ip addr ls
(should show only lo with 127.0.0.1 and ::1)
or for a "whole stack" version with Flatpak, install any random Flatpak
app such as org.gnome.Recipes and do:
$ flatpak run --unshare=network org.gnome.Recipes
# or to explore the sandbox environment interactively
$ flatpak run --command=bash --unshare=network org.gnome.Recipes
For simplicity, the use of bwrap shown above is not a security boundary:
it doesn't make any attempt to restrict access to the host filesystem
like e.g. Flatpak does. bwrap command-lines that implement a meaningful
security boundary, while still providing useful functionality, are much
longer than that!
> Sounds like a worthwhile feature?
I'm not sure. As with most security designs, it depends on your security
model.
To protect a trusted user from their own sandboxed apps, it should be
unnecessary/redundant for Flatpak users, because Flatpak already doesn't
let apps inherit CAP_NET_ADMIN or create new user namespaces - but it
could be useful for other sandboxed app frameworks, or as a second line
of defence against Flatpak not providing the boundary that it aims to.
To protect the OS and other users from a malicious or compromised
user account using kernel vulnerabilities to elevate privileges, it's
insufficient - if that's your security model then there isn't going to be
any substitute for either trusting the kernel to make CAP_NET_ADMIN in a
non-init user namespace be safe, or trusting a component like bwrap to
impose restrictions that its caller is not allowed to bypass.
Of course, any time we say things like "trusting a component to impose
restrictions that its caller is not allowed to bypass", we get into
the same territory as setuid/setgid/setcap, in terms of needing to
prevent LD_PRELOAD, LD_LIBRARY_PATH and similar ways to influence the
trusted component's behaviour from the outside - which is likely to be
impossible if the kernel isn't helping to defang those aspects of the
execution environment by flagging the process as AT_SECURE, either in
core kernel code or in an LSM like AppArmor.
I believe the kernel maintainers' position is that CAP_NET_ADMIN in
a non-init userns is meant to be safe for untrusted code to have, so
auditing and if necessary hardening the kernel's use of CAP_NET_ADMIN
might well be better-received upstream than trying to limit which parts
of user-space can obtain it.
smcv
On Sat, Apr 20, 2024 at 09:33:07PM +0000, Jordan Glover wrote:
> bubblwrap has --disable-userns option which prevents creation of nested namespaces (from manpage):
>
> --disable-userns
> Prevent the process in the sandbox from creating further user namespaces, so that it cannot rearrange the filesystem namespace or do other more complex namespace modification. This is currently implemented by setting the user.max_user_namespaces sysctl to 1, and then entering a nested user namespace which is unable to raise that limit in the outer namespace. This option requires --unshare-user, and doesn't work in the setuid version of bubblewrap.
>
> Flatpak uses this (or seccomp filter) to block nested namespaces as this can bypass security its design. For this reason firefox own sandbox doesn't use namespaces in flatpak, see https://bugzilla.mozilla.org/show_bug.cgi?id=1756236
Thanks, I didn't expect it was this advanced already.
In what exact way would nested namespaces bypass the security design of
Flatpak? Is this about the kernel's attack surface exposed by
capabilities in a namespace or something else? I guess capabilities are
also dropped in the nested namespace?
After reviewing some kernel code, I have doubts as to how effective the
dropping of capabilities in a namespace actually is.
security/commoncap.c: cap_capable() includes this:
/*
* The owner of the user namespace in the parent of the
* user namespace has all caps.
*/
if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
return 0;
this check is only reached when cap_capable() is called for a target
namespace other than one the credentials are from. However, such uses
do exist, e.g. via Netlink, which would expose e.g. Netfilter:
net/netlink/af_netlink.c:
/**
* netlink_net_capable - Netlink network namespace message capability test
* @skb: socket buffer holding a netlink command from userspace
* @cap: The capability to use
*
* Test to see if the opener of the socket we received the message
* from had when the netlink socket was created and the sender of the
* message has the capability @cap over the network namespace of
* the socket we received the message from.
*/
bool netlink_net_capable(const struct sk_buff *skb, int cap)
{
return netlink_ns_capable(skb, sock_net(skb->sk)->user_ns, cap);
}
So I worry whether even with all namespaces in a sandbox having dropped
capabilities, an attack can still be arranged (with a pair of namespaces
one nested in the other) where a task effectively "has all caps" for a
dangerous operation like configuring Netfilter due to it hitting code
paths like this, which bypass capability bit checks.
The above finding may be a reason for us to prefer making capabilities
in a namespace ineffective vs. dropping capabilities. In context of my
idea/proposal for a new sysctl, it could be better for it to work as I
had described, overriding security_capable() return, instead of e.g.
hooking return of create_user_ns() and dropping new cred's capabilities.
I hope the Ubuntu/AppArmor solution is also safe in this respect, as it
sounds like it similarly makes capabilities ineffective instead of
dropping them.
Alexander
On Sun, Apr 21, 2024 at 01:30:49PM +0100, Simon McVittie wrote:
> On Sat, 20 Apr 2024 at 20:12:11 +0200, Solar Designer wrote:
> > So with my idea/proposal, someone using these tools on a
> > desktop system would need to set the max depth to 1. That would leave
> > the kernel's full attack surface exposed on the host system, but not to
> > sandboxed programs because those would run with capabilities already
> > relinquished (per what you write above) and would not be able to regain
> > them by creating a nested namespace.
>
> I believe that's all correct. If someone prototypes this, a way to verify
> it would be, minimally:
>
> $ ip addr ls
> (should show all your IP addresses)
> $ bwrap --dev-bind / / -- ip addr ls
> (same output)
> $ bwrap --dev-bind / / --unshare-net -- ip addr ls
> (should show only lo with 127.0.0.1 and ::1)
>
> or for a "whole stack" version with Flatpak, install any random Flatpak
> app such as org.gnome.Recipes and do:
>
> $ flatpak run --unshare=network org.gnome.Recipes
>
> # or to explore the sandbox environment interactively
> $ flatpak run --command=bash --unshare=network org.gnome.Recipes
>
> For simplicity, the use of bwrap shown above is not a security boundary:
> it doesn't make any attempt to restrict access to the host filesystem
> like e.g. Flatpak does. bwrap command-lines that implement a meaningful
> security boundary, while still providing useful functionality, are much
> longer than that!
Thank you!
> > Sounds like a worthwhile feature?
>
> I'm not sure. As with most security designs, it depends on your security
> model.
My priorities are:
1. Systems and especially servers that do not use containers. They
nevertheless may use namespaces in some systemd services by default,
which I'd like to keep working seamlessly. On such systems, it should
be possible to reduce the kernel's exposure to a level we had prior to
unprivileged user namespaces. Right now, upstream's only max_* settings
break those systemd services (just the sandboxing aspect or fully).
I hope my proposed setting with a depth of 0 (capabilities ineffective
in any namespace, or maybe in any created other than by host root) would
work for this.
With luck, the same might even work for Firefox if it does not need
capabilities (but my priority is server systems, so that would be a
pleasant extra, not a requirement).
2. Server and development systems that use containers, such as Docker
and Kubernetes. I guess for them a depth of 1 would commonly be needed,
but we'd still protect the kernel from attacks by nested containers
(intentional or attacker-created). I suppose a compromised task running
as non-root in a container (I mean non-root even from the container's
perspective) would no longer have capabilities and due to max depth
would not be able to usefully gain them by creating a nested namespace.
With luck, some setups like this could even work with a max depth of 0,
if we allow capabilities to remain effective when the container is
started by host root.
3. Desktop systems, Flatpak, etc. If we can provide useful settings and
hardening for these as well, that's a great bonus.
Overall, my thinking is that someone using containers may be most
concerned about attacks from within containers than from the host.
Similarly, someone using nested containers may be most concerned about
attacks from the deepest level. Ideally, we'd protect against attacks
from all levels, but since can't do that easily, let's at least protect
from some - hopefully, the most relevant ones.
> To protect a trusted user from their own sandboxed apps, it should be
> unnecessary/redundant for Flatpak users, because Flatpak already doesn't
> let apps inherit CAP_NET_ADMIN or create new user namespaces - but it
> could be useful for other sandboxed app frameworks, or as a second line
> of defence against Flatpak not providing the boundary that it aims to.
>
> To protect the OS and other users from a malicious or compromised
> user account using kernel vulnerabilities to elevate privileges, it's
> insufficient - if that's your security model then there isn't going to be
> any substitute for either trusting the kernel to make CAP_NET_ADMIN in a
> non-init user namespace be safe, or trusting a component like bwrap to
> impose restrictions that its caller is not allowed to bypass.
Yes, with depth >= 1 allowed, such as to use Flatpak, there would be no
protection from host users.
> Of course, any time we say things like "trusting a component to impose
> restrictions that its caller is not allowed to bypass", we get into
> the same territory as setuid/setgid/setcap, in terms of needing to
> prevent LD_PRELOAD, LD_LIBRARY_PATH and similar ways to influence the
> trusted component's behaviour from the outside - which is likely to be
> impossible if the kernel isn't helping to defang those aspects of the
> execution environment by flagging the process as AT_SECURE, either in
> core kernel code or in an LSM like AppArmor.
To have a component impose restrictions, the feature would first need to
be made unavailable directly. Which basically means no _unprivileged_
user namespaces, and bwrap or such started as SUID root - in which case
it would have AT_SECURE.
That's not a setup I was thinking of, but now that you bring it up this
shows how upstream Linux is lacking support for it - this needs a
separate knob to control _unprivileged_ user namespaces like Debian has.
My proposed knob could also satisfy this need, if we do include a bypass
for namespaces created by host root.
> I believe the kernel maintainers' position is that CAP_NET_ADMIN in
> a non-init userns is meant to be safe for untrusted code to have, so
> auditing and if necessary hardening the kernel's use of CAP_NET_ADMIN
> might well be better-received upstream than trying to limit which parts
> of user-space can obtain it.
This seems to be the case, but those activities are orthogonal. We can
try and have both.
Alexander
On Mon, Apr 22, 2024 at 02:33:56PM +0000, Jordan Glover wrote:
> On Sunday, April 21st, 2024 at 10:06 PM, Solar Designer <[email protected]> wrote:
>
> > In what exact way would nested namespaces bypass the security design of
> > Flatpak? Is this about the kernel's attack surface exposed by
> > capabilities in a namespace or something else? I guess capabilities are
> > also dropped in the nested namespace?
>
> In flatpak, apps in container communicate with host through portals[1] using dbus.
> Portals identify particular app through unique appid (i.e. "org.mozilla.firefox"
> for firefox) and grant some permissions according to that. appid is read from
> /.flatpak-info that exist inside container and is immutable there. If namespaces
> were available inside sandbox then malicious app could leverage mount namespace
> to mount crafted /.flatpak-info containing arbitrary data and lie to the portal
> about appid - it could tell portal that it's org.mozilla.firefox when it isn't.
>
> [1] https://github.com/flatpak/xdg-desktop-portal
>
> Jordan
Why is the appid read from /.flatpak-info, instead of having the flatpak
process that spawned the container pass the info to the dbus proxy along
with the FD used to communicate with the container?
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
On Mon, 22 Apr 2024 at 18:10:27 -0400, Demi Marie Obenour wrote:
> Why is the appid read from /.flatpak-info, instead of having the flatpak
> process that spawned the container pass the info to the dbus proxy along
> with the FD used to communicate with the container?
I didn't design this mechanism, so I can't say anything authoritative
about the motivations of its initial designer.
(I would appreciate it if this thread can avoid being derailed into asking
me "why can't you just?" about design decisions that were already made,
by people who weren't me.)
Some factors that may have been relevant:
D-Bus is not the only AF_UNIX-based protocol that can be used by sandboxed
apps to communicate with peers outside the sandbox: some others (subject
to suitable --socket and --filesystem permissions) include X11, Wayland,
PulseAudio, Pipewire, or in principle anything that exposes an AF_UNIX
socket in a well-known location. D-Bus is the only one of these that
currently uses a proxy.
The fact that a D-Bus proxy is necessary is not ideal, and ideally the
message bus would be able to do the firewall-like filtering of messages
itself (subject to Someone™ having enough time to design and implement
that, of course). If the design of Flatpak's interactions with portals
via D-Bus "baked in" an assumption that there will always be a trusted
proxy in the middle, which could be asked for more information about the
connection, then that would prevent us from being able to replace the
proxy with a suitably enhanced message bus at some point in the future.
There is already no D-Bus proxy used if the app has been given direct
access to the session bus - which makes that particular app effectively
non-sandboxed and part of the trusted computing base, so it would be
Very Bad for such an app to be compromised or malicious, but it's still
desirable to be able to query the identity of those apps in the same
way we would for an app that has been effectively sandboxed.
As discussed in this thread, creating new namespaces is a relatively
scary attack surface to be giving to the sort of semi-trusted apps that
you would typically want to sandbox with Flatpak, so even if the integrity
of /.flatpak-info wasn't being used as a security property, we would
probably still want to deny that ability to most Flatpak apps anyway
(on the same basis that Flatpak already uses seccomp to prevent various
more obscure or large-attack-surface syscalls by most sandboxed apps,
for example denying ptrace unless the app has --allow=devel, even though
in principle allowing ptrace "should" be safe).
smcv
On 4/21/24 13:06, Solar Designer wrote:
> On Sat, Apr 20, 2024 at 09:33:07PM +0000, Jordan Glover wrote:
>> bubblwrap has --disable-userns option which prevents creation of nested namespaces (from manpage):
>>
>> --disable-userns
>> Prevent the process in the sandbox from creating further user namespaces, so that it cannot rearrange the filesystem namespace or do other more complex namespace modification. This is currently implemented by setting the user.max_user_namespaces sysctl to 1, and then entering a nested user namespace which is unable to raise that limit in the outer namespace. This option requires --unshare-user, and doesn't work in the setuid version of bubblewrap.
>>
>> Flatpak uses this (or seccomp filter) to block nested namespaces as this can bypass security its design. For this reason firefox own sandbox doesn't use namespaces in flatpak, see https://bugzilla.mozilla.org/show_bug.cgi?id=1756236
>
> Thanks, I didn't expect it was this advanced already.
>
> In what exact way would nested namespaces bypass the security design of
> Flatpak? Is this about the kernel's attack surface exposed by
> capabilities in a namespace or something else? I guess capabilities are
> also dropped in the nested namespace?
>
> After reviewing some kernel code, I have doubts as to how effective the
> dropping of capabilities in a namespace actually is.
>
> security/commoncap.c: cap_capable() includes this:
>
> /*
> * The owner of the user namespace in the parent of the
> * user namespace has all caps.
> */
> if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
> return 0;
>
> this check is only reached when cap_capable() is called for a target
> namespace other than one the credentials are from. However, such uses
> do exist, e.g. via Netlink, which would expose e.g. Netfilter:
>
> net/netlink/af_netlink.c:
>
> /**
> * netlink_net_capable - Netlink network namespace message capability test
> * @skb: socket buffer holding a netlink command from userspace
> * @cap: The capability to use
> *
> * Test to see if the opener of the socket we received the message
> * from had when the netlink socket was created and the sender of the
> * message has the capability @cap over the network namespace of
> * the socket we received the message from.
> */
> bool netlink_net_capable(const struct sk_buff *skb, int cap)
> {
> return netlink_ns_capable(skb, sock_net(skb->sk)->user_ns, cap);
> }
>
> So I worry whether even with all namespaces in a sandbox having dropped
> capabilities, an attack can still be arranged (with a pair of namespaces
> one nested in the other) where a task effectively "has all caps" for a
> dangerous operation like configuring Netfilter due to it hitting code
> paths like this, which bypass capability bit checks.
>
> The above finding may be a reason for us to prefer making capabilities
> in a namespace ineffective vs. dropping capabilities. In context of my
> idea/proposal for a new sysctl, it could be better for it to work as I
> had described, overriding security_capable() return, instead of e.g.
> hooking return of create_user_ns() and dropping new cred's capabilities.
>
> I hope the Ubuntu/AppArmor solution is also safe in this respect, as it
> sounds like it similarly makes capabilities ineffective instead of
> dropping them.
>
The AppArmor solution is flexible, allowing the policy author to decide
what is done. The namespace creation can be allowed, denied or the profile
can be transitioned on namespace creation. So the behavior can be tuned
selectively per application, and based on whether it is in a user namespace
or not.
The 24.04 Ubuntu behavior is for "unconfined" applications to transition
to a profile that denies further creation of user namespaces and denies
capabilities within the user namespace.
There are profiles for known applications allowing them to use user
namespaces. The behavior of most of these just allow the user namespace
and maybe a specific capability, currently without transitioning the
user namespace to tighter confinement, but ideally the policy would
do more, and there are plans to improve the policy around the set of
applications.
Bubblewrap and unshare have additional behaviors around restricting what
the applications can do as they also take advantage of the exec barrier.
Applications that embedded bubblewrap to setup their sandbox, eg.
steam's pressure vessel, can have their own profiles that can control
bubblewrap separate from the system bubblewrap policy.
Its still early days and policy the rollout/policy has been mostly to set
a default of allowing user namespaces but with no capabilities. Then
provide default very open policy for application that have been found to
need them, with plans to tighten that policy on a per application basis
in the future.
appimages and containers that users expect to be able to run from their
home or other user writable locations are the big issue atm. They are
allowed the default behavior of allowed to create user namespaces without
any capabilities but if they require more, we are requiring privileged
user intervention to individually enable running these applications.
We have found application behavior around restricting user namespaces
to be very inconsistent. Eg. qtwebkit will crash if you deny creation
of the user namespace, but will gracefully fallback to not using
user namespaces in its sandbox if its denied capabilities within the
user namespace during sandbox setup. Firefox on the other hand crashes
when user namespaces or capabilities are denied.
On 4/19/24 12:01, [email protected] wrote:
> On Wed, Apr 17, 2024 at 09:52:10AM GMT, Georgia Garcia wrote:
>
>> I just wanted to add that in the Ubuntu Noble Numbat release we are
>> using AppArmor to restrict unprivileged user namespaces.
>
>> Applications that don't have an AppArmor profile will use a default
>> profile which denies the use of capabilities within the user
>> namespace. Applications that need to use capabilities will have to
>> be confined by a profile. Since we understand that creating an
>> AppArmor profile might not be a trivial task for large programs, we
>> introduced the "unconfined" flag which makes the profile act as if
>> it were unconfined from the perspective of AppArmor, allowing all
>> operations.
>
>> There are more details here:
>
>> https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890#unprivileged-user-namespace-restrictions-13
>
> I wonder if this (at least the kernel part of it) is already in the
> latest PopOS rolling updates? I see some nodes in /proc/sys/kernel
> that look very related.
>
partially. The ability to straight up deny user namespace creation is
in the kernel already. The ability to transition the profile and the
default behavior for unconfined is not. In Ubuntu the behavior for
the unconfined profile is hard coded as there is still some work to be
done around allowing this to be replaced easily in policy (its
possible but has some limitations/costs that were not acceptable).
Once the work to make replacing unconfined easy is done that will be
upstreamed and the hard coded behavior will get dropped.
On Mon, Apr 15, 2024 at 11:33:32PM +0000, Jordan Glover wrote:
> On Monday, April 15th, 2024 at 5:47 PM, Simon McVittie <[email protected]> wrote:
>
> > On Mon, 15 Apr 2024 at 17:13:09 +0200, Solar Designer wrote:
> >
> > I am not a kernel developer, so this is second-hand information; but I
> > believe the implementation of kernel.unprivileged_userns_clone used in
> > Debian (and subsequently copied from Debian by various other distros)
> > is derived from patches that were already proposed and rejected upstream,
> > so the feeling was that trying again to upstream that feature would be a
> > waste of time and upstream goodwill, because it would just get rejected
> > again by the same kernel maintainer.
> >
>
> Perhaps it's best to link old article covering the situation back then:
> https://lwn.net/Articles/673597/
>
> And yes, current kernel maintainers are biggest proponents of unpriv
> userns so any restriction is rather impossible sell.
Landlock [1] could be extended to control user namespace creation the
same way we will be able to deny socket creation [2]. I'll definitely
consider any relevant sandboxing feature such as user namespace and
fine-grained capability control (that cannot already be done with
existing kernel features). Contributions are welcome!
[1] https://docs.kernel.org/userspace-api/landlock.html
[2] https://github.com/landlock-lsm/linux/issues/6
Regards,
Mickaël