Received-SPF: pass (google.com: domain of oss-security-return-30066-linux.lists.archive=gmail.com@lists.openwall.com designates 193.110.157.125 as permitted sender) client-ip=193.110.157.125;
Mailing-List: contact oss-security-help@lists.openwall.com; run by ezmlm
Precedence: bulk
Reply-To: oss-security@lists.openwall.com
Date: Sun, 21 Apr 2024 22:47:12 +0200
From: Solar Designer <solar@openwall.com>
To: oss-security@lists.openwall.com
Message-ID: <20240421204712.GA17034@openwall.com>
References: <20240414190855.GA12716@openwall.com> <354b913bc1c154c1e3a2fc34ed8ed6b0d4641f11.camel@canonical.com> <20240419154435.GA7046@openwall.com> <ZiKo7shztRpgvAIC@remnant.pseudorandom.co.uk> <20240420181211.GA12463@openwall.com> <ZiUG-cMNJgFl-zCO@remnant.pseudorandom.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZiUG-cMNJgFl-zCO@remnant.pseudorandom.co.uk>
User-Agent: Mutt/1.4.2.3i
Subject: Re: [oss-security] Linux: Disabling network namespaces

On Sun, Apr 21, 2024 at 01:30:49PM +0100, Simon McVittie wrote:
> On Sat, 20 Apr 2024 at 20:12:11 +0200, Solar Designer wrote:
> > So with my idea/proposal, someone using these tools on a
> > desktop system would need to set the max depth to 1.  That would leave
> > the kernel's full attack surface exposed on the host system, but not to
> > sandboxed programs because those would run with capabilities already
> > relinquished (per what you write above) and would not be able to regain
> > them by creating a nested namespace.
> 
> I believe that's all correct. If someone prototypes this, a way to verify
> it would be, minimally:
> 
>     $ ip addr ls
>     (should show all your IP addresses)
>     $ bwrap --dev-bind / / -- ip addr ls
>     (same output)
>     $ bwrap --dev-bind / / --unshare-net -- ip addr ls
>     (should show only lo with 127.0.0.1 and ::1)
> 
> or for a "whole stack" version with Flatpak, install any random Flatpak
> app such as org.gnome.Recipes and do:
> 
>     $ flatpak run --unshare=network org.gnome.Recipes
> 
>       # or to explore the sandbox environment interactively
>     $ flatpak run --command=bash --unshare=network org.gnome.Recipes
> 
> For simplicity, the use of bwrap shown above is not a security boundary:
> it doesn't make any attempt to restrict access to the host filesystem
> like e.g. Flatpak does. bwrap command-lines that implement a meaningful
> security boundary, while still providing useful functionality, are much
> longer than that!

Thank you!

> > Sounds like a worthwhile feature?
> 
> I'm not sure. As with most security designs, it depends on your security
> model.

My priorities are:

1. Systems and especially servers that do not use containers.  They
nevertheless may use namespaces in some systemd services by default,
which I'd like to keep working seamlessly.  On such systems, it should
be possible to reduce the kernel's exposure to a level we had prior to
unprivileged user namespaces.  Right now, upstream's only max_* settings
break those systemd services (just the sandboxing aspect or fully).
I hope my proposed setting with a depth of 0 (capabilities ineffective
in any namespace, or maybe in any created other than by host root) would
work for this.

With luck, the same might even work for Firefox if it does not need
capabilities (but my priority is server systems, so that would be a
pleasant extra, not a requirement).

2. Server and development systems that use containers, such as Docker
and Kubernetes.  I guess for them a depth of 1 would commonly be needed,
but we'd still protect the kernel from attacks by nested containers
(intentional or attacker-created).  I suppose a compromised task running
as non-root in a container (I mean non-root even from the container's
perspective) would no longer have capabilities and due to max depth
would not be able to usefully gain them by creating a nested namespace.

With luck, some setups like this could even work with a max depth of 0,
if we allow capabilities to remain effective when the container is
started by host root.

3. Desktop systems, Flatpak, etc.  If we can provide useful settings and
hardening for these as well, that's a great bonus.

Overall, my thinking is that someone using containers may be most
concerned about attacks from within containers than from the host.
Similarly, someone using nested containers may be most concerned about
attacks from the deepest level.  Ideally, we'd protect against attacks
from all levels, but since can't do that easily, let's at least protect
from some - hopefully, the most relevant ones.

> To protect a trusted user from their own sandboxed apps, it should be
> unnecessary/redundant for Flatpak users, because Flatpak already doesn't
> let apps inherit CAP_NET_ADMIN or create new user namespaces - but it
> could be useful for other sandboxed app frameworks, or as a second line
> of defence against Flatpak not providing the boundary that it aims to.
> 
> To protect the OS and other users from a malicious or compromised
> user account using kernel vulnerabilities to elevate privileges, it's
> insufficient - if that's your security model then there isn't going to be
> any substitute for either trusting the kernel to make CAP_NET_ADMIN in a
> non-init user namespace be safe, or trusting a component like bwrap to
> impose restrictions that its caller is not allowed to bypass.

Yes, with depth >= 1 allowed, such as to use Flatpak, there would be no
protection from host users.

> Of course, any time we say things like "trusting a component to impose
> restrictions that its caller is not allowed to bypass", we get into
> the same territory as setuid/setgid/setcap, in terms of needing to
> prevent LD_PRELOAD, LD_LIBRARY_PATH and similar ways to influence the
> trusted component's behaviour from the outside - which is likely to be
> impossible if the kernel isn't helping to defang those aspects of the
> execution environment by flagging the process as AT_SECURE, either in
> core kernel code or in an LSM like AppArmor.

To have a component impose restrictions, the feature would first need to
be made unavailable directly.  Which basically means no _unprivileged_
user namespaces, and bwrap or such started as SUID root - in which case
it would have AT_SECURE.

That's not a setup I was thinking of, but now that you bring it up this
shows how upstream Linux is lacking support for it - this needs a
separate knob to control _unprivileged_ user namespaces like Debian has.

My proposed knob could also satisfy this need, if we do include a bypass
for namespaces created by host root.

> I believe the kernel maintainers' position is that CAP_NET_ADMIN in
> a non-init userns is meant to be safe for untrusted code to have, so
> auditing and if necessary hardening the kernel's use of CAP_NET_ADMIN
> might well be better-received upstream than trying to limit which parts
> of user-space can obtain it.

This seems to be the case, but those activities are orthogonal.  We can
try and have both.

Alexander