2024-02-17 20:57:04

by Kent Overstreet

[permalink] [raw]
Subject: [LSF TOPIC] beyond uidmapping, & towards a better security model

AKA - integer identifiers considered harmful

Any time you've got a namespace that's just integers, if you ever end up
needing to subdivide it you're going to have a bad time.

This comes up all over the place - for another example, consider ioctl
numbering, where keeping them organized and collision free is a major
headache.

For UIDs, we need to be able to subdivide the UID namespace for e.g.
containers and mounting filesystems as an unprivileged user - but since
we just have an integer identifier, this requires complicated remapping
and updating and maintaining a global table.

Subdividing a UID to create new permissions domains should be a cheap,
easy operation, and it's not.

The solution (originally from plan9, of course) is - UIDs shouldn't be
numbers, they should be strings; and additionally, the strings should be
paths.

Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
subusers, created by alice without any privileged operations or mucking
with outside system state, and 'alice' would be superuser w.r.t.
'alice.foo' and 'alice.bar'.

What's this get us?

Much better, easier to use sandboxing - and maybe we can kill off a
_whole_ lot of other stuff, too.

Apparmour and selinux are fundamentally just about sandboxing programs
so they can't own everything owned by the user they're run by.

But if we have an easy way to say "exec this program as a subuser of the
current user..."

Then we can control what that program can access with just our existing
UNIX permission and acls.

This would be a pretty radical change, and there's a number of things to
explore - lots of brainstorming to do.

- How can we do this without breaking absolutely everything? Obviously,
any syscalls that communicate in terms of UIDs and GIDs are a
problem; can we come up with a compat layer so that most stuff more
or less still works?

- How can we do this a way that's the most orthogonal, that gets us the
most bang for our buck? How can we kill off as much security model
stupidity as possible? How can we make sandboxing _dead easy_ for new
applications?

Cheers,
Kent


2024-02-17 22:54:36

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Sat, Feb 17, 2024 at 10:31:29PM +0000, Matthew Wilcox wrote:
> On Sat, Feb 17, 2024 at 03:56:40PM -0500, Kent Overstreet wrote:
> > AKA - integer identifiers considered harmful
>
> Sure, but how far are you willing to take this? You've recently been
> complaining about inode numbers:
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> > The solution (originally from plan9, of course) is - UIDs shouldn't be
> > numbers, they should be strings; and additionally, the strings should be
> > paths.
> >
> > Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
> > subusers, created by alice without any privileged operations or mucking
> > with outside system state, and 'alice' would be superuser w.r.t.
> > 'alice.foo' and 'alice.bar'.
>
> Waitwaitwait. You start out saying "they are paths" and then you use
> '.' as the path separator. I mean, I come from a tradition that *does*
> use '.' as the path separator (RISC OS, from Acorn DFS, which I believe
> was influenced by the Phoenix command interpreter), but Unix tends to
> use / as the separator.

To me, / indicates that it's a filesystem object, which these are not.
Languages tend to use : as the path separator for object namespacing -
heirarchical paths, but not filesystem paths.

> One of the critical things about plan9 that means you have to think
> hard before transposing its ideas to Linux is that it doesn't have suid
> programs. So if I create willy/root, it's essential that a program
> which is suid only becomes suid with respect to other programs inside
> willy's domain. And it doesn't just apply to filesystem things, but
> "can I send signals" and dozens of other things. So there's a lot
> to be fleshed out here.

My proposal is that a user is superuser only over direct sub-users; so
in your example, willy.root would only be superuser over willy.root.*;
it's just your normal willy user that's superuser over all your
sub-users.

That means that our 'root' user doesn't fit with this scheme - the
superuser over the whole system would be the "" user.

Or perhaps we just map our existing users to be sub-users of root?

root
root.willy
root.kent?

User namespacing in this scheme just becomes "prepend this username when
leaving the namespace", so this might actually work; bit odd in that in
this scheme there's nothing implicitly special about the 'root'
username, so that becomes a (mild) wart... easily addressed with
capabilities, though.

2024-02-18 02:10:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Sat, Feb 17, 2024 at 03:56:40PM -0500, Kent Overstreet wrote:
> AKA - integer identifiers considered harmful

Sure, but how far are you willing to take this? You've recently been
complaining about inode numbers:
https://lore.kernel.org/linux-fsdevel/[email protected]/

> The solution (originally from plan9, of course) is - UIDs shouldn't be
> numbers, they should be strings; and additionally, the strings should be
> paths.
>
> Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
> subusers, created by alice without any privileged operations or mucking
> with outside system state, and 'alice' would be superuser w.r.t.
> 'alice.foo' and 'alice.bar'.

Waitwaitwait. You start out saying "they are paths" and then you use
'.' as the path separator. I mean, I come from a tradition that *does*
use '.' as the path separator (RISC OS, from Acorn DFS, which I believe
was influenced by the Phoenix command interpreter), but Unix tends to
use / as the separator.

One of the critical things about plan9 that means you have to think
hard before transposing its ideas to Linux is that it doesn't have suid
programs. So if I create willy/root, it's essential that a program
which is suid only becomes suid with respect to other programs inside
willy's domain. And it doesn't just apply to filesystem things, but
"can I send signals" and dozens of other things. So there's a lot
to be fleshed out here.

2024-02-19 14:27:19

by James Bottomley

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Sat, 2024-02-17 at 15:56 -0500, Kent Overstreet wrote:
> AKA - integer identifiers considered harmful
>
> Any time you've got a namespace that's just integers, if you ever end
> up needing to subdivide it you're going to have a bad time.
>
> This comes up all over the place - for another example, consider
> ioctl numbering, where keeping them organized and collision free is a
> major headache.
>
> For UIDs, we need to be able to subdivide the UID namespace for e.g.
> containers and mounting filesystems as an unprivileged user - but
> since we just have an integer identifier, this requires complicated
> remapping and updating and maintaining a global table.
>
> Subdividing a UID to create new permissions domains should be a
> cheap, easy operation, and it's not.
>
> The solution (originally from plan9, of course) is - UIDs shouldn't
> be numbers, they should be strings; and additionally, the strings
> should be paths.
>
> Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
> subusers, created by alice without any privileged operations or
> mucking with outside system state, and 'alice' would be superuser
> w.r.t. 'alice.foo' and 'alice.bar'.
>
> What's this get us?

I would have to say that changing kuid for a string doesn't really buy
us anything except a load of complexity for no very real gain.
However, since the current kuid is u32 and exposed uid is u16 and there
is already a proposal to make use of this somewhat in the way you
envision, there might be a possibility to re-express kuid as an array
of u16s without much disruption. Each adjacent pair could represent
the owner at the top and the userns assigned uid underneath. That
would neatly solve the nesting problem the current upper 16 bits
proposal has.

However, neither proposal would get us out of the problem of mount
mapping because we'd have to keep the filesystem permission check on
the owning uid unless told otherwise.

> Much better, easier to use sandboxing - and maybe we can kill off a
> _whole_ lot of other stuff, too.
>
> Apparmour and selinux are fundamentally just about sandboxing
> programs so they can't own everything owned by the user they're run
> by.
>
> But if we have an easy way to say "exec this program as a subuser of
> the current user..."
>
> Then we can control what that program can access with just our
> existing UNIX permission and acls.
>
> This would be a pretty radical change, and there's a number of things
> to explore - lots of brainstorming to do.
>
>  - How can we do this without breaking absolutely everything?
> Obviously,
>    any syscalls that communicate in terms of UIDs and GIDs are a
>    problem; can we come up with a compat layer so that most stuff
> more
>    or less still works?
>
>  - How can we do this a way that's the most orthogonal, that gets us
> the
>    most bang for our buck? How can we kill off as much security model
>    stupidity as possible? How can we make sandboxing _dead easy_ for
> new
>    applications?

So all of the above could be covered by a u16 kuid array with the last
element exposed to the user as the uid. However, there are still
problems even with that approach: the unmapped uid/gid is something
some containers rely on and, as I said above, the mount mapping still
would have to be admin assigned.

James


2024-02-21 00:26:27

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Mon, Feb 19, 2024 at 09:26:25AM -0500, James Bottomley wrote:
> On Sat, 2024-02-17 at 15:56 -0500, Kent Overstreet wrote:
> > AKA - integer identifiers considered harmful
> >
> > Any time you've got a namespace that's just integers, if you ever end
> > up needing to subdivide it you're going to have a bad time.
> >
> > This comes up all over the place - for another example, consider
> > ioctl numbering, where keeping them organized and collision free is a
> > major headache.
> >
> > For UIDs, we need to be able to subdivide the UID namespace for e.g.
> > containers and mounting filesystems as an unprivileged user - but
> > since we just have an integer identifier, this requires complicated
> > remapping and updating and maintaining a global table.
> >
> > Subdividing a UID to create new permissions domains should be a
> > cheap, easy operation, and it's not.
> >
> > The solution (originally from plan9, of course) is - UIDs shouldn't
> > be numbers, they should be strings; and additionally, the strings
> > should be paths.
> >
> > Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
> > subusers, created by alice without any privileged operations or
> > mucking with outside system state, and 'alice' would be superuser
> > w.r.t. 'alice.foo' and 'alice.bar'.
> >
> > What's this get us?
>
> I would have to say that changing kuid for a string doesn't really buy
> us anything except a load of complexity for no very real gain.
> However, since the current kuid is u32 and exposed uid is u16 and there
> is already a proposal to make use of this somewhat in the way you
> envision,

Got a link to that proposal?

> there might be a possibility to re-express kuid as an array
> of u16s without much disruption. Each adjacent pair could represent
> the owner at the top and the userns assigned uid underneath. That
> would neatly solve the nesting problem the current upper 16 bits
> proposal has.

At a high level, there's no real difference between a variable length
integer, or a variable length array of integers, or a string.

But there's real advantages to getting rid of the string <-> integer
identifier mapping and plumbing strings all the way through:

- creating a new sub-user can be done with nothing more than the new
username version of setuid(); IOW, we can start a new named subuser
for e.g. firefox without mucking with _any_ system state or tables

- sharing filesystems between machines is always a pita because
usernames might be the same but uids never are - let's kill that off,
please

Doing anything as big as an array of integers is going to be a major
compatibiltiy break anyways, so we might as well do it right.

Either way we're going to need a mapping to 16 bit uids for
compatibility; doing this right gives userspace an incentive to get
_off_ that compatibility layer so we're not dealing with that impedence
mismatch forever.

> However, neither proposal would get us out of the problem of mount
> mapping because we'd have to keep the filesystem permission check on
> the owning uid unless told otherwise.

Not sure I follow?

We're always going to need mount mapping, but if the mount mapping is
just "usernames here get mapped to this subtree of the system username
namespace", then that potentially simplifies things quite a bit - the
mount mapping is no longer a _table_.

And it wouldn't have to be administrator assigned. Some administrator
assignment might be required for the username <-> 16 bit uid mapping,
but if those mappings are ephemeral (i.e. if we get filesystems
persistently storing usernames, which is easy enough with xattrs) then
that just becomes "reserve x range of the 16 bit uid space for ephemeral
translations".

2024-02-21 00:57:00

by Stéphane Graber

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

Hey there,

Sorry, I don't have the time to go through all the details in this
post to provide an adequate response, I'm adding Aleksandr who may be
able to provide more details on what we've been up to (what James
alluded to).

Our proposal is effectively bumping the in-kernel kuid_t/kgid_t from
uint32 to uint64, which allows for individual user namespaces to get a
full usable uint32 uid/gid range in the kernel. Obviously any kind of
data persistence needs some mapping (VFS idmap) and there are a bunch
of other corner cases as to how this is all exposed to userspace.

The idea around this stuff started back at Plumbers / Kernel summit
all the way back in 2019 with a bit of refinement on the idea on and
off ever since.
We now have a functional patchset and example userspace code at:
- https://github.com/mihalicyn/isolated-userns
- https://github.com/mihalicyn/linux/commits/isolated_userns

If you don't mind watching a video, we have a reasonably detailed talk
on the topic as well as demo and useful audience questions and
feedback from FOSDEM here: https://www.youtube.com/watch?v=mOLzSzpVwHU

After talking about this with folks at a number of LPC / kernel summit
/ FOSDEM by this point, our next step is going to be an RFC patchset,
I think at this point we just want the cgroupfs issue sorted out
before sending that out.

I'll try to set some time to go through your full e-mail later this
week if Alex doesn't get to it first!

Stéphane

On Tue, Feb 20, 2024 at 7:26 PM Kent Overstreet
<[email protected]> wrote:
>
> On Mon, Feb 19, 2024 at 09:26:25AM -0500, James Bottomley wrote:
> > On Sat, 2024-02-17 at 15:56 -0500, Kent Overstreet wrote:
> > > AKA - integer identifiers considered harmful
> > >
> > > Any time you've got a namespace that's just integers, if you ever end
> > > up needing to subdivide it you're going to have a bad time.
> > >
> > > This comes up all over the place - for another example, consider
> > > ioctl numbering, where keeping them organized and collision free is a
> > > major headache.
> > >
> > > For UIDs, we need to be able to subdivide the UID namespace for e.g.
> > > containers and mounting filesystems as an unprivileged user - but
> > > since we just have an integer identifier, this requires complicated
> > > remapping and updating and maintaining a global table.
> > >
> > > Subdividing a UID to create new permissions domains should be a
> > > cheap, easy operation, and it's not.
> > >
> > > The solution (originally from plan9, of course) is - UIDs shouldn't
> > > be numbers, they should be strings; and additionally, the strings
> > > should be paths.
> > >
> > > Then, if 'alice' is a user, 'alice.foo' and 'alice.bar' would be
> > > subusers, created by alice without any privileged operations or
> > > mucking with outside system state, and 'alice' would be superuser
> > > w.r.t. 'alice.foo' and 'alice.bar'.
> > >
> > > What's this get us?
> >
> > I would have to say that changing kuid for a string doesn't really buy
> > us anything except a load of complexity for no very real gain.
> > However, since the current kuid is u32 and exposed uid is u16 and there
> > is already a proposal to make use of this somewhat in the way you
> > envision,
>
> Got a link to that proposal?
>
> > there might be a possibility to re-express kuid as an array
> > of u16s without much disruption. Each adjacent pair could represent
> > the owner at the top and the userns assigned uid underneath. That
> > would neatly solve the nesting problem the current upper 16 bits
> > proposal has.
>
> At a high level, there's no real difference between a variable length
> integer, or a variable length array of integers, or a string.
>
> But there's real advantages to getting rid of the string <-> integer
> identifier mapping and plumbing strings all the way through:
>
> - creating a new sub-user can be done with nothing more than the new
> username version of setuid(); IOW, we can start a new named subuser
> for e.g. firefox without mucking with _any_ system state or tables
>
> - sharing filesystems between machines is always a pita because
> usernames might be the same but uids never are - let's kill that off,
> please
>
> Doing anything as big as an array of integers is going to be a major
> compatibiltiy break anyways, so we might as well do it right.
>
> Either way we're going to need a mapping to 16 bit uids for
> compatibility; doing this right gives userspace an incentive to get
> _off_ that compatibility layer so we're not dealing with that impedence
> mismatch forever.
>
> > However, neither proposal would get us out of the problem of mount
> > mapping because we'd have to keep the filesystem permission check on
> > the owning uid unless told otherwise.
>
> Not sure I follow?
>
> We're always going to need mount mapping, but if the mount mapping is
> just "usernames here get mapped to this subtree of the system username
> namespace", then that potentially simplifies things quite a bit - the
> mount mapping is no longer a _table_.
>
> And it wouldn't have to be administrator assigned. Some administrator
> assignment might be required for the username <-> 16 bit uid mapping,
> but if those mappings are ephemeral (i.e. if we get filesystems
> persistently storing usernames, which is easy enough with xattrs) then
> that just becomes "reserve x range of the 16 bit uid space for ephemeral
> translations".



--
Stéphane

2024-02-21 01:02:13

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Tue, Feb 20, 2024 at 07:56:32PM -0500, Stéphane Graber wrote:
> Hey there,
>
> Sorry, I don't have the time to go through all the details in this
> post to provide an adequate response, I'm adding Aleksandr who may be
> able to provide more details on what we've been up to (what James
> alluded to).
>
> Our proposal is effectively bumping the in-kernel kuid_t/kgid_t from
> uint32 to uint64, which allows for individual user namespaces to get a
> full usable uint32 uid/gid range in the kernel. Obviously any kind of
> data persistence needs some mapping (VFS idmap) and there are a bunch
> of other corner cases as to how this is all exposed to userspace.
>
> The idea around this stuff started back at Plumbers / Kernel summit
> all the way back in 2019 with a bit of refinement on the idea on and
> off ever since.
> We now have a functional patchset and example userspace code at:
> - https://github.com/mihalicyn/isolated-userns
> - https://github.com/mihalicyn/linux/commits/isolated_userns
>
> If you don't mind watching a video, we have a reasonably detailed talk
> on the topic as well as demo and useful audience questions and
> feedback from FOSDEM here: https://www.youtube.com/watch?v=mOLzSzpVwHU
>
> After talking about this with folks at a number of LPC / kernel summit
> / FOSDEM by this point, our next step is going to be an RFC patchset,
> I think at this point we just want the cgroupfs issue sorted out
> before sending that out.
>
> I'll try to set some time to go through your full e-mail later this
> week if Alex doesn't get to it first!

Looking forward to it!

2024-02-21 02:07:34

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Wed, Feb 21, 2024 at 01:22:55AM +0000, Matthew Wilcox wrote:
> On Tue, Feb 20, 2024 at 07:25:58PM -0500, Kent Overstreet wrote:
> > But there's real advantages to getting rid of the string <-> integer
> > identifier mapping and plumbing strings all the way through:
> >
> > - creating a new sub-user can be done with nothing more than the new
> > username version of setuid(); IOW, we can start a new named subuser
> > for e.g. firefox without mucking with _any_ system state or tables
> >
> > - sharing filesystems between machines is always a pita because
> > usernames might be the same but uids never are - let's kill that off,
> > please
>
> I feel like we need a bit of a survey of filesystems to see what is
> already supported and what are desirable properties. Block filesystems
> are one thing, but network filesystems have been dealing with crap like
> this for decades. I don't have a good handle on who supports what at
> this point.

I'm not sure it's critical. 9p supports it already. The big one is NFS,
but if this takes off getting it into the next version of NFS or as an
extension is going to be the easy part.

The critical part is going to be coming up with the new syscall
interface, and figuring out the compatibility shims so that the minimum
amount of userspace has to be modified to take advantage of it, and
figuring out what the compatibility code so that non username aware code
does something sensible.

But with filesystem support, so that we're persisting usernames and
those are the source of truth and old style UIDs are just ephermeral,
that part looks tractable.

I'm glad the container people are looking at this are already, and I
hope they're up for something even more ambitious :) I'd love to kill
off the problems with integer identifiers once and for all.

One of my professors way way back, who was a big influence on me, made
the point that the purpose of the operating system is to virtualize the
hardware, so that every program can pretend it has the whole machine to
itself. Hence things like virtual memory, and filesystems that let you
recursively divide your storage.

But that job isn't complete until the operating system lets you
recursively subdivide every resource it provides, physical or virtual:
hence containers and namespaces.

Hence - 64 bit identifiers aren't enough, if we're going to solve this
once and for all it's got to be a variable length path.

> As far as usernames being the same ... well, maybe. I've been willy,
> mrw103, wilma (twice!), mawilc01 and probably a bunch of others I don't
> remember. I don't think we'll ever get away from having a mapping
> between different naming authorities.

*nod* There will still be situations where remapping is needed, but name
<-> name mapping is way easier for users to deal with than integer <->
integer.

Also - I'd like to get some security model people involved with this, if
anyone knows the right people to loop in. That's the part that's the
most interesting to me (and what motivated me to post this the other day
was Al bitching about apparmor on IRC).

I think, in hindsight, that we grew a lot of this strange security model
stuff that doesn't at all fit with the Unix security model for the
simple reason that creating new permissions domains is just not
something you do on an as needed basis, as a normal user.

The next natural thing to do with permissions is to extend the
permissions model with rwx bits for
- parent of current user
- subusers of current user

..and probably various acl variants of these.

There's some interesting territory to be explored there for sure.

2024-02-21 02:15:54

by NeilBrown

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Wed, 21 Feb 2024, Matthew Wilcox wrote:
> On Tue, Feb 20, 2024 at 07:25:58PM -0500, Kent Overstreet wrote:
> > But there's real advantages to getting rid of the string <-> integer
> > identifier mapping and plumbing strings all the way through:
> >
> > - creating a new sub-user can be done with nothing more than the new
> > username version of setuid(); IOW, we can start a new named subuser
> > for e.g. firefox without mucking with _any_ system state or tables
> >
> > - sharing filesystems between machines is always a pita because
> > usernames might be the same but uids never are - let's kill that off,
> > please
>
> I feel like we need a bit of a survey of filesystems to see what is
> already supported and what are desirable properties. Block filesystems
> are one thing, but network filesystems have been dealing with crap like
> this for decades. I don't have a good handle on who supports what at
> this point.

NFSv4 uses textual user and group names. With have an "idmap" service
which maps between name and number on each end.
This is needed when krb5 is used as kerberos identities are names, not
numbers.

But in my (admittedly limited) experience, when krb5 isn't used (and
probably also when it is), uids do match across the network.
While the original NFSv4 didn't support it, and addendum allows
usernames made entirely of digits to be treated as numerical uids, and
that is what (almost) everyone uses.

It is certainly useful to mount "my" files from some other machine and
have them appear to have "my" uid locally which might be different from
the remote uid. I think when two different machines both have two or
more particular users, it is extremely likely that a central uid data
base will be in use (ldap?) and so all uids will match. No mapping
needed.

(happy to be prove wrong...)

NeilBrown


>
> As far as usernames being the same ... well, maybe. I've been willy,
> mrw103, wilma (twice!), mawilc01 and probably a bunch of others I don't
> remember. I don't think we'll ever get away from having a mapping
> between different naming authorities.
>
>


2024-02-21 02:39:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Tue, Feb 20, 2024 at 07:25:58PM -0500, Kent Overstreet wrote:
> But there's real advantages to getting rid of the string <-> integer
> identifier mapping and plumbing strings all the way through:
>
> - creating a new sub-user can be done with nothing more than the new
> username version of setuid(); IOW, we can start a new named subuser
> for e.g. firefox without mucking with _any_ system state or tables
>
> - sharing filesystems between machines is always a pita because
> usernames might be the same but uids never are - let's kill that off,
> please

I feel like we need a bit of a survey of filesystems to see what is
already supported and what are desirable properties. Block filesystems
are one thing, but network filesystems have been dealing with crap like
this for decades. I don't have a good handle on who supports what at
this point.

As far as usernames being the same ... well, maybe. I've been willy,
mrw103, wilma (twice!), mawilc01 and probably a bunch of others I don't
remember. I don't think we'll ever get away from having a mapping
between different naming authorities.

2024-02-21 03:56:54

by James Bottomley

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Tue, 2024-02-20 at 19:25 -0500, Kent Overstreet wrote:
> On Mon, Feb 19, 2024 at 09:26:25AM -0500, James Bottomley wrote:
> > I would have to say that changing kuid for a string doesn't really
> > buy us anything except a load of complexity for no very real gain.
> > However, since the current kuid is u32 and exposed uid is u16 and
> > there is already a proposal to make use of this somewhat in the way
> > you envision,
>
> Got a link to that proposal?

I think this is the latest presentation on it:

https://fosdem.org/2024/schedule/event/fosdem-2024-3217-converting-filesystems-to-support-idmapped-mounts/

>
> > there might be a possibility to re-express kuid as an array
> > of u16s without much disruption.  Each adjacent pair could
> > represent the owner at the top and the userns assigned uid
> > underneath.  That would neatly solve the nesting problem the
> > current upper 16 bits proposal has.
>
> At a high level, there's no real difference between a variable length
> integer, or a variable length array of integers, or a string.

Right, so the advantage is the kernel already does an integer
comparison all over the place.

> But there's real advantages to getting rid of the string <-> integer
> identifier mapping and plumbing strings all the way through:
>
>  - creating a new sub-user can be done with nothing more than the new
>    username version of setuid(); IOW, we can start a new named
> subuser
>    for e.g. firefox without mucking with _any_ system state or tables
>
>  - sharing filesystems between machines is always a pita because
>    usernames might be the same but uids never are - let's kill that
> off,
>    please
>
> Doing anything as big as an array of integers is going to be a major
> compatibiltiy break anyways, so we might as well do it right.

I'm not really convinced it's right. Strings are trickier to handle
and compare than integer arrays and all of the above can be done by
either.

> Either way we're going to need a mapping to 16 bit uids for
> compatibility; doing this right gives userspace an incentive to get
> _off_ that compatibility layer so we're not dealing with that
> impedence mismatch forever.

Fundamentally we have a load of integer to pretty name things we use in
the kernel (protocol, port, ...). The point though is the kernel
doesn't need to know the pretty name, it deals with integers and user
space does the conversion.

> > However, neither proposal would get us out of the problem of mount
> > mapping because we'd have to keep the filesystem permission check
> > on the owning uid unless told otherwise.
>
> Not sure I follow?

Mounting a filesystem inside a userns can cause huge security problems
if we map fs root to inner root without the admin blessing it. Think
of binding /bin into the userns and then altering one of the root owned
binaries as inner root: if the permission check passes, the change
appears in system /bin.

> We're always going to need mount mapping, but if the mount mapping is
> just "usernames here get mapped to this subtree of the system
> username namespace", then that potentially simplifies things quite a
> bit - the mount mapping is no longer a _table_.

But what then is it? If you allow the user arbitrarily to assign
subuids, you can't trust them for the mapping to the fs uid. The
current newidmap/newgidmap are somewhat nasty but at least they're
controlled.

I did try a prototype where all we cared about was the root<->root
mapping, but a unix system has other uids that are privileged as well,
so it didn't solve the security problem.

> And it wouldn't have to be administrator assigned. Some administrator
> assignment might be required for the username <-> 16 bit uid mapping,
> but if those mappings are ephemeral (i.e. if we get filesystems
> persistently storing usernames, which is easy enough with xattrs)
> then that just becomes "reserve x range of the 16 bit uid space for
> ephemeral translations".

*if* the user names you're dealing with are all unprivileged. When we
have a mix of privileged and unprivileged users owning the files, the
problems begin.

James


2024-02-21 23:01:28

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Tue, Feb 20, 2024 at 10:53:58PM -0500, James Bottomley wrote:
> On Tue, 2024-02-20 at 19:25 -0500, Kent Overstreet wrote:
> > On Mon, Feb 19, 2024 at 09:26:25AM -0500, James Bottomley wrote:
> > > I would have to say that changing kuid for a string doesn't really
> > > buy us anything except a load of complexity for no very real gain.
> > > However, since the current kuid is u32 and exposed uid is u16 and
> > > there is already a proposal to make use of this somewhat in the way
> > > you envision,
> >
> > Got a link to that proposal?
>
> I think this is the latest presentation on it:
>
> https://fosdem.org/2024/schedule/event/fosdem-2024-3217-converting-filesystems-to-support-idmapped-mounts/
>
> >
> > > there might be a possibility to re-express kuid as an array
> > > of u16s without much disruption.  Each adjacent pair could
> > > represent the owner at the top and the userns assigned uid
> > > underneath.  That would neatly solve the nesting problem the
> > > current upper 16 bits proposal has.
> >
> > At a high level, there's no real difference between a variable length
> > integer, or a variable length array of integers, or a string.
>
> Right, so the advantage is the kernel already does an integer
> comparison all over the place.
>
> > But there's real advantages to getting rid of the string <-> integer
> > identifier mapping and plumbing strings all the way through:
> >
> >  - creating a new sub-user can be done with nothing more than the new
> >    username version of setuid(); IOW, we can start a new named
> > subuser
> >    for e.g. firefox without mucking with _any_ system state or tables
> >
> >  - sharing filesystems between machines is always a pita because
> >    usernames might be the same but uids never are - let's kill that
> > off,
> >    please
> >
> > Doing anything as big as an array of integers is going to be a major
> > compatibiltiy break anyways, so we might as well do it right.
>
> I'm not really convinced it's right. Strings are trickier to handle
> and compare than integer arrays and all of the above can be done by
> either.

Strings are just arrays of integers, and anyways this stuff would be
within helpers.

But what you're not seeing is the beauty and simplicity of killing the
mapping layer.

When usernames are strings all the way into the kernel, creating and
switching to a new user is a single syscall. You can't do that if users
are small integer identifiers to the kernel; you have to create a new
entry in /etc/passwd or some equivalent, and that is strictly required
in order to avoid collisions. Users also can't be ephemeral.

To sketch out an example of how this would work, say we've got a new
set_subuser() syscall and the username equivalent of chown().

Now if we want to run firefox as a subuser, giving it access only
local/state/firefox, we'd do the following sequence of syscalls within
the start of the new firefox process:

mkdir(".local/state/firefox");
chown_subuser(".local/state/firefox", "firefox"); /* now owned by $USER.firefox */
set_subuser("firefox");

If we want to guarantee uniqueness, we'd append a UUID to the
subusername for the chown_subuser() call, and then for subsequent
invocations read it with statx() (or subuser enabled equivalent) for the
set_subuser() call.

Now firefox is running in a sandbox, where it has no access to the rest
of your home directory - unless explicitly granted with normal ACLs. And
the sandbox requires no system configuration; rm -rfing the
local/state/firefox directory cleans everything up.

And these trivially nest: Firefox itself wants to sandbox individual
tabs from each other, so firefox could run each sub-process as a
different subuser.

This is dead easy compared to what we've been doing.

> > > However, neither proposal would get us out of the problem of mount
> > > mapping because we'd have to keep the filesystem permission check
> > > on the owning uid unless told otherwise.
> >
> > Not sure I follow?
>
> Mounting a filesystem inside a userns can cause huge security problems
> if we map fs root to inner root without the admin blessing it. Think
> of binding /bin into the userns and then altering one of the root owned
> binaries as inner root: if the permission check passes, the change
> appears in system /bin.

So with this proposal mount mapping becomes "map all users on this
filesystem to subusers of username x". That's a much simpler mapping
than mapping integer ranges to integer ranges, much easier to verify
that there aren't accidental root escpes.

> > And it wouldn't have to be administrator assigned. Some administrator
> > assignment might be required for the username <-> 16 bit uid mapping,
> > but if those mappings are ephemeral (i.e. if we get filesystems
> > persistently storing usernames, which is easy enough with xattrs)
> > then that just becomes "reserve x range of the 16 bit uid space for
> > ephemeral translations".
>
> *if* the user names you're dealing with are all unprivileged. When we
> have a mix of privileged and unprivileged users owning the files, the
> problems begin.

Yes, all subusers are unprivilidged - only one username, the empty
username (which we'd probably map to root) maps to existing uid 0.

2024-02-22 00:33:47

by James Bottomley

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote:
> On Tue, Feb 20, 2024 at 10:53:58PM -0500, James Bottomley wrote:
> > On Tue, 2024-02-20 at 19:25 -0500, Kent Overstreet wrote:
> > > On Mon, Feb 19, 2024 at 09:26:25AM -0500, James Bottomley wrote:
> > > > I would have to say that changing kuid for a string doesn't
> > > > really
> > > > buy us anything except a load of complexity for no very real
> > > > gain.
> > > > However, since the current kuid is u32 and exposed uid is u16
> > > > and
> > > > there is already a proposal to make use of this somewhat in the
> > > > way
> > > > you envision,
> > >
> > > Got a link to that proposal?
> >
> > I think this is the latest presentation on it:
> >
> > https://fosdem.org/2024/schedule/event/fosdem-2024-3217-converting-filesystems-to-support-idmapped-mounts/
> >
> > >
> > > > there might be a possibility to re-express kuid as an array
> > > > of u16s without much disruption.  Each adjacent pair could
> > > > represent the owner at the top and the userns assigned uid
> > > > underneath.  That would neatly solve the nesting problem the
> > > > current upper 16 bits proposal has.
> > >
> > > At a high level, there's no real difference between a variable
> > > length
> > > integer, or a variable length array of integers, or a string.
> >
> > Right, so the advantage is the kernel already does an integer
> > comparison all over the place.
> >
> > > But there's real advantages to getting rid of the string <->
> > > integer
> > > identifier mapping and plumbing strings all the way through:
> > >
> > >  - creating a new sub-user can be done with nothing more than the
> > > new
> > >    username version of setuid(); IOW, we can start a new named
> > > subuser
> > >    for e.g. firefox without mucking with _any_ system state or
> > > tables
> > >
> > >  - sharing filesystems between machines is always a pita because
> > >    usernames might be the same but uids never are - let's kill
> > > that
> > > off,
> > >    please
> > >
> > > Doing anything as big as an array of integers is going to be a
> > > major
> > > compatibiltiy break anyways, so we might as well do it right.
> >
> > I'm not really convinced it's right.  Strings are trickier to
> > handle and compare than integer arrays and all of the above can be
> > done by either.
>
> Strings are just arrays of integers, and anyways this stuff would be
> within helpers.

Length limits and comparisons are the problem

>
> But what you're not seeing is the beauty and simplicity of killing
> the mapping layer.

Well, that's the problem: you don't for certain use cases. That's what
I've been trying to explain. For the fully unprivileged use case,
sure, it all works (as does the upper 32 bits proposal or the integer
array ... equally well.

Once you're representing to the userns contained entity they have a
privileged admin that can write to the fsimage as an apparently
privileged user then the problems begin.

> When usernames are strings all the way into the kernel, creating and
> switching to a new user is a single syscall. You can't do that if
> users are small integer identifiers to the kernel; you have to create
> a new entry in /etc/passwd or some equivalent, and that is strictly
> required in order to avoid collisions. Users also can't be ephemeral.
>
> To sketch out an example of how this would work, say we've got a new
> set_subuser() syscall and the username equivalent of chown().
>
> Now if we want to run firefox as a subuser, giving it access only
> .local/state/firefox, we'd do the following sequence of syscalls
> within the start of the new firefox process:
>
> mkdir(".local/state/firefox");
> chown_subuser(".local/state/firefox", "firefox"); /* now owned by
> $USER.firefox */
> set_subuser("firefox");
>
> If we want to guarantee uniqueness, we'd append a UUID to the
> subusername for the chown_subuser() call, and then for subsequent
> invocations read it with statx() (or subuser enabled equivalent) for
> the set_subuser() call.
>
> Now firefox is running in a sandbox, where it has no access to the
> rest of your home directory - unless explicitly granted with normal
> ACLs. And the sandbox requires no system configuration; rm -rfing the
> .local/state/firefox directory cleans everything up.
>
> And these trivially nest: Firefox itself wants to sandbox individual
> tabs from each other, so firefox could run each sub-process as a
> different subuser.
>
> This is dead easy compared to what we've been doing.

The above is the unprivileged use case. It works, but it's not all we
have to support.

> > > > However, neither proposal would get us out of the problem of
> > > > mount mapping because we'd have to keep the filesystem
> > > > permission check on the owning uid unless told otherwise.
> > >
> > > Not sure I follow?
> >
> > Mounting a filesystem inside a userns can cause huge security
> > problems if we map fs root to inner root without the admin blessing
> > it.  Think of binding /bin into the userns and then altering one of
> > the root owned binaries as inner root: if the permission check
> > passes, the change appears in system /bin.
>
> So with this proposal mount mapping becomes "map all users on this
> filesystem to subusers of username x". That's a much simpler mapping
> than mapping integer ranges to integer ranges, much easier to verify
> that there aren't accidental root escpes.

That doesn't work for the privileged container run in unprivileged
userns containment use case because we need a mapping from inner to
outer root.

> > > And it wouldn't have to be administrator assigned. Some
> > > administrator assignment might be required for the username <->
> > > 16 bit uid mapping, but if those mappings are ephemeral (i.e. if
> > > we get filesystems persistently storing usernames, which is easy
> > > enough with xattrs) then that just becomes "reserve x range of
> > > the 16 bit uid space for ephemeral translations".
> >
> > *if* the user names you're dealing with are all unprivileged.  When
> > we have a mix of privileged and unprivileged users owning the
> > files, the problems begin.
>
> Yes, all subusers are unprivilidged - only one username, the empty
> username (which we'd probably map to root) maps to existing uid 0.

But, as I said above, that's only a subset of the use cases. The
equally big use case is figuring out how to run privileged containers
in a deprivileged mode and yet still allow them to update images (and
other things).

The unprivileged case is also solved by the proposal I referred you to,
so it's unclear what the advantage of strings (or even int arrays) adds
to this unless we can get a better handle on the privileged container
use case with them.

James


2024-02-22 03:47:00

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Thu, Feb 22, 2024 at 01:33:14AM +0100, James Bottomley wrote:
> On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote:
> > Strings are just arrays of integers, and anyways this stuff would be
> > within helpers.
>
> Length limits and comparisons are the problem

We'd be using qstrs for this, not c strings, so they really are
equivalent to arrays for this purpose.

>
> >
> > But what you're not seeing is the beauty and simplicity of killing
> > the mapping layer.
>
> Well, that's the problem: you don't for certain use cases. That's what
> I've been trying to explain. For the fully unprivileged use case,
> sure, it all works (as does the upper 32 bits proposal or the integer
> array ... equally well.
>
> Once you're representing to the userns contained entity they have a
> privileged admin that can write to the fsimage as an apparently
> privileged user then the problems begin.

In what sense?

If they're in a userns and all their mounts are username mapped, that's
completely fine from a userns POV; they can put a suid root binary into
the fs image but when they mount that suid root will be suid to the root
user of their userns.

>
> > When usernames are strings all the way into the kernel, creating and
> > switching to a new user is a single syscall. You can't do that if
> > users are small integer identifiers to the kernel; you have to create
> > a new entry in /etc/passwd or some equivalent, and that is strictly
> > required in order to avoid collisions. Users also can't be ephemeral.
> >
> > To sketch out an example of how this would work, say we've got a new
> > set_subuser() syscall and the username equivalent of chown().
> >
> > Now if we want to run firefox as a subuser, giving it access only
> > .local/state/firefox, we'd do the following sequence of syscalls
> > within the start of the new firefox process:
> >
> > mkdir(".local/state/firefox");
> > chown_subuser(".local/state/firefox", "firefox"); /* now owned by
> > $USER.firefox */
> > set_subuser("firefox");
> >
> > If we want to guarantee uniqueness, we'd append a UUID to the
> > subusername for the chown_subuser() call, and then for subsequent
> > invocations read it with statx() (or subuser enabled equivalent) for
> > the set_subuser() call.
> >
> > Now firefox is running in a sandbox, where it has no access to the
> > rest of your home directory - unless explicitly granted with normal
> > ACLs. And the sandbox requires no system configuration; rm -rfing the
> > .local/state/firefox directory cleans everything up.
> >
> > And these trivially nest: Firefox itself wants to sandbox individual
> > tabs from each other, so firefox could run each sub-process as a
> > different subuser.
> >
> > This is dead easy compared to what we've been doing.
>
> The above is the unprivileged use case. It works, but it's not all we
> have to support.

There is only one root user, in the sense of _actual_ root -
CAP_SYS_ADMIN and all that.
>
> > > > > However, neither proposal would get us out of the problem of
> > > > > mount mapping because we'd have to keep the filesystem
> > > > > permission check on the owning uid unless told otherwise.
> > > >
> > > > Not sure I follow?
> > >
> > > Mounting a filesystem inside a userns can cause huge security
> > > problems if we map fs root to inner root without the admin blessing
> > > it.  Think of binding /bin into the userns and then altering one of
> > > the root owned binaries as inner root: if the permission check
> > > passes, the change appears in system /bin.
> >
> > So with this proposal mount mapping becomes "map all users on this
> > filesystem to subusers of username x". That's a much simpler mapping
> > than mapping integer ranges to integer ranges, much easier to verify
> > that there aren't accidental root escpes.
>
> That doesn't work for the privileged container run in unprivileged
> userns containment use case because we need a mapping from inner to
> outer root.

I can't parse this. "Privileged container in an unprivileged
containment"? Do you just mean a container that has root user (which is
only root over that container, not the rest of the system, of course).

Any user is root over its subusers - so that works perfectly.

Or do you mean something else by "privileged container"? Do you mean a
container that actually has CAP_SYS_ADMIN?

> > > > And it wouldn't have to be administrator assigned. Some
> > > > administrator assignment might be required for the username <->
> > > > 16 bit uid mapping, but if those mappings are ephemeral (i.e. if
> > > > we get filesystems persistently storing usernames, which is easy
> > > > enough with xattrs) then that just becomes "reserve x range of
> > > > the 16 bit uid space for ephemeral translations".
> > >
> > > *if* the user names you're dealing with are all unprivileged.  When
> > > we have a mix of privileged and unprivileged users owning the
> > > files, the problems begin.
> >
> > Yes, all subusers are unprivilidged - only one username, the empty
> > username (which we'd probably map to root) maps to existing uid 0.
>
> But, as I said above, that's only a subset of the use cases. The
> equally big use case is figuring out how to run privileged containers
> in a deprivileged mode and yet still allow them to update images (and
> other things).

If you're running in a userns, all your mounts get the same user mapping
as your userns - where that usermapping is just prepending the username
of the userns. That part is easy.

The big difficulty with letting them update images is that our current
filesystems really aren't ready for the mounting of untrusted images -
they're ~100k loc codebases each and the amount of hardening required is
significant. I would hazard to guess that XFS is the furthest along is
this respect (from all the screaming I hear from Darrick about syzkaller
it sounds like they're taking this the most seriously) - but I would
hesitate to depend on any of our filesystems to be secure in this
respect, even my own - not until we get them rewritten in Rust...

2024-02-22 08:45:53

by James Bottomley

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Wed, 2024-02-21 at 22:37 -0500, Kent Overstreet wrote:
> On Thu, Feb 22, 2024 at 01:33:14AM +0100, James Bottomley wrote:
> > On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote:
> > > Strings are just arrays of integers, and anyways this stuff would
> > > be within helpers.
> >
> > Length limits and comparisons are the problem
>
> We'd be using qstrs for this, not c strings, so they really are
> equivalent to arrays for this purpose.
>
> >
> > >
> > > But what you're not seeing is the beauty and simplicity of
> > > killing
> > > the mapping layer.
> >
> > Well, that's the problem: you don't for certain use cases.  That's
> > what I've been trying to explain.  For the fully unprivileged use
> > case, sure, it all works (as does the upper 32 bits proposal or the
> > integer array ... equally well.
> >
> > Once you're representing to the userns contained entity they have a
> > privileged admin that can write to the fsimage as an apparently
> > privileged user then the problems begin.
>
> In what sense?
>
> If they're in a userns and all their mounts are username mapped,
> that's completely fine from a userns POV; they can put a suid root
> binary into the fs image but when they mount that suid root will be
> suid to the root user of their userns.

if userns root can alter a suid root binary that's bind mounted from
the root namespace then that's a security violation because a user in
the root ns could use the altered binary to do a privilege escalation
attack.

> > > When usernames are strings all the way into the kernel, creating
> > > and switching to a new user is a single syscall. You can't do
> > > that if users are small integer identifiers to the kernel; you
> > > have to create a new entry in /etc/passwd or some equivalent, and
> > > that is strictly required in order to avoid collisions. Users
> > > also can't be ephemeral.
> > >
> > > To sketch out an example of how this would work, say we've got a
> > > new set_subuser() syscall and the username equivalent of chown().
> > >
> > > Now if we want to run firefox as a subuser, giving it access only
> > > .local/state/firefox, we'd do the following sequence of syscalls
> > > within the start of the new firefox process:
> > >
> > > mkdir(".local/state/firefox");
> > > chown_subuser(".local/state/firefox", "firefox"); /* now owned by
> > > $USER.firefox */ set_subuser("firefox");
> > >
> > > If we want to guarantee uniqueness, we'd append a UUID to the
> > > subusername for the chown_subuser() call, and then for subsequent
> > > invocations read it with statx() (or subuser enabled equivalent)
> > > for the set_subuser() call.
> > >
> > > Now firefox is running in a sandbox, where it has no access to
> > > the rest of your home directory - unless explicitly granted with
> > > normal ACLs. And the sandbox requires no system configuration; rm
> > > -rfing the .local/state/firefox directory cleans everything up.
> > >
> > > And these trivially nest: Firefox itself wants to sandbox
> > > individual tabs from each other, so firefox could run each sub-
> > > process as a different subuser.
> > >
> > > This is dead easy compared to what we've been doing.
> >
> > The above is the unprivileged use case.  It works, but it's not all
> > we have to support.
>
> There is only one root user, in the sense of _actual_ root -
> CAP_SYS_ADMIN and all that.

No, that's not correct. CAP_SYS_ADMIN is replaced by ns_capable() for
the user namespace. The creating entity of the userns becomes the ID
for which ns_capable() returns true. The whole goal of deprivileging
containers is to get the container root to seem like it has
CAP_SYS_ADMIN but in fact it's only ns_capable(). Certain features
which are allowed to the userns admin (like filesystem mappings of
inner root) are policy decisions the root namespace admin needs to
make.

> >
> > > > > > However, neither proposal would get us out of the problem
> > > > > > of mount mapping because we'd have to keep the filesystem
> > > > > > permission check on the owning uid unless told otherwise.
> > > > >
> > > > > Not sure I follow?
> > > >
> > > > Mounting a filesystem inside a userns can cause huge security
> > > > problems if we map fs root to inner root without the admin
> > > > blessing it.  Think of binding /bin into the userns and then
> > > > altering one of the root owned binaries as inner root: if the
> > > > permission check passes, the change appears in system /bin.
> > >
> > > So with this proposal mount mapping becomes "map all users on
> > > this filesystem to subusers of username x". That's a much simpler
> > > mapping than mapping integer ranges to integer ranges, much
> > > easier to verify that there aren't accidental root escpes.
> >
> > That doesn't work for the privileged container run in unprivileged
> > userns containment use case because we need a mapping from inner to
> > outer root.
>
> I can't parse this. "Privileged container in an unprivileged
> containment"? Do you just mean a container that has root user (which
> is only root over that container, not the rest of the system, of
> course).

A privileged container is one that has services that run as root, yes.

> Any user is root over its subusers - so that works perfectly.

That's only one aspect of what container root might need to be able to
do.

> Or do you mean something else by "privileged container"? Do you mean
> a container that actually has CAP_SYS_ADMIN?

That's what docker currently does when it creates a privileged
container, yes. However, CAP_SYS_ADMIN is too powerful and can
trivially break containment meaning this isn't a workable solution for
container security. What we need is a container that can bring up
privileged services without root namespace CAP_SYS_ADMIN.

>
> > > > > And it wouldn't have to be administrator assigned. Some
> > > > > administrator assignment might be required for the username
> > > > > <-> 16 bit uid mapping, but if those mappings are ephemeral
> > > > > (i.e. if we get filesystems persistently storing usernames,
> > > > > which is easy enough with xattrs) then that just becomes
> > > > > "reserve x range of the 16 bit uid space for ephemeral
> > > > > translations".
> > > >
> > > > *if* the user names you're dealing with are all unprivileged. 
> > > > When we have a mix of privileged and unprivileged users owning
> > > > the files, the problems begin.
> > >
> > > Yes, all subusers are unprivilidged - only one username, the
> > > empty username (which we'd probably map to root) maps to existing
> > > uid 0.
> >
> > But, as I said above, that's only a subset of the use cases.  The
> > equally big use case is figuring out how to run privileged
> > containers in a deprivileged mode and yet still allow them to
> > update images (and other things).
>
> If you're running in a userns, all your mounts get the same user
> mapping as your userns - where that usermapping is just prepending
> the username of the userns. That part is easy.

No, it's not. Any filesystem that's specific *only* to the container
can do an inner root to real root mapping. Any bind mount visible from
outside can't be allowed to do this because of the suid security issue
above. Determining this "visibility" is really hard, which is why it's
become a policy based mapping controlled by the root namespace admin.

> The big difficulty with letting them update images is that our
> current filesystems really aren't ready for the mounting of untrusted
> images - they're ~100k loc codebases each and the amount of hardening
> required is significant. I would hazard to guess that XFS is the
> furthest along is this respect (from all the screaming I hear from
> Darrick about syzkaller it sounds like they're taking this the most
> seriously) - but I would hesitate to depend on any of our filesystems
> to be secure in this respect, even my own - not until we get them
> rewritten in Rust...

This is a completely separate issue: whether we can allow an
unprivileged container to mount a fs image that might have been crafted
to attack the system. Most FS developers believe we'll never achieve
the point where any specially crafted fs image is safe to mount by an
unprivileged user so again whether the container is allowed to mount a
fs from a block or network device becomes a policy issue for the root
namespace admin rather than something we can globally allow.

James


2024-02-22 11:54:53

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Thu, Feb 22, 2024 at 09:45:32AM +0100, James Bottomley wrote:
> On Wed, 2024-02-21 at 22:37 -0500, Kent Overstreet wrote:
> > On Thu, Feb 22, 2024 at 01:33:14AM +0100, James Bottomley wrote:
> > > On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote:
> > > > Strings are just arrays of integers, and anyways this stuff would
> > > > be within helpers.
> > >
> > > Length limits and comparisons are the problem
> >
> > We'd be using qstrs for this, not c strings, so they really are
> > equivalent to arrays for this purpose.
> >
> > >
> > > >
> > > > But what you're not seeing is the beauty and simplicity of
> > > > killing
> > > > the mapping layer.
> > >
> > > Well, that's the problem: you don't for certain use cases.  That's
> > > what I've been trying to explain.  For the fully unprivileged use
> > > case, sure, it all works (as does the upper 32 bits proposal or the
> > > integer array ... equally well.
> > >
> > > Once you're representing to the userns contained entity they have a
> > > privileged admin that can write to the fsimage as an apparently
> > > privileged user then the problems begin.
> >
> > In what sense?
> >
> > If they're in a userns and all their mounts are username mapped,
> > that's completely fine from a userns POV; they can put a suid root
> > binary into the fs image but when they mount that suid root will be
> > suid to the root user of their userns.
>
> if userns root can alter a suid root binary that's bind mounted from
> the root namespace then that's a security violation because a user in
> the root ns could use the altered binary to do a privilege escalation
> attack.

That's a completely different situation; now you're talking about suid
root, where root is _outside_ the userns, and if you're playing tricks
to make something from a user from outside the ns that is not
representable in the ns visible in that ns, and now you're making that
something suid, of course you're going to have trouble defining self
consistent behaviour.

So I'm not sure what point you were trying to make, but it does
illustrate some key points.

Any time you're creating a system where different agents can have
different but overlapping views of the world, you're going to have some
really fun corner cases and it's going to be hard to reason about.

(if you buy me a really nice scotch somitem I'll tell you about fsck for
a snapshotting filesystem, where for performance reasons fsck has to
process keys from all snapshots simultaneously).

So such things are best avoided, if we can. For another example, see if
you know anyone who's had to track down what's keeping a mount alive,
then the something was a systemd service running in a private namespace.

Systems where we can recursively enumerate the world are much nicer to
work with.

Now, back to user namespaces: they shouldn't exist.

And they wouldn't exist, if usernames had started out as a recursive
structure instead of a flat namespace. But since they started out as a
flat namespace, and only _later_ we realized they actually needed to be
a tree structure - but we have to preserve for compatibility the _view_
of the world as a flat namespace! - that's why we have user namespaces.

And you get all sorts of super weird corner cases like you just
described.

So let's take a step back from all that, and instead of reasoning from
"what weird corner cases from our current system do we have to support"
- instead, just seem what we can do with a cleaner model and get that
properly specified. A good model helps you make sense of the world even
in crazy situations.

With that in mind, back to your bind mount thing: if you chroot(), and
you try to access a symlink that points outside the chroot, what
happens?

2024-02-29 08:52:21

by Shyam Prasad N

[permalink] [raw]
Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model

On Wed, Feb 21, 2024 at 7:45 AM NeilBrown <[email protected]> wrote:
>
> On Wed, 21 Feb 2024, Matthew Wilcox wrote:
> > On Tue, Feb 20, 2024 at 07:25:58PM -0500, Kent Overstreet wrote:
> > > But there's real advantages to getting rid of the string <-> integer
> > > identifier mapping and plumbing strings all the way through:
> > >
> > > - creating a new sub-user can be done with nothing more than the new
> > > username version of setuid(); IOW, we can start a new named subuser
> > > for e.g. firefox without mucking with _any_ system state or tables
> > >
> > > - sharing filesystems between machines is always a pita because
> > > usernames might be the same but uids never are - let's kill that off,
> > > please
> >
> > I feel like we need a bit of a survey of filesystems to see what is
> > already supported and what are desirable properties. Block filesystems
> > are one thing, but network filesystems have been dealing with crap like
> > this for decades. I don't have a good handle on who supports what at
> > this point.
>
> NFSv4 uses textual user and group names. With have an "idmap" service
> which maps between name and number on each end.
> This is needed when krb5 is used as kerberos identities are names, not
> numbers.
>
> But in my (admittedly limited) experience, when krb5 isn't used (and
> probably also when it is), uids do match across the network.
> While the original NFSv4 didn't support it, and addendum allows
> usernames made entirely of digits to be treated as numerical uids, and
> that is what (almost) everyone uses.
>

This may not always be a fair assumption. In today's world, Linux
systems need to co-exist with other systems.
I would take an example of the Linux SMB, which in many ways is
similar to NFS in this context. The difference here is that the
servers (or clients) that we need to interact with deals with variable
length identifiers. Traditionally, it has been Windows SIDs. In more
recent terms, Azure identity service (Microsoft Entra) has moved on to
even more generic identifiers.

I actually agree with Kent on this point (on the ability to map UIDs
to variable length identifiers). We are making an assumption here that
there is a global numerical identifier that all identity providers
provide, and that it fits in 32 or 64 bit space, which may not always
be true. Linux SMB ecosystem (kernel SMB client/server, samba etc) is
having to map these identifiers in a lot of hacky ways. And I don't
think this problem is limited just to SMB filesystems.

Having native support in the kernel to at least map UID/GID to a
variable length identifier (using user namespaces) would really help.
Of course, it can be done in a backward-compatible way, where existing
systems can survive without any changes to their design.

> It is certainly useful to mount "my" files from some other machine and
> have them appear to have "my" uid locally which might be different from
> the remote uid. I think when two different machines both have two or
> more particular users, it is extremely likely that a central uid data
> base will be in use (ldap?) and so all uids will match. No mapping
> needed.
>
> (happy to be prove wrong...)
>
> NeilBrown
>
>
> >
> > As far as usernames being the same ... well, maybe. I've been willy,
> > mrw103, wilma (twice!), mawilc01 and probably a bunch of others I don't
> > remember. I don't think we'll ever get away from having a mapping
> > between different naming authorities.
> >
> >
>
>


--
Regards,
Shyam