Hi,
Some issues have been raised by Christoph Hellwig over how I'm handling
filesystem security in my CacheFiles module, and I'd like advice on how to deal
with them.
CacheFiles stores its cache objects as files and directories in a tree under a
directory nominated by the configuration. This means the data it is holding
(a) is potentially exposed to userspace, and (b) must be labelled for access
control according to the usual filesystem rules.
Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
own pathwalk through the cache and whilst creating files and directories in the
cache. This allows it to deal with DAC security directly. All the directories
it creates are given permissions mask 0700 and all files 0000.
However, Christoph has objected to this practice, and has said that I'm not
allowed to change fsuid and fsgid. The problem with not doing so is that this
code is running in the context of the process that issued the original open(),
read(), write(), etc, and so any accesses or creations it does would be done
with that process's fsuid and fsgid, which would lead to a cache with bits that
can't be shared between users.
Another thing I'm currently doing is bypassing the usual calls to the LSM
hooks. This means that I'm not setting and checking security labels and MACs.
The reason for this is again that I'm running in some random process's context
and labelling and MAC'ing will affect the sharability of the cache. This was
objected to also.
This also bypasses auditing (I think). I don't want the CacheFiles module's
access to the cache to be logged against whatever process was accessing, say,
an NFS file. That process didn't ask to access the cache, and the cache is
meant to be transparent.
I can see a few ways to deal with this:
(1) Do all the cache operations in their own thread (sort of like knfsd).
(2) Add further security ops for the caching code to call. These might be of
use elsewhere in the kernel. These would set cache-specific security
labels and check for them.
(3) Add a flag or something to current to override the normal security on the
basis that it should be using the cache's security rather than the
process's security.
Option (1) would lead to serialisation of all cached NFS operations over the
accesses to disk. Some of these accesses to disk involve sequences of mkdir(),
lookup(), [sg]etxattr() and create() inode op calls - all of which are
synchronous. On the other hand, option (1) gives the simplest way of
overriding security, and would entail the least work. It would, though, give
the biggest performance degradation.
Option (2) would require additional security ops, though possibly not many. It
might be sufficient to just add two: to set and to check the security for a
cache object (be it file or directory). This option would also preclude use of
the vfs_mkdir() VFS op and suchlike, as they use the wrong sort of security.
Some other way of handling the UNIX file ownership would also have to be sought.
Option (3) would require the current security handling in both the filesystem
(DAC) and security modules (MAC) to check for the overriding context in
current, and to go with that instead. It might be possible to make such a
thing incorporate fsuid and fsgid.
Thoughts anyone?
David
On 10/25/06, David Howells <[email protected]> wrote:
>
> Hi,
>
> Some issues have been raised by Christoph Hellwig over how I'm handling
> filesystem security in my CacheFiles module, and I'd like advice on how to deal
> with them.
>
> CacheFiles stores its cache objects as files and directories in a tree under a
> directory nominated by the configuration. This means the data it is holding
> (a) is potentially exposed to userspace, and (b) must be labelled for access
> control according to the usual filesystem rules.
>
> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> own pathwalk through the cache and whilst creating files and directories in the
> cache. This allows it to deal with DAC security directly. All the directories
> it creates are given permissions mask 0700 and all files 0000.
>
> However, Christoph has objected to this practice, and has said that I'm not
> allowed to change fsuid and fsgid. The problem with not doing so is that this
> code is running in the context of the process that issued the original open(),
> read(), write(), etc, and so any accesses or creations it does would be done
> with that process's fsuid and fsgid, which would lead to a cache with bits that
> can't be shared between users.
I don't really understand the objection here. Is it likely to cause
security breaches? None of the proposed solutions seem particularly
elegant, so arguing that the current approach is a hack doesn't hold
much water with me.
> Another thing I'm currently doing is bypassing the usual calls to the LSM
> hooks. This means that I'm not setting and checking security labels and MACs.
> The reason for this is again that I'm running in some random process's context
> and labelling and MAC'ing will affect the sharability of the cache. This was
> objected to also.
>
> This also bypasses auditing (I think). I don't want the CacheFiles module's
> access to the cache to be logged against whatever process was accessing, say,
> an NFS file. That process didn't ask to access the cache, and the cache is
> meant to be transparent.
Christoph, are you objecting to this behavior as well? This seems
like the desired outcome. Do you think there is buggy behavior here,
or do you just have issues with David's design? Can you suggest any
alternatives of your own?
NATE
SELinux support addresses all of these issues for B1 level security
quite well with mandatory access controls
at the fs layers. In fact, it works so well, when enabled you cannot
even run apache on top of an FS unless
configured properly.
Jeff
Nate Diller wrote:
> On 10/25/06, David Howells <[email protected]> wrote:
>
>>
>> Hi,
>>
>> Some issues have been raised by Christoph Hellwig over how I'm handling
>> filesystem security in my CacheFiles module, and I'd like advice on
>> how to deal
>> with them.
>>
>> CacheFiles stores its cache objects as files and directories in a
>> tree under a
>> directory nominated by the configuration. This means the data it is
>> holding
>> (a) is potentially exposed to userspace, and (b) must be labelled for
>> access
>> control according to the usual filesystem rules.
>>
>> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst
>> doing its
>> own pathwalk through the cache and whilst creating files and
>> directories in the
>> cache. This allows it to deal with DAC security directly. All the
>> directories
>> it creates are given permissions mask 0700 and all files 0000.
>>
>> However, Christoph has objected to this practice, and has said that
>> I'm not
>> allowed to change fsuid and fsgid. The problem with not doing so is
>> that this
>> code is running in the context of the process that issued the
>> original open(),
>> read(), write(), etc, and so any accesses or creations it does would
>> be done
>> with that process's fsuid and fsgid, which would lead to a cache with
>> bits that
>> can't be shared between users.
>
>
> I don't really understand the objection here. Is it likely to cause
> security breaches? None of the proposed solutions seem particularly
> elegant, so arguing that the current approach is a hack doesn't hold
> much water with me.
>
>> Another thing I'm currently doing is bypassing the usual calls to the
>> LSM
>> hooks. This means that I'm not setting and checking security labels
>> and MACs.
>> The reason for this is again that I'm running in some random
>> process's context
>> and labelling and MAC'ing will affect the sharability of the cache.
>> This was
>> objected to also.
>>
>> This also bypasses auditing (I think). I don't want the CacheFiles
>> module's
>> access to the cache to be logged against whatever process was
>> accessing, say,
>> an NFS file. That process didn't ask to access the cache, and the
>> cache is
>> meant to be transparent.
>
>
> Christoph, are you objecting to this behavior as well? This seems
> like the desired outcome. Do you think there is buggy behavior here,
> or do you just have issues with David's design? Can you suggest any
> alternatives of your own?
>
> NATE
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Jeff V. Merkey <[email protected]> wrote:
> SELinux support addresses all of these issues for B1 level security quite
> well with mandatory access controls at the fs layers. In fact, it works so
> well, when enabled you cannot even run apache on top of an FS unless
> configured properly.
How? The problem I've got is that the caching code would be creating and
accessing files and directories with the wrong security context - that of the
calling process - and not a context suitable for sharing things in the cache
whilst protecting them from userspace as best we can.
David
David Howells wrote:
>Jeff V. Merkey <[email protected]> wrote:
>
>
>
>>SELinux support addresses all of these issues for B1 level security quite
>>well with mandatory access controls at the fs layers. In fact, it works so
>>well, when enabled you cannot even run apache on top of an FS unless
>>configured properly.
>>
>>
>
>How? The problem I've got is that the caching code would be creating and
>accessing files and directories with the wrong security context - that of the
>calling process - and not a context suitable for sharing things in the cache
>whilst protecting them from userspace as best we can.
>
>
Have it access them as 0.0 (root) when you change the fsuid, etc. and I
think this would satisfy security concerns. I agree that it sounds like
someone needs to instrument MAC layers with this subsystem.
Jeff
>David
>
>
>
Jeff V. Merkey <[email protected]> wrote:
> Have it access them as 0.0 (root) when you change the fsuid, etc. and I think
> this would satisfy security concerns.
That's what I'm currently doing, but Christoph objected and said I'm not
allowed to change fsuid and fsgid.
That also doesn't cover the MAC issues.
> I agree that it sounds like someone needs to instrument MAC layers with this
> subsystem.
Yes.
David
On Wed, Oct 25, 2006 at 11:14:16AM +0100, David Howells wrote:
..
> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> own pathwalk through the cache and whilst creating files and directories in the
> cache. This allows it to deal with DAC security directly. All the directories
> it creates are given permissions mask 0700 and all files 0000.
Unionfs used to do the same thing, until we decided that it was better to do
it some other way. (We went with a work queue approach.)
Hrm. How do you do DAC checks if you don't copy over the permissions without
alteration?
I'm wondering, why don't just you duplicate all the attributes of the files
(including xattrs)? That would take care of most if not all the DAC/MAC
issues, no?
> I can see a few ways to deal with this:
>
> (1) Do all the cache operations in their own thread (sort of like knfsd).
In our case it works well, however we have only very specific times when we
need to use the work queue, so the performance hit doesn't hurt us as much
as it would hurt you - I'm assuming you'd be using the thread for a sizable
portion of calls you get.
> (2) Add further security ops for the caching code to call. These might be of
> use elsewhere in the kernel. These would set cache-specific security
> labels and check for them.
I'm thinking that it would be nice to combine the caching related security
code with those for stackable filesystems. I realize that there may not
really be many things that need to have LSM hooks, but stackable filesystems
should be something to keep in mind now that ecryptfs is in (and hopefully
Unionfs will follow shortly :) ). The SELinux guys would probably know
what's needed.
> (3) Add a flag or something to current to override the normal security on the
> basis that it should be using the cache's security rather than the
> process's security.
Umm...this sounds little bit too hacky (and a fair amount of code would have
to get changed.) I'd prefer a more general solution that applies to more
than just caching.
Josef "Jeff" Sipek.
--
*NOTE: This message is ROT-13 encrypted twice for extra protection*
On Wed, Oct 25, 2006 at 04:21:55PM -0400, Josef Sipek wrote:
...
> I'm wondering, why don't just you duplicate all the attributes of the files
> (including xattrs)? That would take care of most if not all the DAC/MAC
> issues, no?
By that I mean issues associated with accessing the cache, not creating the
cache copies.
Josef "Jeff" Sipek.
--
Note 96.3% of all statistics are fiction.
On Wed, 2006-10-25 at 11:14 +0100, David Howells wrote:
> Hi,
>
> Some issues have been raised by Christoph Hellwig over how I'm handling
> filesystem security in my CacheFiles module, and I'd like advice on how to deal
> with them.
>
> CacheFiles stores its cache objects as files and directories in a tree under a
> directory nominated by the configuration. This means the data it is holding
> (a) is potentially exposed to userspace, and (b) must be labelled for access
> control according to the usual filesystem rules.
>
> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> own pathwalk through the cache and whilst creating files and directories in the
> cache. This allows it to deal with DAC security directly. All the directories
> it creates are given permissions mask 0700 and all files 0000.
>
> However, Christoph has objected to this practice, and has said that I'm not
> allowed to change fsuid and fsgid. The problem with not doing so is that this
> code is running in the context of the process that issued the original open(),
> read(), write(), etc, and so any accesses or creations it does would be done
> with that process's fsuid and fsgid, which would lead to a cache with bits that
> can't be shared between users.
>
> Another thing I'm currently doing is bypassing the usual calls to the LSM
> hooks. This means that I'm not setting and checking security labels and MACs.
> The reason for this is again that I'm running in some random process's context
> and labelling and MAC'ing will affect the sharability of the cache. This was
> objected to also.
>
> This also bypasses auditing (I think). I don't want the CacheFiles module's
> access to the cache to be logged against whatever process was accessing, say,
> an NFS file. That process didn't ask to access the cache, and the cache is
> meant to be transparent.
>
> I can see a few ways to deal with this:
>
> (1) Do all the cache operations in their own thread (sort of like knfsd).
>
> (2) Add further security ops for the caching code to call. These might be of
> use elsewhere in the kernel. These would set cache-specific security
> labels and check for them.
>
> (3) Add a flag or something to current to override the normal security on the
> basis that it should be using the cache's security rather than the
> process's security.
>
>
> Option (1) would lead to serialisation of all cached NFS operations over the
> accesses to disk. Some of these accesses to disk involve sequences of mkdir(),
> lookup(), [sg]etxattr() and create() inode op calls - all of which are
> synchronous. On the other hand, option (1) gives the simplest way of
> overriding security, and would entail the least work. It would, though, give
> the biggest performance degradation.
>
> Option (2) would require additional security ops, though possibly not many. It
> might be sufficient to just add two: to set and to check the security for a
> cache object (be it file or directory). This option would also preclude use of
> the vfs_mkdir() VFS op and suchlike, as they use the wrong sort of security.
> Some other way of handling the UNIX file ownership would also have to be sought.
>
> Option (3) would require the current security handling in both the filesystem
> (DAC) and security modules (MAC) to check for the overriding context in
> current, and to go with that instead. It might be possible to make such a
> thing incorporate fsuid and fsgid.
>
> Thoughts anyone?
So you have two problems:
1) You want the cachefiles kernel module to be able to internally access
the cache files without regard to the permissions of the current
process.
2) You want to control how userspace may access the cache files in a way
that allows the cache daemon to do its job without exposing the cache to
corruption or leakage by other processes.
With regard to (1), bypassing security checks for internal access is
already done in certain cases (e.g. fs-private inodes, kernel-internal
sockets). That can be done just by a flag on the task as in option (3);
it is separate from the issue of how to label the new file.
Your solutions for (2) seem to treat the entire cache as having the same
security properties regardless of the files that are being cached - a
single ownership and security context for all cache files. This
requires a fully trusted cache daemon that can arbitrarily tamper with
any content cached in this manner. I think you need to support at least
per-cache security attributes, and allow people to run separate
instances of the cache daemon for caches with different security
attributes (if not already possible), so that no single cache daemon
process needs to be all powerful.
If I understand correctly, you don't want to just propagate the security
attributes of the original file being cached to the cache file, because
that would expose the cache to corruption by processes that can access
the original file and because you would then have to deal with updating
those attributes upon changes to the original's attributes. But
per-cache security attributes would be a reasonable approach to enabling
finer-grained protection.
Then the question is just how the cachefile module sets the security
attributes (including uid/gid and security context) to the appropriate
per-cache value when creating the cache files. That could just be a
matter of setting the fsuid/fsgid and fscreate attributes (the latter
requiring a hook call) prior to creation, if that approach is palatable.
It would help to understand the objection to setting fsuid/fsgid more
clearly - it may have had more to do with always setting them to 0 for
all cache files, or with using that as a way to work around the lack of
an explicit mechanism for bypassing security checks for internal
accesses than with the setting of the fsuid/fsgid itself.
--
Stephen Smalley
National Security Agency
Ar Mer, 2006-10-25 am 11:14 +0100, ysgrifennodd David Howells:
> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> own pathwalk through the cache and whilst creating files and directories in the
> cache. This allows it to deal with DAC security directly. All the directories
> it creates are given permissions mask 0700 and all files 0000.
That seems sensible and fine. It is precisely why we added a separate
fsuid in the first place so that the user space nfsd could take on an fs
identity without breaking signal and other security based forms.
> (1) Do all the cache operations in their own thread (sort of like knfsd).
Slow it down to keep Christoph happy seems iffy
> (2) Add further security ops for the caching code to call. These might be of
> use elsewhere in the kernel. These would set cache-specific security
> labels and check for them.
I can see good arguments for this in some cases where you want strict
divisons in extremely secure computing cases but not usually.
> Thoughts anyone?
I'd like to know more about why Christoph is objecting, whether he has
actual real world examples of races/problems it introduces by altering
fsuid or what his concern is.
On Thu, Oct 26, 2006 at 12:37:39AM +0100, Alan Cox wrote:
> Ar Mer, 2006-10-25 am 11:14 +0100, ysgrifennodd David Howells:
> > Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> > own pathwalk through the cache and whilst creating files and directories in the
> > cache. This allows it to deal with DAC security directly. All the directories
> > it creates are given permissions mask 0700 and all files 0000.
>
> That seems sensible and fine. It is precisely why we added a separate
> fsuid in the first place so that the user space nfsd could take on an fs
> identity without breaking signal and other security based forms.
I see a problem with that; not sure if that's what Christoph is objecting
to. What about access to cache tree by root process that has nothing
to do with that daemon? Should it get free access to that stuff, regardless
of what policy might say about access to cached files? Or should we at
least try to make sure that we have the instances in cache no more permissive
than originals on NFS?
David Howells wrote:
> Hi,
>
> Some issues have been raised by Christoph Hellwig over how I'm handling
> filesystem security in my CacheFiles module, and I'd like advice on how to deal
> with them.
>
> CacheFiles stores its cache objects as files and directories in a tree under a
> directory nominated by the configuration. This means the data it is holding
> (a) is potentially exposed to userspace, and (b) must be labelled for access
> control according to the usual filesystem rules.
>
> Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst doing its
> own pathwalk through the cache and whilst creating files and directories in the
> cache. This allows it to deal with DAC security directly. All the directories
> it creates are given permissions mask 0700 and all files 0000.
>
> However, Christoph has objected to this practice, and has said that I'm not
> allowed to change fsuid and fsgid. The problem with not doing so is that this
> code is running in the context of the process that issued the original open(),
> read(), write(), etc, and so any accesses or creations it does would be done
> with that process's fsuid and fsgid, which would lead to a cache with bits that
> can't be shared between users.
>
> Another thing I'm currently doing is bypassing the usual calls to the LSM
> hooks. This means that I'm not setting and checking security labels and MACs.
> The reason for this is again that I'm running in some random process's context
> and labelling and MAC'ing will affect the sharability of the cache. This was
> objected to also.
>
> This also bypasses auditing (I think). I don't want the CacheFiles module's
> access to the cache to be logged against whatever process was accessing, say,
> an NFS file. That process didn't ask to access the cache, and the cache is
> meant to be transparent.
>
> I can see a few ways to deal with this:
>
> (1) Do all the cache operations in their own thread (sort of like knfsd).
>
> (2) Add further security ops for the caching code to call. These might be of
> use elsewhere in the kernel. These would set cache-specific security
> labels and check for them.
>
> (3) Add a flag or something to current to override the normal security on the
> basis that it should be using the cache's security rather than the
> process's security.
Why again no local userspace daemon to do the caching? That would
put the policy out of the kernel. The additional context switches
are probably pretty cheap compared to the io operations.
Thanks,
Jan
Josef Sipek <[email protected]> wrote:
> Hrm. How do you do DAC checks if you don't copy over the permissions without
> alteration?
You have to remember there are two filesystem layers involved. NFS or other
netfs does the DAC, MAC, whatever checks to see whether the user can access a
file. NFS then asks the cache to back the netfs file and the cache creates a
file in the local filesystem to do that.
The cache file doesn't need the DAC/MAC/whatever attributes applied to the
netfs file, and, in fact, may not be able to support what the netfs deals
with.
> I'm wondering, why don't just you duplicate all the attributes of the files
> (including xattrs)? That would take care of most if not all the DAC/MAC
> issues, no?
You're forgetting that the userspace cache manager daemon still has to access
the cache.
> > (1) Do all the cache operations in their own thread (sort of like knfsd).
>
> In our case it works well, however we have only very specific times when we
> need to use the work queue, so the performance hit doesn't hurt us as much
> as it would hurt you - I'm assuming you'd be using the thread for a sizable
> portion of calls you get.
I'm not sure exactly. Actually, I could probably deal with read/write ops
inline - though I don't have a file struct to carry a security context - but
getting and releasing inodes would certainly wind up being farmed off.
Consider the automounter releasing an NFS share that's been heavily used...
> I'm thinking that it would be nice to combine the caching related security
> code with those for stackable filesystems.
That's fine by me, though I want the security on a cache file to be different
to that on the netfs file it's backing.
David
Stephen Smalley <[email protected]> wrote:
> So you have two problems:
> 1) You want the cachefiles kernel module to be able to internally access
> the cache files without regard to the permissions of the current
> process.
> 2) You want to control how userspace may access the cache files in a way
> that allows the cache daemon to do its job without exposing the cache to
> corruption or leakage by other processes.
That's about it, yes.
I wanted the management daemon to live in userspace because it has to create a
couple of big tables of cullable objects and then keep them around for when
culling becomes a necessity. Filling the tables takes a long time, and it
gets longer the more object you have in the cache.
> With regard to (1), bypassing security checks for internal access is
> already done in certain cases (e.g. fs-private inodes, kernel-internal
> sockets). That can be done just by a flag on the task as in option (3);
> it is separate from the issue of how to label the new file.
Yeah...
> Your solutions for (2) seem to treat the entire cache as having the same
> security properties regardless of the files that are being cached - a
> single ownership and security context for all cache files.
Yes. NFS handles the security for the user, so for files in the cache that
are acting as data storage objects, I can just have as restrictive a security
as I can manage.
> This requires a fully trusted cache daemon that can arbitrarily tamper with
> any content cached in this manner.
Well, we could say that the cache daemon isn't actually allowed to read or
write any of these files; all it need to do is read their cache xattrs, names
and atimes, and rename and delete them. It also needs to be able to do
readdir and lookup on the directories contained therein.
> I think you need to support at least per-cache security attributes, and
> allow people to run separate instances of the cache daemon for caches with
> different security attributes (if not already possible), so that no single
> cache daemon process needs to be all powerful.
My plan is to have a separate cache daemon for each cache. That way the cull
table filling isn't serialised over all caches.
> If I understand correctly, you don't want to just propagate the security
> attributes of the original file being cached to the cache file,
That is correct.
> because that would expose the cache to corruption by processes that can
> access the original file
Yes.
> and because you would then have to deal with updating those attributes upon
> changes to the original's attributes.
Yes.
But also because the netfs may impose stricter access controls than can be
represented in filesystem underlying the cache.
But mainly because the cache management daemon has to be able to operate on
the cache.
> But per-cache security attributes would be a reasonable approach to enabling
> finer-grained protection.
I'm not sure what you mean by this exactly. I'd be happy to set the same
security attributes on the files and directories in the cache that impose
draconian restrictions, but whatever I do has to be representable in the
backing filesystem across reboots.
> Then the question is just how the cachefile module sets the security
> attributes (including uid/gid and security context) to the appropriate
> per-cache value when creating the cache files.
Yes.
> That could just be a matter of setting the fsuid/fsgid and fscreate
> attributes (the latter requiring a hook call) prior to creation, if that
> approach is palatable.
It is to me.
> It would help to understand the objection to setting fsuid/fsgid more
> clearly - it may have had more to do with always setting them to 0 for all
> cache files, or with using that as a way to work around the lack of an
> explicit mechanism for bypassing security checks for internal accesses than
> with the setting of the fsuid/fsgid itself.
Christoph did enscribe thus:
| - in cachefiles_walk_to_object reseting the fsuid/fsgid is not allowed
Which I take to mean that I'm not allowed to change fsuid and fsgid. Why not,
I'm not entirely certain. I wouldn't have thought it would matter as long as
I put them back again before returning.
David
Al Viro <[email protected]> wrote:
> > > Currently, CacheFiles temporarily changes fsuid and fsgid to 0 whilst
> > > doing its own pathwalk through the cache and whilst creating files and
> > > directories in the cache. This allows it to deal with DAC security
> > > directly. All the directories it creates are given permissions mask
> > > 0700 and all files 0000.
> >
> > That seems sensible and fine. It is precisely why we added a separate
> > fsuid in the first place so that the user space nfsd could take on an fs
> > identity without breaking signal and other security based forms.
>
> I see a problem with that; not sure if that's what Christoph is objecting
> to. What about access to cache tree by root process that has nothing
> to do with that daemon? Should it get free access to that stuff, regardless
> of what policy might say about access to cached files? Or should we at
> least try to make sure that we have the instances in cache no more permissive
> than originals on NFS?
Well, if the data is being copied to where userspace can get at it, and if MAC
is disabled, then it doesn't matter: root can always gain access if it wants
to, no matter what the DAC-related attributes are.
I'm quite happy to attach security data to the files in the cache, and use MAC
if it's available, but most of the time, it can't be the MAC-related attributes
of the process in who's context we're running.
David
Ar Iau, 2006-10-26 am 01:32 +0100, ysgrifennodd Al Viro:
> to. What about access to cache tree by root process that has nothing
> to do with that daemon? Should it get free access to that stuff, regardless
> of what policy might say about access to cached files? Or should we at
> least try to make sure that we have the instances in cache no more permissive
> than originals on NFS?
This is already the case however. Root has ptrace, people have /proc
access (even more than before because the chroot check was broken
recently), root has CAP_SYS_RAWIO.
Jan Dittmer <[email protected]> wrote:
> > I can see a few ways to deal with this:
> >
> > (1) Do all the cache operations in their own thread (sort of like knfsd).
> >
> > (2) Add further security ops for the caching code to call. These might
> > be of use elsewhere in the kernel. These would set cache-specific
> > security labels and check for them.
> >
> > (3) Add a flag or something to current to override the normal security on
> > the basis that it should be using the cache's security rather than
> > the process's security.
>
> Why again no local userspace daemon to do the caching?
Because that reduces the problem to option (1), and then we add context
switches and pushing the metadata in and out of userspace. In addition, the
cache calls back into the netfs to get information.
I'm guess you're thinking of a halfway-house with userspace deciding which
files to open and then opening the file and telling the kernel to use that
file to back a particular netfs inode, with the kernel handling the actual
read/write ops. If you are, then this brings us back to the ENFILE problem,
and also raises EMFILE as a new possible problem. But this time, there isn't
a way to escape the ENFILE problem - you're opening files in userspace.
We could have userspace tell the kernel use a file by name rather than
actually opening it and handing over the fd, I suppose; but you still have the
serialisation issue to deal with.
> That would put the policy out of the kernel. The additional context switches
> are probably pretty cheap compared to the io operations.
It's not just a pair of context switches you have to add in.
David
Ar Iau, 2006-10-26 am 11:14 +0200, ysgrifennodd Jan Dittmer:
> Why again no local userspace daemon to do the caching? That would
> put the policy out of the kernel. The additional context switches
> are probably pretty cheap compared to the io operations.
It costs a lot in latency, and you don't need a daemon in the first
place. If you want to do a user space caching fs of course you can
already do it with FUSE..
On Thu, 2006-10-26 at 11:40 +0100, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
> > This requires a fully trusted cache daemon that can arbitrarily tamper with
> > any content cached in this manner.
>
> Well, we could say that the cache daemon isn't actually allowed to read or
> write any of these files; all it need to do is read their cache xattrs, names
> and atimes, and rename and delete them. It also needs to be able to do
> readdir and lookup on the directories contained therein.
Yes, SELinux policy could impose such a restriction, as well as ensuring
that only the cache daemon can access the cache in the first place (vs.
any arbitrary uid 0 process). However, can the cache daemon effectively
redirect an access by a process to a file that is being cached to
another file by renaming cache files within the cache, such that the
higher level permission checking and auditing would be applied to one
file but the actual access would occur to another one? If so, then it
can effectively downgrade any file it is caching to the security context
of any other file it is caching.
> > But per-cache security attributes would be a reasonable approach to enabling
> > finer-grained protection.
>
> I'm not sure what you mean by this exactly. I'd be happy to set the same
> security attributes on the files and directories in the cache that impose
> draconian restrictions, but whatever I do has to be representable in the
> backing filesystem across reboots.
What I mean is that the cachefiles configuration would allow one to
specify a set of security attributes (uid, gid, security context) per
cache, and those attributes would then be applied by the cachefiles
kernel module when creating files within that cache. This means that
all files within a given cache have the same security attributes, but
different caches can have different attributes. Then one can run
different cache daemon instances with different process security
attributes, and ensure that each one can only access the caches it is
responsible for managing. Thus, the cache daemon instance for external
data can't tamper with the cache of internal data, or vice versa. Now,
your default configuration may just use a single set of security
attributes for all caches, and you might have a default definition in
your configuration that is applied in the absence of any per-cache
value, but you would be providing a mechanism by which people who want
to isolate different caches can do so.
> > It would help to understand the objection to setting fsuid/fsgid more
> > clearly - it may have had more to do with always setting them to 0 for all
> > cache files, or with using that as a way to work around the lack of an
> > explicit mechanism for bypassing security checks for internal accesses than
> > with the setting of the fsuid/fsgid itself.
>
> Christoph did enscribe thus:
>
> | - in cachefiles_walk_to_object reseting the fsuid/fsgid is not allowed
>
> Which I take to mean that I'm not allowed to change fsuid and fsgid. Why not,
> I'm not entirely certain. I wouldn't have thought it would matter as long as
> I put them back again before returning.
I think he also talked about binding the right access control
credentials to the cache files, which I think is more along the lines of
the discussion above (and Viro's concern). I don't think it is so much
about setting fsuid/fsgid per se, but about being able to protect the
cache at finer granularity.
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> > Well, we could say that the cache daemon isn't actually allowed to read or
> > write any of these files; all it need to do is read their cache xattrs,
> > names and atimes, and rename and delete them. It also needs to be able to
> > do readdir and lookup on the directories contained therein.
>
> Yes, SELinux policy could impose such a restriction, as well as ensuring
> that only the cache daemon can access the cache in the first place (vs. any
> arbitrary uid 0 process).
Okay.
> However, can the cache daemon effectively redirect an access by a process to
> a file that is being cached to another file by renaming cache files within
> the cache, such that the higher level permission checking and auditing would
> be applied to one file but the actual access would occur to another one? If
> so, then it can effectively downgrade any file it is caching to the security
> context of any other file it is caching.
Whilst it can do a rename trick like that, there's then another check made by
the module once it has the dentry it wants. This involves checking the
netfs's idea of the net file state against the cache's idea of the net file
state (coherency management) which is stored in an xattr attached to the cache
file.
This could be extended to revalidate the key also in some manner, but
currently if the mtime, ctime or file size are different to that on the
server, the cache file will be scrapped.
> > > But per-cache security attributes would be a reasonable approach to
> > > enabling finer-grained protection.
> >
> > I'm not sure what you mean by this exactly. I'd be happy to set the same
> > security attributes on the files and directories in the cache that impose
> > draconian restrictions, but whatever I do has to be representable in the
> > backing filesystem across reboots.
>
> What I mean is that the cachefiles configuration would allow one to
> specify a set of security attributes (uid, gid, security context) per
> cache, and those attributes would then be applied by the cachefiles
> kernel module when creating files within that cache.
Oh, I see what you mean. Yes, that's what I'm thinking of.
> This means that all files within a given cache have the same security
> attributes, but different caches can have different attributes.
That's fine too.
> Then one can run different cache daemon instances with different process
> security attributes, and ensure that each one can only access the caches it
> is responsible for managing.
Okay.
> Thus, the cache daemon instance for external data can't tamper with the
> cache of internal data, or vice versa. Now, your default configuration may
> just use a single set of security attributes for all caches, and you might
> have a default definition in your configuration that is applied in the
> absence of any per-cache value, but you would be providing a mechanism by
> which people who want to isolate different caches can do so.
Sounds okay. I'm not sure how I'd allow that to be configured. I suspect
it'd have to involve the cache module calling an LSM hook for each cache.
> > > It would help to understand the objection to setting fsuid/fsgid more
> > > clearly - it may have had more to do with always setting them to 0 for
> > > all cache files, or with using that as a way to work around the lack of
> > > an explicit mechanism for bypassing security checks for internal
> > > accesses than with the setting of the fsuid/fsgid itself.
> >
> > Christoph did enscribe thus:
> >
> > | - in cachefiles_walk_to_object reseting the fsuid/fsgid is not allowed
> >
> > Which I take to mean that I'm not allowed to change fsuid and fsgid. Why
> > not, I'm not entirely certain. I wouldn't have thought it would matter as
> > long as I put them back again before returning.
>
> I think he also talked about binding the right access control credentials to
> the cache files, which I think is more along the lines of the discussion
> above (and Viro's concern). I don't think it is so much about setting
> fsuid/fsgid per se, but about being able to protect the cache at finer
> granularity.
He did state both objections, yes; but he did specifically say something about
fsuid and fsgid. However, he can always be overridden.
David
On Thu, 2006-10-26 at 17:04 +0100, David Howells wrote:
> > Thus, the cache daemon instance for external data can't tamper with the
> > cache of internal data, or vice versa. Now, your default configuration may
> > just use a single set of security attributes for all caches, and you might
> > have a default definition in your configuration that is applied in the
> > absence of any per-cache value, but you would be providing a mechanism by
> > which people who want to isolate different caches can do so.
>
> Sounds okay. I'm not sure how I'd allow that to be configured. I suspect
> it'd have to involve the cache module calling an LSM hook for each cache.
I'd expect some userspace process to provide the proper context for each
cache to the cache module during normal configuration, and that context
would come from the same config file that defines the rest of the cache
info. There would need to be a permission check (via a security hook
call) at that point between the process' context and the provided
context to prevent a given process from setting the context arbitrarily.
Is the configuration of the cache module done by the cache daemon
itself? What access checks is it normally subjected to? Do you already
perform a check against the cache dir?
The cache module would then internally store the per-cache context
values, and prior to creating a file within the cache, it would set the
current fscreate attribute (via a security hook call) to that per-cache
value, in much the same way it is presently setting the fsuid/fsgid.
The fscreate attribute is normally set via the security_setprocattr hook
(when a task writes to /proc/self/attr/fscreate); however, you may want
a specialized hook for this purpose that distinguishes the fact that you
are doing this internally. Or possibly your mechanism for exempting the
cache module from permission checking would allow you to just use
security_setprocattr as is. You may also want to convert the context
value to a secid once when it is first configured, and then later just
pass the secid to the hook call for setting the fscreate value for
efficiency. We would need a security_secctx_to_secid() hook for that,
and then a variant of security_setprocattr() that takes a secid instead
of a context value. Some similar handling already exists for audit,
secmark, peersec, and netlink, although some of those are using selinux
specific interfaces presently (but they can be turned into LSM hooks
fairly easily).
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> > Sounds okay. I'm not sure how I'd allow that to be configured. I suspect
> > it'd have to involve the cache module calling an LSM hook for each cache.
>
> I'd expect some userspace process to provide the proper context for each
> cache to the cache module during normal configuration, and that context would
> come from the same config file that defines the rest of the cache info.
This is read by the cachefilesd daemon and fed to the module prior to the
start of the caching.
> There would need to be a permission check (via a security hook call) at that
> point between the process' context and the provided context to prevent a
> given process from setting the context arbitrarily.
Okay.
> Is the configuration of the cache module done by the cache daemon itself?
Yes.
> What access checks is it normally subjected to? Do you already perform a
> check against the cache dir?
The module lets the VFS DAC and MAC stuff check that root has writable access
to the cache dir and will also do checks of the xattr if it's there already.
> The cache module would then internally store the per-cache context values,
Okay.
> and prior to creating a file within the cache, it would set the current
> fscreate attribute (via a security hook call) to that per-cache value, in
> much the same way it is presently setting the fsuid/fsgid.
That sounds reasonable. There has to be some way to revert it, though.
> The fscreate attribute is normally set via the security_setprocattr hook
> (when a task writes to /proc/self/attr/fscreate);
That makes it sound like fscreate is a per-process attribute, not a per-thread
attribute. The former would definitely be a problem.
> however, you may want a specialized hook for this purpose that distinguishes
> the fact that you are doing this internally. Or possibly your mechanism for
> exempting the cache module from permission checking would allow you to just
> use security_setprocattr as is.
Sounds feasible, I think; but I still need to revert the change I imposed.
> You may also want to convert the context value to a secid once when it is
> first configured, and then later just pass the secid to the hook call for
> setting the fscreate value for efficiency.
I'm not sure what you mean by that. Can you give a pseudo-code example?
> We would need a security_secctx_to_secid() hook for that, and then a variant
> of security_setprocattr() that takes a secid instead of a context value.
> Some similar handling already exists for audit, secmark, peersec, and
> netlink, although some of those are using selinux specific interfaces
> presently (but they can be turned into LSM hooks fairly easily).
You make it sound so simple:-)
David
On Thu, 2006-10-26 at 18:09 +0100, David Howells wrote:
> > and prior to creating a file within the cache, it would set the current
> > fscreate attribute (via a security hook call) to that per-cache value, in
> > much the same way it is presently setting the fsuid/fsgid.
>
> That sounds reasonable. There has to be some way to revert it, though.
Yes, that can be done. See below.
> > The fscreate attribute is normally set via the security_setprocattr hook
> > (when a task writes to /proc/self/attr/fscreate);
>
> That makes it sound like fscreate is a per-process attribute, not a per-thread
> attribute. The former would definitely be a problem.
It is actually per-task (thread), sorry for the confusing reference
to /proc/self.
> > You may also want to convert the context value to a secid once when it is
> > first configured, and then later just pass the secid to the hook call for
> > setting the fscreate value for efficiency.
>
> I'm not sure what you mean by that. Can you give a pseudo-code example?
When the daemon writes the context value (a string) to the cachefiles
module interface for a given cache, the cachefiles module would do
something like the following:
/* Map the context to an integer. */
rc = security_secctx_to_secid(value, size, &secid);
if (rc)
goto err;
/* Check permission of current to set this context. */
rc = security_cache_set_context(secid);
if (rc)
goto err;
cache->secid = secid;
SELinux would then provide selinux_secctx_to_secid() and
selinux_cache_set_context() implementations; the former would just be
call to selinux_string_to_sid(), while the latter would require some new
permission check to be defined unless we can treat this as equivalent to
some existing operation. You'll find that there is already a
security_secid_to_secctx() hook defined for LSM, so the first hook just
adds the other direction.
Later, when going to create a file in that cache, the cachefiles module
would do something like:
/* Save and switch the fs secid for creation. */
fssecid = security_getfssecid();
security_setfssecid(cache->secid);
<create file>
/* Restore the original fs secid. */
security_setfssecid(fssecid);
SELinux would then provide selinux_getfsecid() and selinux_setfssecid()
implementations that are just:
u32 selinux_getfssecid(void)
{
struct task_security_struct *tsec = current->security;
return tsec->create_sid;
}
void selinux_setfssecid(u32 secid)
{
struct task_security_struct *tsec = current->security;
tsec->create_sid = secid;
}
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> When the daemon writes the context value (a string) to the cachefiles
> module interface for a given cache, the cachefiles module would do
> something like the following:
This looks reasonable.
> SELinux would then provide selinux_secctx_to_secid() and
> selinux_cache_set_context() implementations; the former would just be call to
> selinux_string_to_sid(),
That sounds fairly easy.
> while the latter would require some new permission check to be defined
> unless we can treat this as equivalent to some existing operation.
So what does this actually check? I assume it checks that the process's
current context permits the use of the specified secid in this snippet:
/* Check permission of current to set this context. */
rc = security_cache_set_context(secid);
> You'll find that there is already a security_secid_to_secctx() hook defined
> for LSM, so the first hook just adds the other direction.
Okay.
> cache->secid = secid;
I was wondering if the cache struct should have a "void *security" that the LSM
modules can set, free and assert temporarily, but this sounds like it will do.
> Later, when going to create a file in that cache, the cachefiles module
> would do something like:
> /* Save and switch the fs secid for creation. */
> fssecid = security_getfssecid();
> security_setfssecid(cache->secid);
> <create file>
> /* Restore the original fs secid. */
> security_setfssecid(fssecid);
> SELinux would then provide selinux_getfsecid() and selinux_setfssecid()
> implementations that are just:
> u32 selinux_getfssecid(void)
> {
> struct task_security_struct *tsec = current->security;
> return tsec->create_sid;
> }
> void selinux_setfssecid(u32 secid)
> {
> struct task_security_struct *tsec = current->security;
> tsec->create_sid = secid;
> }
That sounds doable. I presume I should attend to fsuid/fsgid myself, much as
I'm doing now?
David
Stephen Smalley <[email protected]> wrote:
> When the daemon writes the context value (a string) to the cachefiles
> module interface for a given cache, the cachefiles module would do
> something like the following:
What would such a context value look like? I don't really know much about
configuring SELinux. Would it just be the name of a security label?
David
On Thu, 2006-10-26 at 23:53 +0100, David Howells wrote:
> > while the latter would require some new permission check to be defined
> > unless we can treat this as equivalent to some existing operation.
>
> So what does this actually check? I assume it checks that the process's
> current context permits the use of the specified secid in this snippet:
>
> /* Check permission of current to set this context. */
> rc = security_cache_set_context(secid);
That's right, it would perform a check between the task's sid and the
specified secid. We might want more information passed into the hook,
like the cache directory itself, so that we can apply checks based on
that as well (e.g. can the cache daemon apply this secid for use on
files created within this cache directory?). We would either need to
introduce new permission definitions to SELinux to distinguish this
operation, or we would need to map it to something similar, e.g. apply
the same checks that we would perform if the cache daemon was directly
trying to set its fscreate value to this context and create files in
that context. Possibly something like:
int security_cache_set_context(struct vfsmount *mnt, struct dentry *dentry, u32 secid)
{
struct task_security_struct *tsec = current->security;
struct superblock_security_struct *sbsec = dentry->d_inode->i_sb->s_security;
int rc;
/* Can current set its fscreate attribute? */
rc = task_has_perm(current, current,
PROCESS__SETFSCREATE);
if (rc)
return rc;
/* Can current add entries to the cache dir? */
rc = dentry_has_perm(current, mnt, dentry,
DIR__ADD_NAME|DIR__SEARCH);
if (rc)
return rc;
/* Can current create files with this secid? */
rc = avc_has_perm(tsec->sid, secid, SECCLASS_FILE,
FILE__CREATE, NULL);
if (rc)
return rc;
/* Can files with this secid be placed in that filesystem? */
return avc_has_perm(secid, sbsec->sid, SECCLASS_FILESYSTEM,
FILESYSTEM__ASSOCIATE, NULL);
}
> > Later, when going to create a file in that cache, the cachefiles module
> > would do something like:
> > /* Save and switch the fs secid for creation. */
> > fssecid = security_getfssecid();
> > security_setfssecid(cache->secid);
> > <create file>
> > /* Restore the original fs secid. */
> > security_setfssecid(fssecid);
> > SELinux would then provide selinux_getfsecid() and selinux_setfssecid()
> > implementations that are just:
> > u32 selinux_getfssecid(void)
> > {
> > struct task_security_struct *tsec = current->security;
> > return tsec->create_sid;
> > }
> > void selinux_setfssecid(u32 secid)
> > {
> > struct task_security_struct *tsec = current->security;
> > tsec->create_sid = secid;
> > }
>
> That sounds doable. I presume I should attend to fsuid/fsgid myself, much as
> I'm doing now?
Yes, I think so, although you may want to make those values configurable
too (although it isn't clear that a cache daemon can be run as non-root
at present).
I think we likely also want to refer to the above secid as fscreateid or
similar instead of just fsecid for clarity, because it isn't quite the
same thing as a fsuid/fsgid; it is _only_ used for labeling of new
files, not as the label of the process for permission checking purposes.
So it would be security_getfscreateid() and security_setfscreateid().
Process labels and file labels are distinct.
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> We might want more information passed into the hook, like the cache
> directory itself,
I can do that. I have the cache directory path and the cache tag name both
available as strings.
> int security_cache_set_context(struct vfsmount *mnt, struct dentry *dentry, u32 secid)
> {
Where are you envisioning this going? In SELinux, in the LSM core or in
cachefiles? I was also wondering if I could generalise it to handle all cache
types, but the permissions checks are probably going to be quite different for
each type. For instance, CacheFiles uses files on a mounted fs, whilst CacheFS
uses a block device.
> We would either need to introduce new permission definitions to SELinux to
> distinguish this operation, or we would need to map it to something similar,
> e.g. apply the same checks that we would perform if the cache daemon was
> directly trying to set its fscreate value to this context and create files in
> that context.
Mapping it to something similar sounds reasonable, though I'd quite like
something general, so that I can check for any type of cache coming up.
Also, with your multiple cache example, how would I bring each cachefilesd
daemon up in a different context so that it could handle a different cache with
a different context?
> > That sounds doable. I presume I should attend to fsuid/fsgid myself, much
> > as I'm doing now?
>
> Yes, I think so, although you may want to make those values configurable
> too (although it isn't clear that a cache daemon can be run as non-root
> at present).
Actually, that's not something I had considered, but there's no particular
reason that files in the cache have to be owned by root specifically; nor is
there a reason why cachefilesd needs to run as root. The main thing is that
the files are owned by whatever user the daemon runs as. It'd mean creating
another special user for it, but we do that all the time...
> I think we likely also want to refer to the above secid as fscreateid or
> similar instead of just fsecid for clarity, because it isn't quite the
> same thing as a fsuid/fsgid; it is _only_ used for labeling of new
> files, not as the label of the process for permission checking purposes.
> So it would be security_getfscreateid() and security_setfscreateid().
> Process labels and file labels are distinct.
Gotcha.
David
On Thu, Oct 26, 2006 at 10:56:08AM +0100, David Howells wrote:
> Josef Sipek <[email protected]> wrote:
...
> > I'm wondering, why don't just you duplicate all the attributes of the files
> > (including xattrs)? That would take care of most if not all the DAC/MAC
> > issues, no?
>
> You're forgetting that the userspace cache manager daemon still has to access
> the cache.
Yeah, I did forget. Since there is a userspace daemon, it sounds like what
you're doing is right.
> > I'm thinking that it would be nice to combine the caching related security
> > code with those for stackable filesystems.
>
> That's fine by me, though I want the security on a cache file to be different
> to that on the netfs file it's backing.
It sounds reasonable, and my guess is that chances are that the two
(stackable fs and caching) don't really overlap too much.
Josef "Jeff" Sipek.
--
UNIX is user-friendly ... it's just selective about who it's friends are
On Fri, 2006-10-27 at 16:42 +0100, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
> > We might want more information passed into the hook, like the cache
> > directory itself,
>
> I can do that. I have the cache directory path and the cache tag name both
> available as strings.
>
> > int security_cache_set_context(struct vfsmount *mnt, struct dentry *dentry, u32 secid)
> > {
>
> Where are you envisioning this going? In SELinux, in the LSM core or in
> cachefiles?
Sorry, that should have been named selinux_cache_set_context(), a
SELinux-specific implementation of a security_cache_set_context() LSM
hook. The hook would be called after the cache directory pathname has
been looked up by your module, yielding a (mnt, dentry) pair and after
the security context string has been mapped to a secid.
> I was also wondering if I could generalise it to handle all cache
> types, but the permissions checks are probably going to be quite different for
> each type. For instance, CacheFiles uses files on a mounted fs, whilst CacheFS
> uses a block device.
So in the latter case, the daemon supplies the path of a block device
node? I suppose the hook could internally check the type of inode to
decide what checks to apply, using the checks I previously sketched when
it is a directory and using a different set of checks for the block
device (substituting a write check against the block device for the
directory-specific checks). The hook interface itself would look the
same IIUC, i.e. providing the (mnt, dentry) pair to which the path
resolved and the secid to which the context resolved.
> Also, with your multiple cache example, how would I bring each cachefilesd
> daemon up in a different context so that it could handle a different cache with
> a different context?
runcon will run a program in a specified context, so if you defined
cachesfilesd_internal_t and aachesfilesd_external_t domains in policy,
you could then do:
runcon -t cachefilesd_internal_t -- /path/to/cachefilesd args...
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> > I was also wondering if I could generalise it to handle all cache types,
> > but the permissions checks are probably going to be quite different for
> > each type. For instance, CacheFiles uses files on a mounted fs, whilst
> > CacheFS uses a block device.
>
> So in the latter case, the daemon supplies the path of a block device
> node?
No. In the latter case, there is no userspace daemon. As there are no
dentries, filenames and paths in CacheFS, keeping track of the cull table
consumes a less space than for CacheFiles.
You start the cache by mounting it:
mount -t cachefs /dev/hdx9 /cachefs
Then it's online. However, you might want to check that whoever's calling
mount has permission to bring a cache online...
Actually, I think the permission to bring a cache online applies in all cases,
and is probably separate from checking that CacheFiles(d) is permitted to
mangle the filesystem it's using for a cache. With CacheFS, we could do the
equivalent and do a MAC check to make sure we're permitted to read and write
the blockdev, as you suggest in the next bit:
> I suppose the hook could internally check the type of inode to decide what
> checks to apply, using the checks I previously sketched when it is a
> directory and using a different set of checks for the block device
> (substituting a write check against the block device for the
> directory-specific checks). The hook interface itself would look the same
> IIUC, i.e. providing the (mnt, dentry) pair to which the path resolved and
> the secid to which the context resolved.
So, to summarise, is it worth having two checks:
(1) Permission to bring a cache online or to take a cache offline.
(2) Permission for the process bringing the cache online (cachefilesd or
mount) to access the backing store, be it a set of files and directories,
or be it a blockdev.
David
On Fri, 2006-10-27 at 17:25 +0100, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
> > > I was also wondering if I could generalise it to handle all cache types,
> > > but the permissions checks are probably going to be quite different for
> > > each type. For instance, CacheFiles uses files on a mounted fs, whilst
> > > CacheFS uses a block device.
> >
> > So in the latter case, the daemon supplies the path of a block device
> > node?
>
> No. In the latter case, there is no userspace daemon. As there are no
> dentries, filenames and paths in CacheFS, keeping track of the cull table
> consumes a less space than for CacheFiles.
>
> You start the cache by mounting it:
>
> mount -t cachefs /dev/hdx9 /cachefs
>
> Then it's online. However, you might want to check that whoever's calling
> mount has permission to bring a cache online...
Hmmm...that raises a separate issue - how does SELinux label cachefs
inodes? Does cachefs support xattrs? Other option is to use a mount
context (mount -o context=...) to apply a single context to all inodes
within it. Where exactly is the cachefs code available?
I'm also unclear on where you establish the binding between the files
being cached and the cache. What specifies that e.g. a given NFS mount
should be backed to a given cache? We need to be able to control that
relationship too, to establish that the cache is being protected
adequately for the source data.
> Actually, I think the permission to bring a cache online applies in all cases,
> and is probably separate from checking that CacheFiles(d) is permitted to
> mangle the filesystem it's using for a cache. With CacheFS, we could do the
> equivalent and do a MAC check to make sure we're permitted to read and write
> the blockdev, as you suggest in the next bit:
>
> > I suppose the hook could internally check the type of inode to decide what
> > checks to apply, using the checks I previously sketched when it is a
> > directory and using a different set of checks for the block device
> > (substituting a write check against the block device for the
> > directory-specific checks). The hook interface itself would look the same
> > IIUC, i.e. providing the (mnt, dentry) pair to which the path resolved and
> > the secid to which the context resolved.
>
> So, to summarise, is it worth having two checks:
>
> (1) Permission to bring a cache online or to take a cache offline.
At present, this will show up as the usual checking on mount (security
hooks in do_mount and vfs_kern_mount) and on umount (security hook in
do_umount) by SELinux. I'm not sure whether you need anything specific
to the cache.
> (2) Permission for the process bringing the cache online (cachefilesd or
> mount) to access the backing store, be it a set of files and directories,
> or be it a blockdev.
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> > You start the cache by mounting it:
> >
> > mount -t cachefs /dev/hdx9 /cachefs
> >
> > Then it's online. However, you might want to check that whoever's calling
> > mount has permission to bring a cache online...
>
> Hmmm...that raises a separate issue - how does SELinux label cachefs
> inodes? Does cachefs support xattrs? Other option is to use a mount
> context (mount -o context=...) to apply a single context to all inodes
> within it.
There are only two inodes publically available through cachefs - the root dir
and a file of statistics - and they're both read only. Everything else, for
the moment, is strictly internal and unavailable to userspace. In fact, there
may not be any other inodes.
> Where exactly is the cachefs code available?
It'll need a little bit of work to make it compilable again; but I can probably
get you a copy on Monday. I've been concentrating on CacheFiles.
> I'm also unclear on where you establish the binding between the files
> being cached and the cache. What specifies that e.g. a given NFS mount
> should be backed to a given cache? We need to be able to control that
> relationship too, to establish that the cache is being protected
> adequately for the source data.
Yes. I haven't worked that one out yet. Currently it's not a problem as only
CacheFiles is available at the moment, and that currently only supports one
cache.
But I have an idea that I need to work on to make it possible to associate
directories and files with caches (or other useful parameters).
> > (1) Permission to bring a cache online or to take a cache offline.
>
> At present, this will show up as the usual checking on mount (security
> hooks in do_mount and vfs_kern_mount) and on umount (security hook in
> do_umount) by SELinux. I'm not sure whether you need anything specific
> to the cache.
That doesn't apply to CacheFiles, but okay; we can always add it later if we
decide we want it.
David
David Howells <[email protected]> wrote:
> Some issues have been raised by Christoph Hellwig over how I'm handling
> filesystem security in my CacheFiles module, and I'd like advice on how to
> deal with them.
Having discussed this with Stephen Smally and Karl MacMillan, this is, I think,
the security model for CacheFiles:
(*) There will be four security labels per cache:
(a) A security label attached to the caching directory and all the files
and directories contained therein. This identifies those files as
being part of a particular cache's working set.
(b) A security label that defines the context under which the daemon
(cachefilesd) operates. This permits cachefilesd to be restricted to
only accessing files labelled in (a), and only to do things like stat,
list and delete them - not read or write them.
(c) A security label that defines the context under which the module
operates when accessing the cache. This allows the module, when
accessing the cache, to only operate within the bounds of the cache.
It also permits the module to set a common security label on all the
files it creates in the cache.
(d) A security label to attached to the cachefiles control character
device. This limits access to processes with label (b).
(*) The module will obtain label (a) - the security label with which to label
the files it creates (create_sid) - by reading the security label on the
cache base directory (inode->i_security->sid).
(*) The module will obtain label (c) by reading label (b) from the cachefilesd
process when it opens the cachefiles control chardev and then passing it
through security_change_sid() to ask the security policy to for label (c).
(*) When accessing the cache to look up a cache object (equivalent to NFS read
inode), the CacheFiles module will make temporary substitutions for the
following process security attributes:
(1) current->fsuid and current->fsgid will both become 0.
(2) current->security->create_sid will be set to label (a) so that
vfs_mkdir() and vfs_create() will set the correct labels.
(3) current->security->sid will be set to label (c) so that vfs_mkdir(),
vfs_create() and lookup ops will check for the correct labels.
After the access, the old label will be restored.
Point (3) shouldn't cause a cross-thread race as it would appear that the
security label can only be changed on single-threaded processes. Attempts
to do so on multi-threaded processes are rejected.
David
On Tue, 2006-10-31 at 21:26 +0000, David Howells wrote:
> David Howells <[email protected]> wrote:
>
> > Some issues have been raised by Christoph Hellwig over how I'm handling
> > filesystem security in my CacheFiles module, and I'd like advice on how to
> > deal with them.
>
> Having discussed this with Stephen Smally and Karl MacMillan, this is, I think,
> the security model for CacheFiles:
>
> (*) There will be four security labels per cache:
>
> (a) A security label attached to the caching directory and all the files
> and directories contained therein. This identifies those files as
> being part of a particular cache's working set.
>
> (b) A security label that defines the context under which the daemon
> (cachefilesd) operates. This permits cachefilesd to be restricted to
> only accessing files labelled in (a), and only to do things like stat,
> list and delete them - not read or write them.
>
> (c) A security label that defines the context under which the module
> operates when accessing the cache. This allows the module, when
> accessing the cache, to only operate within the bounds of the cache.
Well, only if the module is well-behaved in the first place, since a
kernel module can naturally bypass SELinux at will. What drives this
approach vs. exempting the module from SELinux checking via a task flag
that it raises and lowers around the access (vs. setting and resetting
the sid around the access to the per-cache module context)?
> It also permits the module to set a common security label on all the
> files it creates in the cache.
>
> (d) A security label to attached to the cachefiles control character
> device. This limits access to processes with label (b).
>
> (*) The module will obtain label (a) - the security label with which to label
> the files it creates (create_sid) - by reading the security label on the
> cache base directory (inode->i_security->sid).
>
> (*) The module will obtain label (c) by reading label (b) from the cachefilesd
> process when it opens the cachefiles control chardev and then passing it
> through security_change_sid() to ask the security policy to for label (c).
Do you mean security_transition_sid()? security_change_sid() doesn't
seem suited to that purpose (relabeling vs. transition computations).
What would you use as the target SID and class? What happens if there
is no transition/change rule in the policy? It seems like the default
behavior would leave it operating in the context of the daemon itself,
in which case you might end up denying access to the module upon
creation?
> (*) When accessing the cache to look up a cache object (equivalent to NFS read
> inode), the CacheFiles module will make temporary substitutions for the
> following process security attributes:
>
> (1) current->fsuid and current->fsgid will both become 0.
>
> (2) current->security->create_sid will be set to label (a) so that
> vfs_mkdir() and vfs_create() will set the correct labels.
>
> (3) current->security->sid will be set to label (c) so that vfs_mkdir(),
> vfs_create() and lookup ops will check for the correct labels.
I think you would want this to be a new ->fssid field instead, and
adjust SELinux to use it if set for permission checking (which could
also be leveraged by NFS later). Or just use a task flag to disable
checking on the module internal accesses.
> After the access, the old label will be restored.
>
> Point (3) shouldn't cause a cross-thread race as it would appear that the
> security label can only be changed on single-threaded processes. Attempts
> to do so on multi-threaded processes are rejected.
I don't quite follow this. The cache module needs to be able to perform
cache access on behalf of any process, including multi-threaded ones,
and the security struct is per-task just like fsuid/fsgid. But mutating
->sid could yield unfortunate behavior if e.g. another process happens
to be sending that task a signal at the same time, so if you go this
route, you want a ->fssid.
--
Stephen Smalley
National Security Agency
Stephen Smalley <[email protected]> wrote:
> > (c) A security label that defines the context under which the module
> > operates when accessing the cache. This allows the module, when
> > accessing the cache, to only operate within the bounds of the
> > cache.
>
> Well, only if the module is well-behaved in the first place, since a
> kernel module can naturally bypass SELinux at will. What drives this
> approach vs. exempting the module from SELinux checking via a task flag
> that it raises and lowers around the access (vs. setting and resetting
> the sid around the access to the per-cache module context)?
Christoph objected very strongly to my bypassing of vfs_mkdir() and co, and Al
wasn't to happy about it either. This should allow me, for example, to call
vfs_mkdir() rather than calling the inode op directly as the reason I wasn't
was that I was having to avoid the security checks it made.
Stephen Smalley <[email protected]> wrote:
> > (*) The module will obtain label (c) by reading label (b) from the
> > cachefilesd process when it opens the cachefiles control chardev and
> > then passing it through security_change_sid() to ask the security
> > policy to for label (c).
>
> Do you mean security_transition_sid()? security_change_sid() doesn't
> seem suited to that purpose
That's what Karl said to use.
> What would you use as the target SID and class?
I've no idea. I tried to find out how to use this function from Karl, but he
said I should ask on the list.
> > (3) current->security->sid will be set to label (c) so that
> > vfs_mkdir(), vfs_create() and lookup ops will check for the
> > correct labels.
>
> I think you would want this to be a new ->fssid field instead, and
> adjust SELinux to use it if set for permission checking (which could
> also be leveraged by NFS later).
I could do that. Does it actually gain anything? Or are there good reasons
for not altering current->security->sid? For instance, does that affect the
label seen on /proc/pid/ files?
> Or just use a task flag to disable checking on the module internal accesses.
I could do that too.
> > Point (3) shouldn't cause a cross-thread race as it would appear that
> > the security label can only be changed on single-threaded processes.
> > Attempts to do so on multi-threaded processes are rejected.
>
> I don't quite follow this.
Sorry, I meant that a process can only change its own security label if it's a
single-threaded process. A kernel module can, of course, change the security
label at any time.
> But mutating ->sid could yield unfortunate behavior if e.g. another process
> happens to be sending that task a signal at the same time, so if you go this
> route, you want a ->fssid.
Okay... that seems like a good reason to do use the ->fssid approach. How do I
tell if ->fssid is set? Is zero usable as 'unset'? Alternatively, would it be
reasonable to have ->fssid track ->sid when the latter changes?
David
On Wed, 2006-11-01 at 15:34 +0000, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
> > > (c) A security label that defines the context under which the module
> > > operates when accessing the cache. This allows the module, when
> > > accessing the cache, to only operate within the bounds of the
> > > cache.
> >
> > Well, only if the module is well-behaved in the first place, since a
> > kernel module can naturally bypass SELinux at will. What drives this
> > approach vs. exempting the module from SELinux checking via a task flag
> > that it raises and lowers around the access (vs. setting and resetting
> > the sid around the access to the per-cache module context)?
>
> Christoph objected very strongly to my bypassing of vfs_mkdir() and co, and Al
> wasn't to happy about it either. This should allow me, for example, to call
> vfs_mkdir() rather than calling the inode op directly as the reason I wasn't
> was that I was having to avoid the security checks it made.
>
> Stephen Smalley <[email protected]> wrote:
>
> > > (*) The module will obtain label (c) by reading label (b) from the
> > > cachefilesd process when it opens the cachefiles control chardev and
> > > then passing it through security_change_sid() to ask the security
> > > policy to for label (c).
> >
> > Do you mean security_transition_sid()? security_change_sid() doesn't
> > seem suited to that purpose
>
> That's what Karl said to use.
>
I suggested change instead of transition because, like most uses of
change, this was a manual relabel rather than automatic. Transition
probably makes more sense, though.
> > What would you use as the target SID and class?
>
> I've no idea. I tried to find out how to use this function from Karl, but he
> said I should ask on the list.
>
This is all predicated on the notion that there is a need to have the
normal SELinux checks performed. Since this serves only as a sanity
check and doesn't add any real security the best option seems bypass,
but I guess that isn't an option.
Additionally, whether the security context comes from a config file or
the policy there is still a chance that we will end up with a security
context without sufficient access. Debugging this is going to be
mysterious for the policy author as it will require detailed knowledge
of the kernel mechanism.
I guess the target SID could be the kernel initial sid and the class
process - though that is basically arbitrary. This is an abuse of the
interface to avoid a config file. Any other options?
> > > (3) current->security->sid will be set to label (c) so that
> > > vfs_mkdir(), vfs_create() and lookup ops will check for the
> > > correct labels.
> >
> > I think you would want this to be a new ->fssid field instead, and
> > adjust SELinux to use it if set for permission checking (which could
> > also be leveraged by NFS later).
>
> I could do that. Does it actually gain anything? Or are there good reasons
> for not altering current->security->sid? For instance, does that affect the
> label seen on /proc/pid/ files?
>
> > Or just use a task flag to disable checking on the module internal accesses.
>
> I could do that too.
>
> > > Point (3) shouldn't cause a cross-thread race as it would appear that
> > > the security label can only be changed on single-threaded processes.
> > > Attempts to do so on multi-threaded processes are rejected.
> >
> > I don't quite follow this.
>
> Sorry, I meant that a process can only change its own security label if it's a
> single-threaded process. A kernel module can, of course, change the security
> label at any time.
>
> > But mutating ->sid could yield unfortunate behavior if e.g. another process
> > happens to be sending that task a signal at the same time, so if you go this
> > route, you want a ->fssid.
>
> Okay... that seems like a good reason to do use the ->fssid approach. How do I
> tell if ->fssid is set? Is zero usable as 'unset'? Alternatively, would it be
> reasonable to have ->fssid track ->sid when the latter changes?
>
fssid seems like the wrong name, though it does match the DAC concept.
This is really more general impersonation of another domain by the
kernel and might have other uses.
Karl
> David
On Wed, 2006-11-01 at 15:34 +0000, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
> > Well, only if the module is well-behaved in the first place, since a
> > kernel module can naturally bypass SELinux at will. What drives this
> > approach vs. exempting the module from SELinux checking via a task flag
> > that it raises and lowers around the access (vs. setting and resetting
> > the sid around the access to the per-cache module context)?
>
> Christoph objected very strongly to my bypassing of vfs_mkdir() and co, and Al
> wasn't to happy about it either. This should allow me, for example, to call
> vfs_mkdir() rather than calling the inode op directly as the reason I wasn't
> was that I was having to avoid the security checks it made.
I wasn't suggesting bypassing the vfs helpers; I was only asking what
motivated the approach of substituting a different SID for the
permission checks vs. using a task flag to disable the permission
checking entirely when the module performs an internal access. The task
flag approach is simpler and avoids the need to set up a label and
policy for the module's internal accesses, which should always succeed
if the module is operating correctly. The only reason to apply a check
on the module's internal accesses is if you want policy to help
safeguard the module against unintentional access attempts to e.g. the
wrong cache or a file outside of the cache. And even that is only a
discretionary safeguard wrt the module - it still relies on the module
to not bypass the security hooks and to set the SID correctly for the
cache it wants to access.
> Stephen Smalley <[email protected]> wrote:
> > Do you mean security_transition_sid()? security_change_sid() doesn't
> > seem suited to that purpose
>
> That's what Karl said to use.
I think security_transition_sid() is more appropriate if you use this
approach.
> > What would you use as the target SID and class?
>
> I've no idea. I tried to find out how to use this function from Karl, but he
> said I should ask on the list.
In normal practice, one passes the task SID as the first argument, the
SID of a related object as the second argument, and the class of object
for which you are computing a SID as the third argument. For program
execution, this takes the form of the current task SID, the executable
file SID, and the process class, and computes the new task SID of the
transformed process.
> > I think you would want this to be a new ->fssid field instead, and
> > adjust SELinux to use it if set for permission checking (which could
> > also be leveraged by NFS later).
>
> I could do that. Does it actually gain anything? Or are there good reasons
> for not altering current->security->sid? For instance, does that affect the
> label seen on /proc/pid/ files?
Yes, upon the next d_revalidate, and it also would affect signal
permission checking and any other permission check against that task.
> > Or just use a task flag to disable checking on the module internal accesses.
>
> I could do that too.
Unless there is some benefit to setting a ->fssid and checking against
it (e.g. safeguarding the module against unintentional internal access),
I think the task flag approach is preferable.
> > But mutating ->sid could yield unfortunate behavior if e.g. another process
> > happens to be sending that task a signal at the same time, so if you go this
> > route, you want a ->fssid.
>
> Okay... that seems like a good reason to do use the ->fssid approach. How do I
> tell if ->fssid is set? Is zero usable as 'unset'? Alternatively, would it be
> reasonable to have ->fssid track ->sid when the latter changes?
Zero aka SECSID_NULL is unset. But a task flag may be sufficient here.
--
Stephen Smalley
National Security Agency
On Wed, 2006-11-01 at 10:58 -0500, Karl MacMillan wrote:
> I suggested change instead of transition because, like most uses of
> change, this was a manual relabel rather than automatic. Transition
> probably makes more sense, though.
Yes, if we need it at all.
> This is all predicated on the notion that there is a need to have the
> normal SELinux checks performed. Since this serves only as a sanity
> check and doesn't add any real security the best option seems bypass,
> but I guess that isn't an option.
"Bypass" in the sense of directly calling the inode ops rather than the
vfs helpers is undesirable. "Bypass" in the sense of temporarily
setting a task flag indicating that permission checking should be
disabled for an internal access attempt seems ok.
> fssid seems like the wrong name, though it does match the DAC concept.
> This is really more general impersonation of another domain by the
> kernel and might have other uses.
NFS will want a fssid in order to have file access checks applied
against the client process' SID if/when the client process' context
becomes available. But it isn't really necessary here as far as I can
see; the cachefiles module is not trying to act on behalf of a task, but
instead is performing an internal access to local cache that should
always succeed, and the usual permission checking for userspace is
handled by the fs before cachefiles is called.
--
Stephen Smalley
National Security Agency
On Wed, 2006-11-01 at 12:45 -0500, Stephen Smalley wrote:
> On Wed, 2006-11-01 at 10:58 -0500, Karl MacMillan wrote:
<snip>
> > fssid seems like the wrong name, though it does match the DAC concept.
> > This is really more general impersonation of another domain by the
> > kernel and might have other uses.
>
> NFS will want a fssid in order to have file access checks applied
> against the client process' SID if/when the client process' context
> becomes available.
I was suggesting that it might be helpful if this applied to all checks
- not just file access - and would therefore need a different name.
Karl
Stephen Smalley <[email protected]> wrote:
> Unless there is some benefit to setting a ->fssid and checking against
> it (e.g. safeguarding the module against unintentional internal access),
> I think the task flag approach is preferable.
Well, I think the use of ->fssid is simpler and faster from an implementation
standpoint (see the attached patch), and it can always be used for such as the
in-kernel nfsd later.
The way I've done it in this patch is to have ->fssid shadow ->sid as long as
they're the same. But ->fssid can be set to something else and then later
reset, at which point it becomes the same as ->sid again. A hook is provided
to perform both of these operations:
security_set_fssid(overriding_SID); //set
...
security_set_fssid(SECSID_NULL); //reset
The rest of the patch is that more or less anywhere ->sid is used to represent
a process as an actor, this is replaced with ->fssid. This part requires no
conditional jumps. It becomes a bit tricky around execve() time, but I think
it's reasonable to ignore that as execve() is unlikely to happen in an
overridden context; or maybe the execve() related ops should be failed if
->sid != ->fssid.
Do you think this is reasonable? Or do you definitely want me to use the
suppression flag approach instead?
The flag approach is a bit more work to implement and will slow most ops down
by virtue of including an extra check, though not all ops can be bypassed (for
instance the op that assigns a security label to an inode can't be bypassed).
There are a couple of things I'm not sure about with the ->fssid approach:
(1) What will auditing do? Is it possible to have an SID that isn't audited?
(2) How do I come up with a security label for the module to use?
David
---
CacheFiles: Add an FS-access SID override in task_security_struct
From: David Howells <[email protected]>
Add a SID override to task_security_struct that is equivalent to fsuid/fsgid in
task_struct. This permits a task to make accesses to filesystem objects as if
it is the overriding SID, without changing its own SID that might be needed to
control ptrace, signals, /proc access, etc.
This is useful for CacheFiles in that it allows CacheFiles to access the cache
files and directories using the cache's security context rather than the
security context of the process on whose behalf it is working, and in the
context of which it is running.
Signed-Off-By: David Howells <[email protected]>
---
include/linux/security.h | 16 ++++
security/dummy.c | 6 ++
security/selinux/hooks.c | 148 ++++++++++++++++++++++---------------
security/selinux/include/objsec.h | 1
4 files changed, 109 insertions(+), 62 deletions(-)
diff --git a/include/linux/security.h b/include/linux/security.h
index ff20ec4..48e9e43 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1166,6 +1166,11 @@ #ifdef CONFIG_SECURITY
* Set the current FS security ID.
* @secid contains the security ID to set.
*
+ * @set_fssid:
+ * Set or reset the access override security ID, returning the old
+ * override security ID.
+ * @secid contains the security ID to set or SECSID_NULL to reset.
+ *
* This is the main security structure.
*/
struct security_operations {
@@ -1351,6 +1356,7 @@ struct security_operations {
void (*release_secctx)(char *secdata, u32 seclen);
u32 (*get_fscreate_secid)(void);
u32 (*set_fscreate_secid)(u32 secid);
+ u32 (*set_fssid)(u32 secid);
#ifdef CONFIG_SECURITY_NETWORK
int (*unix_stream_connect) (struct socket * sock,
@@ -2161,6 +2167,11 @@ static inline u32 security_set_fscreate_
return security_ops->set_fscreate_secid(secid);
}
+static inline u32 security_set_fssid(u32 secid)
+{
+ return security_ops->set_fssid(secid);
+}
+
/* prototypes */
extern int security_init (void);
extern int register_security (struct security_operations *ops);
@@ -2851,6 +2862,11 @@ static inline u32 security_set_fscreate_
return 0;
}
+static inline u32 security_set_fssid(u32 secid)
+{
+ return 0;
+}
+
#endif /* CONFIG_SECURITY */
#ifdef CONFIG_SECURITY_NETWORK
diff --git a/security/dummy.c b/security/dummy.c
index 7784bc4..9fcadbb 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -942,6 +942,11 @@ static u32 dummy_set_fscreate_secid(u32
return 0;
}
+static u32 dummy_set_fssid(u32 secid)
+{
+ return 0;
+}
+
#ifdef CONFIG_KEYS
static inline int dummy_key_alloc(struct key *key, struct task_struct *ctx,
unsigned long flags)
@@ -1099,6 +1104,7 @@ void security_fixup_ops (struct security
set_to_dummy_if_null(ops, release_secctx);
set_to_dummy_if_null(ops, get_fscreate_secid);
set_to_dummy_if_null(ops, set_fscreate_secid);
+ set_to_dummy_if_null(ops, set_fssid);
#ifdef CONFIG_SECURITY_NETWORK
set_to_dummy_if_null(ops, unix_stream_connect);
set_to_dummy_if_null(ops, unix_may_send);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index b315559..c048a98 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -162,7 +162,8 @@ static int task_alloc_security(struct ta
return -ENOMEM;
tsec->task = task;
- tsec->osid = tsec->sid = tsec->ptrace_sid = SECINITSID_UNLABELED;
+ tsec->osid = tsec->fssid = tsec->sid = tsec->ptrace_sid =
+ SECINITSID_UNLABELED;
task->security = tsec;
return 0;
@@ -190,7 +191,7 @@ static int inode_alloc_security(struct i
isec->inode = inode;
isec->sid = SECINITSID_UNLABELED;
isec->sclass = SECCLASS_FILE;
- isec->task_sid = tsec->sid;
+ isec->task_sid = tsec->fssid;
inode->i_security = isec;
return 0;
@@ -220,8 +221,8 @@ static int file_alloc_security(struct fi
return -ENOMEM;
fsec->file = file;
- fsec->sid = tsec->sid;
- fsec->fown_sid = tsec->sid;
+ fsec->sid = tsec->fssid;
+ fsec->fown_sid = tsec->fssid;
file->f_security = fsec;
return 0;
@@ -338,12 +339,12 @@ static int may_context_mount_sb_relabel(
{
int rc;
- rc = avc_has_perm(tsec->sid, sbsec->sid, SECCLASS_FILESYSTEM,
+ rc = avc_has_perm(tsec->fssid, sbsec->sid, SECCLASS_FILESYSTEM,
FILESYSTEM__RELABELFROM, NULL);
if (rc)
return rc;
- rc = avc_has_perm(tsec->sid, sid, SECCLASS_FILESYSTEM,
+ rc = avc_has_perm(tsec->fssid, sid, SECCLASS_FILESYSTEM,
FILESYSTEM__RELABELTO, NULL);
return rc;
}
@@ -353,7 +354,7 @@ static int may_context_mount_inode_relab
struct task_security_struct *tsec)
{
int rc;
- rc = avc_has_perm(tsec->sid, sbsec->sid, SECCLASS_FILESYSTEM,
+ rc = avc_has_perm(tsec->fssid, sbsec->sid, SECCLASS_FILESYSTEM,
FILESYSTEM__RELABELFROM, NULL);
if (rc)
return rc;
@@ -1030,7 +1031,7 @@ static int task_has_perm(struct task_str
tsec1 = tsk1->security;
tsec2 = tsk2->security;
- return avc_has_perm(tsec1->sid, tsec2->sid,
+ return avc_has_perm(tsec1->fssid, tsec2->sid,
SECCLASS_PROCESS, perms, NULL);
}
@@ -1047,7 +1048,7 @@ static int task_has_capability(struct ta
ad.tsk = tsk;
ad.u.cap = cap;
- return avc_has_perm(tsec->sid, tsec->sid,
+ return avc_has_perm(tsec->fssid, tsec->fssid,
SECCLASS_CAPABILITY, CAP_TO_MASK(cap), &ad);
}
@@ -1059,7 +1060,7 @@ static int task_has_system(struct task_s
tsec = tsk->security;
- return avc_has_perm(tsec->sid, SECINITSID_KERNEL,
+ return avc_has_perm(tsec->fssid, SECINITSID_KERNEL,
SECCLASS_SYSTEM, perms, NULL);
}
@@ -1084,7 +1085,7 @@ static int inode_has_perm(struct task_st
ad.u.fs.inode = inode;
}
- return avc_has_perm(tsec->sid, isec->sid, isec->sclass, perms, adp);
+ return avc_has_perm(tsec->fssid, isec->sid, isec->sclass, perms, adp);
}
/* Same as inode_has_perm, but pass explicit audit data containing
@@ -1127,8 +1128,8 @@ static int file_has_perm(struct task_str
ad.u.fs.mnt = mnt;
ad.u.fs.dentry = dentry;
- if (tsec->sid != fsec->sid) {
- rc = avc_has_perm(tsec->sid, fsec->sid,
+ if (tsec->fssid != fsec->sid) {
+ rc = avc_has_perm(tsec->fssid, fsec->sid,
SECCLASS_FD,
FD__USE,
&ad);
@@ -1162,7 +1163,7 @@ static int may_create(struct inode *dir,
AVC_AUDIT_DATA_INIT(&ad, FS);
ad.u.fs.dentry = dentry;
- rc = avc_has_perm(tsec->sid, dsec->sid, SECCLASS_DIR,
+ rc = avc_has_perm(tsec->fssid, dsec->sid, SECCLASS_DIR,
DIR__ADD_NAME | DIR__SEARCH,
&ad);
if (rc)
@@ -1171,13 +1172,13 @@ static int may_create(struct inode *dir,
if (tsec->create_sid && sbsec->behavior != SECURITY_FS_USE_MNTPOINT) {
newsid = tsec->create_sid;
} else {
- rc = security_transition_sid(tsec->sid, dsec->sid, tclass,
+ rc = security_transition_sid(tsec->fssid, dsec->sid, tclass,
&newsid);
if (rc)
return rc;
}
- rc = avc_has_perm(tsec->sid, newsid, tclass, FILE__CREATE, &ad);
+ rc = avc_has_perm(tsec->fssid, newsid, tclass, FILE__CREATE, &ad);
if (rc)
return rc;
@@ -1194,7 +1195,8 @@ static int may_create_key(u32 ksid,
tsec = ctx->security;
- return avc_has_perm(tsec->sid, ksid, SECCLASS_KEY, KEY__CREATE, NULL);
+ return avc_has_perm(tsec->fssid, ksid, SECCLASS_KEY, KEY__CREATE,
+ NULL);
}
#define MAY_LINK 0
@@ -1222,7 +1224,7 @@ static int may_link(struct inode *dir,
av = DIR__SEARCH;
av |= (kind ? DIR__REMOVE_NAME : DIR__ADD_NAME);
- rc = avc_has_perm(tsec->sid, dsec->sid, SECCLASS_DIR, av, &ad);
+ rc = avc_has_perm(tsec->fssid, dsec->sid, SECCLASS_DIR, av, &ad);
if (rc)
return rc;
@@ -1241,7 +1243,7 @@ static int may_link(struct inode *dir,
return 0;
}
- rc = avc_has_perm(tsec->sid, isec->sid, isec->sclass, av, &ad);
+ rc = avc_has_perm(tsec->fssid, isec->sid, isec->sclass, av, &ad);
return rc;
}
@@ -1266,16 +1268,16 @@ static inline int may_rename(struct inod
AVC_AUDIT_DATA_INIT(&ad, FS);
ad.u.fs.dentry = old_dentry;
- rc = avc_has_perm(tsec->sid, old_dsec->sid, SECCLASS_DIR,
+ rc = avc_has_perm(tsec->fssid, old_dsec->sid, SECCLASS_DIR,
DIR__REMOVE_NAME | DIR__SEARCH, &ad);
if (rc)
return rc;
- rc = avc_has_perm(tsec->sid, old_isec->sid,
+ rc = avc_has_perm(tsec->fssid, old_isec->sid,
old_isec->sclass, FILE__RENAME, &ad);
if (rc)
return rc;
if (old_is_dir && new_dir != old_dir) {
- rc = avc_has_perm(tsec->sid, old_isec->sid,
+ rc = avc_has_perm(tsec->fssid, old_isec->sid,
old_isec->sclass, DIR__REPARENT, &ad);
if (rc)
return rc;
@@ -1285,13 +1287,13 @@ static inline int may_rename(struct inod
av = DIR__ADD_NAME | DIR__SEARCH;
if (new_dentry->d_inode)
av |= DIR__REMOVE_NAME;
- rc = avc_has_perm(tsec->sid, new_dsec->sid, SECCLASS_DIR, av, &ad);
+ rc = avc_has_perm(tsec->fssid, new_dsec->sid, SECCLASS_DIR, av, &ad);
if (rc)
return rc;
if (new_dentry->d_inode) {
new_isec = new_dentry->d_inode->i_security;
new_is_dir = S_ISDIR(new_dentry->d_inode->i_mode);
- rc = avc_has_perm(tsec->sid, new_isec->sid,
+ rc = avc_has_perm(tsec->fssid, new_isec->sid,
new_isec->sclass,
(new_is_dir ? DIR__RMDIR : FILE__UNLINK), &ad);
if (rc)
@@ -1312,7 +1314,7 @@ static int superblock_has_perm(struct ta
tsec = tsk->security;
sbsec = sb->s_security;
- return avc_has_perm(tsec->sid, sbsec->sid, SECCLASS_FILESYSTEM,
+ return avc_has_perm(tsec->fssid, sbsec->sid, SECCLASS_FILESYSTEM,
perms, ad);
}
@@ -1376,7 +1378,7 @@ static int selinux_ptrace(struct task_st
rc = task_has_perm(parent, child, PROCESS__PTRACE);
/* Save the SID of the tracing process for later use in apply_creds. */
if (!(child->ptrace & PT_PTRACED) && !rc)
- csec->ptrace_sid = psec->sid;
+ csec->ptrace_sid = psec->fssid;
return rc;
}
@@ -1445,7 +1447,7 @@ static int selinux_sysctl(ctl_table *tab
/* The op values are "defined" in sysctl.c, thereby creating
* a bad coupling between this module and sysctl.c */
if(op == 001) {
- error = avc_has_perm(tsec->sid, tsid,
+ error = avc_has_perm(tsec->fssid, tsid,
SECCLASS_DIR, DIR__SEARCH, NULL);
} else {
av = 0;
@@ -1454,7 +1456,7 @@ static int selinux_sysctl(ctl_table *tab
if (op & 002)
av |= FILE__WRITE;
if (av)
- error = avc_has_perm(tsec->sid, tsid,
+ error = avc_has_perm(tsec->fssid, tsid,
SECCLASS_FILE, av, NULL);
}
@@ -1546,7 +1548,7 @@ static int selinux_vm_enough_memory(long
rc = secondary_ops->capable(current, CAP_SYS_ADMIN);
if (rc == 0)
- rc = avc_has_perm_noaudit(tsec->sid, tsec->sid,
+ rc = avc_has_perm_noaudit(tsec->fssid, tsec->fssid,
SECCLASS_CAPABILITY,
CAP_TO_MASK(CAP_SYS_ADMIN),
NULL);
@@ -1598,7 +1600,7 @@ static int selinux_bprm_set_security(str
isec = inode->i_security;
/* Default to the current task SID. */
- bsec->sid = tsec->sid;
+ bsec->sid = tsec->fssid;
/* Reset fs, key, and sock SIDs on execve. */
tsec->create_sid = 0;
@@ -1611,7 +1613,7 @@ static int selinux_bprm_set_security(str
tsec->exec_sid = 0;
} else {
/* Check for a default transition on this program. */
- rc = security_transition_sid(tsec->sid, isec->sid,
+ rc = security_transition_sid(tsec->fssid, isec->sid,
SECCLASS_PROCESS, &newsid);
if (rc)
return rc;
@@ -1622,16 +1624,16 @@ static int selinux_bprm_set_security(str
ad.u.fs.dentry = bprm->file->f_dentry;
if (bprm->file->f_vfsmnt->mnt_flags & MNT_NOSUID)
- newsid = tsec->sid;
+ newsid = tsec->fssid;
- if (tsec->sid == newsid) {
- rc = avc_has_perm(tsec->sid, isec->sid,
+ if (tsec->fssid == newsid) {
+ rc = avc_has_perm(tsec->fssid, isec->sid,
SECCLASS_FILE, FILE__EXECUTE_NO_TRANS, &ad);
if (rc)
return rc;
} else {
/* Check permissions for the transition. */
- rc = avc_has_perm(tsec->sid, newsid,
+ rc = avc_has_perm(tsec->fssid, newsid,
SECCLASS_PROCESS, PROCESS__TRANSITION, &ad);
if (rc)
return rc;
@@ -1810,6 +1812,8 @@ static void selinux_bprm_apply_creds(str
return;
}
}
+ if (tsec->fssid == tsec->sid)
+ tsec->fssid = sid;
tsec->sid = sid;
}
}
@@ -2084,7 +2088,7 @@ static int selinux_inode_init_security(s
if (tsec->create_sid && sbsec->behavior != SECURITY_FS_USE_MNTPOINT) {
newsid = tsec->create_sid;
} else {
- rc = security_transition_sid(tsec->sid, dsec->sid,
+ rc = security_transition_sid(tsec->fssid, dsec->sid,
inode_mode_to_security_class(inode->i_mode),
&newsid);
if (rc) {
@@ -2275,7 +2279,7 @@ static int selinux_inode_setxattr(struct
AVC_AUDIT_DATA_INIT(&ad,FS);
ad.u.fs.dentry = dentry;
- rc = avc_has_perm(tsec->sid, isec->sid, isec->sclass,
+ rc = avc_has_perm(tsec->fssid, isec->sid, isec->sclass,
FILE__RELABELFROM, &ad);
if (rc)
return rc;
@@ -2284,12 +2288,12 @@ static int selinux_inode_setxattr(struct
if (rc)
return rc;
- rc = avc_has_perm(tsec->sid, newsid, isec->sclass,
+ rc = avc_has_perm(tsec->fssid, newsid, isec->sclass,
FILE__RELABELTO, &ad);
if (rc)
return rc;
- rc = security_validate_transition(isec->sid, newsid, tsec->sid,
+ rc = security_validate_transition(isec->sid, newsid, tsec->fssid,
isec->sclass);
if (rc)
return rc;
@@ -2693,7 +2697,7 @@ static int selinux_task_alloc_security(s
tsec2 = tsk->security;
tsec2->osid = tsec1->osid;
- tsec2->sid = tsec1->sid;
+ tsec2->fssid = tsec2->sid = tsec1->sid;
/* Retain the exec, fs, key, and sock SIDs across fork */
tsec2->exec_sid = tsec1->exec_sid;
@@ -2837,7 +2841,7 @@ static int selinux_task_kill(struct task
perm = signal_to_av(sig);
tsec = p->security;
if (secid)
- rc = avc_has_perm(secid, tsec->sid, SECCLASS_PROCESS, perm, NULL);
+ rc = avc_has_perm(secid, tsec->fssid, SECCLASS_PROCESS, perm, NULL);
else
rc = task_has_perm(current, p, perm);
return rc;
@@ -2872,6 +2876,8 @@ static void selinux_task_reparent_to_ini
tsec = p->security;
tsec->osid = tsec->sid;
+ if (tsec->fssid == tsec->sid)
+ tsec->fssid = SECINITSID_KERNEL;
tsec->sid = SECINITSID_KERNEL;
return;
}
@@ -2882,7 +2888,7 @@ static void selinux_task_to_inode(struct
struct task_security_struct *tsec = p->security;
struct inode_security_struct *isec = inode->i_security;
- isec->sid = tsec->sid;
+ isec->sid = tsec->fssid;
isec->initialized = 1;
return;
}
@@ -3054,7 +3060,7 @@ static int socket_has_perm(struct task_s
AVC_AUDIT_DATA_INIT(&ad,NET);
ad.u.net.sk = sock->sk;
- err = avc_has_perm(tsec->sid, isec->sid, isec->sclass, perms, &ad);
+ err = avc_has_perm(tsec->fssid, isec->sid, isec->sclass, perms, &ad);
out:
return err;
@@ -3071,8 +3077,8 @@ static int selinux_socket_create(int fam
goto out;
tsec = current->security;
- newsid = tsec->sockcreate_sid ? : tsec->sid;
- err = avc_has_perm(tsec->sid, newsid,
+ newsid = tsec->sockcreate_sid ? : tsec->fssid;
+ err = avc_has_perm(tsec->fssid, newsid,
socket_type_to_security_class(family, type,
protocol), SOCKET__CREATE, NULL);
@@ -3092,7 +3098,7 @@ static int selinux_socket_post_create(st
isec = SOCK_INODE(sock)->i_security;
tsec = current->security;
- newsid = tsec->sockcreate_sid ? : tsec->sid;
+ newsid = tsec->sockcreate_sid ? : tsec->fssid;
isec->sclass = socket_type_to_security_class(family, type, protocol);
isec->sid = kern ? SECINITSID_KERNEL : newsid;
isec->initialized = 1;
@@ -3904,7 +3910,7 @@ static int ipc_alloc_security(struct tas
isec->sclass = sclass;
isec->ipc_perm = perm;
- isec->sid = tsec->sid;
+ isec->sid = tsec->fssid;
perm->security = isec;
return 0;
@@ -3953,7 +3959,7 @@ static int ipc_has_perm(struct kern_ipc_
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = ipc_perms->key;
- return avc_has_perm(tsec->sid, isec->sid, isec->sclass, perms, &ad);
+ return avc_has_perm(tsec->fssid, isec->sid, isec->sclass, perms, &ad);
}
static int selinux_msg_msg_alloc_security(struct msg_msg *msg)
@@ -3984,7 +3990,7 @@ static int selinux_msg_queue_alloc_secur
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = msq->q_perm.key;
- rc = avc_has_perm(tsec->sid, isec->sid, SECCLASS_MSGQ,
+ rc = avc_has_perm(tsec->fssid, isec->sid, SECCLASS_MSGQ,
MSGQ__CREATE, &ad);
if (rc) {
ipc_free_security(&msq->q_perm);
@@ -4010,7 +4016,7 @@ static int selinux_msg_queue_associate(s
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = msq->q_perm.key;
- return avc_has_perm(tsec->sid, isec->sid, SECCLASS_MSGQ,
+ return avc_has_perm(tsec->fssid, isec->sid, SECCLASS_MSGQ,
MSGQ__ASSOCIATE, &ad);
}
@@ -4062,7 +4068,7 @@ static int selinux_msg_queue_msgsnd(stru
* Compute new sid based on current process and
* message queue this message will be stored in
*/
- rc = security_transition_sid(tsec->sid,
+ rc = security_transition_sid(tsec->fssid,
isec->sid,
SECCLASS_MSG,
&msec->sid);
@@ -4074,11 +4080,11 @@ static int selinux_msg_queue_msgsnd(stru
ad.u.ipc_id = msq->q_perm.key;
/* Can this process write to the queue? */
- rc = avc_has_perm(tsec->sid, isec->sid, SECCLASS_MSGQ,
+ rc = avc_has_perm(tsec->fssid, isec->sid, SECCLASS_MSGQ,
MSGQ__WRITE, &ad);
if (!rc)
/* Can this process send the message */
- rc = avc_has_perm(tsec->sid, msec->sid,
+ rc = avc_has_perm(tsec->fssid, msec->sid,
SECCLASS_MSG, MSG__SEND, &ad);
if (!rc)
/* Can the message be put in the queue? */
@@ -4105,10 +4111,10 @@ static int selinux_msg_queue_msgrcv(stru
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = msq->q_perm.key;
- rc = avc_has_perm(tsec->sid, isec->sid,
+ rc = avc_has_perm(tsec->fssid, isec->sid,
SECCLASS_MSGQ, MSGQ__READ, &ad);
if (!rc)
- rc = avc_has_perm(tsec->sid, msec->sid,
+ rc = avc_has_perm(tsec->fssid, msec->sid,
SECCLASS_MSG, MSG__RECEIVE, &ad);
return rc;
}
@@ -4131,7 +4137,7 @@ static int selinux_shm_alloc_security(st
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = shp->shm_perm.key;
- rc = avc_has_perm(tsec->sid, isec->sid, SECCLASS_SHM,
+ rc = avc_has_perm(tsec->fssid, isec->sid, SECCLASS_SHM,
SHM__CREATE, &ad);
if (rc) {
ipc_free_security(&shp->shm_perm);
@@ -4157,7 +4163,7 @@ static int selinux_shm_associate(struct
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = shp->shm_perm.key;
- return avc_has_perm(tsec->sid, isec->sid, SECCLASS_SHM,
+ return avc_has_perm(tsec->fssid, isec->sid, SECCLASS_SHM,
SHM__ASSOCIATE, &ad);
}
@@ -4230,7 +4236,7 @@ static int selinux_sem_alloc_security(st
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = sma->sem_perm.key;
- rc = avc_has_perm(tsec->sid, isec->sid, SECCLASS_SEM,
+ rc = avc_has_perm(tsec->fssid, isec->sid, SECCLASS_SEM,
SEM__CREATE, &ad);
if (rc) {
ipc_free_security(&sma->sem_perm);
@@ -4256,7 +4262,7 @@ static int selinux_sem_associate(struct
AVC_AUDIT_DATA_INIT(&ad, IPC);
ad.u.ipc_id = sma->sem_perm.key;
- return avc_has_perm(tsec->sid, isec->sid, SECCLASS_SEM,
+ return avc_has_perm(tsec->fssid, isec->sid, SECCLASS_SEM,
SEM__ASSOCIATE, &ad);
}
@@ -4500,14 +4506,19 @@ static int selinux_setprocattr(struct ta
error = avc_has_perm_noaudit(tsec->ptrace_sid, sid,
SECCLASS_PROCESS,
PROCESS__PTRACE, &avd);
- if (!error)
+ if (!error) {
+ if (tsec->fssid == tsec->sid)
+ tsec->fssid = sid;
tsec->sid = sid;
+ }
task_unlock(p);
avc_audit(tsec->ptrace_sid, sid, SECCLASS_PROCESS,
PROCESS__PTRACE, &avd, error, NULL);
if (error)
return error;
} else {
+ if (tsec->fssid == tsec->sid)
+ tsec->fssid = sid;
tsec->sid = sid;
task_unlock(p);
}
@@ -4550,6 +4561,18 @@ static u32 selinux_set_fscreate_secid(u3
return oldsid;
}
+static u32 selinux_set_fssid(u32 secid)
+{
+ struct task_security_struct *tsec = current->security;
+ u32 oldfssid = tsec->fssid;
+
+ if (secid == SECSID_NULL)
+ tsec->fssid = tsec->sid;
+ else
+ tsec->fssid = secid;
+ return oldfssid;
+}
+
#ifdef CONFIG_KEYS
static int selinux_key_alloc(struct key *k, struct task_struct *tsk,
@@ -4599,7 +4622,7 @@ static int selinux_key_permission(key_re
if (perm == 0)
return 0;
- return avc_has_perm(tsec->sid, ksec->sid,
+ return avc_has_perm(tsec->fssid, ksec->sid,
SECCLASS_KEY, perm, NULL);
}
@@ -4735,6 +4758,7 @@ static struct security_operations selinu
.release_secctx = selinux_release_secctx,
.get_fscreate_secid = selinux_get_fscreate_secid,
.set_fscreate_secid = selinux_set_fscreate_secid,
+ .set_fssid = selinux_set_fssid,
.unix_stream_connect = selinux_socket_unix_stream_connect,
.unix_may_send = selinux_socket_unix_may_send,
@@ -4800,7 +4824,7 @@ static __init int selinux_init(void)
if (task_alloc_security(current))
panic("SELinux: Failed to initialize initial task.\n");
tsec = current->security;
- tsec->osid = tsec->sid = SECINITSID_KERNEL;
+ tsec->osid = tsec->fssid = tsec->sid = SECINITSID_KERNEL;
sel_inode_cache = kmem_cache_create("selinux_inode_security",
sizeof(struct inode_security_struct),
diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
index ef2267f..e51eaf3 100644
--- a/security/selinux/include/objsec.h
+++ b/security/selinux/include/objsec.h
@@ -30,6 +30,7 @@ struct task_security_struct {
struct task_struct *task; /* back pointer to task object */
u32 osid; /* SID prior to last execve */
u32 sid; /* current SID */
+ u32 fssid; /* access override SID (normally == sid) */
u32 exec_sid; /* exec SID */
u32 create_sid; /* fscreate SID */
u32 keycreate_sid; /* keycreate SID */
On Thu, 2006-11-02 at 11:29 -0500, Karl MacMillan wrote:
> On Wed, 2006-11-01 at 12:45 -0500, Stephen Smalley wrote:
> > On Wed, 2006-11-01 at 10:58 -0500, Karl MacMillan wrote:
>
> <snip>
>
>
> > > fssid seems like the wrong name, though it does match the DAC concept.
> > > This is really more general impersonation of another domain by the
> > > kernel and might have other uses.
> >
> > NFS will want a fssid in order to have file access checks applied
> > against the client process' SID if/when the client process' context
> > becomes available.
>
> I was suggesting that it might be helpful if this applied to all checks
> - not just file access - and would therefore need a different name.
Not all checks (e.g. not when checking the ability of another task to
send a signal to the task), but all checks where the task SID is used as
the subject/source.
--
Stephen Smalley
National Security Agency
On Thu, 2006-11-02 at 17:16 +0000, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
> > Unless there is some benefit to setting a ->fssid and checking against
> > it (e.g. safeguarding the module against unintentional internal access),
> > I think the task flag approach is preferable.
>
> Well, I think the use of ->fssid is simpler and faster from an implementation
> standpoint (see the attached patch), and it can always be used for such as the
> in-kernel nfsd later.
>
> The way I've done it in this patch is to have ->fssid shadow ->sid as long as
> they're the same. But ->fssid can be set to something else and then later
> reset, at which point it becomes the same as ->sid again. A hook is provided
> to perform both of these operations:
>
> security_set_fssid(overriding_SID); //set
> ...
> security_set_fssid(SECSID_NULL); //reset
>
> The rest of the patch is that more or less anywhere ->sid is used to represent
> a process as an actor, this is replaced with ->fssid. This part requires no
> conditional jumps. It becomes a bit tricky around execve() time, but I think
> it's reasonable to ignore that as execve() is unlikely to happen in an
> overridden context; or maybe the execve() related ops should be failed if
> ->sid != ->fssid.
>
> Do you think this is reasonable? Or do you definitely want me to use the
> suppression flag approach instead?
Just why are you doing all this? Why do we need a back-end that requires
all this extra client-side security infrastructure in order to work?
IOW: What is wrong with the existing CacheFS?
Trond
On Thu, 2006-11-02 at 17:16 +0000, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
> > Unless there is some benefit to setting a ->fssid and checking against
> > it (e.g. safeguarding the module against unintentional internal access),
> > I think the task flag approach is preferable.
>
> Well, I think the use of ->fssid is simpler and faster from an implementation
> standpoint (see the attached patch), and it can always be used for such as the
> in-kernel nfsd later.
>
> The way I've done it in this patch is to have ->fssid shadow ->sid as long as
> they're the same. But ->fssid can be set to something else and then later
> reset, at which point it becomes the same as ->sid again. A hook is provided
> to perform both of these operations:
>
> security_set_fssid(overriding_SID); //set
> ...
> security_set_fssid(SECSID_NULL); //reset
>
> The rest of the patch is that more or less anywhere ->sid is used to represent
> a process as an actor, this is replaced with ->fssid. This part requires no
> conditional jumps. It becomes a bit tricky around execve() time, but I think
> it's reasonable to ignore that as execve() is unlikely to happen in an
> overridden context; or maybe the execve() related ops should be failed if
> ->sid != ->fssid.
>
> Do you think this is reasonable? Or do you definitely want me to use the
> suppression flag approach instead?
It doesn't look simpler, but if you take this approach, then:
1) fssid needs to be renamed to reflect its use as the subject/actor
label (e.g. => "subjsid"),
2) Use "secid" in the core code and LSM interfaces rather than just
"sid" to avoid confusion with session ids (e.g. => "subjsecid"),
3) Define and use a SECID_NULL or similar in the LSM interface
(SECSID_NULL is SELinux private),
4) Be careful about overuse of this id (see below for some examples).
> The flag approach is a bit more work to implement and will slow most ops down
> by virtue of including an extra check, though not all ops can be bypassed (for
> instance the op that assigns a security label to an inode can't be bypassed).
In most cases, you could just generalize the existing IS_PRIVATE(inode)
tests in the security static inlines in security.h to also incorporate a
task flag test. security_inode_init_security() would be an exception.
> There are a couple of things I'm not sure about with the ->fssid approach:
>
> (1) What will auditing do? Is it possible to have an SID that isn't audited?
>
> (2) How do I come up with a security label for the module to use?
Note that these issues don't exist in the task flag approach. I'm not
entirely sure what you mean by (1); the existing syscall audit support
would never look at your fssid, just the ->sid, and only at syscall
entry and exit. SELinux avc audit messages from permission checks would
include the fssid's context if that was the basis of the check.
For (2), you have to make up a SID. This differs from how NFS would use
such a facility, since it could just use the client process' context (if
that were available), but in this case you have no process credential
that is exactly what you want to apply (not even the cache daemon's SID
is precisely right). This goes back to the earlier discussion of
security_transition_sid. But this requires policy configuration to make
it work properly, and the absence of the necessary type transition and
allow rules could yield a broken cache. This makes me doubtful about
this approach for cachefiles (vs. NFS, where it makes more sense).
> diff --git a/include/linux/security.h b/include/linux/security.h
> index ff20ec4..48e9e43 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -1166,6 +1166,11 @@ #ifdef CONFIG_SECURITY
> * Set the current FS security ID.
> * @secid contains the security ID to set.
> *
> + * @set_fssid:
s/fssid/subjsecid/ or similar. Likewise throughout.
> @@ -2837,7 +2841,7 @@ static int selinux_task_kill(struct task
> perm = signal_to_av(sig);
> tsec = p->security;
> if (secid)
> - rc = avc_has_perm(secid, tsec->sid, SECCLASS_PROCESS, perm, NULL);
> + rc = avc_has_perm(secid, tsec->fssid, SECCLASS_PROCESS, perm, NULL);
The task is the target of a check here, so you don't want to use the
overriding SID for it. Otherwise, you may have a false denial or a
false allow of the signal.
> @@ -2882,7 +2888,7 @@ static void selinux_task_to_inode(struct
> struct task_security_struct *tsec = p->security;
> struct inode_security_struct *isec = inode->i_security;
>
> - isec->sid = tsec->sid;
> + isec->sid = tsec->fssid;
Here we are labeling a /proc/pid inode with the task SID, when it is
created or revalidated, so you don't want to use the overriding SID here
either.
--
Stephen Smalley
National Security Agency
Trond Myklebust <[email protected]> wrote:
> Just why are you doing all this? Why do we need a back-end that requires
> all this extra client-side security infrastructure in order to work?
Well, both Christoph and Al are of the opinion that I should be using
vfs_mkdir() and co rather than bypassing the security and calling inode ops
directly.
Also I should be setting security labels on the files I create.
> IOW: What is wrong with the existing CacheFS?
Well, you did require it to be reviewed by Christoph...
David
Stephen Smalley <[email protected]> wrote:
> It doesn't look simpler, but if you take this approach, then:
> 1) fssid needs to be renamed to reflect its use as the subject/actor
> label (e.g. => "subjsid"),
Okay.
> 2) Use "secid" in the core code and LSM interfaces rather than just
> "sid" to avoid confusion with session ids (e.g. => "subjsecid"),
Okay.
> 3) Define and use a SECID_NULL or similar in the LSM interface
> (SECSID_NULL is SELinux private),
Defining a reset function would be better.
> 4) Be careful about overuse of this id (see below for some examples).
Yeah.
> > The flag approach is a bit more work to implement and will slow most ops
> > down by virtue of including an extra check, though not all ops can be
> > bypassed (for instance the op that assigns a security label to an inode
> > can't be bypassed).
>
> In most cases, you could just generalize the existing IS_PRIVATE(inode)
> tests in the security static inlines in security.h to also incorporate a
> task flag test. security_inode_init_security() would be an exception.
So you were thinking of doing it in security.h?
> > There are a couple of things I'm not sure about with the ->fssid approach:
> >
> > (1) What will auditing do? Is it possible to have an SID that isn't
> > audited?
> >
> > (2) How do I come up with a security label for the module to use?
>
> Note that these issues don't exist in the task flag approach.
Yes.
> I'm not entirely sure what you mean by (1); the existing syscall audit
> support would never look at your fssid, just the ->sid, and only at syscall
> entry and exit.
Which is irrelevant since the CacheFiles module doesn't make syscalls.
> SELinux avc audit messages from permission checks would include the fssid's
> context if that was the basis of the check.
That'd probably be sufficient since at least you could tell them apart.
> For (2), you have to make up a SID.
I've done this and got it working with Karl's help.
In cachefilesd.te, I have:
...
type cachefilesd_t;
type cachefilesd_exec_t;
domain_type(cachefilesd_t)
init_daemon_domain(cachefilesd_t, cachefilesd_exec_t)
require { type kernel_t; }
type cachefiles_kernel_t;
domain_type(cachefiles_kernel_t)
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
...
Then I added a CacheFiles hook to do:
static int selinux_cachefiles_getsid(u32 secid, u32 *modsecid)
{
return security_transition_sid(secid, SECINITSID_KERNEL,
SECCLASS_PROCESS, modsecid);
}
It takes the daemon's security ID and translates it to the module's security
ID.
The way I was thinking about it is that a CacheFiles cache has a presence in
the kernel that lasts as long as the cache is in active service. This could
be considered the equivalent of a process, although it doesn't have a
task_struct of its own; but rather makes use of the task_struct of processes
that want to use the service.
> This differs from how NFS would use such a facility, since it could just use
> the client process' context (if that were available),
Indeed; but once NFS had a SID, the two would then be the same.
> But this requires policy configuration to make it work properly, and the
> absence of the necessary type transition and allow rules could yield a
> broken cache.
If the rule isn't there then the cache would refuse to start if SELinux is in
enforcing mode, assuming security_transition_sid() returns an error in that
case. Otherwise it'll run in cachefilesd's context, I think, which it will
cause it to give an error and refuse to cache in enforcing mode because that
context doesn't permit it to do certain things it checks for (like creating
files and directories).
> > + rc = avc_has_perm(secid, tsec->fssid, SECCLASS_PROCESS, perm, NULL);
> ...
> The task is the target of a check here, so you don't want to use the
> overriding SID for it. Otherwise, you may have a false denial or a
> false allow of the signal.
Yep... I got that one wrong.
> > - isec->sid = tsec->sid;
> > + isec->sid = tsec->fssid;
>
> Here we are labeling a /proc/pid inode with the task SID, when it is
> created or revalidated, so you don't want to use the overriding SID here
> either.
And again.
Anyway, I'll expand what I've done and give it a go. I may find compelling
reasons[*] not to do it this way. I can always ditch the patches after all...
[*] Overzealous auditing most probably.
Thanks,
David
On Thu, 2006-11-02 at 20:38 +0000, David Howells wrote:
> Trond Myklebust <[email protected]> wrote:
>
> > Just why are you doing all this? Why do we need a back-end that requires
> > all this extra client-side security infrastructure in order to work?
>
> Well, both Christoph and Al are of the opinion that I should be using
> vfs_mkdir() and co rather than bypassing the security and calling inode ops
> directly.
...but why are you needing to call vfs_mkdir? I thought the standard
cachefs backend just uses a pool of files, rather like the original AFS
cache did. Are you trying to mirror the layout and the permissions of
the NFS filesystem? That is a lot more work than it is worth...
> Also I should be setting security labels on the files I create.
To what end? These files shouldn't need to be made visible to userland
at all.
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
> > Well, both Christoph and Al are of the opinion that I should be using
> > vfs_mkdir() and co rather than bypassing the security and calling inode ops
> > directly.
>
> ...but why are you needing to call vfs_mkdir? I thought the standard
> cachefs backend just uses a pool of files, rather like the original AFS
> cache did. Are you trying to mirror the layout and the permissions of
> the NFS filesystem? That is a lot more work than it is worth...
No.
I have to have a way of mapping a netfs file onto a cache storage object. I do
this by translating the keys the netfs gives me into a path of directory and
file names in the cache.
In NFS's case there are three keys describing a file:
(1) "NFS"
(2) { Server IP, Server port, NFS version }
(3) NFS File Handle
These wind up in the filesystem looking something like:
INDEX INDEX INDEX DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
See Documentation/filesystems/caching/cachefiles.txt for more information, in
particular the section "Cache Structure".
Anyway, it's not just vfs_mkdir(), there's also vfs_create(), vfs_rename(),
vfs_unlink(), vfs_setxattr(), vfs_getxattr(), and I'm going to need a
vfs_lookup() or something (a pathwalk to next dentry).
Yes, I'd prefer not to have to use these, but that doesn't seem to be an
option.
> > Also I should be setting security labels on the files I create.
>
> To what end? These files shouldn't need to be made visible to userland
> at all.
But they are visible to userland and they have to be visible to userland. They
exist in the filesystem in which the cache resides, and they need to be visible
so that cachefilesd can do the culling work. If you were thinking of using
"deleted" files, remember that I want it to be persistent across reboots, or
even when an NFS inode is flushed from memory to make space and then reloaded
later.
David
On Fri, 2006-11-03 at 10:27 +0000, David Howells wrote:
> Anyway, it's not just vfs_mkdir(), there's also vfs_create(), vfs_rename(),
> vfs_unlink(), vfs_setxattr(), vfs_getxattr(), and I'm going to need a
> vfs_lookup() or something (a pathwalk to next dentry).
>
> Yes, I'd prefer not to have to use these, but that doesn't seem to be an
> option.
It is not as if we care about an extra context switch here, and we
really don't want to do that file i/o in the context of the rpciod
process if we can avoid it. It might be nice to be able to do those
calls that only involve lookup+read in the context of the user's process
in order to avoid the context switch when paging in data from the cache,
but writing to it both can and should be done as a write-behind process.
IOW: we should rather set up a separate workqueue to write data to disk,
and just concentrate on working out a way to lookup and read data with
no fsuid/fsgid changes and preferably a minimum of selinux magic.
> > > Also I should be setting security labels on the files I create.
> >
> > To what end? These files shouldn't need to be made visible to userland
> > at all.
>
> But they are visible to userland and they have to be visible to userland. They
> exist in the filesystem in which the cache resides, and they need to be visible
> so that cachefilesd can do the culling work. If you were thinking of using
> "deleted" files, remember that I want it to be persistent across reboots, or
> even when an NFS inode is flushed from memory to make space and then reloaded
> later.
No. I was thinking of keeping the cache on its own partition and using
kernel mounts. cachefilesd could possibly mount the thing in its own
private namespace.
Trond
Trond Myklebust <[email protected]> wrote:
> It is not as if we care about an extra context switch here,
Avoiding context switches aren't the main problem; avoiding serialisation is.
> and we really don't want to do that file i/o in the context of the rpciod
> process if we can avoid it.
We aren't doing file I/O in the context of the rpciod process, at least not to
the cache. It's either done in the context of the process that issued the
read, write or pagefault. keventd handles the copying of data from the backing
file pages to the netfs pages when we satisfy a read from the cache.
> It might be nice to be able to do those calls that only involve lookup+read
> in the context of the user's process in order to avoid the context switch
> when paging in data from the cache,
We do most of both in the user's process's context already.
The security problems come from the lookups, creates, xattr ops that we have to
do when acquiring a cookie. All of these are done in the context of the
process that called iget5 for NFS. We could farm them out to another process
to avoid the security, but that would then cause serialisation and context
switches.
> but writing to it both can and should be done as a write-behind process.
>
> IOW: we should rather set up a separate workqueue to write data to disk,
> and just concentrate on working out a way to lookup and read data with
> no fsuid/fsgid changes and preferably a minimum of selinux magic.
Writing data to the cache is done by the pagecache at the moment. I'd really
like to use direct I/O instead as that'd mean I wouldn't need shadow pages in
the page cache and I'd also be able to avoid the page-to-page copy I'm
currently doing.
David
Trond Myklebust <[email protected]> wrote:
> No. I was thinking of keeping the cache on its own partition
That's a requirement I am specifically avoiding with CacheFiles. I might, for
instance, want to use it on my laptop, and I don't really have enough space to
set aside a partition just for that. The whole point of CacheFiles is that
you don't have to set one aside. If you're going to do that, then CacheFS
should be a better option.
> and using kernel mounts. cachefilesd could possibly mount the thing in its
> own private namespace.
That's still user visible, and SELinux in enforcing mode would still apply.
David
On Fri, 2006-11-03 at 15:23 +0000, David Howells wrote:
> Trond Myklebust <[email protected]> wrote:
>
> > It is not as if we care about an extra context switch here,
>
> Avoiding context switches aren't the main problem; avoiding serialisation is.
Why? It is a backing cache. The only case where serialisation ought to
bother you is the case where the client has to invalidate the cache due
to a server-side update of the file.
> > and we really don't want to do that file i/o in the context of the rpciod
> > process if we can avoid it.
>
> We aren't doing file I/O in the context of the rpciod process, at least not to
> the cache. It's either done in the context of the process that issued the
> read, write or pagefault.
Once the RPC calls have been launched, the process returns to the VM
layer and just waits for the next page to be unlocked. It never returns
to the filesystem layer. So where are you using the process context to
write out the cached data?
> The security problems come from the lookups, creates, xattr ops that we have to
> do when acquiring a cookie. All of these are done in the context of the
> process that called iget5 for NFS. We could farm them out to another process
> to avoid the security, but that would then cause serialisation and context
> switches.
The cookie lookups need to be synchronous, but why would the file
creation need to be synchronous? Creating the cachefs file and waiting
on that to complete etc are all utterly useless activities as far as
satisfying the user request for data goes. Just start the process of
creating a backing file, and then get on with the actual syscall.
> Writing data to the cache is done by the pagecache at the moment. I'd really
> like to use direct I/O instead as that'd mean I wouldn't need shadow pages in
> the page cache and I'd also be able to avoid the page-to-page copy I'm
> currently doing.
Agreed.
Trond
Trond Myklebust <[email protected]> wrote:
> > Avoiding context switches aren't the main problem; avoiding serialisation
> > is.
>
> Why? It is a backing cache. The only case where serialisation ought to
> bother you is the case where the client has to invalidate the cache due
> to a server-side update of the file.
Cache invalidation is not so much of a problem as at that point we know exactly
where whatever it is that we're invalidating is, and if it's a big object we
just move it somewhere for the userspace daemon to splat.
The serialisation problem is that if we put cache lookup in its own thread,
then in effect every open[*] of an NFS file, AFS file, whatever, will be
serialised.
It almost certainly wouldn't matter if what we did was to asynchronously look
up the cache cookie for a file. In the common case, an open(O_RDONLY) syscall
is followed almost immediately by a read(), so there's not much to be gained
from asynchronising things as the cache cookie has to be available by the time
we come to process the read, but we can't get the cache cookie before
completing the server checks made by open, as we need the coherency data before
attempting to acquire a cookie.
The serialisation would stem from having to do several synchronous filesystem
ops for each cache lookup, but only having one thread in which to do them.
Okay, I could have several worker threads, but why? Each process attempting to
access the cache provides me with a suitable worker thread, and then I can have
as many as there are tasks on the system.
[*] Note that for NFS I've now incorporated a patch from Steve Dickson to
acquire file cookies on the NFS open() file op, rather than during iget because
NFS readdir calls iget.
> Once the RPC calls have been launched, the process returns to the VM
> layer and just waits for the next page to be unlocked. It never returns
> to the filesystem layer. So where are you using the process context to
> write out the cached data?
What do you mean by "write out the cached data"? Do you mean write the data to
the cache?
If so, that'd be nfs_readpage_to_fscache() as called from nfs_readpage_sync()
or nfs_readpage_release().
That calls fscache_write_page() which calls cachefiles_write_page() which calls
generic_file_buffered_write_one_kernel_page().
That last copies the data into the pagecache attached to an ext3 inode to be
written out (hopefully) asynchronously.
However, that may do other disk accesses, I suppose, as it calls
prepare_write() and commit_write() on ext3.
I could try and make it asynchronous, but that means more overhead in other
ways:-( I presume this will then sometimes be running in rpciod context?
> The cookie lookups need to be synchronous, but why would the file
> creation need to be synchronous? Creating the cachefs file and waiting
> on that to complete etc are all utterly useless activities as far as
> satisfying the user request for data goes. Just start the process of
> creating a backing file, and then get on with the actual syscall.
vfs_mkdir() is synchronous. vfs_create() is synchronous. vfs_[sg]etxattr is
synchronous. Lookup is synchronous.
Yes, I could make them all asynchronous, but it'd be a lot more work, and
mostly unnecessary, and I'd probably have to fight down lots of objections.
Remember: in the common case, open(O_RDONLY) is going to be followed quickly by
a read(). I suppose there may be an intervening stat() and malloc(), but even
so...
David
On Tue, 2006-11-14 at 19:22 +0000, David Howells wrote:
> Trond Myklebust <[email protected]> wrote:
>
> > > Avoiding context switches aren't the main problem; avoiding serialisation
> > > is.
> >
> > Why? It is a backing cache. The only case where serialisation ought to
> > bother you is the case where the client has to invalidate the cache due
> > to a server-side update of the file.
>
> Cache invalidation is not so much of a problem as at that point we know exactly
> where whatever it is that we're invalidating is, and if it's a big object we
> just move it somewhere for the userspace daemon to splat.
>
> The serialisation problem is that if we put cache lookup in its own thread,
> then in effect every open[*] of an NFS file, AFS file, whatever, will be
> serialised.
>
> It almost certainly wouldn't matter if what we did was to asynchronously look
> up the cache cookie for a file. In the common case, an open(O_RDONLY) syscall
> is followed almost immediately by a read(), so there's not much to be gained
> from asynchronising things as the cache cookie has to be available by the time
> we come to process the read, but we can't get the cache cookie before
> completing the server checks made by open, as we need the coherency data before
> attempting to acquire a cookie.
No. All you should need is the result of the lookup(). The coherency
data needs to be checked against an eventual existing CacheFiles entry
during the call to read(), not before.
> The serialisation would stem from having to do several synchronous filesystem
> ops for each cache lookup, but only having one thread in which to do them.
> Okay, I could have several worker threads, but why? Each process attempting to
> access the cache provides me with a suitable worker thread, and then I can have
> as many as there are tasks on the system.
Umm...because the former is a model which actually fits your security
requirements (i.e. one privileged daemon gets lookup()+open()+mkdir()...
rights on the CacheFiles partition), whereas the latter is not (all
tasks need lookup()+open()+mkdir().... privileges)?
> [*] Note that for NFS I've now incorporated a patch from Steve Dickson to
> acquire file cookies on the NFS open() file op, rather than during iget because
> NFS readdir calls iget.
>
> > Once the RPC calls have been launched, the process returns to the VM
> > layer and just waits for the next page to be unlocked. It never returns
> > to the filesystem layer. So where are you using the process context to
> > write out the cached data?
>
> What do you mean by "write out the cached data"? Do you mean write the data to
> the cache?
>
> If so, that'd be nfs_readpage_to_fscache() as called from nfs_readpage_sync()
> or nfs_readpage_release().
nfs_readpage_release() is an rpciod context, _not_ a user thread
context.
> That calls fscache_write_page() which calls cachefiles_write_page() which calls
> generic_file_buffered_write_one_kernel_page().
>
> That last copies the data into the pagecache attached to an ext3 inode to be
> written out (hopefully) asynchronously.
>
> However, that may do other disk accesses, I suppose, as it calls
> prepare_write() and commit_write() on ext3.
Which would generally be forbidden under the rpciod context, BTW, since
they imply calls to generic memory allocation (== nasty tricksy deadlock
potential, since rpciod may be called upon to help write out NFS pages
via shrink_page_list and friends).
> I could try and make it asynchronous, but that means more overhead in other
> ways:-( I presume this will then sometimes be running in rpciod context?
>
> > The cookie lookups need to be synchronous, but why would the file
> > creation need to be synchronous? Creating the cachefs file and waiting
> > on that to complete etc are all utterly useless activities as far as
> > satisfying the user request for data goes. Just start the process of
> > creating a backing file, and then get on with the actual syscall.
>
> vfs_mkdir() is synchronous. vfs_create() is synchronous. vfs_[sg]etxattr is
> synchronous. Lookup is synchronous.
All of them are synchronous as far as accessing the remote filesystem is
concerned. Why would the user process care if a privileged daemon has
completed the shadow mkdir() or create() on the CacheFiles system or
not?
> Yes, I could make them all asynchronous, but it'd be a lot more work, and
> mostly unnecessary, and I'd probably have to fight down lots of objections.
>
>
> Remember: in the common case, open(O_RDONLY) is going to be followed quickly by
> a read(). I suppose there may be an intervening stat() and malloc(), but even
> so...
Which is why lookup() + open() + read() needs to be fast in the case
where you have a CacheFiles hit. It does not justify mkdir, create, etc
being fast, nor does it justify the open() + read() part needing to be
fast in the case of a CacheFiles miss.
Trond
Trond Myklebust <[email protected]> wrote:
> No. All you should need is the result of the lookup().
Plus the mkdirs, etc., but I'll assume you implied those.
> The coherency data needs to be checked against an eventual existing
> CacheFiles entry during the call to read(), not before.
Or mmap().
But, yes, whilst that's true; the read() is almost certainly going to occur
very quickly after the open(), and so the likelyhood is that asynchronicity is
not going to matter.
> Umm...because the former is a model which actually fits your security
> requirements (i.e. one privileged daemon gets lookup()+open()+mkdir()...
> rights on the CacheFiles partition), whereas the latter is not (all
> tasks need lookup()+open()+mkdir().... privileges)?
The former model is much more complex than the latter, and also has perfomance
issues.
Besides, the latter model seems to work for me (as long as I'm allowed to
change the security ID a process is acting as).
> > If so, that'd be nfs_readpage_to_fscache() as called from
> > nfs_readpage_sync() or nfs_readpage_release().
>
> nfs_readpage_release() is an rpciod context, _not_ a user thread
> context.
But it *is* process context...
I'm not sure what to do about that. I could maintain a record of every page
being serviced by FS-Cache, but that could be an awful lot of pages, and so
consume an awful lot of memory.
I'm hesitant about allocating memory to hold a list of pages in that context as
it might fail.
> > However, that may do other disk accesses, I suppose, as it calls
> > prepare_write() and commit_write() on ext3.
>
> Which would generally be forbidden under the rpciod context, BTW, since
> they imply calls to generic memory allocation (== nasty tricksy deadlock
> potential, since rpciod may be called upon to help write out NFS pages
> via shrink_page_list and friends).
I wish you'd mentioned that six months or more ago when I first waved these
patches in front of you.
> > vfs_mkdir() is synchronous. vfs_create() is synchronous. vfs_[sg]etxattr
> > is synchronous. Lookup is synchronous.
>
> All of them are synchronous as far as accessing the remote filesystem is
> concerned.
What has "accessing the remote filesystem" got to do with it?
> Why would the user process care if a privileged daemon has completed the
> shadow mkdir() or create() on the CacheFiles system or not?
Sigh.
The user process may not do a read() or an mmap() on a read-only file it has
just opened until we've found the data storage object in the cache associated
with that file, or it has created one.
However, several user processes may be accessing several different files on
serveral different mounts of maybe several different netfs's all at the same
time. This could mean that on behalf of each of these processes, CacheFiles
has to create serveral directories and a file.
But it may only have one thread in which to do this, and so each "cache lookup"
will wind up being serialised around the synchronous ext3 operations that
create files and directories and set xattrs. Consider:
SERVICE PROC A PROC B PROC C PROC D
======= ======= ======= ======= =======
open() open() open() open()
A:lookup()
A:mkdir()
A:setxattr()
A:mkdir()
A:setxattr()
A:mkdir()
A:setxattr()
A:mkdir()
A:setxattr()
A:mkdir()
A:setxattr()
A:create()
A:setxattr()
read()
B:lookup()
B:mkdir()
B:setxattr()
B:mkdir()
B:setxattr()
B:mkdir()
B:setxattr()
B:mkdir()
B:setxattr()
B:mkdir()
B:setxattr()
B:create()
B:setxattr()
read()
C:lookup()
C:mkdir()
C:setxattr()
C:mkdir()
C:setxattr()
C:mkdir()
C:setxattr()
C:mkdir()
C:setxattr()
C:mkdir()
C:setxattr()
C:create()
C:setxattr()
read()
D:lookup()
D:mkdir()
D:setxattr()
D:mkdir()
D:setxattr()
D:mkdir()
D:setxattr()
D:mkdir()
D:setxattr()
D:mkdir()
D:setxattr()
D:create()
D:setxattr()
read()
Which is what you're suggesting. Whereas what I have at the moment is this:
PROC A PROC B PROC C PROC D
=============== =============== =============== ===============
open() open() open() open()
A:lookup() B:lookup() C:lookup() D:lookup()
A:mkdir() B:mkdir() C:mkdir() D:mkdir()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
A:mkdir() B:mkdir() C:mkdir() D:mkdir()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
A:mkdir() B:mkdir() C:mkdir() D:mkdir()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
A:mkdir() B:mkdir() C:mkdir() D:mkdir()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
A:mkdir() B:mkdir() C:mkdir() D:mkdir()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
A:create() B:create() C:create() D:create()
A:setxattr() B:setxattr() C:setxattr() D:setxattr()
read() read() read() read()
Even if you replace the mkdir()'s and create()'s with lookup()'s and the
setxattr()'s with getxattr()'s, it doesn't make much difference: lookup() and
getxattr() are both reads and both synchronous.
> > Remember: in the common case, open(O_RDONLY) is going to be followed
> > quickly by a read(). I suppose there may be an intervening stat() and
> > malloc(), but even so...
>
> Which is why lookup() + open() + read() needs to be fast in the case
> where you have a CacheFiles hit. It does not justify mkdir, create, etc
> being fast, nor does it justify the open() + read() part needing to be
> fast in the case of a CacheFiles miss.
My point is that if userspace does open() + read() with no or little
intervening stuff, then asynchronicity is irrelevant as the whole process is
rendered effectively synchronous because the first read() has to wait on
completion.
And once an inode has been assayed for cache presence, subsequent open() calls
don't need to go to the cache as the state is retained in memory until the
inode is discarded.
David
On Wed, 2006-11-15 at 15:28 +0000, David Howells wrote:
> Trond Myklebust <[email protected]> wrote:
>
> > No. All you should need is the result of the lookup().
>
> Plus the mkdirs, etc., but I'll assume you implied those.
No! The success or failure of the NFS operation does _not_ depend on
CacheFiles success or failure. As far as I'm concerned, the mkdirs etc
actually _want_ to be farmed out to some background daemon so that they
have no effect on NFS performance. See below.
> > The coherency data needs to be checked against an eventual existing
> > CacheFiles entry during the call to read(), not before.
>
> Or mmap().
>
> But, yes, whilst that's true; the read() is almost certainly going to occur
> very quickly after the open(), and so the likelyhood is that asynchronicity is
> not going to matter.
In the event of a mkdir() or create(), asynchronicity _does_ make a
difference, because there is no cached file to read.
> > Umm...because the former is a model which actually fits your security
> > requirements (i.e. one privileged daemon gets lookup()+open()+mkdir()...
> > rights on the CacheFiles partition), whereas the latter is not (all
> > tasks need lookup()+open()+mkdir().... privileges)?
>
> The former model is much more complex than the latter, and also has perfomance
> issues.
See below.
> Besides, the latter model seems to work for me (as long as I'm allowed to
> change the security ID a process is acting as).
>
> > > If so, that'd be nfs_readpage_to_fscache() as called from
> > > nfs_readpage_sync() or nfs_readpage_release().
> >
> > nfs_readpage_release() is an rpciod context, _not_ a user thread
> > context.
>
> But it *is* process context...
No. It has whole a bunch of restrictions that normal process contexts do
not.
> I'm not sure what to do about that. I could maintain a record of every page
> being serviced by FS-Cache, but that could be an awful lot of pages, and so
> consume an awful lot of memory.
>
> I'm hesitant about allocating memory to hold a list of pages in that context as
> it might fail.
All that you want to _try_ to ensure is that the page gets written out
before the memory shrinker kicks it out of the page cache. There is
absolutely no need to do this synchronously. There is not even a need to
guarantee that all read pages will hit fscache.
Again, this is something that a background daemon could happily do for
you.
> > > However, that may do other disk accesses, I suppose, as it calls
> > > prepare_write() and commit_write() on ext3.
> >
> > Which would generally be forbidden under the rpciod context, BTW, since
> > they imply calls to generic memory allocation (== nasty tricksy deadlock
> > potential, since rpciod may be called upon to help write out NFS pages
> > via shrink_page_list and friends).
>
> I wish you'd mentioned that six months or more ago when I first waved these
> patches in front of you.
>
> > > vfs_mkdir() is synchronous. vfs_create() is synchronous. vfs_[sg]etxattr
> > > is synchronous. Lookup is synchronous.
> >
> > All of them are synchronous as far as accessing the remote filesystem is
> > concerned.
>
> What has "accessing the remote filesystem" got to do with it?
>
> > Why would the user process care if a privileged daemon has completed the
> > shadow mkdir() or create() on the CacheFiles system or not?
It doesn't, and that is my point. If the fscache backing store doesn't
have the file, then the user process wants to talk to the remote server
ASAP. It doesn't want to waste its time creating CacheFiles.
> Sigh.
>
> The user process may not do a read() or an mmap() on a read-only file it has
> just opened until we've found the data storage object in the cache associated
> with that file, or it has created one.
'or it has created one' is a completely artificial bottleneck that _you_
are imposing.
> However, several user processes may be accessing several different files on
> serveral different mounts of maybe several different netfs's all at the same
> time. This could mean that on behalf of each of these processes, CacheFiles
> has to create serveral directories and a file.
>
> But it may only have one thread in which to do this, and so each "cache lookup"
> will wind up being serialised around the synchronous ext3 operations that
> create files and directories and set xattrs. Consider:
>
> SERVICE PROC A PROC B PROC C PROC D
> ======= ======= ======= ======= =======
> open() open() open() open()
> A:lookup()
> A:mkdir()
> A:setxattr()
> A:mkdir()
> A:setxattr()
> A:mkdir()
> A:setxattr()
> A:mkdir()
> A:setxattr()
> A:mkdir()
> A:setxattr()
> A:create()
> A:setxattr()
> read()
> B:lookup()
> B:mkdir()
> B:setxattr()
> B:mkdir()
> B:setxattr()
> B:mkdir()
> B:setxattr()
> B:mkdir()
> B:setxattr()
> B:mkdir()
> B:setxattr()
> B:create()
> B:setxattr()
> read()
> C:lookup()
> C:mkdir()
> C:setxattr()
> C:mkdir()
> C:setxattr()
> C:mkdir()
> C:setxattr()
> C:mkdir()
> C:setxattr()
> C:mkdir()
> C:setxattr()
> C:create()
> C:setxattr()
> read()
> D:lookup()
> D:mkdir()
> D:setxattr()
> D:mkdir()
> D:setxattr()
> D:mkdir()
> D:setxattr()
> D:mkdir()
> D:setxattr()
> D:mkdir()
> D:setxattr()
> D:create()
> D:setxattr()
> read()
>
> Which is what you're suggesting. Whereas what I have at the moment is this:
No. Why the hell should read() have to wait on what it knows to be an
empty file? As soon as that lookup() is done you can figure out that the
data is not cached, and that the read() needs to go talk to the NFS
server.
There is no reason whatsoever to wait for 50 synchronous mkdir()s and
setxattr()s to complete. Those operations can and should be farmed out
to a background process to take care of.
The ideal model looks like
PROC A PROC B PROC C PROC D
=============== =============== =============== ===============
open() open() open() open()
A:lookup() B:lookup() C:lookup() D:lookup()
read() read() read() read()
irrespective of whether or not fscache has a stored file.
Trond Myklebust <[email protected]> wrote:
> > > No. All you should need is the result of the lookup().
> >
> > Plus the mkdirs, etc., but I'll assume you implied those.
>
> No! The success or failure of the NFS operation does _not_ depend on
> CacheFiles success or failure.
I have not said that it does, and in fact I've made sure that it does not.
> As far as I'm concerned, the mkdirs etc actually _want_ to be farmed out to
> some background daemon so that they have no effect on NFS performance. See
> below.
In effect they will. Yes, ideally everything would be done in the background
on a cache miss, but it's not as simple as you make it out to be.
Your assertion is that I should deal with the security issue by doing
everything in an auxiliary thread - but that means doing the lookups there too.
> In the event of a mkdir() or create(), asynchronicity _does_ make a
> difference, because there is no cached file to read.
But it does make a difference, because the file has to exist before the first
readpages, unless you want to special-case the first readpages - in which case
it has to exist before the second readpages because the second readpage or
readpages might require re-reading of pages the first readpages obtained (the
VM may have discarded them on memory pressure).
> > I'm hesitant about allocating memory to hold a list of pages in that
> > context as it might fail.
Though I might be able to make use of the spent nfs_page struct.
> All that you want to _try_ to ensure is that the page gets written out
> before the memory shrinker kicks it out of the page cache. There is
> absolutely no need to do this synchronously. There is not even a need to
> guarantee that all read pages will hit fscache.
Hah! I've already got bug reports because it appears that not all pages are
hitting the cache.
Generally, I would agree though. I would say that unless there is a lot of
memory pressume, all pages should wind up in the cache. If there is memory
pressure, then it may be okay to abort the writes to the cache. This might
have to be tunable.
> Again, this is something that a background daemon could happily do for
> you.
Again, generally I'd agree, though I'm not sure one thread is sufficient.
Perhaps one per CPU. With CacheFiles as it stands at the moment, that thread
is going to spend a lot of time copying data between pages, and so recycling of
the pagecache is going to be affected.
Unless you can suggest a way of reducing the queue of pages waiting on writing
to the cache under memory pressure (mainly how to work out that there is memory
pressure).
> > > All of them are synchronous as far as accessing the remote filesystem is
> > > concerned.
> >
> > What has "accessing the remote filesystem" got to do with it?
> >
> > > Why would the user process care if a privileged daemon has completed the
> > > shadow mkdir() or create() on the CacheFiles system or not?
>
> It doesn't, and that is my point. If the fscache backing store doesn't
> have the file, then the user process wants to talk to the remote server
> ASAP. It doesn't want to waste its time creating CacheFiles.
Erm... I think you're arguing with your own question there.
Note that NFS has to talk to the server *first*. We want to know the coherency
details at the time we look the file up, so that if it exists, we can decide
whether we want to keep it or not.
Furthermore, we need the data storage file available to store pages in as we
read them. We can farm everything out to other threads, but the total
processing time is increased, and it requires more resources and more
complexity.
> > The user process may not do a read() or an mmap() on a read-only file it
> > has just opened until we've found the data storage object in the cache
> > associated with that file, or it has created one.
>
> 'or it has created one' is a completely artificial bottleneck that _you_
> are imposing.
Not really. Unless readpages() special-cases the first invocation on an inode
we know we didn't previously have in the case, this bottleneck stands. The
second and subsequent readpages() invocations and any readpage() invocation
must assume that there may be pages in the cache already.
Yes, I agree that it'd be very nice to be completely asynchronous with respect
to the cache, but it makes various things much more complex and more resource
intensive as we have to (a) farm things off to other threads, and (b) keep
track of a whole load of extra things.
> > Which is what you're suggesting. Whereas what I have at the moment is this:
>
> No. Why the hell should read() have to wait on what it knows to be an
> empty file? As soon as that lookup() is done you can figure out that the
> data is not cached, and that the read() needs to go talk to the NFS
> server.
In your world, lookup() must also be farmed off to that other thread.
> There is no reason whatsoever to wait for 50 synchronous mkdir()s and
> setxattr()s to complete. Those operations can and should be farmed out
> to a background process to take care of.
> The ideal model looks like
>
>
> PROC A PROC B PROC C PROC D
> =============== =============== =============== ===============
> open() open() open() open()
> A:lookup() B:lookup() C:lookup() D:lookup()
> read() read() read() read()
Nope. You aren't permitted to do that. You've defined your security model as
being context-switch based, so you have to do your lookups in the service
thread. Assuming you're going to get a hit on all your lookups, what you end
up with is:
SERVICE PROC A PROC B PROC C PROC D
======= ======= ======= ======= =======
open() open() open() open()
A:lookup()
A:getxattr()
A:lookup()
A:getxattr()
A:lookup()
A:getxattr()
A:lookup()
A:getxattr()
A:lookup()
A:getxattr()
A:open()
read()
B:lookup()
B:getxattr()
B:lookup()
B:getxattr()
B:lookup()
B:getxattr()
B:lookup()
B:getxattr()
B:lookup()
B:getxattr()
B:open()
read()
C:lookup()
C:getxattr()
C:lookup()
C:getxattr()
C:lookup()
C:getxattr()
C:lookup()
C:getxattr()
C:lookup()
C:getxattr()
C:open()
read()
D:lookup()
D:getxattr()
D:lookup()
D:getxattr()
D:lookup()
D:getxattr()
D:lookup()
D:getxattr()
D:lookup()
D:getxattr()
D:open()
read()
Remember: lookup() and getxattr() are, by there very nature, synchronous, and
so you end up with serialisation.
David