Subject: Re: Security issues with local filesystem caching
From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: David Howells <dhowells@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>, Karl MacMillan <kmacmill@redhat.com>,
       jmorris@namei.org, chrisw@sous-sol.org, selinux@tycho.nsa.gov,
       linux-kernel@vger.kernel.org, aviro@redhat.com
In-Reply-To: <11957.1163532168@redhat.com>
References: <1162575034.5629.24.camel@lade.trondhjem.org>
	 <1162561291.5635.64.camel@lade.trondhjem.org>
	 <1162502667.6071.66.camel@lade.trondhjem.org>
	 <1162496968.6071.38.camel@lade.trondhjem.org>
	 <1162402218.32614.230.camel@moss-spartans.epoch.ncsc.mil>
	 <1162387735.32614.184.camel@moss-spartans.epoch.ncsc.mil>
	 <16969.1161771256@redhat.com> <31035.1162330008@redhat.com>
	 <4417.1162395294@redhat.com> <25037.1162487801@redhat.com>
	 <32754.1162499917@redhat.com> <12984.1162549621@redhat.com>
	 <22908.1162567413@redhat.com>   <11957.1163532168@redhat.com>
Content-Type: text/plain
Date: Wed, 15 Nov 2006 09:05:42 -0500
Message-Id: <1163599542.5691.84.camel@lade.trondhjem.org>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5164
Lines: 110

On Tue, 2006-11-14 at 19:22 +0000, David Howells wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > > Avoiding context switches aren't the main problem; avoiding serialisation
> > > is.
> > 
> > Why? It is a backing cache. The only case where serialisation ought to
> > bother you is the case where the client has to invalidate the cache due
> > to a server-side update of the file.
> 
> Cache invalidation is not so much of a problem as at that point we know exactly
> where whatever it is that we're invalidating is, and if it's a big object we
> just move it somewhere for the userspace daemon to splat.
> 
> The serialisation problem is that if we put cache lookup in its own thread,
> then in effect every open[*] of an NFS file, AFS file, whatever, will be
> serialised.
>
> It almost certainly wouldn't matter if what we did was to asynchronously look
> up the cache cookie for a file.  In the common case, an open(O_RDONLY) syscall
> is followed almost immediately by a read(), so there's not much to be gained
> from asynchronising things as the cache cookie has to be available by the time
> we come to process the read, but we can't get the cache cookie before
> completing the server checks made by open, as we need the coherency data before
> attempting to acquire a cookie.

No. All you should need is the result of the lookup(). The coherency
data needs to be checked against an eventual existing CacheFiles entry
during the call to read(), not before.

> The serialisation would stem from having to do several synchronous filesystem
> ops for each cache lookup, but only having one thread in which to do them.
> Okay, I could have several worker threads, but why?  Each process attempting to
> access the cache provides me with a suitable worker thread, and then I can have
> as many as there are tasks on the system.

Umm...because the former is a model which actually fits your security
requirements (i.e. one privileged daemon gets lookup()+open()+mkdir()...
rights on the CacheFiles partition), whereas the latter is not (all
tasks need lookup()+open()+mkdir().... privileges)?

> [*] Note that for NFS I've now incorporated a patch from Steve Dickson to
> acquire file cookies on the NFS open() file op, rather than during iget because
> NFS readdir calls iget.
> 
> > Once the RPC calls have been launched, the process returns to the VM
> > layer and just waits for the next page to be unlocked. It never returns
> > to the filesystem layer. So where are you using the process context to
> > write out the cached data?
> 
> What do you mean by "write out the cached data"?  Do you mean write the data to
> the cache?
> 
> If so, that'd be nfs_readpage_to_fscache() as called from nfs_readpage_sync()
> or nfs_readpage_release().

nfs_readpage_release() is an rpciod context, _not_ a user thread
context.

> That calls fscache_write_page() which calls cachefiles_write_page() which calls
> generic_file_buffered_write_one_kernel_page().
> 
> That last copies the data into the pagecache attached to an ext3 inode to be
> written out (hopefully) asynchronously.
> 
> However, that may do other disk accesses, I suppose, as it calls
> prepare_write() and commit_write() on ext3.

Which would generally be forbidden under the rpciod context, BTW, since
they imply calls to generic memory allocation (== nasty tricksy deadlock
potential, since rpciod may be called upon to help write out NFS pages
via shrink_page_list and friends).

> I could try and make it asynchronous, but that means more overhead in other
> ways:-(  I presume this will then sometimes be running in rpciod context?
> 
> > The cookie lookups need to be synchronous, but why would the file
> > creation need to be synchronous? Creating the cachefs file and waiting
> > on that to complete etc are all utterly useless activities as far as
> > satisfying the user request for data goes. Just start the process of
> > creating a backing file, and then get on with the actual syscall.
> 
> vfs_mkdir() is synchronous.  vfs_create() is synchronous.  vfs_[sg]etxattr is
> synchronous.  Lookup is synchronous.

All of them are synchronous as far as accessing the remote filesystem is
concerned. Why would the user process care if a privileged daemon has
completed the shadow mkdir() or create() on the CacheFiles system or
not?

> Yes, I could make them all asynchronous, but it'd be a lot more work, and
> mostly unnecessary, and I'd probably have to fight down lots of objections.
> 
> 
> Remember: in the common case, open(O_RDONLY) is going to be followed quickly by
> a read().  I suppose there may be an intervening stat() and malloc(), but even
> so...

Which is why lookup() + open() + read() needs to be fast in the case
where you have a CacheFiles hit. It does not justify mkdir, create, etc
being fast, nor does it justify the open() + read() part needing to be
fast in the case of a CacheFiles miss.

  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/