To: Linus Torvalds <torvalds@transmeta.com>
Cc: David Howells <dhowells@redhat.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] AFS filesystem for Linux (2/2)
In-Reply-To: Message from Linus Torvalds <torvalds@transmeta.com>
   of "Wed, 02 Oct 2002 17:36:06 PDT." <Pine.LNX.4.33.0210021730170.22980-100000@penguin.transmeta.com>
User-Agent: EMH/1.14.1 SEMI/1.14.3 (Ushinoya) FLIM/1.14.3
 (=?ISO-8859-4?Q?Unebigory=F2mae?=) APEL/10.3 Emacs/21.2 (i686-pc-linux-gnu)
 MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: multipart/mixed;
 boundary="Multipart_Thu_Oct__3_10:05:39_2002-1"
Date: Thu, 03 Oct 2002 10:05:39 +0100
Message-ID: <13691.1033635939@warthog.cambridge.redhat.com>
From: David Howells <dhowells@cambridge.redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8776
Lines: 200

--Multipart_Thu_Oct__3_10:05:39_2002-1
Content-Type: text/plain; charset=US-ASCII


Linus Torvalds wrote:
> On Wed, 2 Oct 2002, David Howells wrote:
> >
> > This patch adds an Andrew File System (AFS) driver to the
> > kernel. Currently it only provides read-only, uncached, non-automounted
> > and unsecured support.
>
> Are you sure this is the right way to go?

I think so. I think it makes sense for the AFS VFS-interface to go as directly
as possible to the network without having to make context switches to get into
userspace.

> As far as I can tell, this is a dead end, because we fundamentally cannot
> do the local backing store from the kernel.

I disagree. I think we can (besides which OpenAFS does so), and that most of
it is probably easier to do here than in userspace. For example:

 (*) Readpage

     The filesystem can either use a BIO to read directly into a new page if
     the page is already in the cache, or it can read from across the network,
     and then use a BIO to write the updated page into the cache. This should
     avoid as much page-aliasing as possible.

 (*) Writepage

     The filesystem can use a BIO to write a page into the cache, whilst
     simultaneously dispatching it across the network. Alternatively, it can
     write the page to the cache (fast) immediately and queue it up to be sent
     across the network (slow) from the cache if memory pressure is high.

 (*) Index searching

     The cache needs to keep an on disc index if the contents are to survive
     rebooting. This can be searched more efficiently from within the kernel.

     I've written a scanning algorithm that can scan any file for fixed length
     records in a manner that allows the disc blocks to be scanned in whatever
     order they come off of the disc. This should make scanning the index
     files faster, and may not actually be possible in userspace.

 (*) On-disc file layout

     The on-disc file layout I'm using is the same as many other unix
     filesystems (direct, indirect and double-indirect block pointers) and is
     fairly simple. The biggest difference where I see a hole, I know I have
     to fetch the page from the server, rather that just assuming there's an
     implicit empty page there.

     I'm fetching files a page at a time (on demand from the VM), though I may
     extend this to get bunches of pages for efficiency reasons.

     What I'm not going to do is fetch each file into the the cache and
     maintain it there in its entirety from the moment it is opened to the
     moment it is closed. This has two definite disadvantages: you can't open
     a file bigger than the remaining space in the cache, and the size of the
     cache and the sizes of the files opened limits the number of files you
     can have open.

Currently my plans are not to support disconnected operation (as there are
likely to be holes in the files cached). This means that I don't need to cache
security information on disc, since I can "retrieve" it from the server upon
opening a file anyway.

One thing I'm currently undecided on is whether the security tokens for a user
should be attached to the struct file * being used, or whether they should be
retrieved from a list attached to the current process in some way.

> From my (nonexistent) understanding of how AFS works, would it not be a
> whole lot more sensible to implement it as a coda client or something like
> that (with the networking support in-kernel, but with the caching logic
> etc in user space).

See above.

As has been suggested to me, it may be possible to unload just the space
reclaimation algorithm to userspace. It may also be possible to put the index
maintainer in userspace, and have afs_iget() call it to search for and add
records. "Callbacks" from the server (which indicate that a file has changed
in someway) could also be passed to the index maintainer so that is can
release all the cache blocks for the changed file.

However, the biggest problem with splitting the caching like this is that
there then has to be some sort of locking between kernel space and userspace
to govern access to the allocation bitmaps (or whatever).

Arjan Van de Ven's suggestion is that all the cached data files should be
exposed as files in the cachefs, which then would have an unlink method
available so that the userspace daemon can tell the kernel side cache manager
to reclaim a particular file.

> I dunno, I just get the feeling that a good AFS client simply cannot be
> done entirely in kernel space, and if you start off like this, you'll
> never get where you really want to go. Pls comment on this (and yeah, the
> comment can be a "Boy, you're really a stupid git, and here's why: xyz",
> but I really want the "xyz" part too ;)

I think it can (and should) be done in the kernel (at least for the most part
- there are auxilliary userspace tools that consult the server directly). See
the reasons given above.

> Now, admittedly maybe the user-space deamon approach is crap, and what we
> really want is to have some way to cache network stuff on the disk
> directly from the kernel, ie just implement a real mapping/page-indexed
> cachefs that people could mount and use together with different network
> filesystems.

Hmmm... Interesting idea. There is the problem of working out which files
belong to what source. The AFS filesystem has a three tier approach to
identifying the source of a file: {cell,volume,vnode}, where any given volume
can be on more than one server in a particular cell, and a vnode is the
equivalent of an inode. I suppose NFS, say, could be handled similarly:
{server,export,inode}, and SMB would probably be {server,share,filename}.

The biggest hurdle here is the difference in potential record lengths:-/

		CELL RECORD	CONSISTS OF
	AFS	64 + 16*4	name + 16 volume location servers
	NFS	4		IPv4 address
	SMB	?		server name (maybe just IP address)

		VLDB RECORD	CONSISTS OF
	AFS	64 + 64		volume name, numbers and server info
	NFS	4096?		export path length
	SMB	?		share name

		VNODE INDEX	CONSISTS OF
	AFS	4 + 4		vnode ID number, vnode ID version
	NFS	8		inode number
	SMB	4096?		full file name within share
 or	SMB	4 + 256		cache dir index and filename

To have a heterogenous cache, the VLDB record and vnode index records could be
extended to 2K or 4K in size, or maybe separate catalogues and indices could
be maintained for different filesystem types, and a 0th tier could be a
catalogue of different types held within this cache, complete with information
as to the entry sizes of the tier 1, 2 and 3 catalogues.

David


--Multipart_Thu_Oct__3_10:05:39_2002-1
Content-Type: text/plain; charset=US-ASCII


THE CACHE LAYOUT
================

The local cache will be structured as a set of files:

  (1) Meta-data file. Contains meta data records (~=inodes) for every "file"
      in the cache (including itself).

      Each meta-data record contains a set of direct block pointers, an
      indirect block pointer and a double indirect block pointer by which the
      data to which it points can be located on disc.

      There will always be enough direct pointers to refer to all the blocks
      in a directory directly.

  (2) Cell cache catalogue. Any cell for which we have data cached will be
      recorded in this file.

  (3) Volume location catalogue. Any volume for which we have data cached will
      be recorded in this file. Each VL entry points to the cell record to
      which it belongs.

  (4) A set of indexes (hash table type thing) with fixed size records as
      small as I can make them that cross-reference a volume location and a
      vnode number with an entry in the meta-data file that describes where to
      find the cached data on disc.

      Index entries are also time stamped to show the last time they were
      accessed.

  (5) The cached vnodes (files/dirs/symlinks) themselves.

      Note that in the case of a cached file, any hole in the file _actually_
      represents a page not yet fetched from the server.

There is also a bitmap to indicate which blocks are currently allocated.

I wasn't planning on storing user accessiblity data in the cache itself,
though I can always change my mind later, because this differs from user to
user, and is subject to change without notice from one attempt to read to
another due to the lack of notifications from the protection server when ACLs
change. This sort of thing will have to be stored in each "struct file".

--Multipart_Thu_Oct__3_10:05:39_2002-1--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/