Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 3 Oct 2002 05:00:15 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 3 Oct 2002 05:00:15 -0400 Received: from dell-paw-3.cambridge.redhat.com ([195.224.55.237]:56055 "EHLO executor.cambridge.redhat.com") by vger.kernel.org with ESMTP id ; Thu, 3 Oct 2002 05:00:12 -0400 To: Linus Torvalds Cc: David Howells , linux-kernel@vger.kernel.org Subject: Re: [PATCH] AFS filesystem for Linux (2/2) In-Reply-To: Message from Linus Torvalds of "Wed, 02 Oct 2002 17:36:06 PDT." User-Agent: EMH/1.14.1 SEMI/1.14.3 (Ushinoya) FLIM/1.14.3 (=?ISO-8859-4?Q?Unebigory=F2mae?=) APEL/10.3 Emacs/21.2 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: multipart/mixed; boundary="Multipart_Thu_Oct__3_10:05:39_2002-1" Date: Thu, 03 Oct 2002 10:05:39 +0100 Message-ID: <13691.1033635939@warthog.cambridge.redhat.com> From: David Howells Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8776 Lines: 200 --Multipart_Thu_Oct__3_10:05:39_2002-1 Content-Type: text/plain; charset=US-ASCII Linus Torvalds wrote: > On Wed, 2 Oct 2002, David Howells wrote: > > > > This patch adds an Andrew File System (AFS) driver to the > > kernel. Currently it only provides read-only, uncached, non-automounted > > and unsecured support. > > Are you sure this is the right way to go? I think so. I think it makes sense for the AFS VFS-interface to go as directly as possible to the network without having to make context switches to get into userspace. > As far as I can tell, this is a dead end, because we fundamentally cannot > do the local backing store from the kernel. I disagree. I think we can (besides which OpenAFS does so), and that most of it is probably easier to do here than in userspace. For example: (*) Readpage The filesystem can either use a BIO to read directly into a new page if the page is already in the cache, or it can read from across the network, and then use a BIO to write the updated page into the cache. This should avoid as much page-aliasing as possible. (*) Writepage The filesystem can use a BIO to write a page into the cache, whilst simultaneously dispatching it across the network. Alternatively, it can write the page to the cache (fast) immediately and queue it up to be sent across the network (slow) from the cache if memory pressure is high. (*) Index searching The cache needs to keep an on disc index if the contents are to survive rebooting. This can be searched more efficiently from within the kernel. I've written a scanning algorithm that can scan any file for fixed length records in a manner that allows the disc blocks to be scanned in whatever order they come off of the disc. This should make scanning the index files faster, and may not actually be possible in userspace. (*) On-disc file layout The on-disc file layout I'm using is the same as many other unix filesystems (direct, indirect and double-indirect block pointers) and is fairly simple. The biggest difference where I see a hole, I know I have to fetch the page from the server, rather that just assuming there's an implicit empty page there. I'm fetching files a page at a time (on demand from the VM), though I may extend this to get bunches of pages for efficiency reasons. What I'm not going to do is fetch each file into the the cache and maintain it there in its entirety from the moment it is opened to the moment it is closed. This has two definite disadvantages: you can't open a file bigger than the remaining space in the cache, and the size of the cache and the sizes of the files opened limits the number of files you can have open. Currently my plans are not to support disconnected operation (as there are likely to be holes in the files cached). This means that I don't need to cache security information on disc, since I can "retrieve" it from the server upon opening a file anyway. One thing I'm currently undecided on is whether the security tokens for a user should be attached to the struct file * being used, or whether they should be retrieved from a list attached to the current process in some way. > From my (nonexistent) understanding of how AFS works, would it not be a > whole lot more sensible to implement it as a coda client or something like > that (with the networking support in-kernel, but with the caching logic > etc in user space). See above. As has been suggested to me, it may be possible to unload just the space reclaimation algorithm to userspace. It may also be possible to put the index maintainer in userspace, and have afs_iget() call it to search for and add records. "Callbacks" from the server (which indicate that a file has changed in someway) could also be passed to the index maintainer so that is can release all the cache blocks for the changed file. However, the biggest problem with splitting the caching like this is that there then has to be some sort of locking between kernel space and userspace to govern access to the allocation bitmaps (or whatever). Arjan Van de Ven's suggestion is that all the cached data files should be exposed as files in the cachefs, which then would have an unlink method available so that the userspace daemon can tell the kernel side cache manager to reclaim a particular file. > I dunno, I just get the feeling that a good AFS client simply cannot be > done entirely in kernel space, and if you start off like this, you'll > never get where you really want to go. Pls comment on this (and yeah, the > comment can be a "Boy, you're really a stupid git, and here's why: xyz", > but I really want the "xyz" part too ;) I think it can (and should) be done in the kernel (at least for the most part - there are auxilliary userspace tools that consult the server directly). See the reasons given above. > Now, admittedly maybe the user-space deamon approach is crap, and what we > really want is to have some way to cache network stuff on the disk > directly from the kernel, ie just implement a real mapping/page-indexed > cachefs that people could mount and use together with different network > filesystems. Hmmm... Interesting idea. There is the problem of working out which files belong to what source. The AFS filesystem has a three tier approach to identifying the source of a file: {cell,volume,vnode}, where any given volume can be on more than one server in a particular cell, and a vnode is the equivalent of an inode. I suppose NFS, say, could be handled similarly: {server,export,inode}, and SMB would probably be {server,share,filename}. The biggest hurdle here is the difference in potential record lengths:-/ CELL RECORD CONSISTS OF AFS 64 + 16*4 name + 16 volume location servers NFS 4 IPv4 address SMB ? server name (maybe just IP address) VLDB RECORD CONSISTS OF AFS 64 + 64 volume name, numbers and server info NFS 4096? export path length SMB ? share name VNODE INDEX CONSISTS OF AFS 4 + 4 vnode ID number, vnode ID version NFS 8 inode number SMB 4096? full file name within share or SMB 4 + 256 cache dir index and filename To have a heterogenous cache, the VLDB record and vnode index records could be extended to 2K or 4K in size, or maybe separate catalogues and indices could be maintained for different filesystem types, and a 0th tier could be a catalogue of different types held within this cache, complete with information as to the entry sizes of the tier 1, 2 and 3 catalogues. David --Multipart_Thu_Oct__3_10:05:39_2002-1 Content-Type: text/plain; charset=US-ASCII THE CACHE LAYOUT ================ The local cache will be structured as a set of files: (1) Meta-data file. Contains meta data records (~=inodes) for every "file" in the cache (including itself). Each meta-data record contains a set of direct block pointers, an indirect block pointer and a double indirect block pointer by which the data to which it points can be located on disc. There will always be enough direct pointers to refer to all the blocks in a directory directly. (2) Cell cache catalogue. Any cell for which we have data cached will be recorded in this file. (3) Volume location catalogue. Any volume for which we have data cached will be recorded in this file. Each VL entry points to the cell record to which it belongs. (4) A set of indexes (hash table type thing) with fixed size records as small as I can make them that cross-reference a volume location and a vnode number with an entry in the meta-data file that describes where to find the cached data on disc. Index entries are also time stamped to show the last time they were accessed. (5) The cached vnodes (files/dirs/symlinks) themselves. Note that in the case of a cached file, any hole in the file _actually_ represents a page not yet fetched from the server. There is also a bitmap to indicate which blocks are currently allocated. I wasn't planning on storing user accessiblity data in the cache itself, though I can always change my mind later, because this differs from user to user, and is subject to change without notice from one attempt to read to another due to the lack of notifications from the protection server when ACLs change. This sort of thing will have to be stored in each "struct file". --Multipart_Thu_Oct__3_10:05:39_2002-1-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/