To: David Howells <dhowells@cambridge.redhat.com>,
       linux-kernel@vger.kernel.org
Subject: Re: [PATCH] AFS filesystem for Linux (2/2) 
In-Reply-To: Message from Jan Harkes <jaharkes@cs.cmu.edu> 
   of "Fri, 04 Oct 2002 12:07:08 EDT." <20021004160708.GA14737@ravel.coda.cs.cmu.edu> 
User-Agent: EMH/1.14.1 SEMI/1.14.3 (Ushinoya) FLIM/1.14.3
 (=?ISO-8859-4?Q?Unebigory=F2mae?=) APEL/10.3 Emacs/21.2 (i686-pc-linux-gnu)
 MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
Date: Fri, 04 Oct 2002 17:56:36 +0100
Message-ID: <27504.1033750596@warthog.cambridge.redhat.com>
From: David Howells <dhowells@cambridge.redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6351
Lines: 142


Hi Jan,

> We automatically resolve directory conflict, file conflicts are 'punted' to
> userspace. There is a manual repair tool, where a user can pick the version
> that he believes is the most appropriate, or he can substitute a replacement
> or merged copy.
>
> There is also a framework for application specific resolvers, these can try
> to automatically merge f.i. calendar files, or email boxes, or
> remove/recompile object files when there is a conflict, etc.

I see. So users have to know how to resolve conflicts themselves...

> For AFS you don't have to care, it's semantics without locking are that
> the 'last writer to close it's fd' wins.

Is that a requirement for AFS? I'm looking at trying to pass writes to the
server as soon as possible.

> I thought you were implementing AFS, how can you choose your model?

Where there's leeway, I can choose.

> How about dirty data, that was locally modified?

In some ways, what I'd ideally like to do is have write() contact the server
and actually send the data direct, then assuming that wasn't rejected, update
the local cache. Any bits of the write that don't overlap pages that are
cached _can_ just be discarded once written to the server, as I can always
retrieve them from the server later, complete with their surrounding data.

Alternatively, I can use prepare-write to make sure I hold copies of all the
pages I want to modify, and then use commit-write to write them back. The
problem with doing that is that I potentially have bits either side that may
end up reverting a write made to the server by someone else.

The third option - using writepage() - has the same problems as the second.

However, I could get around this by doing a binary "diff" of the page in the
local cache against the modified page in memory and just send the changes to
the server. Or, I could then retrieve a fresh copy of the page and apply the
differences to that. The problem there is that I can't detect changes that
have made no difference.

> There is a network latency that comes into play here. If you want last
> writer wins you should not zap dirty data.

I have to zap pages that some other client has modified. The server has told
me to invalidate it, and so I need to reread it from the server.

This is probably only going to be a problem with shared-writable mappings, and
the problem is due to AFS, not how I do my caching.

> So other clients will read inconsistent data if they don't have that
> chunk cached. And if the server generates callbacks on the write of a
> chunk, all clients see the update?

Not with prepare-/commit-write if commit-write is synchronous.

In any case, if they don't have that chunk cached they'll have to fetch it
from the server.

> Ehh, access permissions should already be checked when the file is
> opened. NFS is dealing with the same problems here.

AFS fileservers don't have a concept of an open file. Every time you do a
fetch or a store (be for it data or metadata) you get a short-term "lease" on
the file that the server breaks if some other client modifies it.

Furthermore, every operation a client issues is checked at the RxRPC
level. This way a client can have several users holding a file open in the
local VFS and just label the operations sent to the server with tokens
representing the users doing the access.

How does Coda do it? Does it get "open" the file on the server with a
particular set of credentials for each user that wants to use it?

> >      How does Coda deal with this security problem?
> 
> What problem?

For instance: you have one file in your cache, several people write to it, but
one of the people in the middle had permission taken away as far as the server
is concerned.

Also, when writing the file back, if several people have changed it, whose
credentials do you present to the server as having made the change?

> If you don't have write permission you can't write to the file. If you do
> have write permissions you can write. When the last writer closes it's
> filehandle the data is sent to the server with his permissions.

What if someone changes the permissions on the server's copy of the file
whilst you have the file open?

> Now if the ACL is changed on the servers before the close so that this last
> writer loses write access we get a 'conflict' that is punted to userspace,
> similar to the case when writing to an already updated file.

This is what I was getting at. The main issue is how do you deal with a file
in the cache that has been written to by several people, some of whom don't
have write permission any more.

> So I store it in VM on the disk and you store it in your private FS on
> disk. Looks like the same thing to me, except I get to enjoy the benefits of
> the page/swapcache keeping the data in memory when it is frequently used
> without having to do anything smart.

And I get to enjoy the benefits of the pagecache keeping the data in memory
when it is frequently used, and I can also do smart stuff and try to optimise
access.

> Virtual Memory. A private mmap of a file.

I thought you might have meant a "Volume Manager" rather than the kernel's VM
subsystem.

> Updates dirty the pages, so they end up in swap, but we log the same update
> in a logfile (journal), when the log fills up, the changes written directly
> to the underlying file.

So you keep everything in triplicate whilst running? (or at least duplicate
plus a frequently emptied journal).

> Optionally, when we know that the swap copy and the file copy of the page
> are identical, the page is remapped to reduce swap usage. That is where we
> store all the metadata and the Coda file -> cache file mappings so that they
> can survive a reboot or system crash.

I see... you can have dirty data residing in the cache over a reboot, and thus
you can't just nuke your cache partially or wholly, and you have to keep
careful account of all the entries in there... Hmmm... I'll have to think
about that, but I don't think it'll be a problem, provided I don't let close()
complete until all the dirty data from a file is written to the server.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/