2002-10-04 16:51:09

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] AFS filesystem for Linux (2/2)


Hi Jan,

> We automatically resolve directory conflict, file conflicts are 'punted' to
> userspace. There is a manual repair tool, where a user can pick the version
> that he believes is the most appropriate, or he can substitute a replacement
> or merged copy.
>
> There is also a framework for application specific resolvers, these can try
> to automatically merge f.i. calendar files, or email boxes, or
> remove/recompile object files when there is a conflict, etc.

I see. So users have to know how to resolve conflicts themselves...

> For AFS you don't have to care, it's semantics without locking are that
> the 'last writer to close it's fd' wins.

Is that a requirement for AFS? I'm looking at trying to pass writes to the
server as soon as possible.

> I thought you were implementing AFS, how can you choose your model?

Where there's leeway, I can choose.

> How about dirty data, that was locally modified?

In some ways, what I'd ideally like to do is have write() contact the server
and actually send the data direct, then assuming that wasn't rejected, update
the local cache. Any bits of the write that don't overlap pages that are
cached _can_ just be discarded once written to the server, as I can always
retrieve them from the server later, complete with their surrounding data.

Alternatively, I can use prepare-write to make sure I hold copies of all the
pages I want to modify, and then use commit-write to write them back. The
problem with doing that is that I potentially have bits either side that may
end up reverting a write made to the server by someone else.

The third option - using writepage() - has the same problems as the second.

However, I could get around this by doing a binary "diff" of the page in the
local cache against the modified page in memory and just send the changes to
the server. Or, I could then retrieve a fresh copy of the page and apply the
differences to that. The problem there is that I can't detect changes that
have made no difference.

> There is a network latency that comes into play here. If you want last
> writer wins you should not zap dirty data.

I have to zap pages that some other client has modified. The server has told
me to invalidate it, and so I need to reread it from the server.

This is probably only going to be a problem with shared-writable mappings, and
the problem is due to AFS, not how I do my caching.

> So other clients will read inconsistent data if they don't have that
> chunk cached. And if the server generates callbacks on the write of a
> chunk, all clients see the update?

Not with prepare-/commit-write if commit-write is synchronous.

In any case, if they don't have that chunk cached they'll have to fetch it
from the server.

> Ehh, access permissions should already be checked when the file is
> opened. NFS is dealing with the same problems here.

AFS fileservers don't have a concept of an open file. Every time you do a
fetch or a store (be for it data or metadata) you get a short-term "lease" on
the file that the server breaks if some other client modifies it.

Furthermore, every operation a client issues is checked at the RxRPC
level. This way a client can have several users holding a file open in the
local VFS and just label the operations sent to the server with tokens
representing the users doing the access.

How does Coda do it? Does it get "open" the file on the server with a
particular set of credentials for each user that wants to use it?

> > How does Coda deal with this security problem?
>
> What problem?

For instance: you have one file in your cache, several people write to it, but
one of the people in the middle had permission taken away as far as the server
is concerned.

Also, when writing the file back, if several people have changed it, whose
credentials do you present to the server as having made the change?

> If you don't have write permission you can't write to the file. If you do
> have write permissions you can write. When the last writer closes it's
> filehandle the data is sent to the server with his permissions.

What if someone changes the permissions on the server's copy of the file
whilst you have the file open?

> Now if the ACL is changed on the servers before the close so that this last
> writer loses write access we get a 'conflict' that is punted to userspace,
> similar to the case when writing to an already updated file.

This is what I was getting at. The main issue is how do you deal with a file
in the cache that has been written to by several people, some of whom don't
have write permission any more.

> So I store it in VM on the disk and you store it in your private FS on
> disk. Looks like the same thing to me, except I get to enjoy the benefits of
> the page/swapcache keeping the data in memory when it is frequently used
> without having to do anything smart.

And I get to enjoy the benefits of the pagecache keeping the data in memory
when it is frequently used, and I can also do smart stuff and try to optimise
access.

> Virtual Memory. A private mmap of a file.

I thought you might have meant a "Volume Manager" rather than the kernel's VM
subsystem.

> Updates dirty the pages, so they end up in swap, but we log the same update
> in a logfile (journal), when the log fills up, the changes written directly
> to the underlying file.

So you keep everything in triplicate whilst running? (or at least duplicate
plus a frequently emptied journal).

> Optionally, when we know that the swap copy and the file copy of the page
> are identical, the page is remapped to reduce swap usage. That is where we
> store all the metadata and the Coda file -> cache file mappings so that they
> can survive a reboot or system crash.

I see... you can have dirty data residing in the cache over a reboot, and thus
you can't just nuke your cache partially or wholly, and you have to keep
careful account of all the entries in there... Hmmm... I'll have to think
about that, but I don't think it'll be a problem, provided I don't let close()
complete until all the dirty data from a file is written to the server.

David


2002-10-04 17:30:55

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH] AFS filesystem for Linux (2/2)

On Fri, Oct 04, 2002 at 05:56:36PM +0100, David Howells wrote:
> > For AFS you don't have to care, it's semantics without locking are that
> > the 'last writer to close it's fd' wins.
>
> Is that a requirement for AFS? I'm looking at trying to pass writes to the
> server as soon as possible.

Last time I checked, yes. Callbacks should be sent only when the file is
closed.

> > There is a network latency that comes into play here. If you want last
> > writer wins you should not zap dirty data.
>
> I have to zap pages that some other client has modified. The server has told
> me to invalidate it, and so I need to reread it from the server.
>
> This is probably only going to be a problem with shared-writable mappings, and
> the problem is due to AFS, not how I do my caching.

But local modification are per definition more up-to-date because you
haven't closed the file yet. It is very much a problem of how you do
your caching. I'm trying to tell you AFS has a well defined consistency
model called session semantics. That directly influenzes every decision
you make wrt. to caching and everything I've seen up until now shows me
that you're simply implementing an NFS client that happens to speak to
AFS servers.

> > Ehh, access permissions should already be checked when the file is
> > opened. NFS is dealing with the same problems here.
>
> AFS fileservers don't have a concept of an open file. Every time you do a
> fetch or a store (be for it data or metadata) you get a short-term "lease" on
> the file that the server breaks if some other client modifies it.

Actually AFS is most definitely a stateful filesystem and as such has
the concept of an open file.

Just google for "AFS NFS stateful stateless" and you get literally
hundreds of handouts and lecture notes from distributes systems classes
that describe exactly these differences. UNIX/session semantics,
stateless/stateful, etc.

> Furthermore, every operation a client issues is checked at the RxRPC
> level. This way a client can have several users holding a file open in the
> local VFS and just label the operations sent to the server with tokens
> representing the users doing the access.
>
> How does Coda do it? Does it get "open" the file on the server with a
> particular set of credentials for each user that wants to use it?

Yes. If the user hasn't opened the file before, we have to check the
permissions with the server. So every user that opens the same file will
generate an "open" rpc to the server.

> > > How does Coda deal with this security problem?
> >
> > What problem?
>
> For instance: you have one file in your cache, several people write to it, but
> one of the people in the middle had permission taken away as far as the server
> is concerned.

First of all, typical write-write sharing is extremely low, otherwise
Coda wouldn't have even worked with it's optimistic replication and
caching. Second of all permission changes are even rarer.

So on the off chance that a write-write shared file on happens to be
affected by a permission change.

- We either succeed in writing the file when the person who lost access
closed before the last writer who still had access. Big deal, he had
access when he opened the file, and it is next to impossible to tell
wheter his close arrived before or after the permission was taken away
except when every operation is somehow given an absolute linear
ordering in time. Generally too expensive on a distributed system
with clients that may come and go as they please.

- Or the person who lost access is the last one to close the file and
trigger the writeback. In this case the servers will deny the write
operation and the client will declare a conflict on that object and
switch to disconnected mode. Someone who has write permission can
repair the conflict.

> Also, when writing the file back, if several people have changed it, whose
> credentials do you present to the server as having made the change?
>
> > If you don't have write permission you can't write to the file. If you do
> > have write permissions you can write. When the last writer closes it's
> > filehandle the data is sent to the server with his permissions.

Like I said in the section you quoted yourself, the last writer who
closes the file.

> > Updates dirty the pages, so they end up in swap, but we log the same update
> > in a logfile (journal), when the log fills up, the changes written directly
> > to the underlying file.
>
> So you keep everything in triplicate whilst running? (or at least duplicate
> plus a frequently emptied journal).

Actually, for any unmodified data there is only one copy. Once it is
modified there are two (the dirty page and the logged change), why would
there be a third, I don't care about the copy in the backing file...

It has the advantage that even if you pull the powercord, we can still
recover the last consistent set of metadata and from that validate both
the data in the cache files, and revalidate everything against the
servers, which we already need for the disconnected operation.

Jan