2002-10-04 15:29:07

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] AFS filesystem for Linux (2/2)


Hi Jan,

> We don't. Coda has the all or nothing fetch. When we get an open upcall and
> the file isn't cached we get the whole file, and return the filehandle we
> just used to fetch the file, that way even when pages haven't been flushed
> to disk yet, the kernel will see the same data. All reads and writes are
> wrapped in such a way that readpage and writepage directly access the cache
> file, when an mmap is active the Coda inode isn't even touched anymore.

So, say you've got a really big file in your cache. Someone changes a single
byte in it. You can work out that the file has changed by one of a number of
means. How do you avoid having to throw the entire file back to the server?

> Pretty much, but we need that extra space for the disconnected writes
> anyways, that way we can always roll back to a consistent version.

How do you sync up with the server on reconnection? This is similar to the
problem you pointed out that I have to solve - how to deal with the file on
the server being changed by another client whilst I'm also trying to write to
it.

> Ok, traditional AFS semantics is 'session semantics'. Very strict, whenever
> you read from a file, everything is consistent wrt. the time you opened the
> file. Whatever you write isn't committed on the server until you close the
> file. This model has great advantages such as minimizing network traffic,
> giving lock free read/write consistency.

Whatever model you choose, you have to accept some compromises.

The model I'm thinking of is as follows:

(*) Data that the client doesn't have immediately to hand (ie is not yet
cached) it fetched a chunk at a time upon request of the VM (thus
allowing data readahead to be driven by the VM).

(*) All data cached by a client for a particular file is zapped if I get a
callback from the server and/or the data version number of that file
appears to have changed.

(*) In O_SYNC mode, data is written back to the server as promptly as
possible within the write() call (maybe through the auspices of
prepare_write and commit_write).

(*) In non-O_SYNC mode, I would like the data to be written back through the
page cache's writepage() routine(s). By setting the dirty bits on pages,
the write will be scheduled by the VM at some point. This would permit
better write coelescing locally. However, security becomes a problem,
since I have to say to the server which user I'm doing a store as, and if
the data is coelesced from writes done as several different users, then
there could be a problem if the store is rejected.

How does Coda deal with this security problem?

I admit there are a number of problems with this model that might be
alleviated by better operations being available in the AFS spec (such as data
insertion without having to nominate a new EOF position, and data appending
without needing to know the old or specify a new EOF position.

> Now people started throwing big databases in the filesystem, and the cache
> issues became important. So they introduced 'chunked access', dirty chunks
> are still written when the file is closed, but also when the cache is full
> (oops, lost write consistency).

Anyone using a network filesystem of any type to store a big databases is
probably just asking for trouble. IMHO they're far better off talking to a
distributed DB through its own network access protocol. But that's besides the
point.

> > (4) "Diff" the page in the pagecache against a copy stored in the cache
> > and try to send the changes to the server.
>
> As far as I know this is impossible without changing the existing AFS
> servers.

It can be done entirely in the client, provided it has a copy of the
unmodified page still in its cache.

> Doesn't AFS3 give callbacks when it updates a chunk of the file? I guess it
> still has retained at least that part of the original semantics, send
> callbacks when the file is closed (and the data is 'officially'
> committed). It is still up in the air what clients see that read the file
> between the chunked writes and the actual file close.

All it says is that a given file has changed. It doesn't provide a clue as to
where. Hence the entire file has to be flushed.

> Disconnected operation has never been 'AFS semantics'. That's a Coda thing.

I didn't say it was. It's explicitly denied in the AFS docs, though it's
discussed under the future developments section.

> A Coda cell is simply a FQDN, whenever the userspace cachemanager accesses a
> new cell it a locally unique ID, which will exist as long as there are
> objects from that cell in the cache.

The 32-bit ID still has to be mapped through a catalogue somewhere (may be a
text file that the cache manager reads on starting), and if the catalogue is
external to the cache, the catalogue may change what the ID corresponds to
without the cache being invalidated.

> Why would it probably be swapped out to disc? If you're really worried about
> that you could mlock the memory. And if you think that is too expensive, it
> is still better to mlock memory in userspace that to allocate that same
> memory in kernel space.

I'm storing mine on disc as do normal disc-based FS's. That means it can be a
lot bigger. Besides, you don't really want to mlock memory or store it in the
kernel - that would be a big chunk of memory permanently committed and
unavailable for other uses.

> Yeah, that's why Coda is using a recoverable VM, basically a mmapped
> file with an log where modifications are recorded so that we can
> replay/rollback uncommitted operations when we're restarting.

What's "VM" in this context?

David


2002-10-04 16:01:43

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH] AFS filesystem for Linux (2/2)

On Fri, Oct 04, 2002 at 04:34:32PM +0100, David Howells wrote:
> > We don't. Coda has the all or nothing fetch. When we get an open upcall and
> > the file isn't cached we get the whole file, and return the filehandle we
> > just used to fetch the file, that way even when pages haven't been flushed
> > to disk yet, the kernel will see the same data. All reads and writes are
> > wrapped in such a way that readpage and writepage directly access the cache
> > file, when an mmap is active the Coda inode isn't even touched anymore.
>
> So, say you've got a really big file in your cache. Someone changes a single
> byte in it. You can work out that the file has changed by one of a number of
> means. How do you avoid having to throw the entire file back to the server?

Coda actually writes the whole file back. Someone was working on an
rsync based method, but I haven't heard anything for a while. It is
actually mostly databases that update single bytes in large files.
Anything like a text editor will move the original file to a backup copy
and write a new version anyways. Compilers tend to throw away the old
object files and create a new file, etc.

> > Pretty much, but we need that extra space for the disconnected writes
> > anyways, that way we can always roll back to a consistent version.
>
> How do you sync up with the server on reconnection? This is similar to the
> problem you pointed out that I have to solve - how to deal with the file on
> the server being changed by another client whilst I'm also trying to write to
> it.

We automatically resolve directory conflict, file conflicts are 'punted'
to userspace. There is a manual repair tool, where a user can pick the
version that he believes is the most appropriate, or he can substitute a
replacement or merged copy.

There is also a framework for application specific resolvers, these can
try to automatically merge f.i. calendar files, or email boxes, or
remove/recompile object files when there is a conflict, etc.

For AFS you don't have to care, it's semantics without locking are that
the 'last writer to close it's fd' wins.

> Whatever model you choose, you have to accept some compromises.

I thought you were implementing AFS, how can you choose your model?

> The model I'm thinking of is as follows:
...
> (*) All data cached by a client for a particular file is zapped if I get a
> callback from the server and/or the data version number of that file
> appears to have changed.

How about dirty data, that was locally modified? There is a network
latency that comes into play here. If you want last writer wins you
should not zap dirty data.

> (*) In O_SYNC mode, data is written back to the server as promptly as
> possible within the write() call (maybe through the auspices of
> prepare_write and commit_write).

So other clients will read inconsistent data if they don't have that
chunk cached. And if the server generates callbacks on the write of a
chunk, all clients see the update?

> (*) In non-O_SYNC mode, I would like the data to be written back through the
> page cache's writepage() routine(s). By setting the dirty bits on pages,
> the write will be scheduled by the VM at some point. This would permit
> better write coelescing locally. However, security becomes a problem,
> since I have to say to the server which user I'm doing a store as, and if
> the data is coelesced from writes done as several different users, then
> there could be a problem if the store is rejected.

Ehh, access permissions should already be checked when the file is
opened. NFS is dealing with the same problems here.

> How does Coda deal with this security problem?

What problem? If you don't have write permission you can't write to the
file. If you do have write permissions you can write. When the last
writer closes it's filehandle the data is sent to the server with his
permissions.

Now if the ACL is changed on the servers before the close so that this
last writer loses write access we get a 'conflict' that is punted to
userspace, similar to the case when writing to an already updated file.

> > Why would it probably be swapped out to disc? If you're really worried about
> > that you could mlock the memory. And if you think that is too expensive, it
> > is still better to mlock memory in userspace that to allocate that same
> > memory in kernel space.
>
> I'm storing mine on disc as do normal disc-based FS's. That means it can be a
> lot bigger. Besides, you don't really want to mlock memory or store it in the
> kernel - that would be a big chunk of memory permanently committed and
> unavailable for other uses.

So I store it in VM on the disk and you store it in your private FS on
disk. Looks like the same thing to me, except I get to enjoy the
benefits of the page/swapcache keeping the data in memory when it is
frequently used without having to do anything smart.

> > Yeah, that's why Coda is using a recoverable VM, basically a mmapped
> > file with an log where modifications are recorded so that we can
> > replay/rollback uncommitted operations when we're restarting.
>
> What's "VM" in this context?

Virtual Memory. A private mmap of a file. Updates dirty the pages, so
they end up in swap, but we log the same update in a logfile (journal),
when the log fills up, the changes written directly to the underlying
file. Optionally, when we know that the swap copy and the file copy of
the page are identical, the page is remapped to reduce swap usage. That
is where we store all the metadata and the Coda file -> cache file
mappings so that they can survive a reboot or system crash.

Jan