LinuxLists.cc - [PATCH,RFC] more graceful sunrpc cache updates for HA

2009-01-12 10:33:12

Subject: [PATCH,RFC] more graceful sunrpc cache updates for HA

G'day,

This is a pair of patches which attempt to make the sunrpc caching
mechanism more graceful with respect to administrative exportfs changes
from userspace while traffic is flowing to other exports.

The patches are not to be applied and are for comment, to allow someone
else to finish this work, and to document the problem. The patches are
incomplete and not tested. The patches are against the archaic SLES10
versions of the kernel and nfs-utils and will need some work for modern
software.

Here's a description of the problem. The current behaviour when an
export is changed (added, removed, or the flags are changed) using
exportfs is that exporfs writes a new etab and then tells the kernel to
flush the entire kernel-side export state (including the state for
exports and clients which are not logically affected by the change) and
refresh everything by triggering upcalls to rpc.mountd. Most of the
time this works fine, but there's an ugly interaction which can happen
when a client is generating a workload comprising entirely WRITE calls.

2009-01-12 15:51:52

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH,RFC] more graceful sunrpc cache updates for HA

On Mon, Jan 12, 2009 at 09:25:02PM +1100, Greg Banks wrote:
> The kernel keeps an effectively global generation number (genid).
> Interfaces are provided to userspace to allow querying the genid, and to
> allow atomically incrementing and querying the genid. After exportfs
> makes a change to etab it asks the kernel to increment the genid. When
> mountd wants to know if the etab has changed, it checks the genid and
> re-reads etab if the genid has changed since the last read. The export
> updates that mountd writes into the kernel are tagged with the genid
> that mountd thinks they belong to, and this is stored in the cache
> entry. Missing is a hunk to make cache_fresh() compare the genids of
> the entry and the cache_detail and if they differ start an upcall (but
> *not* consider the entry invalid, i.e. behave like the age >
> refresh_age/2 case).

So the result is just to give userspace a way to tell the kernel that it
should start making upcalls without yet dropping the existing cache
entries?

I'd like to guarantee that nfsd behavior reflects the updated exports
by the time exportfs returns. From your description, it doesn't sound
like you're trying to meet such a guarantee? Or is there some way for
exportfs to wait till it sees the updates made?

It also might be possible to teach exportfs and/or mountd how to write
the "diff" between the current kernel exports and the new exports into
the export cache.

> a) allow large NFS calls to be deferred, up to the maximum wsize rather
> than just a page, or
>
> b) change call deferral to always block the calling thread instead of
> using a deferral record and returning -EAGAIN

Any deferral method sufficient to handle reads and writes already
requires saving a fair amount of state, so I wonder whether the extra
overhead just to keep another thread around is worth the trouble of
avoiding....

--b.

> Both approaches have interesting and potentially frightening side
> effects, but could be made to work. I've discussed option b) with
> Bruce, and I understand the NFSv4.1 guys have their own reasons for
> wanting to do something like that. Maybe the above will help explain
> why the current call deferral behaviour gives me the irrits :-)

2009-01-12 21:23:17

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH,RFC] more graceful sunrpc cache updates for HA

J. Bruce Fields wrote:
> On Mon, Jan 12, 2009 at 09:25:02PM +1100, Greg Banks wrote:
>
> So the result is just to give userspace a way to tell the kernel that it
> should start making upcalls without yet dropping the existing cache
> entries?
>
Precisely.

Also, it's a method with a finer grain than 1 second. One of my earlier
attempts involved adding an extra flag to the data passed in the
.../flush pseudofile to make the flushing "soft". One of the problems
this suffered from was that all the logic was working in time_t units.
> I'd like to guarantee that nfsd behavior reflects the updated exports
> by the time exportfs returns.
This is certainly necessary for the "exportfs -u" case so that "exportfs
-u ; umount" always works. The patch needs improvement to ensure that.
I'm not convinced it's strictly necessary for other changes.

> From your description, it doesn't sound
> like you're trying to meet such a guarantee?
We don't have that guarantee now except for the "exportfs -u" case. The
only synchronisation mechanism that I can see is when exportfs writes to
the cache .../flush pseudofile, which blocks until the cache is
completely flushed. This technique doesn't wait for new entries to be
filled in by upcalls, so if the change involved adding new exports or
changing the flags on existing exports it will not yet be in effect when
exportfs exits.

> It also might be possible to teach exportfs and/or mountd how to write
> the "diff" between the current kernel exports and the new exports into
> the export cache.
>
Indeed. The difficulty there is working out how to synchronise between
two exportfs instances, and how to synchronise between an exportfs
instance and calls from new clients coming in over the wire.
>
>> a) allow large NFS calls to be deferred, up to the maximum wsize rather
>> than just a page, or
>>
>> b) change call deferral to always block the calling thread instead of
>> using a deferral record and returning -EAGAIN
>>
>
> Any deferral method sufficient to handle reads and writes already
> requires saving a fair amount of state, so I wonder whether the extra
> overhead just to keep another thread around is worth the trouble of
> avoiding....
>

Agreed.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.