From: "David P. Quigley" <dpquigl@tycho.nsa.gov>
Subject: Re: [NFS] Interesting problem with sunrpc cache
Date: Thu, 18 Oct 2007 11:47:56 -0400
Message-ID: <1192722476.7466.28.camel@moss-terrapins.epoch.ncsc.mil>
References: <1192715235.7466.13.camel@moss-terrapins.epoch.ncsc.mil>
	<20071018145549.GA24088@fieldses.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Cc: nfs@lists.sourceforge.net, nfsv4@linux-nfs.org
To: "J. Bruce Fields" <bfields@fieldses.org>
In-Reply-To: <20071018145549.GA24088@fieldses.org>
Sender: nfsv4-bounces@linux-nfs.org
Errors-To: nfsv4-bounces@linux-nfs.org

On Thu, 2007-10-18 at 10:55 -0400, J. Bruce Fields wrote:
> On Thu, Oct 18, 2007 at 09:47:15AM -0400, David P. Quigley wrote:
> > Hello,
> >     I have been working on a Domain of Interpretation mapper for the
> > labeled nfs work and I seem to have hit a wall. I started with idmapd as
> > a base and proceeded to modify it to work with DOIs instead. Because of
> > this with a few minor exceptions I expected everything to work. For the
> > most part it does however I seem to have a minor problem with the cache
> > on the nfsd side.
> > 
> >     I am running into a problem where I can't mount the export because
> > the server keeps trying to translate the label. I initially get a
> > success but then it repeatedly attempts to retry.
> 
> So you're just getting repeated NFS4ERR_DELAY responses to the same
> request from the client?  Or does the server just stop responding?  Is
> this always reproduceble?

The server stops responding since it is hung in the
nfs_map_local_to_global function so these retries are between the kernel
and the userspace daemon.

> 
> > I have tracked it down to a bit of code which essentially is a
> > duplication of do_idmap_lookup_nowait.
> 
> When exactly does the label translation occur?

We added a new recommended attribute so we have a function called
nfsd4_encode_security_label in nfs4xdr.c. This grabs the label using
security_inode_getsecurity and then sends it to the translation daemon.

> 
> The idmapper is the only server cache that uses this custom
> deferral/revisit code.  We did it because currently deferrals require
> setting aside the whole compound and then reprocessing it later.  Since
> a getattr (possibly including file owners, requiring idmapping) seemed
> likely to occur after a nonidempotent operation, this seemed like a bad
> idea.
> 
> But the solution in nfs4idmap.c is a little ugly and idmapper-specific.
> (There are also some (much less likely) races that could occur due to
> the export-related upcalls).  Also, if your DOI translation typically
> only happens at the beginning of a compound (for example), then may
> never see this problem anyway.
> 
> I think the real solution is to augment the existing
> svc_defer/svc_revisit to allow us to store a lot more state so we don't
> have to start the compound processing over from scratch each time.  It
> shouldn't be hard.  The problem is more that I don't know how to measure
> the improvement.  And at some point I wonder what we'll save by storing
> all this other state, just to save a kernel task.  Are tasks really that
> expensive?
> 
> But anyway, it'd still be interesting to understand this bug since it's
> likely to be a bug in the nfs4idmap.c as well.
> 
> > In this function we set ret to -ETIMEDOUT which
> > is -110 and then we test three conditions. The conditional checks if the
> > cache entry has the CACHE_VALID bit set, that the epiration time hasn't
> > lapsed and that the flush time hasn't lapsed. The last two conditions
> > here are true so I have determined it is because the CACHE_VALID bit
> > isn't being set properly. 
> 
> Right, so there could be odd race conditions in the custom defer/revisit
> code, but as long as CACHE_VALID eventually gets set you'd think it'd
> recover.  Perhaps there's more than one cache item involved?

I don't think there is more than one cache item. I can't tell for sure
since there doesn't seem to be more than one iteration before it goes
into the loop but I do a lookup on the local representation and I get a
negative cache entry back. I'm not sure if the negative bit is set so I
will check that in a bit (no pun intended). Then I make the upcall to
userspace and the parse function comes back with a sucessful translation
and no error code. At this point I put the node into the update function
and in theory the negative entry we pulled from the lookup should update
the global field for the entry. I supposed that I could toss another
lookup call in there to make sure that this is actually happening.
However, update seems to be working properly in that it returns a cache
entry that has the appropriate values. 

> 
> >     This lead me to check deeper into where this bit is set. The answer
> > to this is that it should be set inside sunrpc_cache_update using the
> > cache_fresh_locked function. Entering this function we hit the two
> > conditions up top where it says that the valid bit isn't set and we
> > update it directly. The line below should update the valid bit since the
> > internals of cache_fresh_locked just set the expiry_time entry of cache
> > head and then use test_and_set_bit to set the CACHE_VALID bit in the
> > cache head's flags.
> > 
> > is_new = cache_fresh_locked(old, new->expiry_time);
> > 
> >     I am very perplexed why something like this is happening. I've been
> > trying to figure it out for the better part of a week and I have finally
> > hit the wall at this point. Has anyone else encountered this problem or
> > have any insight into why this might happen.
> 
> I don't have any bright ideas right now, sorry.

Any bit of insight helps me get closer to solving the problem. The
interesting thing is that if this bit was being set properly it seems as
if everything else would be working perfectly.
> 
> --b.


Dave