LinuxLists.cc - Interesting problem with sunrpc cache

2007-10-18 13:47:20

Subject: Interesting problem with sunrpc cache

Hello,
I have been working on a Domain of Interpretation mapper for the
labeled nfs work and I seem to have hit a wall. I started with idmapd as
a base and proceeded to modify it to work with DOIs instead. Because of
this with a few minor exceptions I expected everything to work. For the
most part it does however I seem to have a minor problem with the cache
on the nfsd side.

I am running into a problem where I can't mount the export because
the server keeps trying to translate the label. I initially get a
success but then it repeatedly attempts to retry. I have tracked it down
to a bit of code which essentially is a duplication of
do_idmap_lookup_nowait. In this function we set ret to -ETIMEDOUT which
is -110 and then we test three conditions. The conditional checks if the
cache entry has the CACHE_VALID bit set, that the epiration time hasn't
lapsed and that the flush time hasn't lapsed. The last two conditions
here are true so I have determined it is because the CACHE_VALID bit
isn't being set properly.

This lead me to check deeper into where this bit is set. The answer
to this is that it should be set inside sunrpc_cache_update using the
cache_fresh_locked function. Entering this function we hit the two
conditions up top where it says that the valid bit isn't set and we
update it directly. The line below should update the valid bit since the
internals of cache_fresh_locked just set the expiry_time entry of cache
head and then use test_and_set_bit to set the CACHE_VALID bit in the
cache head's flags.

is_new = cache_fresh_locked(old, new->expiry_time);

I am very perplexed why something like this is happening. I've been
trying to figure it out for the better part of a week and I have finally
hit the wall at this point. Has anyone else encountered this problem or
have any insight into why this might happen.

David Quigley

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-18 14:55:49

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [NFS] Interesting problem with sunrpc cache

On Thu, Oct 18, 2007 at 09:47:15AM -0400, David P. Quigley wrote:
> Hello,
> I have been working on a Domain of Interpretation mapper for the
> labeled nfs work and I seem to have hit a wall. I started with idmapd as
> a base and proceeded to modify it to work with DOIs instead. Because of
> this with a few minor exceptions I expected everything to work. For the
> most part it does however I seem to have a minor problem with the cache
> on the nfsd side.
>
> I am running into a problem where I can't mount the export because
> the server keeps trying to translate the label. I initially get a
> success but then it repeatedly attempts to retry.

So you're just getting repeated NFS4ERR_DELAY responses to the same
request from the client? Or does the server just stop responding? Is
this always reproduceble?

> I have tracked it down to a bit of code which essentially is a
> duplication of do_idmap_lookup_nowait.

When exactly does the label translation occur?

The idmapper is the only server cache that uses this custom
deferral/revisit code. We did it because currently deferrals require
setting aside the whole compound and then reprocessing it later. Since
a getattr (possibly including file owners, requiring idmapping) seemed
likely to occur after a nonidempotent operation, this seemed like a bad
idea.

But the solution in nfs4idmap.c is a little ugly and idmapper-specific.
(There are also some (much less likely) races that could occur due to
the export-related upcalls). Also, if your DOI translation typically
only happens at the beginning of a compound (for example), then may
never see this problem anyway.

I think the real solution is to augment the existing
svc_defer/svc_revisit to allow us to store a lot more state so we don't
have to start the compound processing over from scratch each time. It
shouldn't be hard. The problem is more that I don't know how to measure
the improvement. And at some point I wonder what we'll save by storing
all this other state, just to save a kernel task. Are tasks really that
expensive?

But anyway, it'd still be interesting to understand this bug since it's
likely to be a bug in the nfs4idmap.c as well.

> In this function we set ret to -ETIMEDOUT which
> is -110 and then we test three conditions. The conditional checks if the
> cache entry has the CACHE_VALID bit set, that the epiration time hasn't
> lapsed and that the flush time hasn't lapsed. The last two conditions
> here are true so I have determined it is because the CACHE_VALID bit
> isn't being set properly.

Right, so there could be odd race conditions in the custom defer/revisit
code, but as long as CACHE_VALID eventually gets set you'd think it'd
recover. Perhaps there's more than one cache item involved?

> This lead me to check deeper into where this bit is set. The answer
> to this is that it should be set inside sunrpc_cache_update using the
> cache_fresh_locked function. Entering this function we hit the two
> conditions up top where it says that the valid bit isn't set and we
> update it directly. The line below should update the valid bit since the
> internals of cache_fresh_locked just set the expiry_time entry of cache
> head and then use test_and_set_bit to set the CACHE_VALID bit in the
> cache head's flags.
>
> is_new = cache_fresh_locked(old, new->expiry_time);
>
> I am very perplexed why something like this is happening. I've been
> trying to figure it out for the better part of a week and I have finally
> hit the wall at this point. Has anyone else encountered this problem or
> have any insight into why this might happen.

I don't have any bright ideas right now, sorry.

--b.

2007-10-18 15:47:56

by David P. Quigley

[permalink] [raw]

Subject: Re: [NFS] Interesting problem with sunrpc cache

On Thu, 2007-10-18 at 10:55 -0400, J. Bruce Fields wrote:
> On Thu, Oct 18, 2007 at 09:47:15AM -0400, David P. Quigley wrote:
> > Hello,
> > I have been working on a Domain of Interpretation mapper for the
> > labeled nfs work and I seem to have hit a wall. I started with idmapd as
> > a base and proceeded to modify it to work with DOIs instead. Because of
> > this with a few minor exceptions I expected everything to work. For the
> > most part it does however I seem to have a minor problem with the cache
> > on the nfsd side.
> >
> > I am running into a problem where I can't mount the export because
> > the server keeps trying to translate the label. I initially get a
> > success but then it repeatedly attempts to retry.
>
> So you're just getting repeated NFS4ERR_DELAY responses to the same
> request from the client? Or does the server just stop responding? Is
> this always reproduceble?

The server stops responding since it is hung in the
nfs_map_local_to_global function so these retries are between the kernel
and the userspace daemon.

>
> > I have tracked it down to a bit of code which essentially is a
> > duplication of do_idmap_lookup_nowait.
>
> When exactly does the label translation occur?

We added a new recommended attribute so we have a function called
nfsd4_encode_security_label in nfs4xdr.c. This grabs the label using
security_inode_getsecurity and then sends it to the translation daemon.

>
> The idmapper is the only server cache that uses this custom
> deferral/revisit code. We did it because currently deferrals require
> setting aside the whole compound and then reprocessing it later. Since
> a getattr (possibly including file owners, requiring idmapping) seemed
> likely to occur after a nonidempotent operation, this seemed like a bad
> idea.
>
> But the solution in nfs4idmap.c is a little ugly and idmapper-specific.
> (There are also some (much less likely) races that could occur due to
> the export-related upcalls). Also, if your DOI translation typically
> only happens at the beginning of a compound (for example), then may
> never see this problem anyway.
>
> I think the real solution is to augment the existing
> svc_defer/svc_revisit to allow us to store a lot more state so we don't
> have to start the compound processing over from scratch each time. It
> shouldn't be hard. The problem is more that I don't know how to measure
> the improvement. And at some point I wonder what we'll save by storing
> all this other state, just to save a kernel task. Are tasks really that
> expensive?
>
> But anyway, it'd still be interesting to understand this bug since it's
> likely to be a bug in the nfs4idmap.c as well.
>
> > In this function we set ret to -ETIMEDOUT which
> > is -110 and then we test three conditions. The conditional checks if the
> > cache entry has the CACHE_VALID bit set, that the epiration time hasn't
> > lapsed and that the flush time hasn't lapsed. The last two conditions
> > here are true so I have determined it is because the CACHE_VALID bit
> > isn't being set properly.
>
> Right, so there could be odd race conditions in the custom defer/revisit
> code, but as long as CACHE_VALID eventually gets set you'd think it'd
> recover. Perhaps there's more than one cache item involved?

I don't think there is more than one cache item. I can't tell for sure
since there doesn't seem to be more than one iteration before it goes
into the loop but I do a lookup on the local representation and I get a
negative cache entry back. I'm not sure if the negative bit is set so I
will check that in a bit (no pun intended). Then I make the upcall to
userspace and the parse function comes back with a sucessful translation
and no error code. At this point I put the node into the update function
and in theory the negative entry we pulled from the lookup should update
the global field for the entry. I supposed that I could toss another
lookup call in there to make sure that this is actually happening.
However, update seems to be working properly in that it returns a cache
entry that has the appropriate values.

>
> > This lead me to check deeper into where this bit is set. The answer
> > to this is that it should be set inside sunrpc_cache_update using the
> > cache_fresh_locked function. Entering this function we hit the two
> > conditions up top where it says that the valid bit isn't set and we
> > update it directly. The line below should update the valid bit since the
> > internals of cache_fresh_locked just set the expiry_time entry of cache
> > head and then use test_and_set_bit to set the CACHE_VALID bit in the
> > cache head's flags.
> >
> > is_new = cache_fresh_locked(old, new->expiry_time);
> >
> > I am very perplexed why something like this is happening. I've been
> > trying to figure it out for the better part of a week and I have finally
> > hit the wall at this point. Has anyone else encountered this problem or
> > have any insight into why this might happen.
>
> I don't have any bright ideas right now, sorry.

Any bit of insight helps me get closer to solving the problem. The
interesting thing is that if this bit was being set properly it seems as
if everything else would be working perfectly.
>
> --b.

Dave

2007-10-18 17:01:04

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Interesting problem with sunrpc cache

On Thu, Oct 18, 2007 at 11:47:56AM -0400, David P. Quigley wrote:
> On Thu, 2007-10-18 at 10:55 -0400, J. Bruce Fields wrote:
> > On Thu, Oct 18, 2007 at 09:47:15AM -0400, David P. Quigley wrote:
> > > Hello,
> > > I have been working on a Domain of Interpretation mapper for the
> > > labeled nfs work and I seem to have hit a wall. I started with idmapd as
> > > a base and proceeded to modify it to work with DOIs instead. Because of
> > > this with a few minor exceptions I expected everything to work. For the
> > > most part it does however I seem to have a minor problem with the cache
> > > on the nfsd side.
> > >
> > > I am running into a problem where I can't mount the export because
> > > the server keeps trying to translate the label. I initially get a
> > > success but then it repeatedly attempts to retry.
> >
> > So you're just getting repeated NFS4ERR_DELAY responses to the same
> > request from the client? Or does the server just stop responding? Is
> > this always reproduceble?
>
> The server stops responding since it is hung in the
> nfs_map_local_to_global function so these retries are between the kernel
> and the userspace daemon.

OK. It's not doing exactly the same thing as the nfs4idmap.c code,
then--maybe you'll want to post your patch to help us understand?

> > > I have tracked it down to a bit of code which essentially is a
> > > duplication of do_idmap_lookup_nowait.
> >
> > When exactly does the label translation occur?
>
> We added a new recommended attribute so we have a function called
> nfsd4_encode_security_label in nfs4xdr.c. This grabs the label using
> security_inode_getsecurity and then sends it to the translation daemon.

When is this attribute requested?

> I don't think there is more than one cache item. I can't tell for sure
> since there doesn't seem to be more than one iteration before it goes
> into the loop but I do a lookup on the local representation and I get a
> negative cache entry back.

It shouldn't have CACHE_NEGATIVE set at this point. (CACHE_VALID should
be cleared, if that's what you mean.)

> I'm not sure if the negative bit is set so I will check that in a bit
> (no pun intended).

Hmph.

> Then I make the upcall to userspace and the parse function comes back
> with a sucessful translation and no error code. At this point I put
> the node into the update function and in theory the negative entry we
> pulled from the lookup should update the global field for the entry. I
> supposed that I could toss another lookup call in there to make sure
> that this is actually happening. However, update seems to be working
> properly in that it returns a cache entry that has the appropriate
> values.

You could also printk() the address of the cache items in question just
to make sure the right one's getting updated.

> Any bit of insight helps me get closer to solving the problem. The
> interesting thing is that if this bit was being set properly it seems as
> if everything else would be working perfectly.

Happy to help if I'm able, but I'm leaving for an early weekend sometime
this afternoon, so will probably be mostly unresponsive till Monday.

--b.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-18 18:22:09

by David P. Quigley

[permalink] [raw]

Subject: Re: Interesting problem with sunrpc cache

On Thu, 2007-10-18 at 13:01 -0400, J. Bruce Fields wrote:
> On Thu, Oct 18, 2007 at 11:47:56AM -0400, David P. Quigley wrote:
> > On Thu, 2007-10-18 at 10:55 -0400, J. Bruce Fields wrote:
> > > On Thu, Oct 18, 2007 at 09:47:15AM -0400, David P. Quigley wrote:
> > > > Hello,
> > > > I have been working on a Domain of Interpretation mapper for the
> > > > labeled nfs work and I seem to have hit a wall. I started with idmapd as
> > > > a base and proceeded to modify it to work with DOIs instead. Because of
> > > > this with a few minor exceptions I expected everything to work. For the
> > > > most part it does however I seem to have a minor problem with the cache
> > > > on the nfsd side.
> > > >
> > > > I am running into a problem where I can't mount the export because
> > > > the server keeps trying to translate the label. I initially get a
> > > > success but then it repeatedly attempts to retry.
> > >
> > > So you're just getting repeated NFS4ERR_DELAY responses to the same
> > > request from the client? Or does the server just stop responding? Is
> > > this always reproduceble?
> >
> > The server stops responding since it is hung in the
> > nfs_map_local_to_global function so these retries are between the kernel
> > and the userspace daemon.
>
> OK. It's not doing exactly the same thing as the nfs4idmap.c code,
> then--maybe you'll want to post your patch to help us understand?

That's a good idea. I'm gonna table my translation stuff for a day or so
to incorporate some other patches from my other team mates and I will
post a patch set probably on Monday. I wanted to have a working DOI
translation daemon sent out with it but unfortunately that doesn't seem
like it is going to be an option.

>
> > > > I have tracked it down to a bit of code which essentially is a
> > > > duplication of do_idmap_lookup_nowait.
> > >
> > > When exactly does the label translation occur?
> >
> > We added a new recommended attribute so we have a function called
> > nfsd4_encode_security_label in nfs4xdr.c. This grabs the label using
> > security_inode_getsecurity and then sends it to the translation daemon.
>
> When is this attribute requested?

The call to nfsd4_encode_security_label is in nfsd4_encode_fattr. We
usually pull this across with every getattr request. The decoding which
we haven't hit yet is in nfsd4_decode_fattr
>
> > I don't think there is more than one cache item. I can't tell for sure
> > since there doesn't seem to be more than one iteration before it goes
> > into the loop but I do a lookup on the local representation and I get a
> > negative cache entry back.
>
> It shouldn't have CACHE_NEGATIVE set at this point. (CACHE_VALID should
> be cleared, if that's what you mean.)
>
> > I'm not sure if the negative bit is set so I will check that in a bit
> > (no pun intended).
>
> Hmph.
>
> > Then I make the upcall to userspace and the parse function comes back
> > with a sucessful translation and no error code. At this point I put
> > the node into the update function and in theory the negative entry we
> > pulled from the lookup should update the global field for the entry. I
> > supposed that I could toss another lookup call in there to make sure
> > that this is actually happening. However, update seems to be working
> > properly in that it returns a cache entry that has the appropriate
> > values.
>
> You could also printk() the address of the cache items in question just
> to make sure the right one's getting updated.

I'll give that a try.

>
> > Any bit of insight helps me get closer to solving the problem. The
> > interesting thing is that if this bit was being set properly it seems as
> > if everything else would be working perfectly.
>
> Happy to help if I'm able, but I'm leaving for an early weekend sometime
> this afternoon, so will probably be mostly unresponsive till Monday.

Have a nice weekend :)

>
> --b.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs