2011-02-22 13:31:15

by Rob Landley

[permalink] [raw]
Subject: CACHE_NEW_EXPIRY is 120, nextcheck initialized to 30*60=1800?

In net/sunrpc/cache.c line 416 or so (function cache_clean()) there's
this bit:

else {
current_index = 0;
current_detail->nextcheck = seconds_since_boot()+30*60;
}

The other uses of seconds_since_boot() add CACHE_NEW_EXPIRY (which is
120). This is A) more than ten times that, B) a magic inline constant.

Is there a reason for this? (Some subtle cache lifetime balancing thing?)

Rob


2011-02-22 21:07:39

by NeilBrown

[permalink] [raw]
Subject: Re: CACHE_NEW_EXPIRY is 120, nextcheck initialized to 30*60=1800?

On Tue, 22 Feb 2011 07:31:12 -0600 Rob Landley <[email protected]> wrote:

> In net/sunrpc/cache.c line 416 or so (function cache_clean()) there's
> this bit:
>
> else {
> current_index = 0;
> current_detail->nextcheck = seconds_since_boot()+30*60;
> }
>
> The other uses of seconds_since_boot() add CACHE_NEW_EXPIRY (which is
> 120). This is A) more than ten times that, B) a magic inline constant.
>
> Is there a reason for this? (Some subtle cache lifetime balancing thing?)

Apples and oranges are both fruit, but don't taste the same...


'nextcheck' is when to next clean old data out of the cache. There is no
rush to remove this data, it is just about freeing up memory. So every
half hour is fine. Sometimes an immediate flush is called if there is a
pressing need to remove stuff, but by default, occasional is enough.

CACHE_NEW_EXPIRY is the expiry time for cache entries that are incomplete and
have not yet been filled-in by a down-call. When user-space fills in a cache
entry it gets an expiry time, typically half and hour I think, though that
is up to user-space. If no down-call arrive for 120 seconds we forget about
it. I don't recall the exact point of this - maybe it is to encourage a new
up-call..

But I agree that the 30*60 should be a #defined constant. Patches welcome :-)

NeilBrown


2011-02-23 05:53:29

by NeilBrown

[permalink] [raw]
Subject: Re: CACHE_NEW_EXPIRY is 120, nextcheck initialized to 30*60=1800?

On Tue, 22 Feb 2011 21:59:27 -0600 Rob Landley <[email protected]> wrote:

> On 02/22/2011 03:07 PM, NeilBrown wrote:
> > On Tue, 22 Feb 2011 07:31:12 -0600 Rob Landley <[email protected]> wrote:
> >
> >> In net/sunrpc/cache.c line 416 or so (function cache_clean()) there's
> >> this bit:
> >>
> >> else {
> >> current_index = 0;
> >> current_detail->nextcheck = seconds_since_boot()+30*60;
> >> }
> >>
> >> The other uses of seconds_since_boot() add CACHE_NEW_EXPIRY (which is
> >> 120). This is A) more than ten times that, B) a magic inline constant.
> >>
> >> Is there a reason for this? (Some subtle cache lifetime balancing thing?)
> >
> > Apples and oranges are both fruit, but don't taste the same...
>
> I know what "apples and oranges" means, thanks.
>
> I'm trying to understand this code, and finding a lot of it hard to
> figure out. For example, in net/sunrpc/svcauth_unix.c there are two
> instances of:
>
> expiry = get_expiry(&mesg);
> if (expiry ==0)
> return -EINVAL;
>

The value '0' means that the (textual) mesg didn't look like a value number.


> Except that get_expiry() defined in include/linux/sunrpc/cache.h returns
> the difference between the int stored at &mesg and getboottime(), which
> implies that the value can go negative fairly easily if the system is
> busy with something else for a second, so comparing for equality with
> zero seems odd if it's easy to _miss_. Possibly some kind of timer is
> scheduled to force this test to happen at the expiry time, but if so I
> haven't found it yet...

The value in 'mesg' should always be well in the future (wrt 'gettimeofday').
The value returned by getboottime will always be in the past (wrt
'gettimeofday').
So the difference will only be negative if userspace requested an expiry time
that was before the time when the system was booted.
It could get an EINVAL, which it what it would deserve, but it would be more
likely to get an entry that is already expired (which is a reasonable result).

The 'expiry' number is a 'seconds since epoch' number, in case that wasn't
obvious.

>
> (I'm trying to hunt down a specific bug where a cached value of some
> kind is using the wrong struct net * context, and thus if I mount nfsv3
> from the host context it works, and from a container it also works, but
> if I have different (overlapping) network routings in host and container
> and I mount the same IP from the host from the container it doesn't
> work, even if I _unmount_ the host's copy before mounting the
> container's copy (or vice versa). But that it starts working again when
> I give it a couple minutes after the umount for the cache data to time
> out...)

I'm a little fuzzy about the whole 'struct net * context' thing, but in
cache.c, it only seems to be connected with server-side things, while you
seem to be talking about client-side things so maybe there is a disconnect
there. Not sure though.


>
> Mostly I'm assuming you guys know what you're doing and that my
> understanding of the enormous layers of nested cacheing is incomplete,
> but there's a lot of complexity to dig through here...

You are too kind. "Once thought we knew what we were doing" is about as
much as I'd own up to :-)

If you are looking at client-side handling of net contexts, you probably want
to start at rpc_create which does something with args->net.
Find out where that value came from, and where it is going to. Maybe that
will help. (But if you find any wild geese, let me know!)

NeilBrown

2011-02-23 19:30:05

by Rob Landley

[permalink] [raw]
Subject: NFS in containers

On 02/22/2011 11:53 PM, NeilBrown wrote:
> On Tue, 22 Feb 2011 21:59:27 -0600 Rob Landley <[email protected]> wrote:
>> (I'm trying to hunt down a specific bug where a cached value of some
>> kind is using the wrong struct net * context, and thus if I mount nfsv3
>> from the host context it works, and from a container it also works, but
>> if I have different (overlapping) network routings in host and container
>> and I mount the same IP from the host from the container it doesn't
>> work, even if I _unmount_ the host's copy before mounting the
>> container's copy (or vice versa). But that it starts working again when
>> I give it a couple minutes after the umount for the cache data to time
>> out...)
>
> I'm a little fuzzy about the whole 'struct net * context' thing,

Look at the clone 2 man page for CLONE_NET_NS.

It basically allows a process group to see its own set of network
interfaces, with different routing (and even different iptables rules)
than other groups of processes.

The network namespace pointer is passed as an argument to
__sock_create() so each socket that gets created lives within a given
network namespace, and all operations that happen on it after that are
relative to that network namespace.

There's a global "init_net" context which is what PID 1 gets and which
things inherit by default, but as soon as you unshare(CLONE_NET_NS) you
get your own struct net instance in current->nsproxy->net_ns.

All the networking userspace processes do is automatically relative to
their network namespace, but when the kernel opens its own sockets, the
kernel code has to supply a namespace. Things like CIFS and NFS were
doing a lot of stuff relative to the PID 1 namespace because &init_net
is a global that you can reach out and grab without having to care about
details (or worry about reference counting on).

> but in
> cache.c, it only seems to be connected with server-side things, while you
> seem to be talking about client-side things so maybe there is a disconnect
> there. Not sure though.

Nope, it's client side. The server I'm testing currently against is
entirely userspace (unfs3.sourceforge.net) and running on a different
machine. (Well, my test environment is in kvm and the server is running
on the host laptop.)

what I'm trying to do is set up a new network namespace ala
unshare(CLONE_NET_NS), set up different network routing for that process
than the host has (current->nsproxy->net_ns != &init_net), and then
mount NFS from within that. And I made that part work: as long as only
_one_ network context ever mounts an NFS share. If multiple contexts
their own mounts, the rpc caches mix together and interfere with each
other. (At least I think that's what's going wrong.)

Here's documentatoin on how I set up my test environment. I set up a
containers test environment using the LXC package here:

http://landley.livejournal.com/47024.html

Then I set up network routing for it here:

http://landley.livejournal.com/47205.html

Here are my blog entries about getting Samba to work:

http://landley.livejournal.com/47476.html
http://landley.livejournal.com/47761.html

Which resulted in commit f1d0c998653f1eeec60 which was a patch to make
CIFS mounting work inside a container (although kerberos doesn't yet).

Notice that my test setup intentionally sets up conflicting addresses:
outside the container, packets get routed through eth0 where "10.0.2.2"
as an alias for 127.0.0.1 on the machine running KVM. But inside the
container, packets get routed through eth1 which is a tap interface and
can talk to a 10.0.2.2 address on the machine running KVM. So both
contexts see a 10.0.2.2, but they route to different places, meaning I
can demonstrate this _failing_ as well as succeeding.

Of course you can't reach out and dereference current-> unless you know
you're always called from process context (and the RIGHT process context
at that), so you have to cache these values at mount time and refer to
the cached copies later. (And do a get_net() to incrememt the reference
counter in case the process that called you goes away, and do a
put_net() when you discard your reference. And don't ask me what's
supposed to happen when you call mount -o remount on a network
filesystem _after_ calling unshare(CLONE_NET_NS). Keep the old network
context, I guess. Doing that may be considered pilot error anyway, I'm
not sure.)

My first patch (nfs2.patch, attached) made mounting NFSv3 in the
container work for a fairly restricted test case (mountd and nfsd
explicilty specified by port number, so no portmapper involved), and
only as long as I don't ALSO mount an NFS share on the same IP address
from some other context (such as outside the container). When I do, the
cache mixes the two 10.0.2.2 instances together somehow. (Among other
things, I have to teach the idempotent action replay mechanism that
matching addresses isn't enough, you also have to match network
namespaces. Except my test is just "mount; ls; cat file; umount" and
it's still not working, so that's additional todo items for later. I
haven't even started on lockd yet.)

The problem persists after I umount, but times out after a couple
minutes and eventually starts working again. Of the many different
caches, I don't know WHICH ones I need to fix yet, or even what they all
do. I haven't submitted this patch yet because I'm still making sure
get_sb() is only ever called from the correct process context such via
the mount() system call, so dereferencing current like I'm doing to grab
the network context is ok. (I think this is the case, but proving a
negative is time consuming and I've got several balls in the air. If it
is the case, I need to add comments to that effect.)

My second patch (sunrpc1.patch) was an attempt to fix my first guess at
which cache wasn't takeing network namespace into account when matching
addresses. It compiled and worked but didn't fix the problem I was
seeing, and I'm singificantly less certain my use of
current->nsproxy->net_ns in there is correct, or that I'm not missing an
existing .net buried in a structure somewhere.

I'm working on about three other patches now, but still trying to figure
out where the actual _failure_ is. (The entire transaction is a dozen
or so packets. These packets are generated in an elaborate ceremony
during which a chicken is sacrificed to the sunrpc layer.)

>> Mostly I'm assuming you guys know what you're doing and that my
>> understanding of the enormous layers of nested cacheing is incomplete,
>> but there's a lot of complexity to dig through here...
>
> You are too kind. "Once thought we knew what we were doing" is about as
> much as I'd own up to :-)

Don't get me wrong, I hate NFS at the design level.

I consider it "the cobol of filesystems", am convinced that at least 2/3
of its code is premature optimization left over from the 80's, hate the
way it reimplements half the VFS, consider a "stateless filesystem
server" to be a contradiction in terms, thought that khttpd and the tux
web server had already shown that having servers in kernel space simply
isn't worth it, can't figure out why ext3 was a separate filesystem from
ext2 but nfsv4/nfsv3 are mixed together, consider the NFSv4 design to be
a massive increase in complexity without necessarily being an
improvement over v3 in cache coherency or things like "my build broke
when it tried to rm -rf a directory a process still had a file open in",
and am rooting for the Plan 9 filesystem (or something) to eventually
replace it. I'm only working on it because my employer told me to.

That said, a huge amount of development and testing has gone into the
code that's there, and it's been working for a lot of people for a long
time under some serious load. (But picking apart the layers of strange
asynchronous cacheing and propogating namespace information through them
would not be my first choice of recreational activities.)

> If you are looking at client-side handling of net contexts, you probably want
> to start at rpc_create which does something with args->net.
> Find out where that value came from, and where it is going to. Maybe that
> will help. (But if you find any wild geese, let me know!)

I've been reading through this code, and various standards documents,
and conference papers, and so on, for weeks now. (You are in a maze of
twisty little cache management routines with lifetime rules people have
written multiple conflicting conference papers on.)

I'm now aware of an awful lot more things I really don't understand
about NFS than I was when I started, and although I now know more about
what it does I understand even _less_ about _why_.

Still, third patch is the charm. I should go get a chicken.

> NeilBrown

Rob


Attachments:
nfs2.patch (3.10 kB)
sunrpc1.patch (3.76 kB)
Download all attachments

2011-02-23 03:59:34

by Rob Landley

[permalink] [raw]
Subject: Re: CACHE_NEW_EXPIRY is 120, nextcheck initialized to 30*60=1800?

On 02/22/2011 03:07 PM, NeilBrown wrote:
> On Tue, 22 Feb 2011 07:31:12 -0600 Rob Landley <[email protected]> wrote:
>
>> In net/sunrpc/cache.c line 416 or so (function cache_clean()) there's
>> this bit:
>>
>> else {
>> current_index = 0;
>> current_detail->nextcheck = seconds_since_boot()+30*60;
>> }
>>
>> The other uses of seconds_since_boot() add CACHE_NEW_EXPIRY (which is
>> 120). This is A) more than ten times that, B) a magic inline constant.
>>
>> Is there a reason for this? (Some subtle cache lifetime balancing thing?)
>
> Apples and oranges are both fruit, but don't taste the same...

I know what "apples and oranges" means, thanks.

I'm trying to understand this code, and finding a lot of it hard to
figure out. For example, in net/sunrpc/svcauth_unix.c there are two
instances of:

expiry = get_expiry(&mesg);
if (expiry ==0)
return -EINVAL;

Except that get_expiry() defined in include/linux/sunrpc/cache.h returns
the difference between the int stored at &mesg and getboottime(), which
implies that the value can go negative fairly easily if the system is
busy with something else for a second, so comparing for equality with
zero seems odd if it's easy to _miss_. Possibly some kind of timer is
scheduled to force this test to happen at the expiry time, but if so I
haven't found it yet...

(I'm trying to hunt down a specific bug where a cached value of some
kind is using the wrong struct net * context, and thus if I mount nfsv3
from the host context it works, and from a container it also works, but
if I have different (overlapping) network routings in host and container
and I mount the same IP from the host from the container it doesn't
work, even if I _unmount_ the host's copy before mounting the
container's copy (or vice versa). But that it starts working again when
I give it a couple minutes after the umount for the cache data to time
out...)

Mostly I'm assuming you guys know what you're doing and that my
understanding of the enormous layers of nested cacheing is incomplete,
but there's a lot of complexity to dig through here...

Rob