2008-06-09 14:31:37

by Jeff Layton

[permalink] [raw]
Subject: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

Apologies for the long email, but I ran into an interesting problem the
other day and am looking for some feedback on my general approach to
fixing it before I spend too much time on it:

We (RH) have a cluster-suite product that some people use for making HA
NFS services. When our QA folks test this, they often will start up
some operations that do activity on an NFS mount from the cluster and
then rapidly do failovers between cluster machines and make sure
everything keeps moving along. The cluster is designed to not shut down
nfsd's when a failover occurs. nfsd's are considered a "shared
resource". It's possible that there could be multiple clustered
services for NFS-sharing, so when a failover occurs, we just manipulate
the exports table.

The problem we've run into is that occasionally they fail over to the
alternate machine and then back very rapidly. Because nfsd's are not
shut down on failover, sockets are not closed. So what happens is
something like this on TCP mounts:

- client has NFS mount from clustered NFS service on one server

- service fails over, new server doesn't know anything about the
existing socket, so it sends a RST back to the client when data
comes in. Client closes connection and reopens it and does some
I/O on the socket.

- service fails back to original server. The original socket there
is still open, but now the TCP sequence numbers are off. When
packets come into the server we end up with an ACK storm, and the
client hangs for a long time.

Neil Horman did a good writeup of this problem here for those that
want the gory details:

https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16

I can think of 3 ways to fix this:

1) Add something like the recently added "unlock_ip" interface that
was added for NLM. Maybe a "close_ip" that allows us to close all
nfsd sockets connected to a given local IP address. So clustering
software could do something like:

# echo 10.20.30.40 > /proc/fs/nfsd/close_ip

...and make sure that all of the sockets are closed.

2) just use the same "unlock_ip" interface and just have it also
close sockets in addition to dropping locks.

3) have an nfsd close all non-listening connections when it gets a
certain signal (maybe SIGUSR1 or something). Connections on a
sockets that aren't failing over should just get a RST and would
reopen their connections.

...my preference would probably be approach #1.

I've only really done some rudimentary perusing of the code, so there
may be roadblocks with some of these approaches I haven't considered.
Does anyone have thoughts on the general problem or idea for a solution?

The situation is a bit specific to failover testing -- most people failing
over don't do it so rapidly, but we'd still like to ensure that this
problem doesn't occur if someone does do it.

Thanks,
--
Jeff Layton <[email protected]>
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4


2008-06-09 17:24:25

by Jeff Layton

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, 09 Jun 2008 13:14:56 -0400
Wendy Cheng <[email protected]> wrote:

> Jeff Layton wrote:
> > The problem we've run into is that occasionally they fail over to the
> > alternate machine and then back very rapidly.
>
> It is a well known issue in the NFS-TCP failover arena (or more
> specifically, for floating IP applications) that failover from server A
> to server B, then immediately failing back from server B to A would
> *not* work well. IIRC last round of discussing with Red Hat GPS and
> support folks, we concluded that most of the applications/users *can*
> tolerate this restriction.
>
> Maybe another more basic question: "other than QA efforts, are there
> real NFSv2/v3 applications depending on this "feature" ? Or there may
> need tons of efforts for something that will not have much usages when
> it is finally delivered ?
>

Certainly a valid question...

While rapid failover like this is unusual, it's easily possible for a
sysadmin to do it. Maybe they moved the wrong service, or their downtime
was for something very brief but the service had to be off of the host to
make the change. In that case, a quick failover and back could easily
be something that happens in a real environment.

As to whether it's worth a ton of effort, that's a tough call. People want
HA services to guard against outages. Anything that jeopardizes that is
probably worth fixing. This could be solved with documentation, but a note
like:

"Be sure to wait for X minutes between failovers"

...wouldn't instill me with a lot of confidence. We'd have to have
some sort of mechanism to enforce this, and that would be less than
ideal.

IMO, the ideal thing would be to make sure that the "old" server is
ready to pick up the service again as soon as possible after the service
leaves it.

--
Jeff Layton <[email protected]>
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 19:13:06

by Talpey, Thomas

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

At 03:01 PM 6/9/2008, Jeff Layton wrote:
>On Mon, 09 Jun 2008 13:51:05 -0400
>"Talpey, Thomas" <[email protected]> wrote:
>
>> At 01:24 PM 6/9/2008, Jeff Layton wrote:
>> >
>> >"Be sure to wait for X minutes between failovers"
>>
>> At least one grace period.
>>
>
>Actually, we have to wait until all of the sockets on the old server
>time out. This is difficult to predict and can be quite long.

I just gave the floor. The ceiling is yours. :-)

Orphaned server TCP sockets, btw, in general last forever without keepalive.
Even with keepalive, they can last many tens of minutes.

Tom.

_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 17:59:03

by Talpey, Thomas

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

At 01:51 PM 6/9/2008, Talpey, Thomas wrote:
>and NSM/NLM. NSM provides only notification, there's no way for
>either server to know for sure all the clients have completed
>either switch-to or switch-back.

Just in case it helps to understand why relying on NSM is so risky:

<http://www.connectathon.org/talks06/talpey-cthon06-nsm.pdf>

Slides 16, 17 and 23, especially.

Tom.

_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 16:02:43

by Jeff Layton

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, 9 Jun 2008 11:51:36 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > I can think of 3 ways to fix this:
> >
> > 1) Add something like the recently added "unlock_ip" interface that
> > was added for NLM. Maybe a "close_ip" that allows us to close all
> > nfsd sockets connected to a given local IP address. So clustering
> > software could do something like:
> >
> > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >
> > ...and make sure that all of the sockets are closed.
> >
> > 2) just use the same "unlock_ip" interface and just have it also
> > close sockets in addition to dropping locks.
> >
> > 3) have an nfsd close all non-listening connections when it gets a
> > certain signal (maybe SIGUSR1 or something). Connections on a
> > sockets that aren't failing over should just get a RST and would
> > reopen their connections.
> >
> > ...my preference would probably be approach #1.
>
> What do you see as the advantage of #1 over #2? Are there cases where
> someone would want to drop locks but not also close connections (or
> vice-versa)?
>

There's no real advantage that I can see (maybe if they're running a
cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
imply that it deals with locking, and this doesn't. I'd be OK with #2
if it's a reasonable solution. Given what Chuck mentioned, it sounds
like we'll also need to take care to make sure that existing calls
complete and the replies get flushed out too, so this could be more
complicated that I had anticipated.

--
Jeff Layton <[email protected]>
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 18:07:32

by Neil Horman

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 01:14:56PM -0400, Wendy Cheng wrote:
> Jeff Layton wrote:
> >The problem we've run into is that occasionally they fail over to the
> >alternate machine and then back very rapidly.
>
> It is a well known issue in the NFS-TCP failover arena (or more
> specifically, for floating IP applications) that failover from server A
> to server B, then immediately failing back from server B to A would
> *not* work well. IIRC last round of discussing with Red Hat GPS and
> support folks, we concluded that most of the applications/users *can*
> tolerate this restriction.

I think the big problem here is that this restriction has a window that can be
particularly long lived. If an application doesn't close its sockets, the time
between a failover event, and the time when it is safe to fail back, is bounded
by the lifetime of the socket on the 'failed' server. given the right
configuration, this could be indefinite. Worse, you could fail at just the
wrong time after the sequence number wraps completely, and pickup where you left
off, not knowing you lost 4GB of data in the process.


>
> Maybe another more basic question: "other than QA efforts, are there
> real NFSv2/v3 applications depending on this "feature" ? Or there may
> need tons of efforts for something that will not have much usages when
> it is finally delivered ?
>
> -- Wendy
>
>
>

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 17:51:05

by Talpey, Thomas

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

At 01:24 PM 6/9/2008, Jeff Layton wrote:
>
>"Be sure to wait for X minutes between failovers"

At least one grace period.

>
>...wouldn't instill me with a lot of confidence. We'd have to have
>some sort of mechanism to enforce this, and that would be less than
>ideal.
>
>IMO, the ideal thing would be to make sure that the "old" server is
>ready to pick up the service again as soon as possible after the service
>leaves it.

A great goal, but it seems to me you've bundled a lot of other
incompatible requirements along with it. Having some services
restart and not others, for example. And mixing transparent IP
address takeover with stateful recovery such as TCP reconnect
and NSM/NLM. NSM provides only notification, there's no way for
either server to know for sure all the clients have completed
either switch-to or switch-back.

Of course, you could switch to UDP-only, that would fix the
TCP issue. But it won't fix NSM/NLM.

Tom.

_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 18:10:48

by Neil Horman

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 01:24:25PM -0400, Jeff Layton wrote:
> On Mon, 09 Jun 2008 13:14:56 -0400
> Wendy Cheng <[email protected]> wrote:
>
> > Jeff Layton wrote:
> > > The problem we've run into is that occasionally they fail over to the
> > > alternate machine and then back very rapidly.
> >
> > It is a well known issue in the NFS-TCP failover arena (or more
> > specifically, for floating IP applications) that failover from server A
> > to server B, then immediately failing back from server B to A would
> > *not* work well. IIRC last round of discussing with Red Hat GPS and
> > support folks, we concluded that most of the applications/users *can*
> > tolerate this restriction.
> >
> > Maybe another more basic question: "other than QA efforts, are there
> > real NFSv2/v3 applications depending on this "feature" ? Or there may
> > need tons of efforts for something that will not have much usages when
> > it is finally delivered ?
> >
>
> Certainly a valid question...
>
> While rapid failover like this is unusual, it's easily possible for a
> sysadmin to do it. Maybe they moved the wrong service, or their downtime
> was for something very brief but the service had to be off of the host to
> make the change. In that case, a quick failover and back could easily
> be something that happens in a real environment.
>
> As to whether it's worth a ton of effort, that's a tough call. People want
> HA services to guard against outages. Anything that jeopardizes that is
> probably worth fixing. This could be solved with documentation, but a note
> like:
>
> "Be sure to wait for X minutes between failovers"
>
Thats the real problem here. Given the problem as we've describe it, its
possible for X to be _large_, potentially indefinite.

> IMO, the ideal thing would be to make sure that the "old" server is
> ready to pick up the service again as soon as possible after the service
> leaves it.
>
Yes, this is really what needs to happen. In this environment, a floating IP
address effectively means that nfsd services can inadvertently 'share' a tcp
connection, and if nfsd is to play in a floating IP environment it needs to be
able to handle that sharing...

Neil

> --
> Jeff Layton <[email protected]>

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 17:14:56

by Wendy Cheng

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

Jeff Layton wrote:
> The problem we've run into is that occasionally they fail over to the
> alternate machine and then back very rapidly.

It is a well known issue in the NFS-TCP failover arena (or more
specifically, for floating IP applications) that failover from server A
to server B, then immediately failing back from server B to A would
*not* work well. IIRC last round of discussing with Red Hat GPS and
support folks, we concluded that most of the applications/users *can*
tolerate this restriction.

Maybe another more basic question: "other than QA efforts, are there
real NFSv2/v3 applications depending on this "feature" ? Or there may
need tons of efforts for something that will not have much usages when
it is finally delivered ?

-- Wendy



_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 15:51:36

by J. Bruce Fields

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> I can think of 3 ways to fix this:
>
> 1) Add something like the recently added "unlock_ip" interface that
> was added for NLM. Maybe a "close_ip" that allows us to close all
> nfsd sockets connected to a given local IP address. So clustering
> software could do something like:
>
> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>
> ...and make sure that all of the sockets are closed.
>
> 2) just use the same "unlock_ip" interface and just have it also
> close sockets in addition to dropping locks.
>
> 3) have an nfsd close all non-listening connections when it gets a
> certain signal (maybe SIGUSR1 or something). Connections on a
> sockets that aren't failing over should just get a RST and would
> reopen their connections.
>
> ...my preference would probably be approach #1.

What do you see as the advantage of #1 over #2? Are there cases where
someone would want to drop locks but not also close connections (or
vice-versa)?

--b.
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 19:10:11

by Jeff Layton

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, 9 Jun 2008 13:23:13 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote:
> > On Mon, 9 Jun 2008 11:51:36 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > > > I can think of 3 ways to fix this:
> > > >
> > > > 1) Add something like the recently added "unlock_ip" interface that
> > > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > > nfsd sockets connected to a given local IP address. So clustering
> > > > software could do something like:
> > > >
> > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > > >
> > > > ...and make sure that all of the sockets are closed.
> > > >
> > > > 2) just use the same "unlock_ip" interface and just have it also
> > > > close sockets in addition to dropping locks.
> > > >
> > > > 3) have an nfsd close all non-listening connections when it gets a
> > > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > > sockets that aren't failing over should just get a RST and would
> > > > reopen their connections.
> > > >
> > > > ...my preference would probably be approach #1.
> > >
> > > What do you see as the advantage of #1 over #2? Are there cases where
> > > someone would want to drop locks but not also close connections (or
> > > vice-versa)?
> > >
> >
> > There's no real advantage that I can see (maybe if they're running a
> > cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
> > imply that it deals with locking, and this doesn't. I'd be OK with #2
> > if it's a reasonable solution. Given what Chuck mentioned, it sounds
> > like we'll also need to take care to make sure that existing calls
> > complete and the replies get flushed out too, so this could be more
> > complicated that I had anticipated.
>
> It seems to me that in the long run what we'd like is a virtualized NFS
> service--you should be able to start and stop independent "servers"
> hosted on a single kernel, and to clients they should look like
> completely independent servers.
>
> And I guess the question is how little "virtualization" you can get away
> with and still have the whole thing work.

Yep. That was Lon's exact question. Could we start nfsd's that just
work for certain exports? The answer (of course) is currently no.

As an idle side thought, I wonder whether/how we could make nfsd
containerized? I wonder if it's possible to run a local nfsd in
a Solaris zone/container thingy.

>
> But anyway, ideally I think there'd be a single interface that says
> "shut down the nfs service provided via server ip x.y.z.w, for possible
> migration to another host". That's the only operation anyone really
> want to do--independent control over the tcp connections, and the locks,
> and the rpc cache, and whatever else needs to be dealt with, sounds
> unlikely to be useful.
>

Ok. When I get some time to work on this, I'll plan to work on hooking
into the current unlock_ip interface rather than creating a new
procfile. That does seem to make the most sense, though the name
"unlock_ip" might not really adequately convey what it will now be doing...

--
Jeff Layton <[email protected]>
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 17:23:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote:
> On Mon, 9 Jun 2008 11:51:36 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > > I can think of 3 ways to fix this:
> > >
> > > 1) Add something like the recently added "unlock_ip" interface that
> > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > nfsd sockets connected to a given local IP address. So clustering
> > > software could do something like:
> > >
> > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > >
> > > ...and make sure that all of the sockets are closed.
> > >
> > > 2) just use the same "unlock_ip" interface and just have it also
> > > close sockets in addition to dropping locks.
> > >
> > > 3) have an nfsd close all non-listening connections when it gets a
> > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > sockets that aren't failing over should just get a RST and would
> > > reopen their connections.
> > >
> > > ...my preference would probably be approach #1.
> >
> > What do you see as the advantage of #1 over #2? Are there cases where
> > someone would want to drop locks but not also close connections (or
> > vice-versa)?
> >
>
> There's no real advantage that I can see (maybe if they're running a
> cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
> imply that it deals with locking, and this doesn't. I'd be OK with #2
> if it's a reasonable solution. Given what Chuck mentioned, it sounds
> like we'll also need to take care to make sure that existing calls
> complete and the replies get flushed out too, so this could be more
> complicated that I had anticipated.

It seems to me that in the long run what we'd like is a virtualized NFS
service--you should be able to start and stop independent "servers"
hosted on a single kernel, and to clients they should look like
completely independent servers.

And I guess the question is how little "virtualization" you can get away
with and still have the whole thing work.

But anyway, ideally I think there'd be a single interface that says
"shut down the nfs service provided via server ip x.y.z.w, for possible
migration to another host". That's the only operation anyone really
want to do--independent control over the tcp connections, and the locks,
and the rpc cache, and whatever else needs to be dealt with, sounds
unlikely to be useful.

--b.
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 18:03:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 12:40:05PM -0400, Talpey, Thomas wrote:
> At 12:22 PM 6/9/2008, Jeff Layton wrote:
> >That might be worth investigating, but sounds like it might cause problems
> >with the services associated with IP addresses that are staying on the
> >victim server.
>
> Jeff, I think you have many years of job security to look forward to, here. :-)
>
> Since you sent this to the NFSv4 list - is there any chance you're thinking
> to not transparently take over IP addresses, but use NFSv4 locations and
> referrals for these "migrations"?

Yeah, definitely. We've a got a prototype and some other work in
progress--hopefully there'll be something "real" in the coming months!

There's some overlap with nfsv2/v3, though (not in this case, but in the
need for lock migration, for example). And people really are using this
floating-ip address stuff now, so anything we can do to make it more
reliable or easier to use is welcome.

--b.

> Yes, I know some clients may not quite be
> there yet.
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 19:01:05

by Jeff Layton

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, 09 Jun 2008 13:51:05 -0400
"Talpey, Thomas" <[email protected]> wrote:

> At 01:24 PM 6/9/2008, Jeff Layton wrote:
> >
> >"Be sure to wait for X minutes between failovers"
>
> At least one grace period.
>

Actually, we have to wait until all of the sockets on the old server
time out. This is difficult to predict and can be quite long.

> >
> >...wouldn't instill me with a lot of confidence. We'd have to have
> >some sort of mechanism to enforce this, and that would be less than
> >ideal.
> >
> >IMO, the ideal thing would be to make sure that the "old" server is
> >ready to pick up the service again as soon as possible after the service
> >leaves it.
>
> A great goal, but it seems to me you've bundled a lot of other
> incompatible requirements along with it. Having some services
> restart and not others, for example. And mixing transparent IP
> address takeover with stateful recovery such as TCP reconnect
> and NSM/NLM. NSM provides only notification, there's no way for
> either server to know for sure all the clients have completed
> either switch-to or switch-back.
>

Thanks for the slides -- very interesting.

Yep. NSM is risky, but this is really the same situation as solo NFS
server spontaneously rebooting. The failover we're doing is really just
simulating that (for the case of lockd anyway). The unreliability is just
an unfortunate fact of life with NFSv2/3...

> Of course, you could switch to UDP-only, that would fix the
> TCP issue. But it won't fix NSM/NLM.
>

Right. Nothing can really fix that so we just have to make do. All of
the NSM/NLM stuff here is really separate from the main problem I'm
interested in at the moment, which is how to deal with the old, stale
sockets that nfsd has open after the local address disappears.

--
Jeff Layton <[email protected]>
_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 20:19:46

by Lon Hohberger

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?


On Mon, 2008-06-09 at 15:10 -0400, Jeff Layton wrote:
> > It seems to me that in the long run what we'd like is a virtualized NFS
> > service--you should be able to start and stop independent "servers"
> > hosted on a single kernel, and to clients they should look like
> > completely independent servers.
> >
> > And I guess the question is how little "virtualization" you can get away
> > with and still have the whole thing work.
>
> Yep. That was Lon's exact question. Could we start nfsd's that just
> work for certain exports? The answer (of course) is currently no.

s/exports/IP addresses/

-- Lon


_______________________________________________
NFSv4 mailing list
[email protected]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-06-09 16:01:22

by Chuck Lever

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 9, 2008 at 11:49 AM, Jeff Layton <[email protected]> wrote:
> On Mon, 09 Jun 2008 11:37:27 -0400
> Peter Staubach <[email protected]> wrote:
>
>> Neil Horman wrote:
>> > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>> >
>> >> Jeff Layton wrote:
>> >>
>> >>> Apologies for the long email, but I ran into an interesting problem the
>> >>> other day and am looking for some feedback on my general approach to
>> >>> fixing it before I spend too much time on it:
>> >>>
>> >>> We (RH) have a cluster-suite product that some people use for making HA
>> >>> NFS services. When our QA folks test this, they often will start up
>> >>> some operations that do activity on an NFS mount from the cluster and
>> >>> then rapidly do failovers between cluster machines and make sure
>> >>> everything keeps moving along. The cluster is designed to not shut down
>> >>> nfsd's when a failover occurs. nfsd's are considered a "shared
>> >>> resource". It's possible that there could be multiple clustered
>> >>> services for NFS-sharing, so when a failover occurs, we just manipulate
>> >>> the exports table.
>> >>>
>> >>> The problem we've run into is that occasionally they fail over to the
>> >>> alternate machine and then back very rapidly. Because nfsd's are not
>> >>> shut down on failover, sockets are not closed. So what happens is
>> >>> something like this on TCP mounts:
>> >>>
>> >>> - client has NFS mount from clustered NFS service on one server
>> >>>
>> >>> - service fails over, new server doesn't know anything about the
>> >>> existing socket, so it sends a RST back to the client when data
>> >>> comes in. Client closes connection and reopens it and does some
>> >>> I/O on the socket.
>> >>>
>> >>> - service fails back to original server. The original socket there
>> >>> is still open, but now the TCP sequence numbers are off. When
>> >>> packets come into the server we end up with an ACK storm, and the
>> >>> client hangs for a long time.
>> >>>
>> >>> Neil Horman did a good writeup of this problem here for those that
>> >>> want the gory details:
>> >>>
>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>> >>>
>> >>> I can think of 3 ways to fix this:
>> >>>
>> >>> 1) Add something like the recently added "unlock_ip" interface that
>> >>> was added for NLM. Maybe a "close_ip" that allows us to close all
>> >>> nfsd sockets connected to a given local IP address. So clustering
>> >>> software could do something like:
>> >>>
>> >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>> >>>
>> >>> ...and make sure that all of the sockets are closed.
>> >>>
>> >>> 2) just use the same "unlock_ip" interface and just have it also
>> >>> close sockets in addition to dropping locks.
>> >>>
>> >>> 3) have an nfsd close all non-listening connections when it gets a
>> >>> certain signal (maybe SIGUSR1 or something). Connections on a
>> >>> sockets that aren't failing over should just get a RST and would
>> >>> reopen their connections.
>> >>>
>> >>> ...my preference would probably be approach #1.
>> >>>
>> >>> I've only really done some rudimentary perusing of the code, so there
>> >>> may be roadblocks with some of these approaches I haven't considered.
>> >>> Does anyone have thoughts on the general problem or idea for a solution?
>> >>>
>> >>> The situation is a bit specific to failover testing -- most people failing
>> >>> over don't do it so rapidly, but we'd still like to ensure that this
>> >>> problem doesn't occur if someone does do it.
>> >>>
>> >>> Thanks,
>> >>>
>> >>>
>> >> This doesn't sound like it would be an NFS specific situation.
>> >> Why doesn't TCP handle this, without causing an ACK storm?
>> >>
>> >>
>> >
>> > You're right, its not a problem specific to NFS, any TCP based service in which
>> > sockets are not explicitly closed on the application are subject to this
>> > problem. however, I think NFS is currently the only clustered service that we
>> > offer in which we explicitly leave nfsd running during such a 'soft' failover,
>> > and so practically speaking, this is the only place that this issue manifests
>> > itself. If we could shut down nfsd on the server doing a failover, that would
>> > solve this problem (as it prevents the problem with all other clustered tcp
>> > based services), but from what I'm told, thats a non-starter.
>> >
>> >
>>
>> I think that this last would be a good thing to pursue anyway,
>> or at least be able to understand why it would be considered to
>> be a "non-starter". When failing away a service, why not stop
>> the service on the original node?
>>
>
> Suppose you have more than one "NFS service". People do occasionally set
> up NFS exports in separate services. Also, there's the possibility of a
> mix of clustered + non-clustered exports. So shutting down nfsd could
> disrupt NFS services on any IP addresses that remain on the box.
>
> That said, we could maybe shut down nfsd and trust that retransmissions
> will take care of the problem. That could be racy though.

In that case, it might make sense to have an nfsd-specific mechanism
that allows you to fence exports instead of whole servers.

--
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales

2008-06-09 17:14:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

On Mon, Jun 09, 2008 at 12:09:48PM -0400, Talpey, Thomas wrote:
> At 12:01 PM 6/9/2008, Jeff Layton wrote:
> >On Mon, 09 Jun 2008 11:51:51 -0400
> >"Talpey, Thomas" <[email protected]> wrote:
> >
> >> At 11:18 AM 6/9/2008, Jeff Layton wrote:
> >> >No, it's not specific to NFS. It can happen to any "service" that
> >> >floats IP addresses between machines, but does not close the sockets
> >> >that are connected to those addresses. Most services that fail over
> >> >(at least in RH's cluster server) shut down the daemons on failover
> >> >too, so tends to mitigate this problem elsewhere.
> >>
> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> >> victim server?
> >
> >The victim server might have other nfsd/lockd's running on them. Stopping
> >all the nfsd's could bring down lockd, and then you have to deal with lock
> >recovery on the stuff that isn't moving to the other server.
>
> But but but... the IP address is the only identification the client can use
> to isolate a server.

Right.

> You're telling me that some locks will migrate and some won't? Good
> luck with that! The clients are going to be mightily confused.

Locks migrate or not depending on the server ip address. Where do you
see the confusion?

--b.