2007-05-02 22:05:13

by Glen Turner

[permalink] [raw]
Subject: Detecting process death for anycast named process monitoring


Hi folks,

Anycast services are a nice way of robustly offering DNS and other
services. We create an interface which reflects the availability
of the service and advertise that into the network using a OSPF
router like Quagga.

For more detail see
http://www.aarnet.edu.au/~gdt/presentations/2006-07-18-linuxsa-anycast/
which is a summary of work which was presented at linux.conf.au.

The question is, how can a process with no relationship to another
process detect that process unexpectedly dying? If named goes
away to a better place, we want to shut down the interface
which causes Quagga to inject the anycast route.

We don't want to be the parent of the running process, because that
doesn't add robustness. If the parent process dies, then the service
dies, and the interface still stays up.

We don't want to poll, because that isn't pretty and the polling
interval needs to be very short on a big ISP's DNS servers.

I have tried using the various notify functions against /proc, but
they don't work for that filesystem. I have tried using notify
against a UNIX domain socket, but notify doesn't work for
that either.

Suggestions, or a patch to support notify for /proc or to push
process death notifications into DBUS or whatever, are welcome.

Thank you, Glen


2007-05-02 22:31:08

by Chris Friesen

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

Glen Turner wrote:

> The question is, how can a process with no relationship to another
> process detect that process unexpectedly dying? If named goes
> away to a better place, we want to shut down the interface
> which causes Quagga to inject the anycast route.
>
> We don't want to be the parent of the running process, because that
> doesn't add robustness. If the parent process dies, then the service
> dies, and the interface still stays up.
>
> We don't want to poll, because that isn't pretty and the polling
> interval needs to be very short on a big ISP's DNS servers.

We did something similar where arbitrary processes can register to be
sent an arbitrary signal when the state of other processes change. The
caller passes in the pid, the signal to be sent, and an event mask
describing which events you're interested in (stop/start/exit/kill/etc.).

A signal number of 0 means to deregister interest in the specified pid.

Chris

2007-05-03 08:01:51

by Russell King

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

On Thu, May 03, 2007 at 06:34:42AM +0930, Glen Turner wrote:
> We don't want to be the parent of the running process, because that
> doesn't add robustness. If the parent process dies, then the service
> dies, and the interface still stays up.

Okay.

> We don't want to poll, because that isn't pretty and the polling
> interval needs to be very short on a big ISP's DNS servers.

If you did have a process which polls for the service, what happens if
that process dies?

> I have tried using the various notify functions against /proc, but
> they don't work for that filesystem. I have tried using notify
> against a UNIX domain socket, but notify doesn't work for
> that either.
>
> Suggestions, or a patch to support notify for /proc or to push
> process death notifications into DBUS or whatever, are welcome.

What if the dbus system dies? What if your monitoring process dies?

Surely a simple solution is going to be the best solution? Given that
you're always going to have another process (which might be killed)
your thought about having a parent process monitor the death of the
child seems to be the simplest.

You could also have that process interact with a watchog, so failures
with that process cause a reboot.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2007-05-03 08:55:13

by Glen Turner

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring


Hi Russell,

Thanks for your answer.

> If you did have a process which polls for the service, what happens if
> that process dies?

The failure mode is good. The monitoring process dies, the
interface stays up, ospfd keeps advertising the route, named
keeps running. We pick up the lack of a monitoring process
in Nagios, manually down lo:1 so the traffic goes elsewhere
and investigate the fault. No customer impact unless
named dies for some independent reason before the NOC staff
down lo:1.

> ... Given that
> you're always going to have another process (which might be killed)
> your thought about having a parent process monitor the death of the
> child seems to be the simplest.

The failure mode for a parent monitoring process is not good.
The monitoring process dies, the interface stays up, ospfd
keeps advertising the route, the child named dies. Since
we still have incoming DNS requests but no running DNS server,
customers will need to timeout and try the next DNS server
in their /etc/resolv.conf. So customer impact is severely
reduced performance web performance until the NOC staff log
in and down lo:1.

As you can see, the basic requirement is for the lo:1 interface
to track the state of the named process at all times.

> What if the dbus system dies? What if your monitoring process dies?

As long as these don't kill named whilst failing, we have enough
time to sort it out manually. Nagios (or whatever system health
monitor you shoose to configure) will hassle the Network Operations
Center in short order.

> Surely a simple solution is going to be the best solution?

That's why I'm posting here. I'd settle for some simple answer,
even if it is particular to Linux.

> You could also have that process interact with a watchog, so failures
> with that process cause a reboot.

No need. Dropping lo:1 makes the DNS traffic go to a healthier
server. Then the box can be left as-is so the sysadmins can take
it apart to find the basic cause of the fault.

Thanks for your thoughts. Some monitoring mechanism that didn't
kill named if it goes wrong would be fantastic.

Glen

2007-05-03 09:40:57

by Andrew Morton

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

On Thu, 03 May 2007 06:34:42 +0930 Glen Turner <[email protected]> wrote:

> The question is, how can a process with no relationship to another
> process detect that process unexpectedly dying?

Monitor the system using the taskstats interface. There is a sample
application and documentation in Documentation/accounting/.

Your monitoring application will receive a netlink packet each time a process
exits. It includes the exit code and the process's name.

2007-05-03 10:17:21

by Glen Turner

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

On Thu, 2007-05-03 at 02:40 -0700, Andrew Morton wrote:
> Monitor the system using the taskstats interface. There is a sample
> application and documentation in Documentation/accounting/.
>
> Your monitoring application will receive a netlink packet each time a process
> exits. It includes the exit code and the process's name.

Marvellous, just what is needed. Thank you Andrew.

2007-05-04 01:59:22

by David M. Lloyd

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

On Wed, 2007-05-02 at 16:30 -0600, Chris Friesen wrote:
> Glen Turner wrote:
>
> > The question is, how can a process with no relationship to another
> > process detect that process unexpectedly dying? If named goes
> > away to a better place, we want to shut down the interface
> > which causes Quagga to inject the anycast route.

> We did something similar where arbitrary processes can register to be
> sent an arbitrary signal when the state of other processes change.

What about something like inotify, but for processes? That would be
cool...

- DML

2007-05-04 05:54:38

by Russell King

[permalink] [raw]
Subject: Re: Detecting process death for anycast named process monitoring

On Wed, May 02, 2007 at 06:12:27PM -0500, David M. Lloyd wrote:
> On Wed, 2007-05-02 at 16:30 -0600, Chris Friesen wrote:
> > Glen Turner wrote:
> >
> > > The question is, how can a process with no relationship to another
> > > process detect that process unexpectedly dying? If named goes
> > > away to a better place, we want to shut down the interface
> > > which causes Quagga to inject the anycast route.
>
> > We did something similar where arbitrary processes can register to be
> > sent an arbitrary signal when the state of other processes change.
>
> What about something like inotify, but for processes? That would be
> cool...

Or maybe just ignoring the SIGHUP before exec'ing the named process as
a child.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: