Hi Kernel-people,
while playing with the Zebra routing daemon, I realized that neither
Linux nor Zebra are capable of detecting the operative state of an
interface, f.e. the ethernet link beat. This is a major show stopper
against using Linux for "serious" IP routing.
While my Zebra patches still need some testing, I'd like to present my
ideas for Linux to the community and get some feedback if this is the
right way to go.
Under Unix systems, the flag IFF_UP represents the administrative state
of an interface, while IFF_RUNNING should be set whenever the driver is
convinced that it actually can send and receive data.
Currently, IFF_RUNNING in the net_device->flags is not used at all,
rtnetlink and query ioctls use netif_carrier_ok() that tests the
__LINK_STATE_NO_CARRIER bit in net_device->state. Looking back into the
LKML archives, the reason is that the flags element must not be changed
inside interrupts.
So what about the following idea: The network interface drivers use the
netif_carrier_on() and netif_carrier_off() functions to update their
interface card status (a bunch of drivers already do). To get this
information forwarded to user mode via netlink socket, we use a kernel
thread that goes through the device list, and everytime IFF_RUNNING
and netif_carrier_ok() differ, IFF_RUNNING is updated and a message is
sent via netlink.
Attached you'll find a patch against 2.4.17 for the tulip driver to
forward (again) link beat information for MII tranceivers and a small
kernel module for 2.4 that implements the kernel thread. Please forgive
the crude way of unloading, this is my first kernel hacking attempt
since 1.2 ;-)
If this is the right direction, I'll submit a more elegant patch for 2.5
in the next weeks.
Note that there is still sourcecode in drivers/net that plays with
IFF_RUNNING directly. As I don't own these cards, let me just list the
files so that the maintainers can have a look: bmac.c,
wan/lmc/lmc_main.c, wan/comx-proto-fr.c, tlan.c, eepro100.c,
au1000_eth.c, tokenring/ibmtr.c and bonding.c.
diff -urN -X dontdiff linux-2.4.17/drivers/net/tulip/21142.c linux-2.4.17stefan/drivers/net/tulip/21142.c
--- linux-2.4.17/drivers/net/tulip/21142.c Fri Nov 9 22:45:35 2001
+++ linux-2.4.17stefan/drivers/net/tulip/21142.c Sat Jan 19 15:28:57 2002
@@ -39,8 +39,12 @@
printk(KERN_INFO"%s: 21143 negotiation status %8.8x, %s.\n",
dev->name, csr12, medianame[dev->if_port]);
if (tulip_media_cap[dev->if_port] & MediaIsMII) {
- tulip_check_duplex(dev);
- next_tick = 60*HZ;
+ if (tulip_check_duplex(dev) < 0) {
+ netif_carrier_off(dev);
+ } else {
+ netif_carrier_on(dev);
+ }
+ next_tick = 3*HZ;
} else if (tp->nwayset) {
/* Don't screw up a negotiated session! */
if (tulip_debug > 1)
diff -urN -X dontdiff linux-2.4.17/drivers/net/tulip/timer.c linux-2.4.17stefan/drivers/net/tulip/timer.c
--- linux-2.4.17/drivers/net/tulip/timer.c Wed Jun 20 20:15:44 2001
+++ linux-2.4.17stefan/drivers/net/tulip/timer.c Sat Jan 19 15:30:35 2002
@@ -137,10 +137,10 @@
medianame[mleaf->media & MEDIA_MASK]);
if ((p[2] & 0x61) == 0x01) /* Bogus Znyx board. */
goto actually_mii;
- /* netif_carrier_on(dev); */
+ netif_carrier_on(dev);
break;
}
- /* netif_carrier_off(dev); */
+ netif_carrier_off(dev);
if (tp->medialock)
break;
select_next_media:
@@ -164,10 +164,11 @@
}
case 1: case 3: /* 21140, 21142 MII */
actually_mii:
- if (tulip_check_duplex(dev) < 0)
- { /* netif_carrier_off(dev); */ }
- else
- { /* netif_carrier_on(dev); */ }
+ if (tulip_check_duplex(dev) < 0) {
+ netif_carrier_off(dev);
+ } else {
+ netif_carrier_on(dev);
+ }
next_tick = 60*HZ;
break;
case 2: /* 21142 serial block has no link beat. */
And the module source code dev_watch.c:
/* Linux network device status watcher
(C) 2002 Stefan Rompf
GPL Version 2 applies
compile with something like
gcc -c -DMODULE -D__KERNEL__ -O2 dev_watch.c
*/
#include <linux/sched.h>
#include <linux/config.h>
#include <linux/module.h>
#include <linux/netdevice.h>
#include <linux/if.h>
#include <linux/rtnetlink.h>
static volatile pid_t me;
static volatile int running = 1;
static int devwatcher(void *dummy)
{
struct net_device *dev;
daemonize();
strcpy(current->comm, "devwatchd");
while(running) {
write_lock(&dev_base_lock);
for (dev=dev_base; dev; dev = dev->next) {
/* State of netif_carrier_ok() is reflected
into dev_flags by this loop, and a netlink
message is omitted whenever the state
changes */
if (dev->flags & IFF_RUNNING) {
if (!netif_carrier_ok(dev)) {
dev->flags &= ~IFF_RUNNING;
rtnl_lock();
netdev_state_change(dev);
rtnl_unlock();
}
} else {
if (netif_carrier_ok(dev)) {
dev->flags |= IFF_RUNNING;
rtnl_lock();
netdev_state_change(dev);
rtnl_unlock();
}
}
}
write_unlock(&dev_base_lock);
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(HZ);
}
me = 0;
return 0;
}
static int __init devwatch_init(void) {
me = kernel_thread(devwatcher, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
if (!me) return 1;
return 0;
}
static void __exit devwatch_cleanup(void) {
running = 0;
while(me) schedule();
}
EXPORT_NO_SYMBOLS;
module_init(devwatch_init);
module_exit(devwatch_cleanup);
Cheers, Stefan
Stefan Rompf wrote:
> So what about the following idea: The network interface drivers use the
> netif_carrier_on() and netif_carrier_off() functions to update their
> interface card status (a bunch of drivers already do). To get this
> information forwarded to user mode via netlink socket, we use a kernel
> thread that goes through the device list, and everytime IFF_RUNNING
> and netif_carrier_ok() differ, IFF_RUNNING is updated and a message is
> sent via netlink.
In fact, that's the 2.5 plan -- send link up/down notifications via
netlink. If you provide a patch, more the better.
For 2.4, you'll need to poll ETHTOOL_GLINK, assuming that the driver
supports ethtool.
In any case, your tulip patch looks pretty good on first review. I'll
likely apply it to 2.4 and 2.5 after testing and further review.
Jeff
--
Jeff Garzik | Alternate titles for LOTR:
Building 1024 | Fast Times at Uruk-Hai
MandrakeSoft | The Took, the Elf, His Daughter and Her Lover
| Samwise Gamgee: International Hobbit of Mystery
Stefan Rompf <[email protected]> writes:
> while playing with the Zebra routing daemon, I realized that neither
> Linux nor Zebra are capable of detecting the operative state of an
> interface, f.e. the ethernet link beat. This is a major show stopper
> against using Linux for "serious" IP routing.
It's only when you assume that the link beat is a "serious" sign for
link healthiness. Unfortunately there are many error cases where a link
can fail, but the link beat is still there - for example the software
on the other machine crashing but the NIC still working fine. These
seem to be the majority of the cases in fact except for demo situations
where people pull cables on purpose. To handle all the other cases you
need a separate heartbeat protocol that actually checks if the higher
layers above the networking card are alive on the peer too. Most routing
protocols do this already in fact, e.g. OSPF or RIP with their 'hello'
packets. The Linux IP stack does it also using ARP probes. When a probe
is not answered the routing daemon eventually notices and takes action.
While waiting for probes is a bit slower (30-60s usually), checking
the link beat only handles such a small subset of cases that it is not
worth it to optimize these rare ones.
Commercial vendors seem to like it because it looks good in demos ;),
but linux fortunately doesn't have to be concerned with such marketing
reasoning.
In short - Linux doesn't have this feature because it's not needed.
If your routing protocol relies on link state checking without other
probing it's broken. Zebra isn't.
-Andi
Hi,
> interface, f.e. the ethernet link beat. This is a major show stopper
> against using Linux for "serious" IP routing.
[...]
> So what about the following idea: The network interface drivers use the
> netif_carrier_on() and netif_carrier_off() functions to update their
> interface card status (a bunch of drivers already do). To get this
> information forwarded to user mode via netlink socket, we use a kernel
> thread that goes through the device list, and everytime IFF_RUNNING
> and netif_carrier_ok() differ, IFF_RUNNING is updated and a message is
> sent via netlink.
A slightely different approach could be to use the alternative routes
using the patch from Julian Anastasov [1] and set RTNH_F_BADSTATE. With
the RTNH_F_DEAD we can select a new route in the multipath route setup.
You set up multiple routes and in case of link state problems you mark
the route dead and the routeing code will not select the routes defined
with that interface anymore. Together with an active NUD you get a
fairly decent responsive HA system. But maybe I'm way off-topic and I
missunderstood your concerns.
[1] http://www.linux-vs.org/~julian/01_alt_routes-2.4.12-5.diff
Cheers,
Roberto Nibali, ratz
Andi Kleen wrote:
> It's only when you assume that the link beat is a "serious" sign for
> link healthiness. Unfortunately there are many error cases where a link
> can fail, but the link beat is still there - for example the software
> on the other machine crashing but the NIC still working fine. These
> seem to be the majority of the cases in fact except for demo situations
> where people pull cables on purpose.
I disagree. There are two very real situations that come to my mind
spontaneously:
-Many telecom providers loop back a synchronous serial line whenever the
connection fails somewhere on the WAN. While this actually isn't a link
beat, it can be detected as an interface failure instantly
(!IFF_RUNNING) and is therefore much faster than any (even highly tuned)
routing protocol.
-You have two redundant routers pointing into an ethernet segment, and
if the NIC of one router starts failing, the switch might decide to turn
off the port in question because of excessive errors (many switches do).
Or the switch may simply lose power. Without link beat detection, the
router in question cannot see these situations and will happily continue
advertising the connected network, making the automated backup
mechanism fail. ARP probing is no solution as it can only detect a host
on the network going down, not the interface.
Of course, link beat detection is not the magic lantern making all
networking problems vanish, but from my networking experience it is
important enough to support it.
Stefan
Stefan Rompf wrote:
> Note that there is still sourcecode in drivers/net that plays with
> IFF_RUNNING directly. As I don't own these cards, let me just list the
> files so that the maintainers can have a look: bmac.c,
> wan/lmc/lmc_main.c, wan/comx-proto-fr.c, tlan.c, eepro100.c,
> au1000_eth.c, tokenring/ibmtr.c and bonding.c.
I recently fixed this in eepro100.c, in 2.4.18-pre9 and later. Patches
accepted to clear up the others.
> diff -urN -X dontdiff linux-2.4.17/drivers/net/tulip/21142.c linux-2.4.17stefan/drivers/net/tulip/21142.c
> --- linux-2.4.17/drivers/net/tulip/21142.c Fri Nov 9 22:45:35 2001
> +++ linux-2.4.17stefan/drivers/net/tulip/21142.c Sat Jan 19 15:28:57
thanks, patch applied and modified slightly, to 2.4 and 2.5.
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com
Andi,
> In short - Linux doesn't have this feature because it's not needed.
> If your routing protocol relies on link state checking without other
> probing it's broken. Zebra isn't.
Link state checking _would_ be nice, though, for stuff like laptops
using dhcp. Right now, if I start my laptop without being plugged into
the network, I have to wait for DHCP to time out before it'll finish
booting - about 1 minute. (yech!)
Is there a good way for a dhcp daemon to find out whether the laptop is
connected to the network or not? It'd be really sweet if dhcpcd could:
- down the interface and remove routes whenever the network cable was
unplugged
- wait for the interface to get link beat again, and send a new request
This would greatly enhance desktop usability. I'd be happy to do the
dhcpcd hacking if the right kernel interfaces were available.
regards,
David
--
David L. Parsley
Network Administrator, Roanoke College
"If I have seen further it is by standing on ye shoulders of Giants."
--Isaac Newton
"David L. Parsley" wrote:
> Is there a good way for a dhcp daemon to find out whether the laptop is
> connected to the network or not? It'd be really sweet if dhcpcd could:
>
> - down the interface and remove routes whenever the network cable was
> unplugged
> - wait for the interface to get link beat again, and send a new request
>
> This would greatly enhance desktop usability. I'd be happy to do the
> dhcpcd hacking if the right kernel interfaces were available.
If the network driver supports querying of the MII registers (usually done with
the private ioctl stuff) then you can query the link beat from userspace.
Depending on the driver/chip this may be a very fast query, or it may take
upwards of 100us with interrupts disabled.
I can post example code if you like...
Now what I would *really* like to see would be a way to get asynchronous
notification of userspace processes on link beat change. Of course, depending
on NIC this would require an interrupt handler or a kernel thread periodically
checking the link state, as well as some way to pass that information to the
user (netlink socket, interrupt...not sure what the best would be).
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
Hi,
> Now what I would *really* like to see would be a way to get asynchronous
> notification of userspace processes on link beat change. Of course, depending
> on NIC this would require an interrupt handler or a kernel thread periodically
> checking the link state, as well as some way to pass that information to the
> user (netlink socket, interrupt...not sure what the best would be).
I have already written code that sends a link state notification via
netlink socket, but still have to create the patches for 2.5, maybe
2.4ac and later 2.4. Just give me some more days, if you're working as a
software developer in your daily job, you are not always motivated to
run straight to your private computer afterworks, at least I am not ;-)
Stefan