2006-01-17 12:20:54

by Richard Mueller

[permalink] [raw]
Subject: Strange Problems with ARP and Linux

Hy... (the same was posted to linux-net yesterday)

I experienced some strange behaviour with linux and the arp protocol.

1.) Kernel-Version: 2.6.11.7 plus grsec-patches

2.) Setup:

+--------+
| Router |
+---+----+
|
|
+------+-------+
| |
| | Transitnet for
| | Cluster/Router
+----+-----+ +-----+------+
| Primary | | Secondary |
+----+-----+ +-----+------+
| |
| | LAN
+--------------+

Router: C2600 router from ISP

Primary: First(active) linux router
Secondary: Secondary(standby) linux router

Primary/Secondary are configured as a cluster
with the heartbeat package.

The cluster shares a IP-Alias in the transitnet and
many IPs in the LAN-segments. The IP-Alias is always
bound to one node at the same time.

Following IPs and MACs are used for this example:

transit-net:
Router: 10.0.0.1/24 | 00:10:F3:09:10:70
Primary: 10.0.0.10/24 | 00:10:F3:09:11:71
Secondary: 10.0.0.11/24 | 00:10:F3:09:12:72
IP-Alias: 10.0.0.20/24 | depends where it ist bound to

lan:
Primary: 10.1.0.10/24 | 00:10:F3:10:11:71
Secondary: 10.1.0.11/24 | 00:10:F3:10:12:72
IP-Alias: 10.1.0.20/24 | depends where it ist bound to

3.) The Problem

First everything works fine. If I fail the primary node,
the secondary does the take over. The ARP-Entrys are
changing to the MAC of the secondary, and everything is
fine.

Now if you want to ping/ssh/somewhat the shared IP-Alias
in the LAN from the networks behind the C2600 everthing begins:

I. The C2600 is able to deliver the IP-packet to the node because
it has a valid arp-entry.

II. The Linux-machine (secondary) does not have any arp-entrys
(because it was inactive for a while) so it has to initiate
ARP before it can deliver the answer IP-packet.

Then IT HAPPENS:

The Linux Box asks in the transit net:

0.000000 10.1.0.20 -> Broadcast ARP Who has 10.0.0.1? Tell 10.1.0.20

Why does Linux make ARP-requests with SRC-IPs from a different subnet?
This can't be the expected behaviour... :(

BTW:
The C2600 is so "smart" to put an entry with
"10.1.0.20 -> 00:10:F3:09:12:72"
in its ARP-Cache, based on this single ARP-Broadcast
from 10.1.0.20 and after a failback to the primary nobody can reach the
10.1.0.20... :-)


4.) Solution: Dirty Userspace Fix
Ping the C2600 from the primary/secondary infinitely.
The same does a ping-group in heartbeat.
This can't be the real truth... ;-)

5.) Solution: Dirty Kernel-Patch
With my skillful hands I wrote a dirty hack:
<patch>
--- arp.c Fri Jan 13 16:44:06 2006
+++ arp.c.new Fri Jan 13 16:43:52 2006
@@ -342,9 +342,9 @@
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
- if (skb && inet_addr_type(skb->nh.iph->saddr) == RTN_LOCAL)
+ /* if (skb && inet_addr_type(skb->nh.iph->saddr) == RTN_LOCAL)
saddr = skb->nh.iph->saddr;
- break;
+ break; */
case 1: /* Restrict announcements of saddr in same subnet */
if (!skb)
break;
</patch>

6.) Solution: Clean Kernel-Patch
Can anybody improve this patch above to a clean one so that it finds
it way to the vanilla kernel?


bye
richard

--
Richard M?ller
Gesch?ftsf?hrer Technik

team(ix) GmbH
Powering Enterprise Linux Networks
S?dwestpark 35
90449 N?rnberg

fon: +49 (911) 30999- 0
fax: +49 (911) 30999-99
mail: [email protected]
web: http://www.teamix.de
vcf: http://www.teamix.de/vcf/rm.vcf
gpg: 296C 0BAF 8FC8 DCE2 99BD
5777 FA73 ECDC F9F1 8FF7


Subject: Re: Strange Problems with ARP and Linux

[email protected] (Richard Mueller) writes:
> Hy... (the same was posted to linux-net yesterday)
> [..]

It's the expected behavior. Look for arp_ignore and arp_announce under
linux/Documentation/networking/ip-sysctl.txt

--
Mathieu Chouquet-Stringer
"Le disparu, si l'on v?n?re sa m?moire, est plus pr?sent et
plus puissant que le vivant".
-- Antoine de Saint-Exup?ry, Citadelle --