2002-03-21 20:29:42

by Patrick McHardy

[permalink] [raw]
Subject: Updated Equalize patch

Hi!

I've updated the equalize patch to apply on 2.4.18.
The patch also addresses two race conditions in
ip_route_input(..) and ip_route_output_key(..).
The rt_hash_table entry is only read locked although elements
from the chain can be freed if there is a matching entry with
RTCF_EQUALIZE set.

Bye,
Patrick


Attachments:
equalize_2.4.18.patch (10.78 kB)

2002-03-21 20:39:02

by Guus Sliepen

[permalink] [raw]
Subject: Re: Updated Equalize patch

On Thu, Mar 21, 2002 at 09:28:32PM +0100, Patrick McHardy wrote:

> I've updated the equalize patch to apply on 2.4.18.
> The patch also addresses two race conditions in
> ip_route_input(..) and ip_route_output_key(..).
> The rt_hash_table entry is only read locked although elements
> from the chain can be freed if there is a matching entry with
> RTCF_EQUALIZE set.

Thank you very much! I've added it to the FTP site. I'd like to know if
there is anything the patch does that can't be done with the bonding
module, because otherwise I'd suggest using the latter (it's much
cleaner and handles all Ethernet protocols).

--
Met vriendelijke groet / with kind regards,
Guus Sliepen <[email protected]>


Attachments:
(No filename) (713.00 B)
(No filename) (232.00 B)
Download all attachments

2002-03-21 20:55:22

by Patrick McHardy

[permalink] [raw]
Subject: Re: Updated Equalize patch

Guus Sliepen schrieb:
>
> On Thu, Mar 21, 2002 at 09:28:32PM +0100, Patrick McHardy wrote:
>
> > I've updated the equalize patch to apply on 2.4.18.
> > The patch also addresses two race conditions in
> > ip_route_input(..) and ip_route_output_key(..).
> > The rt_hash_table entry is only read locked although elements
> > from the chain can be freed if there is a matching entry with
> > RTCF_EQUALIZE set.
>
> Thank you very much! I've added it to the FTP site. I'd like to know if
> there is anything the patch does that can't be done with the bonding
> module, because otherwise I'd suggest using the latter (it's much
> cleaner and handles all Ethernet protocols).

You're welcome :)
I'm using it (better trying) to bond two ppp links together (actually
the
uplinks) without provider support by distributing the traffic according
to uplink bandwidth and spoofing source ip of link1 on link2.
I've not tried the bonding module but afaik it can't be used with ppp
links
and has to be supported by the other end so equalize looks like
the only way to do it, especially since i've got only very few high
bandwidth
connections and one link is a lot slower than the other so i can't rely
on
normal multipath routing to distribute traffic correct.

Bye,
Patrick

2002-03-21 21:05:45

by Guus Sliepen

[permalink] [raw]
Subject: Re: Updated Equalize patch

On Thu, Mar 21, 2002 at 09:54:53PM +0100, Patrick McHardy wrote:

> > > I've updated the equalize patch to apply on 2.4.18.
[...]
> > Thank you very much! I've added it to the FTP site. I'd like to know if
> > there is anything the patch does that can't be done with the bonding
> > module, because otherwise I'd suggest using the latter (it's much
> > cleaner and handles all Ethernet protocols).
[...]
> I've not tried the bonding module but afaik it can't be used with ppp
> links

True, it works only with Ethernet devices AFAIK. But from what I saw the
design of the bonding device might actually be made to work with ppp
devices as well...

> and has to be supported by the other end so equalize looks like

That's not true.

> the only way to do it, especially since i've got only very few high
> bandwidth connections and one link is a lot slower than the other so i
> can't rely on normal multipath routing to distribute traffic correct.

Aha. I'll try out your updated patch later, but can you tell me if it
works without warnings or errors for you? What is the maximum throughput
you get?

--
Met vriendelijke groet / with kind regards,
Guus Sliepen <[email protected]>


Attachments:
(No filename) (1.16 kB)
(No filename) (232.00 B)
Download all attachments

2002-03-22 16:36:02

by Patrick McHardy

[permalink] [raw]
Subject: Re: Updated Equalize patch

diff -urN linux-2.4.18-clean/Documentation/networking/load-balancing.txt linux-2.4.18/Documentation/networking/load-balancing.txt
--- linux-2.4.18-clean/Documentation/networking/load-balancing.txt Thu Jan 1 01:00:00 1970
+++ linux-2.4.18/Documentation/networking/load-balancing.txt Fri Mar 22 17:21:21 2002
@@ -0,0 +1,125 @@
+Load balancing using multipaths (patch version: 5)
+==================================================
+
+Contact Guus Sliepen <[email protected]> if you need help, want to know
+more, have remarks or further idea's with relation to this.
+
+Intro
+-----
+
+If you have multiple physical network links to another computer, and you want
+some kind of load balancing, you can now do so. Please note that this only
+applies to IPv4 traffic, not for IPX, IPv6 or any other protocol (yet).
+
+Needed
+-----
+
+* LATEST iproute package from ftp://ftp.inr.ac.ru/ip-routing/
+* CONFIG_IP_ROUTE_MULTIPATH enabled in kernel configuration (it's in
+ Networking options, below the Advanced Router option you'll have to enable
+ too)
+* Ofcourse you must also have patched your kernel and recompiled it for this
+ feature to be enabled.
+
+To do
+-----
+
+* Make sure the devices you want to combine are up, they all accept the
+ packets you want to send (ie, they must all have the same IP address/netmask
+ or something clever to get the same result)
+* Just to make sure, remove any routes via those devices (route del ...)
+* Now add all routes via one iproute command using the 'nexthops' statement:
+
+ ip route add <destaddress>/<netmask> equalize \\
+ nexthop dev <first device> \\
+ nexthop dev <second device> \\
+ nexthop ...
+
+* Just to make sure, flush route cache:
+
+ echo 1 >/proc/sys/net/ipv4/route/flush
+
+Example
+-------
+
+This is an example showing how to make a 20 Mbit connection between two
+computers using 2 10 Mbit ethernet cards per computer. Computer 1 has IP
+192.168.1.1 and computer 2 has IP 192.168.1.2. We start from scratch:
+
+[computer1]~/>ifconfig eth0 192.168.1.1 netmask 255.255.255.0
+[computer1]~/>route del -net 192.168.1.0 netmask 255.255.255.0
+[computer1]~/>ifconfig eth1 192.168.1.1 netmask 255.255.255.0
+[computer1]~/>route del -net 192.168.1.0 netmask 255.255.255.0
+[computer1]~/>ip route add 192.168.1.0/24 equalize nexthop dev eth0 nexthop dev eth1
+[computer1]~/>echo 1 >/proc/sys/net/ipv4/route/flush
+
+[computer2]~/>ifconfig eth0 192.168.1.2 netmask 255.255.255.0
+[computer2]~/>route del -net 192.168.1.0 netmask 255.255.255.0
+[computer2]~/>ifconfig eth1 192.168.1.2 netmask 255.255.255.0
+[computer2]~/>route del -net 192.168.1.0 netmask 255.255.255.0
+[computer2]~/>ip route add 192.168.1.0/24 equalize nexthop dev eth0 nexthop dev eth1
+[computer2]~/>echo 1 >/proc/sys/net/ipv4/route/flush
+
+You can even add more computers, just replace the x in 192.168.1.x with the
+number of your computer, and make sure all eth0's are connected to each other
+and all eth1's. You can also use more devices, just ifconfig them all and
+remove the default routes that are generated, and add extra nexthops.
+
+Notes
+-----
+
+If you want to add a gateway entry in your routing table, and want it to be
+balanced too, you first have to make singlepath entries for every network
+interface you want to use, after that add the gateway with all the nexthops
+filled in, then delete the singlepath routes and then add the normal
+multipath route.
+
+Older patch versions used a /proc entry to control load-balancing. This does
+not work anymore. You should use the 'equalize' flag instead while adding new
+routes. You need a fresh version of iproute for that.
+
+Status
+------
+
+Packet type: Balanced? Note
+----------------------------------------------------------------------
+ARP no But we don't want them to ;)
+ICMP yes
+Connectionless UDP yes
+Connected UDP yes
+Broadcast UDP no Would be nice if it would,
+ but this is rarely used for
+ high bandwith data transfers.
+TCP yes At least all data packets are,
+ maybe some control packets are
+ not.
+
+(Known) Bugs
+------------
+
+Due to the nature of the patch, every packet that follows a multipath uses
+a little memory that is not instantly cleaned up, but after a short period.
+This means that if your load gets higher, memory useage is higher. Since
+there is a limit to the memory that can be allocated for the packets, there
+is also a load limit. I cannot give exact numbers, however this patch does
+work with a load of 20 Mbit/s without problems on a 486 dx2 66, but not
+with a load of 400 Mbit/s on a box with multiple 400 Mhz Xeon processors.
+If the load gets too high, no memory is left for network IO, which stops
+for a while if that happens. The kernel should not crash if this happens.
+
+Technically
+-----------
+
+Load balancing needed a slight adjustment to the unpatched linux kernel,
+because of the route cache. Multipath is an option already found in the old
+2.1.x kernels. However, once a packet arrives, and it matches a multipath
+route, a (quasi random) device out of the list of nexthops is taken for its
+destination. That's okay, but after that the kernel puts everything into a
+hash table, and the next time a packet with the same source/dest/tos arrives,
+it finds it is in the hash table, and routes it via the same device as last
+time. The adjustment I made is as follows: If the kernel sees that the route
+to be taken has got the 'equalize' flag set, it not only selects the random
+device, but also tags the packet with the RTCF_EQUALIZE flag. If another
+packet of the same kind arrives, it is looked up in the hash table. It then
+checks if our flag is set, and if so, it deletes the entry in the cache and
+has to recalculate the destination again.
diff -urN linux-2.4.18-clean/include/linux/in_route.h linux-2.4.18/include/linux/in_route.h
--- linux-2.4.18-clean/include/linux/in_route.h Fri Jun 12 07:52:33 1998
+++ linux-2.4.18/include/linux/in_route.h Fri Mar 22 17:21:21 2002
@@ -18,6 +18,7 @@
#define RTCF_MASQ 0x00400000
#define RTCF_SNAT 0x00800000
#define RTCF_DOREDIRECT 0x01000000
+#define RTCF_EQUALIZE 0x02000000
#define RTCF_DIRECTSRC 0x04000000
#define RTCF_DNAT 0x08000000
#define RTCF_BROADCAST 0x10000000
diff -urN linux-2.4.18-clean/net/ipv4/fib_semantics.c linux-2.4.18/net/ipv4/fib_semantics.c
--- linux-2.4.18-clean/net/ipv4/fib_semantics.c Mon Feb 25 20:38:14 2002
+++ linux-2.4.18/net/ipv4/fib_semantics.c Fri Mar 22 17:21:21 2002
@@ -101,6 +101,10 @@
};


+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+unsigned int mp_counter=0;
+#endif
+
/* Release a nexthop info record */

void free_fib_info(struct fib_info *fi)
@@ -955,7 +959,7 @@
it is pretty bad approximation.
*/

- w = jiffies % fi->fib_power;
+ w = mp_counter++ % fi->fib_power;

change_nexthops(fi) {
if (!(nh->nh_flags&RTNH_F_DEAD) && nh->nh_power) {
diff -urN linux-2.4.18-clean/net/ipv4/ip_output.c linux-2.4.18/net/ipv4/ip_output.c
--- linux-2.4.18-clean/net/ipv4/ip_output.c Wed Oct 17 23:16:39 2001
+++ linux-2.4.18/net/ipv4/ip_output.c Fri Mar 22 17:21:21 2002
@@ -354,7 +354,7 @@

/* Make sure we can route this packet. */
rt = (struct rtable *)__sk_dst_check(sk, 0);
- if (rt == NULL) {
+ if (rt == NULL || rt->u.dst.obsolete || rt->rt_flags&RTCF_EQUALIZE) {
u32 daddr;

/* Use correct destination address if we have options. */
diff -urN linux-2.4.18-clean/net/ipv4/route.c linux-2.4.18/net/ipv4/route.c
--- linux-2.4.18-clean/net/ipv4/route.c Mon Feb 25 20:38:14 2002
+++ linux-2.4.18/net/ipv4/route.c Fri Mar 22 17:23:03 2002
@@ -1419,8 +1419,11 @@
goto martian_destination;

#ifdef CONFIG_IP_ROUTE_MULTIPATH
- if (res.fi->fib_nhs > 1 && key.oif == 0)
+ if (res.fi->fib_nhs > 1 && key.oif == 0) {
fib_select_multipath(&key, &res);
+ if (res.fi->fib_flags&RTM_F_EQUALIZE)
+ flags |= RTCF_EQUALIZE;
+ }
#endif
out_dev = in_dev_get(FIB_RES_DEV(res));
if (out_dev == NULL) {
@@ -1622,15 +1625,15 @@
int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
u8 tos, struct net_device *dev)
{
- struct rtable * rth;
+ struct rtable * rth, **rthp;
unsigned hash;
int iif = dev->ifindex;

tos &= IPTOS_RT_MASK;
hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);

- read_lock(&rt_hash_table[hash].lock);
- for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
+ write_lock(&rt_hash_table[hash].lock);
+ for (rthp=&rt_hash_table[hash].chain; (rth=*rthp); rthp=&rth->u.rt_next) {
if (rth->key.dst == daddr &&
rth->key.src == saddr &&
rth->key.iif == iif &&
@@ -1639,16 +1642,22 @@
rth->key.fwmark == skb->nfmark &&
#endif
rth->key.tos == tos) {
+ if (rth->rt_flags&RTCF_EQUALIZE) {
+ *rthp = rth->u.rt_next;
+ rth->u.rt_next = NULL;
+ rt_free(rth);
+ break;
+ }
rth->u.dst.lastuse = jiffies;
dst_hold(&rth->u.dst);
rth->u.dst.__use++;
rt_cache_stat[smp_processor_id()].in_hit++;
- read_unlock(&rt_hash_table[hash].lock);
+ write_unlock(&rt_hash_table[hash].lock);
skb->dst = (struct dst_entry*)rth;
return 0;
}
}
- read_unlock(&rt_hash_table[hash].lock);
+ write_unlock(&rt_hash_table[hash].lock);

/* Multicast recognition logic is moved from route cache to here.
The problem was that too many Ethernet cards have broken/missing
@@ -1852,8 +1861,11 @@
}

#ifdef CONFIG_IP_ROUTE_MULTIPATH
- if (res.fi->fib_nhs > 1 && key.oif == 0)
+ if (res.fi->fib_nhs > 1 && key.oif == 0) {
fib_select_multipath(&key, &res);
+ if (res.fi->fib_flags&RTM_F_EQUALIZE)
+ flags |= RTCF_EQUALIZE;
+ }
else
#endif
if (!res.prefixlen && res.type == RTN_UNICAST && !key.oif)
@@ -1984,12 +1996,12 @@
int ip_route_output_key(struct rtable **rp, const struct rt_key *key)
{
unsigned hash;
- struct rtable *rth;
+ struct rtable *rth, **rthp;

hash = rt_hash_code(key->dst, key->src ^ (key->oif << 5), key->tos);

- read_lock_bh(&rt_hash_table[hash].lock);
- for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
+ write_lock_bh(&rt_hash_table[hash].lock);
+ for (rthp=&rt_hash_table[hash].chain; (rth=*rthp); rthp=&rth->u.rt_next) {
if (rth->key.dst == key->dst &&
rth->key.src == key->src &&
rth->key.iif == 0 &&
@@ -1999,16 +2011,22 @@
#endif
!((rth->key.tos ^ key->tos) &
(IPTOS_RT_MASK | RTO_ONLINK))) {
+ if (rth->rt_flags&RTCF_EQUALIZE) {
+ *rthp = rth->u.rt_next;
+ rth->u.rt_next = NULL;
+ rt_free(rth);
+ break;
+ }
rth->u.dst.lastuse = jiffies;
dst_hold(&rth->u.dst);
rth->u.dst.__use++;
rt_cache_stat[smp_processor_id()].out_hit++;
- read_unlock_bh(&rt_hash_table[hash].lock);
+ write_unlock_bh(&rt_hash_table[hash].lock);
*rp = rth;
return 0;
}
}
- read_unlock_bh(&rt_hash_table[hash].lock);
+ write_unlock_bh(&rt_hash_table[hash].lock);

return ip_route_output_slow(rp, key);
}
diff -urN linux-2.4.18-clean/net/ipv4/udp.c linux-2.4.18/net/ipv4/udp.c
--- linux-2.4.18-clean/net/ipv4/udp.c Mon Feb 25 20:38:14 2002
+++ linux-2.4.18/net/ipv4/udp.c Fri Mar 22 17:21:21 2002
@@ -740,6 +740,14 @@
sk->state = TCP_ESTABLISHED;
sk->protinfo.af_inet.id = jiffies;

+ #ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if(rt->rt_flags&RTCF_EQUALIZE) {
+ ip_rt_put(rt);
+ sk->dst_cache=NULL;
+ }
+ else
+ #endif
+
sk_dst_set(sk, &rt->u.dst);
return(0);
}


Attachments:
equalize_2.4.18.patch (11.07 kB)