2004-09-01 12:43:26

by Einar Lueck

[permalink] [raw]
Subject: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

The following small patch (applies to BK head) addresses issues relevant for transparent NIC failover (especially in case of NFS). It allows to configure on a per device basis via sysctl an IP address (Source Virtual IP Address - Source VIPA) that is set as IP source address for all connections for which no bind has been applied. ?To allow for NIC failover one then just needs:
1. A dummy-Device set up with the Source VIPA
2. Outbound routes via both/all redundant NICs for the relevant packets (more precisely: dynamic routing with for example ZEBRA)
3. Routes to the Source VIPA on the relevant router having the IPs of the redundant NICs configured as gateways (more precisely: dynamic routing with for example ZEBRA)
Dynamic routing is mandatory as it is necessary that dead routes (e.g. NIC dead) are removed at the relevant router.

We tested this patch in the desired NFS failover usage scenario and of course without any Source VIPA configured.

The reason for the development of this patch is that the alternatives we thought of have serious limitations for the intended usage scenarios:
1. A User space tool intercepting connects and issuing binds (configuration on a per application basis) (refer to: http://oss.software.ibm.com/linux390/useful_add-ons_vipa.shtml)
This approach does not allow for NFS failover which we consider to be a very important use case because NFS works in kernel.
2. ip route xxx.xxx.xxx.xxx/xx src SourceVIPA
OSPF, etc. do not support automatic setup of and discovery of desired source addresses. As a consequence one would have to configure static routes for all use cases which is not desirable in complex routing scenarios and especially in presence of dynamic routing.
3. netfilter ((S)NAT)
NAT takes place after routing is applied and an IP address (e.g. IP of the output NIC) has been set for a packet. Consequently, returned packets are routed to the original IP address. As a result no failover is possible.
4. NIC bonding
There is a strong dependence on the switches' timeout for the IP/MAC pair. In addition to that, as far as we know not all NICs support bonding with failover.

I hope I described the overall use case comprehensible enough to clarify why we consider this patch as very useful and important.

Einar Lueck


diff -ruN linux-2.6.8.1/include/linux/inetdevice.h linux-2.6.8.1.new/include/linux/inetdevice.h
--- linux-2.6.8.1/include/linux/inetdevice.h 2004-08-31 17:50:03.000000000 +0200
+++ linux-2.6.8.1.new/include/linux/inetdevice.h 2004-08-31 18:07:01.000000000 +0200
@@ -27,6 +27,9 @@
int no_policy;
int force_igmp_version;
void *sysctl;
+#ifdef CONFIG_IP_SOURCEVIPA
+ __u32 source_vipa;
+#endif
};

extern struct ipv4_devconf ipv4_devconf;
diff -ruN linux-2.6.8.1/include/linux/sysctl.h linux-2.6.8.1.new/include/linux/sysctl.h
--- linux-2.6.8.1/include/linux/sysctl.h 2004-08-31 17:50:04.000000000 +0200
+++ linux-2.6.8.1.new/include/linux/sysctl.h 2004-08-31 18:08:13.000000000 +0200
@@ -393,6 +393,9 @@
NET_IPV4_CONF_FORCE_IGMP_VERSION=17,
NET_IPV4_CONF_ARP_ANNOUNCE=18,
NET_IPV4_CONF_ARP_IGNORE=19,
+#ifdef CONFIG_IP_SOURCEVIPA
+ NET_IPV4_CONF_SOURCE_VIPA = 20
+#endif
};

/* /proc/sys/net/ipv4/netfilter */
diff -ruN linux-2.6.8.1/net/ipv4/Kconfig linux-2.6.8.1.new/net/ipv4/Kconfig
--- linux-2.6.8.1/net/ipv4/Kconfig 2004-08-31 17:50:04.000000000 +0200
+++ linux-2.6.8.1.new/net/ipv4/Kconfig 2004-08-31 18:10:11.000000000 +0200
@@ -115,6 +115,20 @@
handled by the klogd daemon which is responsible for kernel messages
("man klogd").

+config IP_SOURCEVIPA
+ bool "IP: Source Virtual IP Address"
+ help
+ If you say Y you are able to configure on a per device basis
+ virtual source ip addresses to be set for not explicitly
+ bound sockets. Thereby, one may force applications like
+ FTP, NFS, etc. to implicitly bind to dummy interfaces.
+ On the basis of dummy interfaces one may decouple applications
+ from physical interfaces and may as a consequence achieve a higher
+ degree of fault tolerance.
+
+ If unsure, say N.
+
+
config IP_PNP
bool "IP: kernel level autoconfiguration"
depends on INET
diff -ruN linux-2.6.8.1/net/ipv4/devinet.c linux-2.6.8.1.new/net/ipv4/devinet.c
--- linux-2.6.8.1/net/ipv4/devinet.c 2004-08-31 17:50:04.000000000 +0200
+++ linux-2.6.8.1.new/net/ipv4/devinet.c 2004-08-31 18:27:25.000000000 +0200
@@ -57,6 +57,7 @@
#include <linux/sysctl.h>
#endif
#include <linux/kmod.h>
+#include <linux/ctype.h>

#include <net/ip.h>
#include <net/route.h>
@@ -67,6 +68,9 @@
.send_redirects = 1,
.secure_redirects = 1,
.shared_media = 1,
+#ifdef CONFIG_IP_SOURCEVIPA
+ .source_vipa = 0,
+#endif
};

static struct ipv4_devconf ipv4_devconf_dflt = {
@@ -75,6 +79,10 @@
.secure_redirects = 1,
.shared_media = 1,
.accept_source_route = 1,
+#ifdef CONFIG_IP_SOURCEVIPA
+ .source_vipa = 0,
+#endif
+
};

static void rtmsg_ifa(int event, struct in_ifaddr *);
@@ -767,6 +775,9 @@
u32 inet_select_addr(const struct net_device *dev, u32 dst, int scope)
{
u32 addr = 0;
+#ifdef CONFIG_IP_SOURCEVIPA
+ u32 source_vipa = 0;
+#endif
struct in_device *in_dev;

rcu_read_lock();
@@ -784,11 +795,17 @@
if (!addr)
addr = ifa->ifa_local;
} endfor_ifa(in_dev);
+#ifdef CONFIG_IP_SOURCEVIPA
+ source_vipa = in_dev->cnf.source_vipa;
+#endif
no_in_dev:
rcu_read_unlock();

if (addr)
goto out;
+#ifdef CONFIG_IP_SOURCEVIPA
+ source_vipa = 0;
+#endif

/* Not loopback addresses on loopback should be preferred
in this case. It is importnat that lo is the first interface
@@ -804,6 +821,9 @@
if (ifa->ifa_scope != RT_SCOPE_LINK &&
ifa->ifa_scope <= scope) {
addr = ifa->ifa_local;
+#ifdef CONFIG_IP_SOURCEVIPA
+ source_vipa = in_dev->cnf.source_vipa;
+#endif
goto out_unlock_both;
}
} endfor_ifa(in_dev);
@@ -812,6 +832,14 @@
read_unlock(&dev_base_lock);
rcu_read_unlock();
out:
+#ifdef CONFIG_IP_SOURCEVIPA
+ /* Set Source Virtual IP Address (Source VIPA) if one is
+ configured for the device and the device has a natural
+ IP */
+ if (addr != 0 && source_vipa != 0) {
+ addr = source_vipa;
+ }
+#endif
return addr;
}

@@ -1151,6 +1179,158 @@
return ret;
}

+#ifdef CONFIG_IP_SOURCEVIPA
+
+static int
+ipv4_inet_addr(const char *cp, void *dst)
+{
+ unsigned long value;
+ char *endp;
+ const char *startp;
+ unsigned char bytes[4];
+ int byteNo;
+
+ *((int*)bytes) = 0;
+
+ startp = cp;
+ for (byteNo = 0; byteNo < 4; ++byteNo) {
+ value = simple_strtoul( startp, &endp, 10 );
+ if ( value > 0xFF ) {
+ return -EINVAL;
+ }
+ bytes[byteNo] = (char) value;
+ if ( *endp == 0 ) {
+ *((int *)dst) = *((int *)bytes);
+ return 0;
+ }
+ else if ( *endp == '.' ) {
+ startp = endp + 1;
+ }
+ else {
+ return -EINVAL;
+ }
+ }
+
+ return -EINVAL;
+}
+
+
+/**
+ * ipv4_doinetaddrstring_and_flush - read an ip address string sysctl
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @filp: the file structure
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ *
+ * Reads/writes a string representing an IP address from/to the user buffer.
+ * It converts the string to an integer value through the use of
+ * ipv4_inet_addr.
+ * buffer provided is not large enough to hold the string, the
+ * string is truncated. The copied string is %NULL-terminated.
+ * If the string is being read by the user process, it is copied
+ * and a newline '\n' is added. It is truncated if the buffer is
+ * not large enough.
+ * On write operations the routing cache is flushed.
+ *
+ * Returns 0 on success.
+ */
+int
+ipv4_doinetaddrstring_and_flush(ctl_table *table, int write,
+ struct file *filp,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ char __user *p;
+ char *kerneltempbuffer;
+ int nullterminationpos;
+ int retval;
+
+
+ if (!table->data || !table->maxlen || !*lenp ||
+ (*ppos && !write)) {
+ *lenp = 0;
+ return 0;
+ }
+
+ if (write) {
+ kerneltempbuffer = (char*) kmalloc(table->maxlen, GFP_KERNEL);
+ if (!kerneltempbuffer) {
+ return -EFAULT;
+ }
+
+ /* copy data to kernel space */
+ if(strncpy_from_user(kerneltempbuffer, buffer,
+ table->maxlen) < 0) {
+ retval = -EFAULT;
+ goto cleanup;
+ }
+
+ /* set null-termination if necessary */
+ nullterminationpos = 0;
+ p = kerneltempbuffer;
+ while (nullterminationpos < table->maxlen) {
+ if (*p == '\n' || *p == 0)
+ break;
+
+ ++p;
+ ++nullterminationpos;
+ }
+ if ( nullterminationpos == table->maxlen )
+ nullterminationpos--;
+ kerneltempbuffer[nullterminationpos] = 0;
+
+ /* convert address */
+ retval = ipv4_inet_addr(kerneltempbuffer, table->data);
+ if ( retval != 0 ) {
+ goto cleanup;
+ }
+ *ppos += *lenp;
+
+ /* flush routing cache */
+ rt_cache_flush(0);
+ printk( KERN_DEBUG "%s: new IP written: %s(%u)",
+ __FUNCTION__, kerneltempbuffer,
+ *((__u32*)table->data) );
+ cleanup:
+ kfree( kerneltempbuffer );
+ goto out;
+
+ } else {
+ char inetaddrstr[16];
+ size_t len;
+ sprintf( inetaddrstr, "%u.%u.%u.%u",
+ *((unsigned char*)table->data),
+ *((unsigned char*)table->data+1),
+ *((unsigned char*)table->data+2),
+ *((unsigned char*)table->data+3) );
+ len = strlen( inetaddrstr );
+ if ( len > table->maxlen)
+ len = table->maxlen;
+ if (len > *lenp)
+ len = *lenp;
+ if (len)
+ if(copy_to_user(buffer, inetaddrstr, len)) {
+ retval = -EFAULT;
+ goto out;
+ }
+ if (len < *lenp) {
+ if(put_user('\n', ((char __user *) buffer) + len)) {
+ retval = -EFAULT;
+ goto out;
+ }
+ len++;
+ }
+ *lenp = len;
+ *ppos += len;
+ }
+ retval = 0;
+ out:
+ return retval;
+}
+
+#endif /* CONFIG_IP_SOURCEVIPA */
+
int ipv4_doint_and_flush(ctl_table *ctl, int write,
struct file* filp, void __user *buffer,
size_t *lenp, loff_t *ppos)
@@ -1209,7 +1389,11 @@

static struct devinet_sysctl_table {
struct ctl_table_header *sysctl_header;
- ctl_table devinet_vars[20];
+#ifdef CONFIG_IP_SOURCEVIPA
+ ctl_table devinet_vars[21];
+#else
+ ctl_table devinet_vars[20];
+#endif
ctl_table devinet_dev[2];
ctl_table devinet_conf_dir[2];
ctl_table devinet_proto_dir[2];
@@ -1371,6 +1555,16 @@
.proc_handler = &ipv4_doint_and_flush,
.strategy = &ipv4_doint_and_flush_strategy,
},
+#ifdef CONFIG_IP_SOURCEVIPA
+ {
+ .ctl_name = NET_IPV4_CONF_SOURCE_VIPA,
+ .procname = "source_vipa",
+ .data = &ipv4_devconf.source_vipa,
+ .maxlen = 16,
+ .mode = 0644,
+ .proc_handler = &ipv4_doinetaddrstring_and_flush
+ },
+#endif
},
.devinet_dev = {
{


2004-09-01 13:31:10

by Alan

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Mer, 2004-09-01 at 13:41, Einar Lueck wrote:
> The following small patch (applies to BK head) addresses issues relevant for transparent NIC failover (especially in case of NFS). It allows to configure on a per device basis via sysctl an IP address (Source Virtual IP Address - Source VIPA) that is set as IP source address for all connections for which no bind has been applied. ?To allow for NIC failover one then just needs:
> 1. A dummy-Device set up with the Source VIPA
> 2. Outbound routes via both/all redundant NICs for the relevant packets (more precisely: dynamic routing with for example ZEBRA)
> 3. Routes to the Source VIPA on the relevant router having the IPs of the redundant NICs configured as gateways (more precisely: dynamic routing with for example ZEBRA)
> Dynamic routing is mandatory as it is necessary that dead routes (e.g. NIC dead) are removed at the relevant router.

Is there anything here that cannot already be done by the ip route
command and iptables nat ?


2004-09-02 12:01:47

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Mer, 2004-09-01 at 14:25, ?Alan Cox wrote:
> Is there anything here that cannot already be done by the ip route
> command and iptables nat ?
Our experiences with customers operating high-available enterprise
installations indicate that ip route and iptables nat do not fullfill all the
relevant requirements:
The high-availability requirements drive those customers to ensure that
all routes are dynamic which is not possible with the proposed ip route
approach.
The complexity of the relevant setups necessitates an easy and
requirements-driven configuration and solution approach. The overall
concept of setting up a virtual device with a virtual IP adress and
assigning this virtual IP adress as a Source VIPA to the devices which should
allow for a failover is well known from other operating system and expected
by relevant enterprise customers. IP routes and NAT allow to achieve the
same effect with the exception mentioned above, but the corresponding
configuration overhead is in the opinion of customers having enterprise
setups too complex and complicated. Our overall approach introduces
this concept as a facility to address these requirements very clearly.

Einar

2004-09-02 12:09:22

by Alan

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Iau, 2004-09-02 at 13:01, Einar Lueck wrote:
> The high-availability requirements drive those customers to ensure that
> all routes are dynamic which is not possible with the proposed ip route
> approach.

Why. You are simply doing NAT for that port from all locally sourced
packets to your "virtual" address ?

Alan

2004-09-02 12:20:51

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Donnerstag, 2. September 2004 13:05 Alan Cox wrote:
> On Iau, 2004-09-02 at 13:01, Einar Lueck wrote:
> > The high-availability requirements drive those customers to ensure that
> > all routes are dynamic which is not possible with the proposed ip route
> > approach.
>
> Why. You are simply doing NAT for that port from all locally sourced
> packets to your "virtual" address ?
I thought You proposed to utilize ip route in the following way:
"ip route xxx.xxx.xxx.xxx/xx via xxx dev xxx src MY_VIRTUAL_IP_ADDRESS"
This approach does not work with the setups we have in mind. So I misunderstood
You. As we pointed out in the other part of the mail iptables NAT is not an option
for the relevant customers.

Einar

2004-09-02 12:26:32

by Alan

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Iau, 2004-09-02 at 13:20, Einar Lueck wrote:
> You. As we pointed out in the other part of the mail iptables NAT is not an option
> for the relevant customers.

You've failed as far as I can see to explain why NAT doesn't do the
right thing in this case. I don't care whether the customers like it, I
care whether it works. If it works then we don't need to add junk to the
kernel. If it works but is hard to configure then its an opportunity to
write a nice tool to manage it.

2004-09-02 12:52:29

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Donnerstag, 2. September 2004 13:24 Alan Cox wrote:
> You've failed as far as I can see to explain why NAT doesn't do the
> right thing in this case. I don't care whether the customers like it, I
> care whether it works. If it works then we don't need to add junk to the
> kernel. If it works but is hard to configure then its an opportunity to
> write a nice tool to manage it.
I am sorry: I failed to point out that NAT does the job!

We think that the the proposed patch, that is a really small one,
introduces a facility that works well for existing operating systems
and is desired by customers. Consequently, it enriches the kernel with
a concept that has already proven its value. That's the reason why we
think the kernel could profit from the application of the corresponding
patch, especially as it makes more customers happy. As the whole feature
may be configured out we don't see what's wrong with making more and
especially large customers happy with an enriched kernel.

Einar

2004-09-02 13:14:05

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Thu, Sep 02, 2004 at 02:01:42PM +0200, Einar Lueck wrote:
> The complexity of the relevant setups necessitates an easy and
> requirements-driven configuration and solution approach. The overall
> concept of setting up a virtual device with a virtual IP adress and
> assigning this virtual IP adress as a Source VIPA to the devices which should
> allow for a failover is well known from other operating system and expected
> by relevant enterprise customers. IP routes and NAT allow to achieve the
> same effect with the exception mentioned above, but the corresponding
> configuration overhead is in the opinion of customers having enterprise
> setups too complex and complicated. Our overall approach introduces
> this concept as a facility to address these requirements very clearly.

The problem here is not the kernel, it's the awful state of network
management in the common initscripts and routing daemons, which have not
evolved far beyond the traditional (static) BSD model of interfaces,
despite the qualitative leap in functionality brought on by Alexey's
contributions to Linux 2.2, netfilter and vlans in 2.4, and now IPsec in 2.6.

While policy routing, traffic control, netfilter, etc., are extremely
powerful separately and in combination, there is no unified mechanism
for managing different aspects of network configuration (addressing,
redundancy, security, QoS, etc.) that cut across each of these kernel
mechanisms.

Several of the routing daemons (Quagga, Bird, etc.) have gotten partway
there, but none is (AFAIK, correct me if I'm wrong) in a state where
one can eliminate all of the other networking-related scripts from
initscripts and just expect the daemon(s) to manage networking.

It seems that effort would be better spent cleaning up userspace, by
looking at use cases, identifying the right abstractions, and reifying
them. Choose a routing daemon, and turn it into a "Network Mgmt daemon".

Should this be built on top of HAL/D-BUS? I don't know, will HAL/D-BUS
remain lightweight enough to be used in an embedded router? I sure
hope so.

Regards,

Bill Rugolsky

2004-09-02 16:22:17

by Paul Jakma

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Wed, 1 Sep 2004, Einar Lueck wrote:

> The following small patch (applies to BK head) addresses issues relevant for transparent NIC failover (especially in case of NFS). It allows to configure on a per device basis via sysctl an IP address (Source Virtual IP Address - Source VIPA) that is set as IP source address for all connections for which no bind has been applied. ?To allow for NIC failover one then just needs:

> 1. A dummy-Device set up with the Source VIPA

Why cant you use loopback as the placeholder interface for the VIPA
IP?

> 2. Outbound routes via both/all redundant NICs for the relevant
> packets (more precisely: dynamic routing with for example ZEBRA)

# telnet localhost zebra
...
host# en
host(config)# in lo
host(config-if)# ip address <desired address>/32
host(config-if)# end
host# wri fi

hey presto, "virtual IP" which you can redistribute connected in
ospfd/ripd whatever and publish in DNS.

> The reason for the development of this patch is that the
> alternatives we thought of have serious limitations for the
> intended usage scenarios:

> 1. A User space tool intercepting connects and issuing binds
> (configuration on a per application basis) (refer to:
> http://oss.software.ibm.com/linux390/useful_add-ons_vipa.shtml)
> This approach does not allow for NFS failover which we consider to
> be a very important use case because NFS works in kernel.

What problems did you ahve?

> 2. ip route xxx.xxx.xxx.xxx/xx src SourceVIPA OSPF, etc. do not
> support automatic setup of and discovery of desired source
> addresses.

ip route add default via <gateway> src <virtual ip>

Having zebra be able to apply route-maps to routes and set src would
be useful too.

But why do you need this even? See below about replies. Just publish
the virtual IP in DNS, not the interface addresses.

> As a consequence one would have to configure static routes for all
> use cases which is not desirable in complex routing scenarios and
> especially in presence of dynamic routing. 3. netfilter ((S)NAT)
> NAT takes place after routing is applied and an IP address (e.g. IP
> of the output NIC) has been set for a packet. Consequently,
> returned packets are routed to the original IP address. As a result
> no failover is possible.

TCP responses to requests that come in addressed to the virtual IP
will automatically have source of that virtual IP. For UDP too if app
binds to the virtual IP or uses IP_HDRINCL. Not sure about kernel UDP
though, but I bet linux-nfs list would be amienable to any changes
needed for knfsd to work nicely with virtual/loopback IPs.

> 4. NIC bonding

> There is a strong dependence on the switches' timeout for the
> IP/MAC pair. In addition to that, as far as we know not all NICs
> support bonding with failover.

How does bonding come into it?

> I hope I described the overall use case comprehensible enough to
> clarify why we consider this patch as very useful and important.
>
> Einar Lueck

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
<<<<< EVACUATION ROUTE <<<<<

2004-09-02 16:58:45

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

Please apologize: I fear I have not specified the setups we address
precisely enough: I think it helps to imagine the following scenario:
1. SERVER1 mounts some NFS stuff exported by SERVER2.
2. SERVER1 has two NICs.
3. The souce IP address of the related packets is the IP address
of the NIC the kernel identifies for the outbound route, or the IP
address specified via a static route (ip route .... src SVIPA)
4. In case no static routes are allowed and iptables NAT is no
option, the relevant packets thus have the IP of a NIC.
5. In case the corresponding NIC dies, we have a serious problem.

On Donnerstag, 2. September 2004 18:22 Paul Jakma wrote:
> On Wed, 1 Sep 2004, Einar Lueck wrote:
>
> > The following small patch (applies to BK head) addresses issues relevant for transparent NIC failover (especially in case of NFS). It allows to configure on a per device basis via sysctl an IP address (Source Virtual IP Address - Source VIPA) that is set as IP source address for all connections for which no bind has been applied. ?To allow for NIC failover one then just needs:
>
> > 1. A dummy-Device set up with the Source VIPA
>
> Why cant you use loopback as the placeholder interface for the VIPA
> IP?

This is works of course, too. But it does not solve the problem described
above. But, maybe I misunderstood you. Please correct me, if I am wrong.

>
> > 2. Outbound routes via both/all redundant NICs for the relevant
> > packets (more precisely: dynamic routing with for example ZEBRA)
>
> # telnet localhost zebra
> ...
> host# en
> host(config)# in lo
> host(config-if)# ip address <desired address>/32
> host(config-if)# end
> host# wri fi
>
> hey presto, "virtual IP" which you can redistribute connected in
> ospfd/ripd whatever and publish in DNS.

I think this is the same as the first point.

>
> > The reason for the development of this patch is that the
> > alternatives we thought of have serious limitations for the
> > intended usage scenarios:
>
> > 1. A User space tool intercepting connects and issuing binds
> > (configuration on a per application basis) (refer to:
> > http://oss.software.ibm.com/linux390/useful_add-ons_vipa.shtml)
> > This approach does not allow for NFS failover which we consider to
> > be a very important use case because NFS works in kernel.
>
> What problems did you ahve?

NFS works in kernel. Therefore we could not intercept the corresponding
"connect" calls. As a result, the problem scenario described above could
not be solved.

>
> TCP responses to requests that come in addressed to the virtual IP
> will automatically have source of that virtual IP. For UDP too if app
> binds to the virtual IP or uses IP_HDRINCL. Not sure about kernel UDP
> though, but I bet linux-nfs list would be amienable to any changes
> needed for knfsd to work nicely with virtual/loopback IPs.

You are right, but the packets do not come in, they go out, as I tried to
illustrate above. Anyway, in the other case you are completely right.

>
> > 4. NIC bonding
>
> > There is a strong dependence on the switches' timeout for the
> > IP/MAC pair. In addition to that, as far as we know not all NICs
> > support bonding with failover.
>
> How does bonding come into it?

Bonding offers a failover facility. For more details, please refer to:
Documentation/networking/bonding.txt in the kernel tree.

Regards
Einar

2004-09-02 20:52:45

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Thu, 2 Sep 2004 14:51:55 +0200
Einar Lueck <[email protected]> wrote:

> On Donnerstag, 2. September 2004 13:24 Alan Cox wrote:
> > You've failed as far as I can see to explain why NAT doesn't do the
> > right thing in this case. I don't care whether the customers like it, I
> > care whether it works. If it works then we don't need to add junk to the
> > kernel. If it works but is hard to configure then its an opportunity to
> > write a nice tool to manage it.
> I am sorry: I failed to point out that NAT does the job!
>
> We think that the the proposed patch, that is a really small one,
> introduces a facility that works well for existing operating systems
> and is desired by customers. Consequently, it enriches the kernel with
> a concept that has already proven its value.

We never add patches that duplicate existing functionality just to
make it somehow "easier" for the user. That's a job for scripts
and good tools, that make use of existing kernel facilities.

Alan is saying you can hide whatever complexity you claim exists via
tools, without any kernel modifications. If you continue to ignore
that part of the discussion, it seems likely we will just the same
ignore your patch.

I, frankly, see no reason at all to even remotely consider your patch.
Furthermore, you'll get more discussion by bringing this up in the proper
place to propose such networking changes ([email protected]).

2004-09-02 21:01:52

by Paul Jakma

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Thu, 2 Sep 2004, Einar Lueck wrote:

> Please apologize: I fear I have not specified the setups we address
> precisely enough: I think it helps to imagine the following scenario:
> 1. SERVER1 mounts some NFS stuff exported by SERVER2.
> 2. SERVER1 has two NICs.
> 3. The souce IP address of the related packets is the IP address
> of the NIC the kernel identifies for the outbound route, or the IP
> address specified via a static route (ip route .... src SVIPA)
> 4. In case no static routes are allowed and iptables NAT is no
> option, the relevant packets thus have the IP of a NIC.
> 5. In case the corresponding NIC dies, we have a serious problem.

> This is works of course, too. But it does not solve the problem
> described above. But, maybe I misunderstood you. Please correct me,
> if I am wrong.

I dont see why it wouldnt work, it almost undoubtedly will work for
NFS over TCP. And any problems to cause it to not work would be best
taking up on the linux-nfs list in order to have a "bind to address"
option added to knfsd.

Why would it not work?

>> hey presto, "virtual IP" which you can redistribute connected in
>> ospfd/ripd whatever and publish in DNS.
>
> I think this is the same as the first point.

But you must describe why it would not work. *Why*?

> NFS works in kernel. Therefore we could not intercept the
> corresponding "connect" calls. As a result, the problem scenario
> described above could not be solved.

Why could it not be solved? And why is the answer not "ask the knfsd
people to provide bind-to-ip option"?

> You are right, but the packets do not come in, they go out, as I
> tried to illustrate above. Anyway, in the other case you are
> completely right.

But on a server, the packets that go out tend to be replies to
requests. Or at least, the only packets of interest are replies. It's
a rare server that just off its own bat goes and talks to clients
which have not communicated first with the server before.

Anyway, even if the server for some reason initiated traffic, the
correct answer surely is "modify the server to bind to a specific
address", no?

> Bonding offers a failover facility. For more details, please refer to:
> Documentation/networking/bonding.txt in the kernel tree.

Right, but what does bonding (layer 2) have to do with virtual IPs
and IP source address?

> Regards
> Einar

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
People say I live in my own little fantasy world... well, at least they
*know* me there!
-- D.L. Roth

2004-09-03 08:17:46

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Donnerstag, 2. September 2004 22:59 Paul Jakma wrote:
>
> I dont see why it wouldnt work, it almost undoubtedly will work for
> NFS over TCP. And any problems to cause it to not work would be best
> taking up on the linux-nfs list in order to have a "bind to address"
> option added to knfsd.

I just set up the loopback interface via ZEBRA/OSPF as You described it
and checked via tcpdump the source IP address of the related NFS packets.
The kernel chooses the IP address of the NIC he routes the packets over as
the source IP address and not the Source VIPA configured for loopback.

You are right, it would be one option to have a "bind to address" in KNFSD.
But our idea was to implement a feature well known from other operating
systems like AIX to Linux because this feature is quite popular and liked
especially by large customers. As You have read for sure such a feature
adding redundant functionality to the kernel is not desired. So maybe we
should continue our discussion privately. Thanks for Your suggestions!

>
> Why could it not be solved? And why is the answer not "ask the knfsd
> people to provide bind-to-ip option"?
>

We would win a facility allowing for a Source VIPA for all
kinds of servers not offering an explicit bind option. So: Due to the
feature port idea mentioned above.

> But on a server, the packets that go out tend to be replies to
> requests. Or at least, the only packets of interest are replies. It's
> a rare server that just off its own bat goes and talks to clients
> which have not communicated first with the server before.

The enterprise customers we care about have for example servers
that utilize other servers (application servers utilizing a database or
a NFS server, etc.). So to generate replies these servers need
replies of other servers .

>
> Anyway, even if the server for some reason initiated traffic, the
> correct answer surely is "modify the server to bind to a specific
> address", no?

As mentioned above ;)

>
> > Bonding offers a failover facility. For more details, please refer to:
> > Documentation/networking/bonding.txt in the kernel tree.
>
> Right, but what does bonding (layer 2) have to do with virtual IPs
> and IP source address?
>

If we focus for a moment just on the NIC-fail-over issue (not caring
about layers, virtual IPs, etc.) then bonding offers the desired failover with
some restriction. This is the reason why I mentioned it in this context.

Again, thanks for Your suggestions and maybe we should continue our
discussion privately.

Regards

Einar.

2004-09-03 16:58:29

by Paul Jakma

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Fri, 3 Sep 2004, Einar Lueck wrote:

> I just set up the loopback interface via ZEBRA/OSPF as You
> described it and checked via tcpdump the source IP address of the
> related NFS packets. The kernel chooses the IP address of the NIC
> he routes the packets over as the source IP address and not the
> Source VIPA configured for loopback.

Ah, I didnt say adding an address to loopback would make everything
use it. Merely that loopback already exists as an interface to which
from which you can 'hang' your VIPA - no need for a new interface.

You could try:

ip route change default via <gateway> src <vipa>

Presuming the NFS clients are behind a gateway. If also onlink, you
need to modify the connected routes and change the src there too.

> You are right, it would be one option to have a "bind to address"
> in KNFSD.

It might even already exist.. who knows. ;)

> But our idea was to implement a feature well known from other
> operating systems like AIX to Linux because this feature is quite
> popular and liked especially by large customers.

Right, but Linux can already do it. The configuration might not be
the same as AIX, but that's not a good reason.. if it were, you
should also be porting smit to Linux to satisfy your customers ;)

Linux can already do what you want I think. Just a matter of
configuring it.

> We would win a facility allowing for a Source VIPA for all kinds of
> servers not offering an explicit bind option. So: Due to the
> feature port idea mentioned above.

Have you tried playing with ip route?

ip route <destination> ...... src <source address>

> If we focus for a moment just on the NIC-fail-over issue (not
> caring about layers, virtual IPs, etc.) then bonding offers the
> desired failover with some restriction. This is the reason why I
> mentioned it in this context.

Ah.

> Again, thanks for Your suggestions and maybe we should continue our
> discussion privately.

Sure.

> Regards
>
> Einar.

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
kernel panic: write-only-memory (/dev/wom0) capacity exceeded.

2004-09-05 23:02:23

by Daniel Roesen

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Thu, Sep 02, 2004 at 05:22:01PM +0100, Paul Jakma wrote:
> ip route add default via <gateway> src <virtual ip>

Sitenote: unfortunately, "src <...>" doesn't work for IPv6 routes, at
least in 2.4 -- can someone confirm this problem to still exist in
current 2.6?

Copying netdev for the record.


Regards,
Daniel

2004-09-06 03:24:30

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Mon, 6 Sep 2004 01:02:19 +0200
Daniel Roesen <[email protected]> wrote:

> On Thu, Sep 02, 2004 at 05:22:01PM +0100, Paul Jakma wrote:
> > ip route add default via <gateway> src <virtual ip>
>
> Sitenote: unfortunately, "src <...>" doesn't work for IPv6 routes, at
> least in 2.4 -- can someone confirm this problem to still exist in
> current 2.6?

That's right, no routing by source in ipv6 yet, the folks
working with the USAGI guys on MIPV6 support will add the
feature.

2004-09-06 07:50:45

by Einar Lueck

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

Paul Jakma wrote:

> You could try:
>
> ip route change default via <gateway> src <vipa>
>
> Presuming the NFS clients are behind a gateway. If also onlink, you
> need to modify the connected routes and change the src there too.
>
We have tried it this way, of course, and it works. But the
corresponding routes are static. In our context, all routes have to be
dynamic (for fault tolerance and due to configuration overhead). Zebra
and OSPFD do not support this type of route. As a consequence, this
approach does not work for us. Anyway, iptables is the way to do it.

Regards,

Einar.

2004-09-06 07:58:00

by Paul Jakma

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Mon, 6 Sep 2004, Einar Lueck wrote:

> We have tried it this way, of course, and it works. But the
> corresponding routes are static. In our context, all routes have to
> be dynamic (for fault tolerance and due to configuration overhead).

then change the 'connected' routes. that's all you need to affect
source address selection. All your other dynamic routes will still
work.

> Zebra and OSPFD do not support this type of route.

ospfd never would. You could add the required support to zebra if you
wished. ;)

> As a consequence, this approach does not work for us. Anyway,
> iptables is the way to do it.

One way, but quite a kludge to effectively NAT local machines. You'd
be much better up configuring the routes which affect source address
selection to taste.

anyway..

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
All of the animals except man know that the principal business of life is
to enjoy it.

2004-09-07 16:09:34

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH] net/ipv4 for Source VIPA support, kernel BK Head

On Fri, Sep 03, 2004 at 10:07:29AM +0200, Einar Lueck wrote:
> On Donnerstag, 2. September 2004 22:59 Paul Jakma wrote:
> >
> > I dont see why it wouldnt work, it almost undoubtedly will work for
> > NFS over TCP. And any problems to cause it to not work would be best
> > taking up on the linux-nfs list in order to have a "bind to address"
> > option added to knfsd.
>
> I just set up the loopback interface via ZEBRA/OSPF as You described it
> and checked via tcpdump the source IP address of the related NFS packets.
> The kernel chooses the IP address of the NIC he routes the packets over as
> the source IP address and not the Source VIPA configured for loopback.
>
> You are right, it would be one option to have a "bind to address" in KNFSD.
> But our idea was to implement a feature well known from other operating
> systems like AIX to Linux because this feature is quite popular and liked
> especially by large customers. As You have read for sure such a feature
> adding redundant functionality to the kernel is not desired. So maybe we
> should continue our discussion privately. Thanks for Your suggestions!
>
> >
> > Why could it not be solved? And why is the answer not "ask the knfsd
> > people to provide bind-to-ip option"?
> >
>
> We would win a facility allowing for a Source VIPA for all
> kinds of servers not offering an explicit bind option. So: Due to the
> feature port idea mentioned above.

btw, something very similar is implemented and used
by linux-vserver (it's called chbind) to restrict
0.0.0.0 (IADDR_ANY) binds to specified address(es)

if you need more details, just let me know ...

best,
Herbert

> > But on a server, the packets that go out tend to be replies to
> > requests. Or at least, the only packets of interest are replies. It's
> > a rare server that just off its own bat goes and talks to clients
> > which have not communicated first with the server before.
>
> The enterprise customers we care about have for example servers
> that utilize other servers (application servers utilizing a database or
> a NFS server, etc.). So to generate replies these servers need
> replies of other servers .
>
> >
> > Anyway, even if the server for some reason initiated traffic, the
> > correct answer surely is "modify the server to bind to a specific
> > address", no?
>
> As mentioned above ;)
>
> >
> > > Bonding offers a failover facility. For more details, please refer to:
> > > Documentation/networking/bonding.txt in the kernel tree.
> >
> > Right, but what does bonding (layer 2) have to do with virtual IPs
> > and IP source address?
> >
>
> If we focus for a moment just on the NIC-fail-over issue (not caring
> about layers, virtual IPs, etc.) then bonding offers the desired failover with
> some restriction. This is the reason why I mentioned it in this context.
>
> Again, thanks for Your suggestions and maybe we should continue our
> discussion privately.
>
> Regards
>
> Einar.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/