Inter Process Networking:
a kernel module (and some simple kernel patches) to provide
AF_IPN: a new address family for process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).
WHAT IS IT?
-----------
Berkeley socket have been designed for client server or point to point
communication. All existing Address Families implement this idea.
IPN is a new address family designed for one-to-many, many-to-many and
peer-to-peer communication among processes.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.
On IPN, processes can interoperate using real networking protocols
(e.g. ethernet) but also using application defined protocols (maybe
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.
IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).
IPN is part of the Virtual Square Project (vde, lwipv6, view-os,
umview/kmview, see wiki.virtualsquare.org).
WHY?
----
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their
data to several consuming processes (maybe joining the stream at run time).
IPN sockets can be also connected to tap (tuntap) like interfaces or
to real interfaces (like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.
Several existing services could be implemented (and often could have extended
features) on the top of IPN:
- kernel bridge
- tuntap
- macvlan
IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel
i.e. where processes can "join (and leave) the information flow" at
runtime.
Different delivery policies can be defined as IPN protocols (loaded
as submodules of ipn.ko).
e.g. ethernet switch is a policy (kvde_switch.ko: packets are unicast
delivered if the MAC address is already in the switching hash table),
we are designing an extendended switch, full of interesting features like
our userland vde_switch (with vlan/fst/manamement etc..), and a layer3
switch, but other policies can be defined to implement the specific
requirements of other services. I feel that there is no limits to creativity
about multicast services for processes.
Userspace services (like vde or jack) do exist, but IPN provides
a faster and unified support.
HOW?
----
The complete specifications for IPN can be found here:
http://wiki.virtualsquare.org/index.php/IPN
bind() creates the socket (if it does not already exist). When bind() succeeds,
the process has the right to manage the "network".
No data is received or can be send if the socket is not connected
(only get/setsockopt and ioctl work on bound unconnected sockets).
connect() is used to join the network. When the socket is connected it
is possible to send/receive data. If the socket is already bound it is
useless to specify the socket again (you can use NULL, or specify the same
address).
connect() can be also used without bind(). In this case the process sends and
receives data but it cannot manage the network (in this case the socket
address specification is required).
listen() and accept() are for servers, thus they does not exist for IPN.
Examples:
1- Peer-to-Peer Communication:
Several processes run the same code:
struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST);
err=bind(s,(struct sockaddr *)&sun,sizeof(sun));
err=connect(s,NULL,0);
In this case all the messages sent by each process get received by all the
other processes (IPN_BROADCAST).
The processes need to be able to receive data when there are pending packets,
e.g. by using poll/select and event driven programming or multithreading.
2- (One or) Some senders/many receivers
The sender runs the following code:
struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST);
err=shutdown(s,SHUT_RD);
err=bind(s,(struct sockaddr *)&sun,sizeof(sun));
err=connect(s,NULL,0);
The receivers do not need to define the network, thus they skip the bind():
struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST);
err=shutdown(s,SHUT_WR);
err=connect(s,(struct sockaddr *)&sun,sizeof(sun));
In the previous examples processes can send and receive every kind of
data.
When messages are ethernet packets (maybe from virtual machines), IPN
works like a Hub by using the IPN_BROADCAST protocol.
Different protocols (delivery policies) can be specified by changing
IPN_BROADCAST with a different tag.
A IPN protocol specific submodule must have been registered
the protocol tag in advance. (e.g. when kvde_switch.ko is loaded
IPN_VDESWITCH can be used too).
The basic broadcasting protocol IPN_BROADCAST is built-in (all the
messages get delivered to all the connected processes but the sender).
IPN sockets use the filesystem for naming and access control.
srwxr-xr-x 1 renzo renzo 0 2007-12-04 22:28 /tmp/sockipn
An IPN socket appear in the file like a UNIX socket.
r/w permissions represent the right to receive from/send data to the
socket. The 'x' permission represent the right to manage the socket.
connect() automatically shuts down SHUT_WR or SHUT_RD if the user has not
the correspondent right.
WHAT WE NEED FROM THE LINUX KERNEL COMMUNITY
--------------------------------------------
0- (Constructive) comments.
1- The "official" assignment of an Address Family.
(It is enough for everything but interface grabbing, see 2)
in include/linux/net.h:
- #define NPROTO 34 /* should be enough for now.. */
+ #define NPROTO 35 /* should be enough for now.. */
in include/linux/socket.h
+ #define AF_IPN 34
+ #define PF_IPN AF_IPN
- #define AF_MAX 34 /* For now.. */
+ #define AF_MAX 35 /* For now.. */
This seems to be quite simple.
2- Another "grabbing hook" for interfaces (like the ones already
existing for the kernel bridge and for the macvlan).
In include/linux/netdevice.h:
among the fields of struct net_device:
/* bridge stuff */
struct net_bridge_port *br_port;
/* macvlan */
struct macvlan_port *macvlan_port;
+ /* ipn */
+ struct ipn_node *ipn_port;
/* class/net/name entry */
struct device dev;
In net/core/dev.c, we need another section for grabbing packets....
like the ones defined for CONFIG_BRIDGE and CONFIG_MACVLAN.
I can write the patch (it needs just tens of minutes of cut&paste).
We are studying some way to register/deregister grabbing services,
I feel this would be the cleanest way.
WHERE?
------
There is an experimental version in the VDE svn tree.
http://sourceforge.net/projects/vde
The current implementation can be compiled as a module on linux >= 2.6.22.
We have currently "stolen" the AF_RXRPC and the kernel bridge hook, thus
this experimental implementation is incompatible with RXRPC and the kernel
bridge (sharing the same data structure). This is just to show the
effectiveness of this idea, in this way it can be compiled as a module
without patching the kernel.
We'll migrate IPN to its specific AF and grabbing hook as soon as they
have been defined.
renzo
(V^2 project)
On Wed, 5 Dec 2007 17:40:55 +0100
[email protected] (Renzo Davoli) wrote:
>
> WHAT WE NEED FROM THE LINUX KERNEL COMMUNITY
> --------------------------------------------
> 0- (Constructive) comments.
>
> 1- The "official" assignment of an Address Family.
> (It is enough for everything but interface grabbing, see 2)
>
> in include/linux/net.h:
> - #define NPROTO 34 /* should be enough for now.. */
> + #define NPROTO 35 /* should be enough for now.. */
>
> in include/linux/socket.h
> + #define AF_IPN 34
> + #define PF_IPN AF_IPN
> - #define AF_MAX 34 /* For now.. */
> + #define AF_MAX 35 /* For now.. */
>
> This seems to be quite simple.
>
> 2- Another "grabbing hook" for interfaces (like the ones already
> existing for the kernel bridge and for the macvlan).
>
> In include/linux/netdevice.h:
> among the fields of struct net_device:
>
> /* bridge stuff */
> struct net_bridge_port *br_port;
> /* macvlan */
> struct macvlan_port *macvlan_port;
> + /* ipn */
> + struct ipn_node *ipn_port;
>
> /* class/net/name entry */
> struct device dev;
>
> In net/core/dev.c, we need another section for grabbing packets....
> like the ones defined for CONFIG_BRIDGE and CONFIG_MACVLAN.
> I can write the patch (it needs just tens of minutes of cut&paste).
> We are studying some way to register/deregister grabbing services,
> I feel this would be the cleanest way.
>
> WHERE?
> ------
> There is an experimental version in the VDE svn tree.
> http://sourceforge.net/projects/vde
>
Post complete source code for kernel part to [email protected].
If you want the hooks, you need to include the full source code for inclusion
in mainline. All the Documentation/SubmittingPatches rules apply;
you can't just ask for "facilitators" and expect to keep your stuff out of tree.
[email protected] (Renzo Davoli) writes:
> Berkeley socket have been designed for client server or point to point
> communication. All existing Address Families implement this idea.
Netlink is multicast/broadcast by default for once. And BC/MC certainly
works for IPv[46] and a couple of other protocols too.
> IPN is an Inter Process Communication paradigm where all the processes
> appear as they were connected by a networking bus.
Sounds like netlink. See also RFC 3549
Haven't read further I admit.
-Andi
On Thu, Dec 06, 2007 at 12:39:22AM +0100, Andi Kleen wrote:
> [email protected] (Renzo Davoli) writes:
>
> > Berkeley socket have been designed for client server or point to point
> > communication. All existing Address Families implement this idea.
> Netlink is multicast/broadcast by default for once. And BC/MC certainly
> works for IPv[46] and a couple of other protocols too.
>
> > IPN is an Inter Process Communication paradigm where all the processes
> > appear as they were connected by a networking bus.
>
> Sounds like netlink. See also RFC 3549
RFC 3549 says:
"This document describes Linux Netlink, which is used in Linux both as
an intra-kernel messaging system as well as between kernel and user
space."
We know AF_NETLINK, our user-space stack lwipv6 supports it.
AF_IPN is different.
AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX.
It supports communication among *user* processes.
Example:
Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an
Ethernet Hub and communicate among themselves with the hosting computer
and the world by a tap like interface.
You can also grab an interface (say eth1) and use eth0 for your hosting
computer and eth1 for the IPN network of virtual machines.
If you load the kvde_switch submodule IPN can be a virtual Ethernet switch.
This example is already working using the svn versions of ipn and
vdeplug.
Another Example:
You have a continuous stream of data packets generated by a process,
and you want to send this data to many processes.
Maybe the set of processes is not known in advance, you want to send the
data to any interested process. Some kind of publish&subscribe
communication service (among unix processes not on TCP-IP).
Without IPN you need a server. With IPN the sender creates the socket
connects to it and feed it with data packets. All the interested
receivers connects to it and start reading. That's all.
I hope that this message can give a better undertanding of what IPN is.
renzo
On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote:
> On Wed, 5 Dec 2007 17:40:55 +0100
> [email protected] (Renzo Davoli) wrote:
> > 0- (Constructive) comments.
> > 1- The "official" assignment of an Address Family.
> > 2- Another "grabbing hook" for interfaces (like the ones already
> > We are studying some way to register/deregister grabbing services,
> > I feel this would be the cleanest way.
>
> Post complete source code for kernel part to [email protected].
I'll do it as soon as possible.
> If you want the hooks, you need to include the full source code for inclusion
> in mainline. All the Documentation/SubmittingPatches rules apply;
> you can't just ask for "facilitators" and expect to keep your stuff out of tree.
I am sorry if I was misunderstood.
I did not want any "facilitator", nor I wanted to keep my code outside
the kernel, on the contrary.
It is perfectly okay for me to provide the entire code for inclusion.
The purposes of my message were the following:
- I wanted to introduce the idea and say to the linux kernel community
that a team is working on it.
- Address family: is it okay to send a patch that add a new AF?
is there a "AF registry" somewhere? (like the device major/minor
registry or the well-known port assignment for TCP-IP).
- Hook: we have two different options. We can add another grabbing
inline function like those used by the bridge and macvlan or we can
design a grabbing service registration facility. Which one is preferrable?
The former is simpler, the latter is more elegant but it requires some
changes in the kernel bridge code.
So the former choice is between less-invasive,safer,inelegant, the
latter is more-invasive,less safe,elegant.
We need a bit of time to stabilize the code: deeply testing the existing
features and implementing some more ideas we have on it.
In the meanwhile we would be grateful if the community could kindly ask to the
questions above.
renzo
> In the meanwhile we would be grateful if the community could kindly ask to the
> questions above.
Obviously I meant:
In the meanwhile we would be grateful if the community could kindly *answer*
to the questions above
sorry (it is early morning here, it happens ;-)
renzo
On Thu, 6 Dec 2007 06:38:21 +0100
[email protected] (Renzo Davoli) wrote:
> On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote:
> > On Wed, 5 Dec 2007 17:40:55 +0100
> > [email protected] (Renzo Davoli) wrote:
> > > 0- (Constructive) comments.
> > > 1- The "official" assignment of an Address Family.
> > > 2- Another "grabbing hook" for interfaces (like the ones already
> > > We are studying some way to register/deregister grabbing services,
> > > I feel this would be the cleanest way.
> >
> > Post complete source code for kernel part to [email protected].
> I'll do it as soon as possible.
> > If you want the hooks, you need to include the full source code for inclusion
> > in mainline. All the Documentation/SubmittingPatches rules apply;
> > you can't just ask for "facilitators" and expect to keep your stuff out of tree.
> I am sorry if I was misunderstood.
> I did not want any "facilitator", nor I wanted to keep my code outside
> the kernel, on the contrary.
Greate
> It is perfectly okay for me to provide the entire code for inclusion.
> The purposes of my message were the following:
> - I wanted to introduce the idea and say to the linux kernel community
> that a team is working on it.
> - Address family: is it okay to send a patch that add a new AF?
> is there a "AF registry" somewhere? (like the device major/minor
> registry or the well-known port assignment for TCP-IP).
The usual process is to just add the value as part of the patchset.
You then need to tell the glibc maintainers so it gets included appropriately
in userspace.
> - Hook: we have two different options. We can add another grabbing
> inline function like those used by the bridge and macvlan or we can
> design a grabbing service registration facility. Which one is preferrable?
The problem with making it a registration facilties are:
* risk of making it easier for non-GPL out of tree abuse
* possible ordering issues: ie. by hardcoding each hook, the
behaviour is defined in the case of multiple usages on the same
machine.
> The former is simpler, the latter is more elegant but it requires some
> changes in the kernel bridge code.
Not a big deal, but see above
> So the former choice is between less-invasive,safer,inelegant, the
> latter is more-invasive,less safe,elegant.
> We need a bit of time to stabilize the code: deeply testing the existing
> features and implementing some more ideas we have on it.
> In the meanwhile we would be grateful if the community could kindly ask to the
> questions above.
I am a believer in review early and often. It is easier to just deal with
the nuisance issues (style, naming, configuration) at the beginning rather
than the final stage of the project.
On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote:
> AF_IPN is different. AF_IPN is the broadcast and peer-to-peer
> extension of AF_UNIX. It supports communication among *user*
> processes.
Ok, you say it's different, but then you describe how IP unicast and
broadcast work. Both are frequently used for communication among
"*user* processes". Please provide significantly more details about
exactly *how* it's different.
> Example:
>
> Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an
> Ethernet Hub and communicate among themselves with the hosting
> computer and the world by a tap like interface.
You say "tap like" interface, but people do this already with
existing infrastructure. You can connect Qemu, UML, and KVM to a
standard linus "tap" interface, and then use the standard Linux
bridging code to connect the "tap" interface to your existing network
interfaces. Alternatively you could use the standard and well-tested
IP routing/firewalling/NAT code to move your packets around. None of
this requires new network infrastructure in the slightest. If you
have problems with the existing code, please improve it instead of
creating a slightly incompatible replacement which has different bugs
and workarounds.
> You can also grab an interface (say eth1) and use eth0 for your
> hosting computer and eth1 for the IPN network of virtual machines.
You can do that already with the bridging code.
> If you load the kvde_switch submodule IPN can be a virtual Ethernet
> switch.
As I described above, this can be done with the existing bridging and
tun/tap code.
> Another Example:
>
> You have a continuous stream of data packets generated by a
> process, and you want to send this data to many processes. Maybe
> the set of processes is not known in advance, you want to send the
> data to any interested process. Some kind of publish&subscribe
> communication service (among unix processes not on TCP-IP). Without
> IPN you need a server. With IPN the sender creates the socket
> connects to it and feed it with data packets. All the interested
> receivers connects to it and start reading. That's all.
This is already done frequently in userspace. Just register a port
number with IANA on which to implement a "registration" server and
write a little daemon to listen on 127.0.0.1:${YOUR_PORT}. Your
interconnecting programs then use either unicast or multicast sockets
to bind, then report to the registration server what service you are
offering and what port it's on. Your "receivers" then connect to the
registration server, ask what port a given service is on, and then
multicast-listen or unicast-connect to access that service. The best
part is that all of the performance implications are already
thoroughly understood. Furthermore, if you want to extend your
communication protocol to other hosts as well, you just have to
replace the 127.0.0.1 bind with a global bind. This is exactly how
the standard-specified multiple-participant "SIP" protocol works, for
example.
So if you really think this is something that belongs in the kernel
you need to provide much more detailed descriptions and use-cases for
why it cannot be implemented in user-space or with small
modifications to existing UDP/TCP networking.
Cheers,
Kyle Moffett
Kyle Moffett wrote:
> On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote:
>> AF_IPN is different. AF_IPN is the broadcast and peer-to-peer
>> extension of AF_UNIX. It supports communication among *user* processes.
>
> Ok, you say it's different, but then you describe how IP unicast and
> broadcast work.
Renzo also described something new (in the socket() arena): the
multi-reader, multi-writer is just not available in IP.
I wonder if this solves the same problem as d-bus?
> So if you really think this is something that belongs in the kernel
> you need to provide much more detailed descriptions and use-cases for
> why it cannot be implemented in user-space or with small modifications
> to existing UDP/TCP networking.
I would strengthen this sentiment: If you think something belongs in the
kernel, you need to argue your case (provide much more detailed
descriptions and use-cases.)
> Renzo also described something new (in the socket() arena): the
> multi-reader, multi-writer is just not available in IP.
How is that different from a multicast group?
-Andi
> "This document describes Linux Netlink, which is used in Linux both as
> an intra-kernel messaging system as well as between kernel and user
> space."
It can be used between user space daemons as well. In fact it is.
e.g. they often listen to each other's messages.
-Andi
Andi Kleen wrote:
>>"This document describes Linux Netlink, which is used in Linux both as
>> an intra-kernel messaging system as well as between kernel and user
>> space."
>
>
> It can be used between user space daemons as well. In fact it is.
> e.g. they often listen to each other's messages.
One problem we ran into was that there are only 32 multicast groups per
netlink protocol family.
We had a situation where we could have used netlink, but we needed the
equivalent of thousands of multicast groups. Latency was very
important, so we ended up doing essentially a multicast unix socket
rather than taking the extra penalty for UDP multicast.
Chris
> Latency was very
> important, so we ended up doing essentially a multicast unix socket
> rather than taking the extra penalty for UDP multicast.
What extra penalty? Local UDP shouldn't be much more expensive than Unix.
-Andi
Andi Kleen wrote:
>>Latency was very
>>important, so we ended up doing essentially a multicast unix socket
>>rather than taking the extra penalty for UDP multicast.
>
>
> What extra penalty? Local UDP shouldn't be much more expensive than Unix.
On a 1.4GHz P4 I measured a 44% increase in latency between a unix
datagram and a UDP datagram.
For UDP it has to go down the udp stack, then the ip stack, then through
the routing tables and back up the receive side.
The unix socket just hashes to get the destination and delivers the message.
Chris
On Thu, Dec 06, 2007 at 03:49:51PM -0600, Chris Friesen wrote:
> Andi Kleen wrote:
> >>Latency was very
> >>important, so we ended up doing essentially a multicast unix socket
> >>rather than taking the extra penalty for UDP multicast.
> >
> >
> >What extra penalty? Local UDP shouldn't be much more expensive than Unix.
>
> On a 1.4GHz P4 I measured a 44% increase in latency between a unix
> datagram and a UDP datagram.
That's weird.
>
> For UDP it has to go down the udp stack, then the ip stack, then through
UDP doesn't really have much stack. IP is also very little assuming
cached route (connect called first)
I would expect the copies to dominate in both cases.
-Andi
Some more explanations trying to describe what IPN is and what it is
useful for. We are writing the complete patch....
Summary:
* IPN is for inter-process communication. It is *not* directly related
to TCP-IP or Ethernet.
* IPN itself is a *level 1* virtual physical network. IPN services
* (like AF_UNIX) do not require root privileges. TAP and GRAB are just
* extra features for for IPN deliverying Ethernet frames.
----
* IPN is for inter-process communication. It is *not* directly related
to TCP-IP or Ethernet.
If you want you can call it Inter Process Bus Communication. It is an
extension of AF_UNIX. Comments saying that some services can be
implemented by using TCP-IP multicast protocols are unrelated to IPN.
All AF_UNIX services could be implemented as TCP-IP services on
127.0.0.1. Do we abolish AF_UNIX, then? The problem is that to use
TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet
headers, the stack would lose time to manage useless protocols. If you
want just to send strings to set of local processes TCP-IP is an
overloading solution. Even X-Window uses AF_UNIX sockets to talk with
local clients, it is a performance issue... I think Chris is right.
* IPN itself is a *level 1* virtual physical network.
Like any physical network you can run higher level protocols on it, thus
Ethernet, and then TCP-IP can be services you can run on IPN, but there
can be IPN networks running neither TCP-IP nor Ethernet.
* IPN services (like AF_UNIX) do not require root privileges.
There are many communication services where the user need broadcast or
p2p among user processes. If a user (not root) wants to run several
User-Mode Linux, Qemu, Kvm VM the only way to have them connected
together is our Virtual Distributed Ethernet. (For this reason VDE
exists in almost all the distros, it has been ported to other OSs, and
is already supported in the Linux Kernel for User-Mode Linux). VDE is a
userland deamon, hence requires two context switches to deliver a
packet: VM1 -> K -> Daemon -> K -> VM2. Kvde running on IPN just one:
VM1 -> K ->VM2. I think D-Bus can use IPN, too. The same cutoff of
context switches applies. May I speculate that there will be a sensible
increase in performance? *nix are multiuser. It means that there do
exist people that need to set up services without root access. And even
if you have root access, the less you need to work as root, the safer is
you system.
* TAP and GRAB are just extra features for for IPN deliverying Ethernet frames.
Some IPN networks do use Ethernet as Data-Link protocol. It is useful
to provide means to connect the IPN socket to a virtual (TAP) interface
or to a real (GRAB) interface. I know that a lot of people use tap
interfaces, and the kernel bridge to connect Virtual Machines. The
access can be resticted to some users or processes by itpables, but it
not as simple as a chmod to the socket. A lot of people also use tunctl
to define a priori tap interfaces for users. They must define as many
tuntap interfaces as the number of VM the users may want, each user has
his/her own taps. Some users define a userland VDE switch to
interconnect their VM. IPN itself could use a userland process to
define a standard TAP interface and loose its time and its cpu cycles to
move packets from tap to ipn and viceversa. IPN is already kernel code
and then all its context switches and cpu cycles can be saved by
accessing the tap or grabbed interface diretly from the kernel. (TAP
and GRAB obviously require CAP_NET_ADMIN). Using IPN with TAP you can
define one single TAP interface connected to an IPN socket. Several VMs
can use that IPN socket, in this way the VMs are connected by a (virtual
ethernet) network which include the TAP interface. The access control
to the network (and then to the TAP) is done by setting the permissions
to the socket. Tunctl is *not* able to create a tap where all the users
belonging to a group can start their VM. IPN can.
Andi Kleen wrote:
>> Renzo also described something new (in the socket() arena): the
>> multi-reader, multi-writer is just not available in IP.
>>
>
> How is that different from a multicast group?
>
Good question. I don't know much about multicast IP. It's a bit new
for me. I knew it uses Martian addresses! After a little reading, I
now know that it does allow many to many communication.
Renzo's IPN is a local protocol--you can't multicast to localhost.
On Thu, Dec 06, 2007 at 11:18:37PM +0100, Renzo Davoli wrote:
> * IPN is for inter-process communication. It is *not* directly related
> to TCP-IP or Ethernet.
>
> If you want you can call it Inter Process Bus Communication. It is an
> extension of AF_UNIX. Comments saying that some services can be
> implemented by using TCP-IP multicast protocols are unrelated to IPN.
> All AF_UNIX services could be implemented as TCP-IP services on
> 127.0.0.1. Do we abolish AF_UNIX, then? The problem is that to use
> TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet
No ethernet headers on localhost. Just to give you a perspective:
IP+TCP headers are 50 bytes (with timestamps) and IP+UDP is 28 bytes.
On the other hand the sk_buff+skb_shared_info header which are used for
all socket communication in Linux and have to be mostly set up always
are 192+312bytes on 64bit [parts of the 312 bytes is an array that is
typically only partly used] or 156+236 bytes on 32bit. So the network
headers dwarf the internal data structures.
There might be other reasons why TCP/IP is slower, but arguing
with the size of the headers is just bogus.
My personal feeling would be that if TCP/IP is too slow for something
it is better to just improve the stack than to add a completely
new socket family. That will benefit much more applications without
requiring to change them.
About the only good reason to use UNIX sockets is when you need to use
file system permissions.
> * IPN services (like AF_UNIX) do not require root privileges.
>
> There are many communication services where the user need broadcast or
> p2p among user processes. If a user (not root) wants to run several
IP Multicast when properly set up also doesn't need root.
Broadcast is kind of obsolete anyways.
> User-Mode Linux, Qemu, Kvm VM the only way to have them connected
> together is our Virtual Distributed Ethernet. (For this reason VDE
They could easily just tunnel over a local multicast group for example.
-Andi
> Renzo's IPN is a local protocol--you can't multicast to localhost.
You don't need to. All local clients can join the same group without
using localhost.
-Andi
Andi Kleen wrote:
>>On a 1.4GHz P4 I measured a 44% increase in latency between a unix
>>datagram and a UDP datagram.
> That's weird.
I just reran on a 3.2GHZ P4 running 2.6.11 (Fedora Core 4). 42% latency
increase.
For stream sockets, unix gives approximately a 62% bandwidth increase
over tcp. (Although tcp could probably be tuned to do better than this.)
Chris
On Thu, Dec 06, 2007 at 05:02:40PM -0600, Chris Friesen wrote:
> Andi Kleen wrote:
>
> >>On a 1.4GHz P4 I measured a 44% increase in latency between a unix
> >>datagram and a UDP datagram.
>
> >That's weird.
>
> I just reran on a 3.2GHZ P4 running 2.6.11 (Fedora Core 4). 42% latency
> increase.
Sounds like something that should be looked into. I know of no
principal reasons for that.
> For stream sockets, unix gives approximately a 62% bandwidth increase
> over tcp. (Although tcp could probably be tuned to do better than this.)
How long a stream did you test? You might be measuring slow start.
-Andi
Andi Kleen wrote:
> On Thu, Dec 06, 2007 at 05:02:40PM -0600, Chris Friesen wrote:
>>I just reran on a 3.2GHZ P4 running 2.6.11 (Fedora Core 4). 42% latency
>>increase.
> Sounds like something that should be looked into. I know of no
> principal reasons for that.
>>For stream sockets, unix gives approximately a 62% bandwidth increase
>>over tcp. (Although tcp could probably be tuned to do better than this.)
> How long a stream did you test? You might be measuring slow start.
No idea. These are just the standard local networking tests in lmbench
v2. In our case the latency was the big concern and we were using
datagrams anyway.
Chris
I have done some raw tests.
(you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/)
The programs are quite simple. The sender sends "Hello World" as fast as it
can, while the receiver prints time() for each 1 million message
received.
On my laptop, tests on 20000000 "Hello World" packets,
One receiver:
multicast 244,000 msg/sec
IPN 333,000 msg/sec (36% faster)
Two receivers:
multicast 174,000 msg/sec
IPN 250,000 msg/sec (43% faster)
Apart from this, how could I implement policies over a multicast socket,
e.g. how does a Kernel VDE_switch work on multicast sockets?
If I send an ethernet packet over a multicast socket it can emulate just a
hub (Although it seems to me quite innatural to have to have TCP-UDP
over IP over Ethernet over UDP over IP - okay we can skip the Ethernet
on localhost, long ethernet frames get fragmentated but... details).
On multicast socket you cannot use policies, I mean a IPN network (or
bus or group) can have a policy reading some info on the packet to
decide the set of receipients.
For a vde_switch it is the destination mac address when found in the
MAC hash table to select the receipient port. For midi communication it
could be the channel number....
Moving the switching fabric to the userland the performance figures are
quite different.
renzo
From: "Chris Friesen" <[email protected]>
Date: Thu, 06 Dec 2007 14:36:54 -0600
> One problem we ran into was that there are only 32 multicast groups per
> netlink protocol family.
I'm pretty sure we've removed this limitation.
David Miller wrote:
> From: "Chris Friesen" <[email protected]>
> Date: Thu, 06 Dec 2007 14:36:54 -0600
>
>
>>One problem we ran into was that there are only 32 multicast groups per
>>netlink protocol family.
>
>
> I'm pretty sure we've removed this limitation.
As of 2.6.23 nl_groups is a 32-bit bitmask with one bit per group.
Also, it appears that only root is allowed to use multicast netlink.
Chris
"Chris Friesen" <[email protected]> writes:
> David Miller wrote:
>> From: "Chris Friesen" <[email protected]>
>>> One problem we ran into was that there are only 32 multicast groups
>>> per netlink protocol family.
>> I'm pretty sure we've removed this limitation.
> As of 2.6.23 nl_groups is a 32-bit bitmask with one bit per
> group.
Use setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, ...) to
join an arbitrary Netlink multicast group.
--
"A computer is a state machine.
Threads are for people who cant [sic] program state machines."
--Alan Cox
From: "Chris Friesen" <[email protected]>
Date: Thu, 06 Dec 2007 22:21:39 -0600
> David Miller wrote:
> > From: "Chris Friesen" <[email protected]>
> > Date: Thu, 06 Dec 2007 14:36:54 -0600
> >
> >
> >>One problem we ran into was that there are only 32 multicast groups per
> >>netlink protocol family.
> >
> >
> > I'm pretty sure we've removed this limitation.
>
> As of 2.6.23 nl_groups is a 32-bit bitmask with one bit per group.
> Also, it appears that only root is allowed to use multicast netlink.
The kernel supports much more than 32 groups, see nlk->groups which is
a bitmap which can be sized to arbitrary sizes. nlk->nl_groups is
for backwards compatability only.
netlink_change_ngroups() does the bitmap resizing when necessary.
The root multicast listening restriction can be relaxed in some
circumstances, whatever is needed to fill your needs.
Stop making excuses, with minor adjustments we have the facilities to
meet your needs. There is no need for yet-another-protocol to do what
you're trying to do, we already have too much duplicated
functionality.
> Stop making excuses, with minor adjustments we have the facilities to
> meet your needs. There is no need for yet-another-protocol to do what
I suspect they would be better of just using IP multicast. But the localhost
latency penalty vs Unix Chris was talking about probably needs to be investigated.
-Andi
Andi, David,
I disagree. If you suspect we would be better using IP multicast, I think
your suspects are not supported.
Try the following exercises, please.... Can you provide better solutions
without IPN?
renzo
Exercise #1.
I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network
between my VM.
My solution:
I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can
communicate.
Your solution:
- I am condamned by two kernel developers to run the switch in the userland
- I beg the sysadm to give me some pre-allocated taps connected together
by a kernel bridge.
- I create a multicast socket limited to this host (TTL=0) and I use it
like a hub. It cannot switch the packets.
Exercise #2.
I am a sysadm (maybe a lab administrator). I want my users (not root)
of the group "vmenabled" to run their VM connected to a network.
I have hundreds of users in vmenabled(say students).
My Solution:
I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual
interface say ipn0. I give to the socket permission 760 owner
root:vmenabled.
Your solution:
- I am condamned by two kernel developers to run the switch in the userland
- I create a multicast socket connected to a tap and then I define iptables
filters to avoid unauthorized users to join the net.
- I create hundreds of preallocated tap interfaces, at least one per user.
Exercise #3.
I am a user (NOT ROOT) and I have a heavy stream of *very private data*
generated by some processes that must be received by several processes.
I am looking for an efficient solution.
Data can be ASCII strings, or a binary stream.
It is not a "networking" issue, it is just IPC.
My solution.
I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All
the processes connect to the socket either for writing or for reading (or both).
Your solution:
- I am condamned by two kernel developers to use userland inefficient
solutions like named pipes, tee, or a user daemon among AF_UNIX sockets.
- If I use multicast, others can read the stream.
(security by obscurity? the attacker do not know the address?)
- I use a multicast socket with SSL (it sounds funny to use encryption
to talk with myself, exposing the stream to crypto attack).
From: [email protected] (Renzo Davoli)
Date: Fri, 7 Dec 2007 22:18:05 +0100
> I disagree. If you suspect we would be better using IP multicast, I think
> your suspects are not supported.
> Try the following exercises, please.... Can you provide better solutions
> without IPN?
I personally have not purely advocated IP, although the performance
differences UDP and AF_UNIX should be investigated.
Instead I advocated using AF_NETLINK with some minor multicast
permission modifications to suit your needs.
David Miller wrote:
> The kernel supports much more than 32 groups, see nlk->groups which is
> a bitmap which can be sized to arbitrary sizes. nlk->nl_groups is
> for backwards compatability only.
>
> netlink_change_ngroups() does the bitmap resizing when necessary.
Thanks for the explanation. Given that it's a bitmap doesn't that
result in a cost of O(number of groups) when processing messages? In
our case we need potentially thousands of groups.
> The root multicast listening restriction can be relaxed in some
> circumstances, whatever is needed to fill your needs.
Also, good to know.
> Stop making excuses, with minor adjustments we have the facilities to
> meet your needs. There is no need for yet-another-protocol to do what
> you're trying to do, we already have too much duplicated
> functionality.
You may have confused me with the OP...I just chimed in because of some
of the limitations we found when we wanted to do similar things. In our
case we created a new unix-like protocol to allow multicast, and have
been using it for a few years.
However, if we could use netlink instead in our next release that would
be a good thing. A couple questions:
1) Is it possible to register to receive all netlink messages for a
particular netlink family? This is useful for debugging--it allows a
tcpdump equivalent.
2) Is there any up-to-date netlink programming guide? I found this one:
http://people.redhat.com/nhorman/papers/netlink.pdf
but it's three years old now.
Thanks,
Chris