2007-12-17 09:24:20

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH 0/1] IPN: Inter Process Networking

Inter Process Networking (PATCH):

This patch adds a new address family for inter process communication.
AF_IPN: inter process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

Contents of this document:

1. What is IPN?
2. Why IPN?
2.1 Why IPN instead of IP Multicast?
2.2 Why IPN instead of AF_NETLINK?
3. How?

We've read all the comments in the previous thread about IPN and we've
tried to answer.

1. WHAT IS IPN?
---------------

IPN is a new address family designed for one-to-many, many-to-many and
peer-to-peer communication among processes.
Berkeley sockets have been designed for client-server or point-to-point
communication; AF_UNIX does not support multicast/broadcast. AF_IPN
does, in a simple, efficient but extensible way.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.

On IPN, processes can interoperate using real networking protocols
(e.g. ethernet) but also using application defined protocols (maybe
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.

IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).

IPN is part of the Virtual Square Project (vde, lwipv6, view-os,
umview/kmview, see wiki.virtualsquare.org).

2. WHY IPN?
-----------
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their
data (*any kind* of data, not only networking-related messages) to several
consuming processes (maybe joining the stream at run time). IPN sockets
can be also connected to tap (tuntap) like interfaces or to real interfaces
(like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.

Several existing services could be implemented (and often could have extended
features) on the top of IPN:
- kernel Ethernet bridging
- TUN/TAP
- MACVLAN

IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel
i.e. where processes can "join (and leave) the information flow" at
runtime. IPN can be seen as "publish and subscribe".
Different delivery policies can be defined as IPN protocols (loaded
as submodules of ipn.ko).
For instance, an ethernet switch is a policy (kvde_switch.ko: packets
are unicast delivered if the MAC address is already in the switching
hash table), we are designing an extendended switch, full of interesting
features like our userland vde_switch (with vlan/fst/manamement etc..),
and a layer3 switch, but other policies can be defined to implement the
specific requirements of other services. I feel that there is no limits
to creativity about multicast services for processes. Userspace
services (like vde) do exist, but IPN provides a faster and unified
support.

2.1 Why IPN instead of IP Multicast?
------------------------------------
- IPN seems to be faster than IP Multicast. (see my message to LKML
of Dec 06).
- IPN provides file system permission to access the communication medium,
and it uses the file system for naming.
- IPN does not need any tunneling or packet encapsulation, it works as a
layer 1 virtual network.
- IPN protocols (implemented by kernel submodules) provide forwarding
policies: the set of receipients for each messages is computed from the
contents of the message itself.
Ethernet virtual switches or other routing rules for any kind of data
can be implemented as IPN protocols.

2.2 Why IPN instead of AF_NETLINK?
----------------------------------
- Netlink has been designed for user to kernel communication.
- Netlink has many missing features to provide services similar to IPN.
- Currently multicast seems to be allowed for root only. Access control
should be added completely.
- Netlink interface for user processes is not very immediate (libnl has
been developed as a higher level solution to that).
- Netlink already seems to suffer from "overpopulation":
NETLINK_GENERIC has been added for "simplified netlink usage" but it
adds yet another header and rules to be followed.
- Netlinks is quite rigid as for message delivery guarantees: unicast
implies lossless communication, multicast implies best-effort
delivery.
- Netlink does not support different forwarding policies.
We are trying to compare netlink performances with IPN.

3. HOW?
-------
The complete specifications for IPN can be found here:

http://wiki.virtualsquare.org/index.php/IPN

- bind() creates the socket (if it does not already exist). When bind()
succeeds, the process has the right to manage the "network". No data
is received or can be send if the socket is not connected (only
get/setsockopt and ioctl work on bound unconnected sockets).

- connect() is used to join the network. When the socket is connected
it is possible to send/receive data. If the socket is already bound
it is useless to specify the socket again (you can use NULL, or
specify the same address). connect() can be also used without
bind(). In this case the process sends and receives data but it
cannot manage the network (in this case the socket address
specification is required).

- listen() and accept() are for servers, thus they does not exist for IPN.

Examples:
1. Peer-to-Peer Communication:
Several processes run the same code:

struct sockaddr_un sun = { .sun_family = AF_IPN, .sun_path = "/tmp/sockipn" };
int s = socket(AF_IPN, SOCK_RAW, IPN_BROADCAST);
err = bind(s, (struct sockaddr*) &sun, sizeof(sun));
err = connect(s, NULL, 0);

In this case all the messages sent by each process get received by all
the other processes (IPN_BROADCAST). The processes need to be able to
receive data when there are pending packets, e.g. by using poll/select
and event driven programming or multithreading.

2. (One or) Some senders/many receivers
The sender runs the following code:

struct sockaddr_un sun = { .sun_family = AF_IPN, .sun_path = "/tmp/sockipn" };
int s = socket(AF_IPN, SOCK_RAW, IPN_BROADCAST);
err = shutdown(s, SHUT_RD);
err = bind(s, (struct sockaddr*) &sun, sizeof(sun));
err = connect(s, NULL, 0);

The receivers do not need to define the network, thus they skip the bind():

struct sockaddr_un sun = { .sun_family = AF_IPN, .sun_path = "/tmp/sockipn" };
int s = socket(AF_IPN, SOCK_RAW, IPN_BROADCAST);
err = shutdown(s, SHUT_WR);
err = connect(s, (struct sockaddr*) &sun, sizeof(sun));

In the previous examples processes can send and receive every kind of
data.

When messages are ethernet packets (maybe from virtual machines), IPN
works like a Hub by using the IPN_BROADCAST protocol. Different
protocols (delivery policies) can be specified by changing IPN_BROADCAST
with a different tag. A IPN protocol specific submodule must have been
registered the protocol tag in advance. (e.g. when kvde_switch.ko is
loaded IPN_VDESWITCH can be used too). The basic broadcasting protocol
IPN_BROADCAST is built-in (all the messages get delivered to all the
connected processes but the sender).

IPN sockets use the filesystem for naming and access control. For example,

srwxr-xr-x 1 renzo renzo 0 2007-12-04 22:28 /tmp/sockipn

An IPN socket appear in the file like a UNIX socket. 'r/w' permissions
represent the right to receive from/send data to the socket. The 'x'
permission represent the right to manage the socket. connect()
automatically shuts down SHUT_WR or SHUT_RD if the user has not the
correspondent right.

The patch also supports the run time reconfiguration of the max # of
nodes and out of band messages for protocol log messages.

IPN provides OOB messages to notify number of senders and receivers
changes. For example it is possible to write sender programs that stop
to feed data when there are no receivers and restart as soon as a new
receiver join the network.

renzo,
Ludovico Gardenghi and the V^2 project group


2007-12-17 09:28:09

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH 1/1] IPN: Inter Process Networking

Inter Process Networking Patch.

It applies to 2.6.24-rc5, include documentation, the new kernel option
(experimental), kernel include file include/net/af_ipn.h and the
protocol directory net/ipn.

renzo

Signed-off-by: Renzo Davoli <[email protected]>

diff -Naur linux-2.6.24-rc5/Documentation/networking/ipn.txt linux-2.6.24-rc5-ipn/Documentation/networking/ipn.txt
--- linux-2.6.24-rc5/Documentation/networking/ipn.txt 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/Documentation/networking/ipn.txt 2007-12-16 16:30:01.000000000 +0100
@@ -0,0 +1,326 @@
+Inter Process Networking (IPN)
+
+IPN is an Inter Process Communication service. It uses the same programming
+interface and protocols used for networking. Processes using IPN are connected
+to a "network" (many to many communication). The messages or packets sent by a
+process on an IPN network can be delivered to many other processes connected to
+the same IPN network, potentially to all the other processes. Different
+protocols can be defined on the IPN service. The basic one is the broadcast
+(level 1) protocol: all the packets get received by all the processes but the
+sender. It is also possible to define more sophisticated protocols. For example
+it is possible to have IPN sockets dipatching packets using the Ethernet
+protocol (like a Virtual Distributed Ethernet - VDE switch), or Internet
+Protocol (like a layer 3 switch). These are just examples, several other
+policies can be defined.
+
+Description:
+------------
+
+The Berkeley socket Application Programming Interface (API) was designed for
+client server applications and for point-to-point communications. There is not
+a support for broadcasting/multicasting domains.
+
+IPN updates the interface by introducing a new protocol family (PF_IPN or
+AF_IPN). PF_IPN is similar to PF_UNIX but for IPN the Socket API calls have a
+different (extended) behavior.
+
+ #include <sys/socket.h>
+ #include <sys/un.h>
+ #include <sys/ipn.h>
+
+ sockfd = socket(AF_IPN, int socket_type, int protocol);
+
+creates a communication socket. The only socket_type defined is SOCK_RAW, other
+socket_types can be used for future extensions. A socket cannot be used to send
+or receive data until it gets connected (using the "connect" call). The
+protocol argument defines the policy used by the socket. Protocol IPN_BROADCAST
+(1) is the basic policy: a packet is sent to all the receipients but the sender
+itself. The policy IPN_ANY (0) can be used to connect or bind a pre-existing
+IPN network regardless of the policy used. (2 will be IPN_VDESWITCH and 3
+IPN_VDESWITCHL3).
+
+The address format is the same of PF_UNIX (a.k.a PF_LOCAL), see unix(7) manual.
+
+ int bind(int sockfd, const struct sockaddr *my_addr, socklen_t addrlen);
+
+This call creates an IPN network if it does not exist, or join an existing
+network (just for management) if it already exists. The policy of the network
+must be consistent with the protocol argument of the "socket" call. A new
+network has the policy defined for the socket. "bind" or "connect" operations
+on existing networks fail if the policy of the socket is neither IPN_ANY nor
+the same of the network. (A network should not be created by a IPN_ANY socket).
+An IPN network appears in the file system as a unix socket. The execution
+permission (x) on this file is required for "bind' to succeed (otherwise -EPERM
+is returned). Similarly the read/write permissions (rw) permits the "connect"
+operation for reading (receiving) or writing (sending) packets respectively.
+When a socket is bound (but not connected) to a IPN network the process does
+not receive or send any data but it can call "ioctl" or "setsockopt" to
+configure the network.
+
+ int connect(int sockfd, const struct sockaddr *serv_addr, socklen_t addrlen);
+
+This call connects a socket to an existing IPN network. The socket can be
+already bound (through the "bind" call) or unbound. Unbound connected sockets
+receive and send data but they cannot configure the network. The read or write
+permission on the socket (rw) is required to "connect" the channel and
+read/write respectively. When "connect" succeeds and provided the socket has
+appropriate permissions, the process can sends packets and receives all the
+packets sent by other processes and delivered to it by the network policy. The
+socket can receive data at any time (like a network interface) so the process
+must be able to handle incoming data (using select/poll or multithreading).
+Obviously higher lever protocols can also prevent the reception of unexpected
+messages by design. It is the case of networks used with with exactly one
+sender, all the other processes can simply receive the data and the sender will
+never receive any packet. It is also possible to have sockets with different
+roles assigning reading permission to some and writing permissions to others.
+If data overrun occurs there can be data loss or the sender can be blocked
+depending on the policy of the socket (LOSSY or LOSSLESS, see over). Bind must
+be called before connect. The correct sequences are: socket+bind: just for
+management, socket+bind+connect: management and communication. socket+connect:
+communication without management).
+
+The calls "accept" and "listen" are not defined for AF_IPN, as there is not any
+server. All the communication takes place among peers.
+
+Data can be sent and received using read, write, send, recv, sendto, recvfrom, sendmsg, recvmsg.
+
+Socket options and flags.
+-------------------------
+
+These options can be set by getsockopt and setsockopt.
+
+There are two different kinds of options: network options and node options. The
+formers define the structure of the network and must be set prior to bind. It
+is not currently possible to change this flag of an existing network. When a
+socket is bound and/or connected to an existing network getsockopt gives the
+current value of the options. Node options define parameters of the node. These
+must be set prior to connect.
+
+***Network Options (These options can be set prior to bind/connec
+
+IPN_SO_FLAGS: This tag permits to set/get the network flags.
+
+IPN_FLAG_LOSSLESS: this flag defines the behavior in case of network
+overloading or data overrun, i.e. when some process are too slow in consuming
+the packets for the network buffers. When the network is LOSSY (the flag is
+cleared) packets get dropped in case of buffer overflow. A LOSSLESS (flag set)
+IPN network blocks the sender if the buffer is full. LOSSY is the default
+behavior.
+
+IPN_SO_NUMNODES: max number of connected sockets (default value 32)
+
+IPN_SO_MTU: maximum transfer unit: maximum size of packets (default value 1514,
+Ethernet frame, including VLAN).
+
+IPN_SO_MSGPOOLSIZE: size of the buffer (#of pending packets, default value 8).
+This option has two different meanings depending on the LOSSY/LOSSLESS behavior
+of the network. For LOSSY networks, this is the maximum number of pending
+packets of each node. For LOSSLESS network this is the global number of the
+pending packets in the network. When the same packet is sent to many
+destinations it is counted just once.
+
+IPN_SO_MODE: this option specifies the permission to use when the socket gets
+created on the file system. It is modified by the process' umask in the usual
+way. The created socket permission are (mode & ~umask).
+
+***Network Options (Options for bound/connected sockets)
+
+IPN_SO_CHANGE_NUMNODES: (runtime) change of the number of ipn network ports.
+
+***Node Options
+
+IPN_SO_PORT: (default value IPN_PORTNO_ANY) This option specify the port number
+where the socket must be connected. When IPN_PORTNO_ANY the port number is
+decided by the service. There can be network services where different ports
+have different definitions (e.g. different VLANs for ports of virtual Ethernet
+switches).
+
+IPN_SO_DESCR: This is the description of the node. It is a string, having
+maxlength IPN_DESCRLEN. It is just used by debugging tools.
+
+IPN_SO_HANDLE_OOB: The node is able to manage Out Of Band protocol messages
+
+IPN _SO_WANT_OOB_NUMNODES: The socket wants OOB messages to notify the change
+of #writers #readers (requires IPN_SO_HANDLE_OOB)
+
+TAP and GRAB nodes for IPN networks
+-----------------------------------
+
+It is possible to connect IPN sockets to virtual and real network interfaces
+using specific ioctl and provided the user has the permission to configure the
+network (e.g. the CAP_NET_ADMIN Posix capability). A virtual interface
+connected to an IPN network is similar to a tap interface (provided by the
+tuntap module). A tap interface appears as an ethernet interface to the hosting
+operating system, all the packets sent and received through the tap interface
+get received and sent by the application which created the tap interface. IPN
+virtual network interface appears in the same way but the packets are received
+and sent through the IPN network and delivered consistently with the policy
+(BROADCAST acts as a basic HUB for the connected processes). It is also
+possible to *grab* a real interface. In this case the closest example is the
+Linux kernel ethernet bridge. When a real interface is connected to a IPN all
+the packets received from the real network are injected also into the IPN and
+all the packets sent by the IPN through the real network 'port' get sent on the
+real network.
+
+ioctl is used for creation or control of TAP or GRAB interfaces.
+
+ int ioctl(int d, int request, .../* arg */);
+
+A list of the request values currently supported follows.
+
+IPN_CONN_NETDEV: (struct ifreq *arg). This call creates a TAP interface or
+implements a GRAB on an existing interface and connects it to a bound IPN
+socket. The field ifr_flags can be IPN_NODEFLAG_TAP for a TAP interface,
+IPN_NODEFLAG_GRAB to grab an existing interface. The field ifr_name is the
+desired name for the new TAP interface or is the name of the interface to grab
+(e.g. eth0). For TAP interfaces, ifr_name can be an empty string. The interface
+in this latter case is named ipn followed by a number (e.g. ipn0, ipn1, ...).
+This ioctl must be used on a bound but unconnected socket. When the call
+succeeds, the socket gets the connected status, but the packets are sent and
+received through the interface. Persistence apply only to interface nodes (TAP
+or GRAB).
+
+IPN_SETPERSIST (int arg). If (arg != 0) it gives the interface the persistent
+status: the network interface survives and stay connected to the IPN network
+when the socket is closed. When (arg == 0) the standard behavior is resumed:
+the interface is deleted or the grabbing is terminated when the socket is
+closed.
+
+IPN_JOIN_NETDEV: (struct ifreq *arg). This call reconnects a socket to an
+existing persistent node. The interface can be defined either by name
+(ifr_name) or by index (ifr_index). If there is already a socket controlling
+the interface this call fails (EADDRNOTAVAIL).
+
+There are also some ioctl that can be used by a sysadm to give/clear
+persistence on existing IPN interfaces. These calls apply to unbound sockets.
+
+IPN_SETPERSIST_NETDEV: (struct ifreq *arg). This call sets the persistence
+status of an IPN interface. The interface can be defined either by name
+(ifr_name) or by index (ifr_index).
+
+IPN_CLRPERSIST_NETDEV: (struct ifreq *arg). This call clears the persistence
+status of an IPN interface. The interface is specified as in the opposite call
+above. The interface is deleted (TAP) or the grabbing is terminated when the
+socket is closed, or immediately if the interface is not controlled by a
+socket. If the IPN network had the interface as its sole node, the IPN network
+is terminated, too.
+
+When unloading the ipn kernel module, all the persistent flags of interfaces
+are cleared.
+
+Related Work.
+-------------
+
+IPN is able to give a unifying solution to several problems and creates new
+opportunities for applications.
+
+Several existing tools can be implemented using IPN sockets:
+
+ * VDE. Level 2 service implements a VDE switch in the kernel, providing a
+ considerable speedup.
+ * Tap (tuntap) networking for virtual machines
+ * Kernel ethernet bridge
+ * All the applications which need multicasting of data streams, like tee
+
+A continuous stream of data (like audio/video/midi etc) can be sent on an IPN
+network and several application can receive the broadcast just by joining the
+channel.
+
+It is possible to write programs that forward packets between different IPN
+networks running on the same or on different systems extending the IPN in the
+same way as cables extend ethernet networks connecting switches or hubs
+together. (VDE cables are examples of such a kind of programs).
+
+IPN interface to protocol modules
+---------------------------------
+
+struct ipn_protocol {
+ int refcnt;
+ int (*ipn_p_newport)(struct ipn_node *newport);
+ int (*ipn_p_handlemsg)(struct ipn_node *from,struct msgpool_item *msgitem, int depth);
+ void (*ipn_p_delport)(struct ipn_node *oldport);
+ void (*ipn_p_postnewport)(struct ipn_node *newport);
+ void (*ipn_p_predelport)(struct ipn_node *oldport);
+ int (*ipn_p_newnet)(struct ipn_network *newnet);
+ int (*ipn_p_resizenet)(struct ipn_network *net,int oldsize,int newsize);
+ void (*ipn_p_delnet)(struct ipn_network *oldnet);
+ int (*ipn_p_setsockopt)(struct ipn_node *port,int optname,
+ char __user *optval, int optlen);
+ int (*ipn_p_getsockopt)(struct ipn_node *port,int optname,
+ char __user *optval, int *optlen);
+ int (*ipn_p_ioctl)(struct ipn_node *port,unsigned int request,
+ unsigned long arg);
+};
+
+int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service);
+int ipn_proto_deregister(int protocol);
+
+void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg, int depth);
+
+
+A protocol (sub) module must define its own ipn_protocol structure (maybe a
+global static variable).
+
+ipn_proto_register must be called in the module init to register the protocol
+to the IPN core module. ipn_proto_deregister must be called in the destructor
+of the module. It fails if there are already running networks based on this
+protocol.
+
+Only two fields must be initialized in any case: ipn_p_newport and
+ipn_p_handlemsg.
+
+ipn_p_newport is the new network node notification. The return value is the
+port number of the new node. This call can be used to allocate and set private
+data used by the protocol (the field proto_private of the struct ipn_node has
+been defined for this purpose).
+
+ipn_p_handlemsg is the notification of a message that must be dispatched. This
+function should call ipn_proto_sendmsg for each recipient. It is possible for
+the protocol to change the message (provided the global length of the packet
+does not exceed the MTU of the network). Depth is for loop control. Two IPN can
+be interconnected by kernel cables (not implemented yet). Cycles of cables
+would generate infinite loops of packets. After a pre-defined number of hops
+the packet gets dropped (it is like EMLINK for symbolic links). Depth value
+must be copied to all ipn_proto_sendmsg calls. Usually the handlemsg function
+has the following structure:
+
+static int ipn_xxxxx_handlemsg(struct ipn_node *from, struct msgpool_item *msgitem, int depth)
+{
+ /* compute the set of receipients */
+ for (/*each receipient "to"*/)
+ ipn_proto_sendmsg(to,msgitem,depth);
+}
+
+It is also possible to send different packets to different recipients.
+
+struct msgpool_item *newitem=ipn_msgpool_alloc(from->ipn);
+/* create a new contents for the packet by filling in newitem->len and newitem->data */
+ipn_proto_sendmsg(recipient1,newitem,depth);
+ipn_proto_sendmsg(recipient2,newitem,depth);
+....
+ipn_msgpool_put(newitem);
+
+(please remember to call ipn_msgpool_put after the sendmsg of packets allocated
+by the protocol submodule).
+
+ipn_p_delport is used to deallocate port related data structures.
+
+ipn_p_postnewport and ipn_p_predelport are used to notify new nodes or deleted
+nodes. newport and delport get called before activating the port and after
+disactivating it respectively, therefore it is not possible to use the new port
+or deleted port to signal the change on the net itself. ipn_p_postnewport and
+ipn_p_predelport get called just after the activation and just before the
+deactivation thus the protocols can already send packets on the network.
+
+ipn_p_newnet and ipn_p_delnet notify the creation/deletion of a IPN network
+using the given protocol.
+
+ipn_p_resizenet notifies a number of ports change
+
+ipn_p_setsockopt and ipn_p_getsockopt can be used to provide specific socket
+options.
+
+ipn_p_ioctl protocols can implement also specific ioctl services.
+
+Further documentation and examples can be found in the Virtual Square project
+web site: wiki.virtualsquare.org
diff -Naur linux-2.6.24-rc5/MAINTAINERS linux-2.6.24-rc5-ipn/MAINTAINERS
--- linux-2.6.24-rc5/MAINTAINERS 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/MAINTAINERS 2007-12-16 16:30:01.000000000 +0100
@@ -2094,6 +2094,15 @@
W: http://openipmi.sourceforge.net/
S: Supported

+IPN INTER PROCESS NETWORKING
+P: Renzo Davoli
+M: [email protected]
+P: Ludovico Gardenghi
+M: [email protected]
+L: [email protected]
+W: http://wiki.virtualsquare.org
+S: Maintained
+
IPX NETWORK LAYER
P: Arnaldo Carvalho de Melo
M: [email protected]
diff -Naur linux-2.6.24-rc5/include/linux/net.h linux-2.6.24-rc5-ipn/include/linux/net.h
--- linux-2.6.24-rc5/include/linux/net.h 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/include/linux/net.h 2007-12-16 16:30:03.000000000 +0100
@@ -25,7 +25,7 @@
struct inode;
struct net;

-#define NPROTO 34 /* should be enough for now.. */
+#define NPROTO 35 /* should be enough for now.. */

#define SYS_SOCKET 1 /* sys_socket(2) */
#define SYS_BIND 2 /* sys_bind(2) */
diff -Naur linux-2.6.24-rc5/include/linux/netdevice.h linux-2.6.24-rc5-ipn/include/linux/netdevice.h
--- linux-2.6.24-rc5/include/linux/netdevice.h 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/include/linux/netdevice.h 2007-12-16 16:30:03.000000000 +0100
@@ -705,6 +705,8 @@
struct net_bridge_port *br_port;
/* macvlan */
struct macvlan_port *macvlan_port;
+ /* ipn */
+ struct ipn_node *ipn_port;

/* class/net/name entry */
struct device dev;
diff -Naur linux-2.6.24-rc5/include/linux/socket.h linux-2.6.24-rc5-ipn/include/linux/socket.h
--- linux-2.6.24-rc5/include/linux/socket.h 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/include/linux/socket.h 2007-12-16 16:30:03.000000000 +0100
@@ -189,7 +189,8 @@
#define AF_BLUETOOTH 31 /* Bluetooth sockets */
#define AF_IUCV 32 /* IUCV sockets */
#define AF_RXRPC 33 /* RxRPC sockets */
-#define AF_MAX 34 /* For now.. */
+#define AF_IPN 34 /* IPN sockets */
+#define AF_MAX 35 /* For now.. */

/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
@@ -224,6 +225,7 @@
#define PF_BLUETOOTH AF_BLUETOOTH
#define PF_IUCV AF_IUCV
#define PF_RXRPC AF_RXRPC
+#define PF_IPN AF_IPN
#define PF_MAX AF_MAX

/* Maximum queue length specifiable by listen. */
diff -Naur linux-2.6.24-rc5/include/net/af_ipn.h linux-2.6.24-rc5-ipn/include/net/af_ipn.h
--- linux-2.6.24-rc5/include/net/af_ipn.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/include/net/af_ipn.h 2007-12-16 16:30:03.000000000 +0100
@@ -0,0 +1,233 @@
+#ifndef __LINUX_NET_AFIPN_H
+#define __LINUX_NET_AFIPN_H
+
+#define IPN_ANY 0
+#define IPN_BROADCAST 1
+#define IPN_HUB 1
+#define IPN_VDESWITCH 2
+#define IPN_VDESWITCH_L3 3
+
+#define IPN_SO_PREBIND 0x80
+#define IPN_SO_PORT 0
+#define IPN_SO_DESCR 1
+#define IPN_SO_CHANGE_NUMNODES 2
+#define IPN_SO_HANDLE_OOB 3
+#define IPN_SO_WANT_OOB_NUMNODES 4
+#define IPN_SO_MTU (IPN_SO_PREBIND | 0)
+#define IPN_SO_NUMNODES (IPN_SO_PREBIND | 1)
+#define IPN_SO_MSGPOOLSIZE (IPN_SO_PREBIND | 2)
+#define IPN_SO_FLAGS (IPN_SO_PREBIND | 3)
+#define IPN_SO_MODE (IPN_SO_PREBIND | 4)
+
+#define IPN_PORTNO_ANY -1
+
+#define IPN_DESCRLEN 128
+
+#define IPN_FLAG_LOSSLESS 1
+#define IPN_FLAG_TERMINATED 0x1000
+
+/* Ioctl defines */
+#define IPN_SETPERSIST_NETDEV _IOW('I', 200, int)
+#define IPN_CLRPERSIST_NETDEV _IOW('I', 201, int)
+#define IPN_CONN_NETDEV _IOW('I', 202, int)
+#define IPN_JOIN_NETDEV _IOW('I', 203, int)
+#define IPN_SETPERSIST _IOW('I', 204, int)
+
+#define IPN_OOB_NUMNODE_TAG 0
+
+/* OOB message for change of numnodes
+ * Common fields for oob IPN signaling:
+ * @level=level of the service who generated the oob
+ * @tag=tag of the message
+ * Specific fields:
+ * @numreaders=number of readers
+ * @numwriters=number of writers
+ * */
+struct numnode_oob {
+ int level;
+ int tag;
+ int numreaders;
+ int numwriters;
+};
+
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#include <linux/mutex.h>
+#include <linux/un.h>
+#include <net/sock.h>
+#include <linux/netdevice.h>
+
+#define IPN_HASH_SIZE 256
+
+/* The AF_IPN socket */
+struct msgpool_item;
+struct ipn_network;
+struct pre_bind_parms;
+
+/*
+ * ipn_node
+ *
+ * @nodelist=pointers for connectqueue or unconnectqueue (see network)
+ * @protocol=kind of service 0->standard broadcast
+ * @flags= see IPN_NODEFLAG_xxx
+ * @shutdown= SEND_SHUTDOWN/RCV_SHUTDOWN and OOBRCV_SHUTDOWN
+ * @descr=description of this port
+ * @portno=when connected: port of the netowrk (<0 means unconnected)
+ * @msglock=mutex on the msg queue
+ * @totmsgcount=total # of pending msgs
+ * @oobmsgcount=# of pending oob msgs
+ * @msgqueue=queue of messages
+ * @oobmsgqueue=queue of messages
+ * @read_wait=waitqueue for reading
+ * @net=current network
+ * @dev=device (TAP or GRAB)
+ * @ipn=network we are connected to
+ * @pbp=temporary storage for parms that must be set prior to bind
+ * @proto_private=handle for protocol private data
+ */
+struct ipn_node {
+ struct list_head nodelist;
+ int protocol;
+ volatile unsigned char flags;
+ unsigned char shutdown;
+ char descr[IPN_DESCRLEN];
+ int portno;
+ spinlock_t msglock;
+ unsigned short totmsgcount;
+ unsigned short oobmsgcount;
+ struct list_head msgqueue;
+ struct list_head oobmsgqueue;
+ wait_queue_head_t read_wait;
+ struct net *net;
+ struct net_device *dev;
+ struct ipn_network *ipn;
+ struct pre_bind_parms *pbp;
+ void *proto_private;
+};
+#define IPN_NODEFLAG_BOUND 0x1 /* bind succeeded */
+#define IPN_NODEFLAG_INUSE 0x2 /* is currently "used" (0 for persistent, unbound interfaces) */
+#define IPN_NODEFLAG_PERSIST 0x4 /* if persist does not disappear on close (net interfaces) */
+#define IPN_NODEFLAG_TAP 0x10 /* This is a tap interface */
+#define IPN_NODEFLAG_GRAB 0x20 /* This is a grab of a real interface */
+#define IPN_NODEFLAG_DEVMASK 0x30 /* True if this is a device */
+#define IPN_NODEFLAG_OOB_NUMNODES 0x40 /* Node wants OOB for NNODES */
+
+/*
+ * ipn_sock
+ *
+ * unfortunately we must use a struct sock (most of the fields are useless) as
+ * this is the standard "agnostic" structure for socket implementation.
+ * This proofs that it is not "agnostic" enough!
+ */
+
+struct ipn_sock {
+ struct sock sk;
+ struct ipn_node *node;
+};
+
+/*
+ * ipn_network network descriptor
+ *
+ * @hnode=hash to find this entry (looking for i-node)
+ * @unconnectqueue=queue of unconnected (bound) nodes
+ * @connectqueue=queue of connected nodes (faster for broadcasting)
+ * @refcnt=reference count (bound or connected sockets)
+ * @dentry/@mnt=to keep the file system descriptor into memory
+ * @ipnn_lock=lock for protocol functions
+ * @protocol=kind of service
+ * @flags=flags (IPN_FLAG_LOSSLESS)
+ * @maxports=number of ports available in this network
+ * @msgpool_nelem=number of pending messages
+ * @msgpool_size=max number of pending messages *per net* when IPN_FLAG_LOSSLESS
+ * @msgpool_size=max number of pending messages *per port*when LOSSY
+ * @mtu=MTU
+ * @send_wait=wait queue waiting for a message in the msgpool (IPN_FLAG_LOSSLESS)
+ * @msgpool_cache=slab for msgpool (unused yet)
+ * @proto_private=handle for protocol private data
+ * @connports=array of connected sockets
+ */
+struct ipn_network {
+ struct hlist_node hnode;
+ struct list_head unconnectqueue;
+ struct list_head connectqueue;
+ atomic_t refcnt;
+ struct dentry *dentry;
+ struct vfsmount *mnt;
+ struct semaphore ipnn_mutex;
+ int sunaddr_len;
+ struct sockaddr_un sunaddr;
+ unsigned int protocol;
+ unsigned int flags;
+ int numreaders;
+ int numwriters;
+ atomic_t msgpool_nelem;
+ unsigned short maxports;
+ unsigned short msgpool_size;
+ unsigned short mtu;
+ wait_queue_head_t send_wait;
+ struct kmem_cache *msgpool_cache;
+ void *proto_private;
+ struct ipn_node **connport;
+};
+
+/* struct msgpool_item
+ * the local copy of the message for dispatching
+ * @count refcount
+ * @len packet len
+ * @data payload
+ */
+struct msgpool_item {
+ atomic_t count;
+ int len;
+ unsigned char data[0];
+};
+
+struct msgpool_item *ipn_msgpool_alloc(struct ipn_network *ipnn);
+void ipn_msgpool_put(struct msgpool_item *old, struct ipn_network *ipnn);
+
+/*
+ * protocol service:
+ *
+ * @refcnt: number of networks using this protocol
+ * @newport=upcall for reporting a new port. returns the portno, -1=error
+ * @handlemsg=dispatch a message.
+ * should call ipn_proto_sendmsg for each desctination
+ * can allocate other msgitems using ipn_msgpool_alloc to send
+ * different messages to different destinations;
+ * @delport=(may be null) reports the terminatio of a port
+ * @postnewport,@predelport: similar to newport/delport but during these calls
+ * the node is (still) connected. Useful when protocols need
+ * welcome and goodbye messages.
+ * @ipn_p_setsockopt
+ * @ipn_p_getsockopt
+ * @ipn_p_ioctl=(may be null) upcall to manage specific options or ctls.
+ */
+struct ipn_protocol {
+ int refcnt;
+ int (*ipn_p_newport)(struct ipn_node *newport);
+ int (*ipn_p_handlemsg)(struct ipn_node *from,struct msgpool_item *msgitem);
+ void (*ipn_p_delport)(struct ipn_node *oldport);
+ void (*ipn_p_postnewport)(struct ipn_node *newport);
+ void (*ipn_p_predelport)(struct ipn_node *oldport);
+ int (*ipn_p_newnet)(struct ipn_network *newnet);
+ int (*ipn_p_resizenet)(struct ipn_network *net,int oldsize,int newsize);
+ void (*ipn_p_delnet)(struct ipn_network *oldnet);
+ int (*ipn_p_setsockopt)(struct ipn_node *port,int optname,
+ char __user *optval, int optlen);
+ int (*ipn_p_getsockopt)(struct ipn_node *port,int optname,
+ char __user *optval, int *optlen);
+ int (*ipn_p_ioctl)(struct ipn_node *port,unsigned int request,
+ unsigned long arg);
+};
+
+int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service);
+int ipn_proto_deregister(int protocol);
+
+int ipn_proto_injectmsg(struct ipn_node *from, struct msgpool_item *msg);
+void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg);
+void ipn_proto_oobsendmsg(struct ipn_node *to, struct msgpool_item *msg);
+
+extern struct sk_buff *(*ipn_handle_frame_hook)(struct ipn_node *p,
+ struct sk_buff *skb);
+#endif
+#endif
diff -Naur linux-2.6.24-rc5/net/Kconfig linux-2.6.24-rc5-ipn/net/Kconfig
--- linux-2.6.24-rc5/net/Kconfig 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/Kconfig 2007-12-16 16:30:04.000000000 +0100
@@ -37,6 +37,7 @@

source "net/packet/Kconfig"
source "net/unix/Kconfig"
+source "net/ipn/Kconfig"
source "net/xfrm/Kconfig"
source "net/iucv/Kconfig"

diff -Naur linux-2.6.24-rc5/net/Makefile linux-2.6.24-rc5-ipn/net/Makefile
--- linux-2.6.24-rc5/net/Makefile 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/Makefile 2007-12-16 16:30:04.000000000 +0100
@@ -19,6 +19,7 @@
obj-$(CONFIG_INET) += ipv4/
obj-$(CONFIG_XFRM) += xfrm/
obj-$(CONFIG_UNIX) += unix/
+obj-$(CONFIG_IPN) += ipn/
ifneq ($(CONFIG_IPV6),)
obj-y += ipv6/
endif
diff -Naur linux-2.6.24-rc5/net/core/dev.c linux-2.6.24-rc5-ipn/net/core/dev.c
--- linux-2.6.24-rc5/net/core/dev.c 2007-12-11 04:48:43.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/core/dev.c 2007-12-16 16:30:04.000000000 +0100
@@ -1925,7 +1925,7 @@
int *ret,
struct net_device *orig_dev)
{
- if (skb->dev->macvlan_port == NULL)
+ if (!skb || skb->dev->macvlan_port == NULL)
return skb;

if (*pt_prev) {
@@ -1938,6 +1938,32 @@
#define handle_macvlan(skb, pt_prev, ret, orig_dev) (skb)
#endif

+#if defined(CONFIG_IPN) || defined(CONFIG_IPN_MODULE)
+struct sk_buff *(*ipn_handle_frame_hook)(struct ipn_node *port,
+ struct sk_buff *skb) __read_mostly;
+EXPORT_SYMBOL_GPL(ipn_handle_frame_hook);
+
+static inline struct sk_buff *handle_ipn(struct sk_buff *skb,
+ struct packet_type **pt_prev,
+ int *ret,
+ struct net_device *orig_dev)
+{
+ struct ipn_node *port;
+
+ if (!skb || skb->pkt_type == PACKET_LOOPBACK ||
+ (port = rcu_dereference(skb->dev->ipn_port)) == NULL)
+ return skb;
+
+ if (*pt_prev) {
+ *ret = deliver_skb(skb, *pt_prev, orig_dev);
+ *pt_prev = NULL;
+ }
+ return ipn_handle_frame_hook(port, skb);
+}
+#else
+#define handle_ipn(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
#ifdef CONFIG_NET_CLS_ACT
/* TODO: Maybe we should just force sch_ingress to be compiled in
* when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
@@ -2070,9 +2096,8 @@
#endif

skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
- if (!skb)
- goto out;
skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
+ skb = handle_ipn(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;

diff -Naur linux-2.6.24-rc5/net/ipn/Kconfig linux-2.6.24-rc5-ipn/net/ipn/Kconfig
--- linux-2.6.24-rc5/net/ipn/Kconfig 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/ipn/Kconfig 2007-12-16 16:30:04.000000000 +0100
@@ -0,0 +1,21 @@
+#
+# Unix Domain Sockets
+#
+
+config IPN
+ tristate "IPN domain sockets (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ ---help---
+ If you say Y here, you will include support for IPN domain sockets.
+ Inter Process Networking socket are similar to Unix sockets but
+ they support peer-to-peer, one-to-many and many-to-many communication
+ among processes.
+ Sub-Modules can be loaded to provide dispatching protocols.
+ This service include the IPN_BROADCST policy: all the messages get
+ sent to all the receipients (but the sender itself).
+
+ To compile this driver as a module, choose M here: the module will be
+ called ipn.
+
+ If unsure, say 'N'.
+
diff -Naur linux-2.6.24-rc5/net/ipn/Makefile linux-2.6.24-rc5-ipn/net/ipn/Makefile
--- linux-2.6.24-rc5/net/ipn/Makefile 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/ipn/Makefile 2007-12-16 16:30:04.000000000 +0100
@@ -0,0 +1,8 @@
+#
+## Makefile for the IPN (Inter Process Networking) domain socket layer.
+#
+
+obj-$(CONFIG_IPN) += ipn.o
+
+ipn-y := af_ipn.o ipn_netdev.o
+
diff -Naur linux-2.6.24-rc5/net/ipn/af_ipn.c linux-2.6.24-rc5-ipn/net/ipn/af_ipn.c
--- linux-2.6.24-rc5/net/ipn/af_ipn.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/ipn/af_ipn.c 2007-12-16 18:53:13.000000000 +0100
@@ -0,0 +1,1540 @@
+/*
+ * Main inter process networking (virtual distributed ethernet) module
+ * (part of the View-OS project: wiki.virtualsquare.org)
+ *
+ * Copyright (C) 2007 Renzo Davoli ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Due to this file being licensed under the GPL there is controversy over
+ * whether this permits you to write a module that #includes this file
+ * without placing your module under the GPL. Please consult a lawyer for
+ * advice before doing this.
+ *
+ * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/socket.h>
+#include <linux/poll.h>
+#include <linux/un.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <net/sock.h>
+#include <net/af_ipn.h>
+#include "ipn_netdev.h"
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("VIEW-OS TEAM");
+MODULE_DESCRIPTION("IPN Kernel Module");
+
+#define IPN_MAX_PROTO 4
+
+/*extension of RCV_SHUTDOWN defined in include/net/sock.h
+ * when the bit is set recv fails */
+/* NO_OOB: do not send OOB */
+#define RCV_SHUTDOWN_NO_OOB 4
+/* EXTENDED MASK including OOB */
+#define SHUTDOWN_XMASK (SHUTDOWN_MASK | RCV_SHUTDOWN_NO_OOB)
+/* if XRCV_SHUTDOWN is all set recv fails */
+#define XRCV_SHUTDOWN (RCV_SHUTDOWN | RCV_SHUTDOWN_NO_OOB)
+
+/* Network table and hash */
+struct hlist_head ipn_network_table[IPN_HASH_SIZE + 1];
+DEFINE_SPINLOCK(ipn_table_lock);
+static struct kmem_cache *ipn_network_cache;
+static struct kmem_cache *ipn_node_cache;
+static struct kmem_cache *ipn_msgitem_cache;
+static DECLARE_MUTEX(ipn_glob_mutex);
+
+/* Protocol 1: HUB/Broadcast default protocol. Function Prototypes */
+static int ipn_bcast_newport(struct ipn_node *newport);
+static int ipn_bcast_handlemsg(struct ipn_node *from,
+ struct msgpool_item *msgitem);
+
+/* default protocol IPN_BROADCAST (0) */
+static struct ipn_protocol ipn_bcast = {
+ .refcnt=0,
+ .ipn_p_newport=ipn_bcast_newport,
+ .ipn_p_handlemsg=ipn_bcast_handlemsg};
+/* Protocol table */
+static struct ipn_protocol *ipn_protocol_table[IPN_MAX_PROTO]={&ipn_bcast};
+
+/* Socket call function prototypes */
+static int ipn_release(struct socket *);
+static int ipn_bind(struct socket *, struct sockaddr *, int);
+static int ipn_connect(struct socket *, struct sockaddr *,
+ int addr_len, int flags);
+static int ipn_getname(struct socket *, struct sockaddr *, int *, int);
+static unsigned int ipn_poll(struct file *, struct socket *, poll_table *);
+static int ipn_ioctl(struct socket *, unsigned int, unsigned long);
+static int ipn_shutdown(struct socket *, int);
+static int ipn_sendmsg(struct kiocb *, struct socket *,
+ struct msghdr *, size_t);
+static int ipn_recvmsg(struct kiocb *, struct socket *,
+ struct msghdr *, size_t, int);
+static int ipn_setsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int optlen);
+static int ipn_getsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int __user *optlen);
+
+/* Network table Management
+ * inode->ipn_network hash table */
+static inline void ipn_insert_network(struct hlist_head *list, struct ipn_network *ipnn)
+{
+ spin_lock(&ipn_table_lock);
+ hlist_add_head(&ipnn->hnode, list);
+ spin_unlock(&ipn_table_lock);
+}
+
+static inline void ipn_remove_network(struct ipn_network *ipnn)
+{
+ spin_lock(&ipn_table_lock);
+ hlist_del(&ipnn->hnode);
+ spin_unlock(&ipn_table_lock);
+}
+
+static struct ipn_network *ipn_find_network_byinode(struct inode *i)
+{
+ struct ipn_network *ipnn;
+ struct hlist_node *node;
+
+ spin_lock(&ipn_table_lock);
+ hlist_for_each_entry(ipnn, node,
+ &ipn_network_table[i->i_ino & (IPN_HASH_SIZE - 1)], hnode) {
+ struct dentry *dentry = ipnn->dentry;
+
+ if(atomic_read(&ipnn->refcnt) > 0 && dentry && dentry->d_inode == i)
+ goto found;
+ }
+ ipnn = NULL;
+found:
+ spin_unlock(&ipn_table_lock);
+ return ipnn;
+}
+
+/* msgpool management
+ * msgpool_item are ipn_network dependent (each net has its own MTU)
+ * for each message sent there is one msgpool_item and many struct msgitem
+ * one for each receipient.
+ * msgitem are connected to the node's msgqueue or oobmsgqueue.
+ * when a message is delivered to a process the msgitem is deleted and
+ * the count of the msgpool_item is decreased.
+ * msgpool_item elements gets deleted automatically when count is 0*/
+
+struct msgitem {
+ struct list_head list;
+ struct msgpool_item *msg;
+};
+
+/* alloc a fresh msgpool item. count is set to 1.
+ * the typical use is
+ * ipn_msgpool_alloc
+ * for each receipient
+ * enqueue messages to the process (using msgitem), ipn_msgpool_hold
+ * ipn_msgpool_put
+ * The message can be delivered concurrently. init count to 1 guarantees
+ * that it survives at least until is has been enqueued to all
+ * receivers */
+struct msgpool_item *ipn_msgpool_alloc(struct ipn_network *ipnn)
+{
+ struct msgpool_item *new;
+ new=kmem_cache_alloc(ipnn->msgpool_cache,GFP_KERNEL);
+ atomic_set(&new->count,1);
+ atomic_inc(&ipnn->msgpool_nelem);
+ return new;
+}
+
+/* If the service il LOSSLESS, this msgpool call waits for an
+ * available msgpool item */
+static struct msgpool_item *ipn_msgpool_alloc_locking(struct ipn_network *ipnn)
+{
+ if (ipnn->flags & IPN_FLAG_LOSSLESS) {
+ while (atomic_read(&ipnn->msgpool_nelem) >= ipnn->msgpool_size) {
+ if (wait_event_interruptible_exclusive(ipnn->send_wait,
+ atomic_read(&ipnn->msgpool_nelem) < ipnn->msgpool_size))
+ return NULL;
+ }
+ }
+ return ipn_msgpool_alloc(ipnn);
+}
+
+static inline void ipn_msgpool_hold(struct msgpool_item *msg)
+{
+ atomic_inc(&msg->count);
+}
+
+/* decrease count and delete msgpool_item if count == 0 */
+void ipn_msgpool_put(struct msgpool_item *old,
+ struct ipn_network *ipnn)
+{
+ if (atomic_dec_and_test(&old->count)) {
+ kmem_cache_free(ipnn->msgpool_cache,old);
+ atomic_dec(&ipnn->msgpool_nelem);
+ if (ipnn->flags & IPN_FLAG_LOSSLESS) /* could be done anyway */
+ wake_up_interruptible(&ipnn->send_wait);
+ }
+}
+
+/* socket calls */
+static const struct proto_ops ipn_ops = {
+ .family = PF_IPN,
+ .owner = THIS_MODULE,
+ .release = ipn_release,
+ .bind = ipn_bind,
+ .connect = ipn_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = sock_no_accept,
+ .getname = ipn_getname,
+ .poll = ipn_poll,
+ .ioctl = ipn_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = ipn_shutdown,
+ .setsockopt = ipn_setsockopt,
+ .getsockopt = ipn_getsockopt,
+ .sendmsg = ipn_sendmsg,
+ .recvmsg = ipn_recvmsg,
+ .mmap = sock_no_mmap,
+ .sendpage = sock_no_sendpage,
+};
+
+static struct proto ipn_proto = {
+ .name = "IPN",
+ .owner = THIS_MODULE,
+ .obj_size = sizeof(struct ipn_sock),
+};
+
+/* create a socket
+ * ipn_node is a separate structure, pointed by ipn_sock -> node
+ * when a node is "persistent", ipn_node survives while ipn_sock gets released*/
+static int ipn_create(struct net *net,struct socket *sock, int protocol)
+{
+ struct ipn_sock *ipn_sk;
+ struct ipn_node *ipn_node;
+
+ if (net != &init_net)
+ return -EAFNOSUPPORT;
+
+ if (sock->type != SOCK_RAW)
+ return -EPROTOTYPE;
+ if (protocol > 0)
+ protocol=protocol-1;
+ else
+ protocol=IPN_BROADCAST-1;
+ if (protocol < 0 || protocol >= IPN_MAX_PROTO ||
+ ipn_protocol_table[protocol] == NULL)
+ return -EPROTONOSUPPORT;
+ ipn_sk = (struct ipn_sock *) sk_alloc(net, PF_IPN, GFP_KERNEL, &ipn_proto);
+
+ if (!ipn_sk)
+ return -ENOMEM;
+ ipn_sk->node=ipn_node=kmem_cache_alloc(ipn_node_cache,GFP_KERNEL);
+ if (!ipn_node) {
+ sock_put((struct sock *) ipn_sk);
+ return -ENOMEM;
+ }
+ sock_init_data(sock,(struct sock *) ipn_sk);
+ sock->state = SS_UNCONNECTED;
+ sock->ops = &ipn_ops;
+ sock->sk=(struct sock *)ipn_sk;
+ INIT_LIST_HEAD(&ipn_node->nodelist);
+ ipn_node->protocol=protocol;
+ ipn_node->flags=IPN_NODEFLAG_INUSE;
+ ipn_node->shutdown=RCV_SHUTDOWN_NO_OOB;
+ ipn_node->descr[0]=0;
+ ipn_node->portno=IPN_PORTNO_ANY;
+ ipn_node->net=net;
+ ipn_node->dev=NULL;
+ ipn_node->proto_private=NULL;
+ ipn_node->totmsgcount=0;
+ ipn_node->oobmsgcount=0;
+ spin_lock_init(&ipn_node->msglock);
+ INIT_LIST_HEAD(&ipn_node->msgqueue);
+ INIT_LIST_HEAD(&ipn_node->oobmsgqueue);
+ ipn_node->ipn=NULL;
+ init_waitqueue_head(&ipn_node->read_wait);
+ ipn_node->pbp=NULL;
+ return 0;
+}
+
+/* update # of readers and # of writers counters for an ipn network.
+ * This function sends oob messages to nodes requesting the service */
+static void ipn_net_update_counters(struct ipn_network *ipnn,
+ int chg_readers, int chg_writers) {
+ ipnn->numreaders += chg_readers;
+ ipnn->numwriters += chg_writers;
+ if (ipnn->mtu >= sizeof(struct numnode_oob))
+ {
+ struct msgpool_item *ipn_msg=ipn_msgpool_alloc(ipnn);
+ if (ipn_msg) {
+ struct numnode_oob *oob_msg=(struct numnode_oob *)(ipn_msg->data);
+ struct ipn_node *ipn_node;
+ ipn_msg->len=sizeof(struct numnode_oob);
+ oob_msg->level=IPN_ANY;
+ oob_msg->tag=IPN_OOB_NUMNODE_TAG;
+ oob_msg->numreaders=ipnn->numreaders;
+ oob_msg->numwriters=ipnn->numwriters;
+ list_for_each_entry(ipn_node, &ipnn->connectqueue, nodelist) {
+ if (ipn_node->flags & IPN_NODEFLAG_OOB_NUMNODES)
+ ipn_proto_oobsendmsg(ipn_node,ipn_msg);
+ }
+ ipn_msgpool_put(ipn_msg,ipnn);
+ }
+ }
+}
+
+/* flush pending messages (for close and shutdown RCV) */
+static void ipn_flush_recvqueue(struct ipn_node *ipn_node)
+{
+ struct ipn_network *ipnn=ipn_node->ipn;
+ spin_lock(&ipn_node->msglock);
+ while (!list_empty(&ipn_node->msgqueue)) {
+ struct msgitem *msgitem=
+ list_first_entry(&ipn_node->msgqueue, struct msgitem, list);
+ list_del(&msgitem->list);
+ ipn_node->totmsgcount--;
+ ipn_msgpool_put(msgitem->msg,ipnn);
+ kmem_cache_free(ipn_msgitem_cache,msgitem);
+ }
+ spin_unlock(&ipn_node->msglock);
+}
+
+/* flush pending oob messages (for socket close) */
+static void ipn_flush_oobrecvqueue(struct ipn_node *ipn_node)
+{
+ struct ipn_network *ipnn=ipn_node->ipn;
+ spin_lock(&ipn_node->msglock);
+ while (!list_empty(&ipn_node->oobmsgqueue)) {
+ struct msgitem *msgitem=
+ list_first_entry(&ipn_node->oobmsgqueue, struct msgitem, list);
+ list_del(&msgitem->list);
+ ipn_node->totmsgcount--;
+ ipn_node->oobmsgcount--;
+ ipn_msgpool_put(msgitem->msg,ipnn);
+ kmem_cache_free(ipn_msgitem_cache,msgitem);
+ }
+ spin_unlock(&ipn_node->msglock);
+}
+
+/* Terminate node. The node is "logically" terminated. */
+static int ipn_terminate_node(struct ipn_node *ipn_node)
+{
+ struct ipn_network *ipnn=ipn_node->ipn;
+ if (ipnn) {
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ return -ERESTARTSYS;
+ if (ipn_node->portno >= 0) {
+ ipn_protocol_table[ipnn->protocol]->ipn_p_predelport(ipn_node);
+ ipnn->connport[ipn_node->portno]=NULL;
+ }
+ list_del(&ipn_node->nodelist);
+ ipn_flush_recvqueue(ipn_node);
+ ipn_flush_oobrecvqueue(ipn_node);
+ if (ipn_node->portno >= 0) {
+ ipn_protocol_table[ipnn->protocol]->ipn_p_delport(ipn_node);
+ ipn_node->ipn=NULL;
+ ipn_net_update_counters(ipnn,
+ (ipn_node->shutdown & RCV_SHUTDOWN)?0:-1,
+ (ipn_node->shutdown & SEND_SHUTDOWN)?0:-1);
+ up(&ipnn->ipnn_mutex);
+ if (ipn_node->dev)
+ ipn_netdev_close(ipn_node);
+ }
+ /* No more network elements */
+ if (atomic_dec_and_test(&ipnn->refcnt))
+ {
+ ipn_protocol_table[ipnn->protocol]->ipn_p_delnet(ipnn);
+ ipn_remove_network(ipnn);
+ ipn_protocol_table[ipnn->protocol]->refcnt--;
+ if (ipnn->dentry) {
+ dput(ipnn->dentry);
+ mntput(ipnn->mnt);
+ }
+ module_put(THIS_MODULE);
+ if (ipnn->msgpool_cache)
+ kmem_cache_destroy(ipnn->msgpool_cache);
+ if (ipnn->connport)
+ kfree(ipnn->connport);
+ kmem_cache_free(ipn_network_cache, ipnn);
+ }
+ }
+ if (ipn_node->pbp) {
+ kfree(ipn_node->pbp);
+ ipn_node->pbp=NULL;
+ }
+ ipn_node->shutdown = SHUTDOWN_XMASK;
+ return 0;
+}
+
+/* release of a socket */
+static int ipn_release (struct socket *sock)
+{
+ struct ipn_sock *ipn_sk=(struct ipn_sock *)sock->sk;
+ struct ipn_node *ipn_node=ipn_sk->node;
+ int rv;
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (ipn_node->flags & IPN_NODEFLAG_PERSIST) {
+ ipn_node->flags &= ~IPN_NODEFLAG_INUSE;
+ rv=0;
+ } else {
+ rv=ipn_terminate_node(ipn_node);
+ if (rv==0)
+ kmem_cache_free(ipn_node_cache,ipn_node);
+ }
+ if (rv==0)
+ sock_put((struct sock *) ipn_sk);
+ up(&ipn_glob_mutex);
+ return rv;
+}
+
+/* _set persist, change the persistence of a node,
+ * when persistence gets cleared and the node is no longer used
+ * the node is terminated and freed.
+ * ipn_glob_mutex must be locked */
+static int _ipn_setpersist(struct ipn_node *ipn_node, int persist)
+{
+ int rv=0;
+ if (persist)
+ ipn_node->flags |= IPN_NODEFLAG_PERSIST;
+ else {
+ ipn_node->flags &= ~IPN_NODEFLAG_PERSIST;
+ if (!(ipn_node->flags & IPN_NODEFLAG_INUSE)) {
+ rv=ipn_terminate_node(ipn_node);
+ if (rv==0)
+ kmem_cache_free(ipn_node_cache,ipn_node);
+ }
+ }
+ return rv;
+}
+
+/* ipn_setpersist
+ * lock ipn_glob_mutex and call __ipn_setpersist above */
+static int ipn_setpersist(struct ipn_node *ipn_node, int persist)
+{
+ int rv=0;
+ if (ipn_node->dev == NULL)
+ return -ENODEV;
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ rv=_ipn_setpersist(ipn_node,persist);
+ up(&ipn_glob_mutex);
+ return rv;
+}
+
+/* several network parameters can be set by setsockopt prior to bind */
+/* struct pre_bind_parms is a temporary stucture connected to ipn_node->pbp
+ * to keep the parameter values. */
+struct pre_bind_parms {
+ unsigned short maxports;
+ unsigned short flags;
+ unsigned short msgpoolsize;
+ unsigned short mtu;
+ unsigned short mode;
+};
+
+/* STD_PARMS: BITS_PER_LONG nodes, no flags, BITS_PER_BYTE pending msgs,
+ * Ethernet + VLAN MTU*/
+#define STD_BIND_PARMS {BITS_PER_LONG, 0, BITS_PER_BYTE, 1514, 0x777};
+
+static int ipn_mkname(struct sockaddr_un * sunaddr, int len)
+{
+ if (len <= sizeof(short) || len > sizeof(*sunaddr))
+ return -EINVAL;
+ if (!sunaddr || sunaddr->sun_family != AF_IPN)
+ return -EINVAL;
+ /*
+ * This may look like an off by one error but it is a bit more
+ * subtle. 108 is the longest valid AF_IPN path for a binding.
+ * sun_path[108] doesnt as such exist. However in kernel space
+ * we are guaranteed that it is a valid memory location in our
+ * kernel address buffer.
+ */
+ ((char *)sunaddr)[len]=0;
+ len = strlen(sunaddr->sun_path)+1+sizeof(short);
+ return len;
+}
+
+
+/* IPN BIND */
+static int ipn_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+ struct sockaddr_un *sunaddr=(struct sockaddr_un *)uaddr;
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct nameidata nd;
+ struct ipn_network *ipnn;
+ struct dentry * dentry = NULL;
+ int err;
+ struct pre_bind_parms parms=STD_BIND_PARMS;
+
+ //printk("IPN bind\n");
+
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (sock->state != SS_UNCONNECTED ||
+ ipn_node->ipn != NULL) {
+ err= -EISCONN;
+ goto out;
+ }
+
+ if (ipn_node->protocol >= 0 &&
+ (ipn_node->protocol >= IPN_MAX_PROTO ||
+ ipn_protocol_table[ipn_node->protocol] == NULL)) {
+ err= -EPROTONOSUPPORT;
+ goto out;
+ }
+
+ addr_len = ipn_mkname(sunaddr, addr_len);
+ if (addr_len < 0) {
+ err=addr_len;
+ goto out;
+ }
+
+ /* check if there is already a socket with that name */
+ err = path_lookup(sunaddr->sun_path, LOOKUP_FOLLOW, &nd);
+ if (err) { /* it does not exist, NEW IPN socket! */
+ unsigned int mode;
+ /* Is it everything okay with the parent? */
+ err = path_lookup(sunaddr->sun_path, LOOKUP_PARENT, &nd);
+ if (err)
+ goto out_mknod_parent;
+ /* Do I have the permission to create a file? */
+ dentry = lookup_create(&nd, 0);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out_mknod_unlock;
+ /*
+ * All right, let's create it.
+ */
+ if (ipn_node->pbp)
+ mode = ipn_node->pbp->mode;
+ else
+ mode = SOCK_INODE(sock)->i_mode;
+ mode = S_IFSOCK | (mode & ~current->fs->umask);
+ err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+ if (err)
+ goto out_mknod_dput;
+ mutex_unlock(&nd.dentry->d_inode->i_mutex);
+ dput(nd.dentry);
+ nd.dentry = dentry;
+ /* create a new ipn_network item */
+ if (ipn_node->pbp)
+ parms=*ipn_node->pbp;
+ ipnn=kmem_cache_zalloc(ipn_network_cache,GFP_KERNEL);
+ if (!ipnn) {
+ err=-ENOMEM;
+ goto out_mknod_dput_ipnn;
+ }
+ ipnn->connport=kzalloc(parms.maxports * sizeof(struct ipn_node *),GFP_KERNEL);
+ if (!ipnn->connport) {
+ err=-ENOMEM;
+ goto out_mknod_dput_ipnn2;
+ }
+
+ /* module refcnt is incremented for each network, thus
+ * rmmod is forbidden if there are persistent node */
+ if (!try_module_get(THIS_MODULE)) {
+ err = -EINVAL;
+ goto out_mknod_dput_ipnn2;
+ }
+ memcpy(&ipnn->sunaddr,sunaddr,addr_len);
+ ipnn->mtu=parms.mtu;
+ ipnn->msgpool_cache=kmem_cache_create(ipnn->sunaddr.sun_path,sizeof(struct msgpool_item)+ipnn->mtu,0,0,NULL);
+ if (!ipnn->msgpool_cache) {
+ err=-ENOMEM;
+ goto out_mknod_dput_putmodule;
+ }
+ INIT_LIST_HEAD(&ipnn->unconnectqueue);
+ INIT_LIST_HEAD(&ipnn->connectqueue);
+ atomic_set(&ipnn->refcnt,1);
+ ipnn->dentry=nd.dentry;
+ ipnn->mnt=nd.mnt;
+ init_MUTEX(&ipnn->ipnn_mutex);
+ ipnn->sunaddr_len=addr_len;
+ ipnn->protocol=ipn_node->protocol;
+ if (ipnn->protocol < 0) ipnn->protocol = 0;
+ ipn_protocol_table[ipnn->protocol]->refcnt++;
+ ipnn->flags=parms.flags;
+ ipnn->numreaders=0;
+ ipnn->numwriters=0;
+ ipnn->maxports=parms.maxports;
+ atomic_set(&ipnn->msgpool_nelem,0);
+ ipnn->msgpool_size=parms.msgpoolsize;
+ ipnn->proto_private=NULL;
+ init_waitqueue_head(&ipnn->send_wait);
+ err=ipn_protocol_table[ipnn->protocol]->ipn_p_newnet(ipnn);
+ if (err)
+ goto out_mknod_dput_putmodule;
+ ipn_insert_network(&ipn_network_table[nd.dentry->d_inode->i_ino & (IPN_HASH_SIZE-1)],ipnn);
+ } else {
+ /* join an existing network */
+ err = vfs_permission(&nd, MAY_EXEC);
+ if (err)
+ goto put_fail;
+ err = -ECONNREFUSED;
+ if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
+ goto put_fail;
+ ipnn=ipn_find_network_byinode(nd.dentry->d_inode);
+ if (!ipnn || (ipnn->flags & IPN_FLAG_TERMINATED))
+ goto put_fail;
+ list_add_tail(&ipn_node->nodelist,&ipnn->unconnectqueue);
+ atomic_inc(&ipnn->refcnt);
+ }
+ if (ipn_node->pbp) {
+ kfree(ipn_node->pbp);
+ ipn_node->pbp=NULL;
+ }
+ ipn_node->ipn=ipnn;
+ ipn_node->flags |= IPN_NODEFLAG_BOUND;
+ up(&ipn_glob_mutex);
+ return 0;
+
+put_fail:
+ path_release(&nd);
+out:
+ up(&ipn_glob_mutex);
+ return err;
+
+out_mknod_dput_putmodule:
+ module_put(THIS_MODULE);
+out_mknod_dput_ipnn2:
+ kfree(ipnn->connport);
+out_mknod_dput_ipnn:
+ kmem_cache_free(ipn_network_cache,ipnn);
+out_mknod_dput:
+ dput(dentry);
+out_mknod_unlock:
+ mutex_unlock(&nd.dentry->d_inode->i_mutex);
+ path_release(&nd);
+out_mknod_parent:
+ if (err==-EEXIST)
+ err=-EADDRINUSE;
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+/* IPN CONNECT */
+static int ipn_connect(struct socket *sock, struct sockaddr *addr,
+ int addr_len, int flags){
+ struct sockaddr_un *sunaddr=(struct sockaddr_un*)addr;
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct nameidata nd;
+ struct ipn_network *ipnn,*previousipnn;
+ int err=0;
+ int portno;
+
+ /* the socket cannot be connected twice */
+ if (sock->state != SS_UNCONNECTED)
+ return EISCONN;
+
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+
+ if ((previousipnn=ipn_node->ipn) == NULL) { /* unbound */
+ unsigned char mustshutdown=0;
+ err = ipn_mkname(sunaddr, addr_len);
+ if (err < 0)
+ goto out;
+ addr_len=err;
+ err = path_lookup(sunaddr->sun_path, LOOKUP_FOLLOW, &nd);
+ if (err)
+ goto out;
+ err = vfs_permission(&nd, MAY_READ);
+ if (err) {
+ if (err == -EACCES || err == -EROFS)
+ mustshutdown|=RCV_SHUTDOWN;
+ else
+ goto put_fail;
+ }
+ err = vfs_permission(&nd, MAY_WRITE);
+ if (err) {
+ if (err == -EACCES)
+ mustshutdown|=SEND_SHUTDOWN;
+ else
+ goto put_fail;
+ }
+ mustshutdown |= ipn_node->shutdown;
+ /* if the combination of shutdown and permissions leaves
+ * no abilities, connect returns EACCES */
+ if (mustshutdown == SHUTDOWN_XMASK) {
+ err=-EACCES;
+ goto put_fail;
+ } else {
+ err=0;
+ ipn_node->shutdown=mustshutdown;
+ }
+ if (!S_ISSOCK(nd.dentry->d_inode->i_mode)) {
+ err = -ECONNREFUSED;
+ goto put_fail;
+ }
+ ipnn=ipn_find_network_byinode(nd.dentry->d_inode);
+ if (!ipnn || (ipnn->flags & IPN_FLAG_TERMINATED)) {
+ err = -ECONNREFUSED;
+ goto put_fail;
+ }
+ if (ipn_node->protocol == IPN_ANY)
+ ipn_node->protocol=ipnn->protocol;
+ else if (ipnn->protocol != ipn_node->protocol) {
+ err = -EPROTO;
+ goto put_fail;
+ }
+ path_release(&nd);
+ ipn_node->ipn=ipnn;
+ } else
+ ipnn=ipn_node->ipn;
+
+ if (down_interruptible(&ipnn->ipnn_mutex)) {
+ err=-ERESTARTSYS;
+ goto out;
+ }
+ portno = ipn_protocol_table[ipnn->protocol]->ipn_p_newport(ipn_node);
+ if (portno >= 0 && portno<ipnn->maxports) {
+ sock->state = SS_CONNECTED;
+ ipn_node->portno=portno;
+ ipnn->connport[portno]=ipn_node;
+ if (!(ipn_node->flags & IPN_NODEFLAG_BOUND)) {
+ atomic_inc(&ipnn->refcnt);
+ list_del(&ipn_node->nodelist);
+ }
+ list_add_tail(&ipn_node->nodelist,&ipnn->connectqueue);
+ ipn_net_update_counters(ipnn,
+ (ipn_node->shutdown & RCV_SHUTDOWN)?0:1,
+ (ipn_node->shutdown & SEND_SHUTDOWN)?0:1);
+ } else {
+ ipn_node->ipn=previousipnn; /* undo changes on ipn_node->ipn */
+ err=-EADDRNOTAVAIL;
+ }
+ up(&ipnn->ipnn_mutex);
+ up(&ipn_glob_mutex);
+ return err;
+
+put_fail:
+ path_release(&nd);
+out:
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+static int ipn_getname(struct socket *sock, struct sockaddr *uaddr,
+ int *uaddr_len, int peer) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ struct sockaddr_un *sunaddr=(struct sockaddr_un *)uaddr;
+ int err=0;
+
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (ipnn) {
+ *uaddr_len = ipnn->sunaddr_len;
+ memcpy(sunaddr,&ipnn->sunaddr,*uaddr_len);
+ } else
+ err = -ENOTCONN;
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+/* IPN POLL */
+static unsigned int ipn_poll(struct file *file, struct socket *sock,
+ poll_table *wait) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ unsigned int mask=0;
+
+ if (ipnn) {
+ poll_wait(file,&ipn_node->read_wait,wait);
+ if (ipnn->flags & IPN_FLAG_LOSSLESS)
+ poll_wait(file,&ipnn->send_wait,wait);
+ /* POLLIN if recv succeeds,
+ * POLL{PRI,RDNORM} if there are {oob,non-oob} messages */
+ if (ipn_node->totmsgcount > 0) mask |= POLLIN;
+ if (!(list_empty(&ipn_node->msgqueue))) mask |= POLLRDNORM;
+ if (!(list_empty(&ipn_node->oobmsgqueue))) mask |= POLLPRI;
+ if ((!(ipnn->flags & IPN_FLAG_LOSSLESS)) |
+ (atomic_read(&ipnn->msgpool_nelem) < ipnn->msgpool_size))
+ mask |= POLLOUT | POLLWRNORM;
+ }
+ return mask;
+}
+
+/* connect netdev (from ioctl). connect a bound socket to a
+ * network device TAP or GRAB */
+static int ipn_connect_netdev(struct socket *sock,struct ifreq *ifr)
+{
+ int err=0;
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+ if (sock->state != SS_UNCONNECTED)
+ return -EISCONN;
+ if (!ipnn)
+ return -ENOTCONN; /* Maybe we need a different error for "NOT BOUND" */
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (down_interruptible(&ipnn->ipnn_mutex)) {
+ up(&ipn_glob_mutex);
+ return -ERESTARTSYS;
+ }
+ ipn_node->dev=ipn_netdev_alloc(ipn_node->net,ifr->ifr_flags,ifr->ifr_name,&err);
+ if (ipn_node->dev) {
+ int portno;
+ portno = ipn_protocol_table[ipnn->protocol]->ipn_p_newport(ipn_node);
+ if (portno >= 0 && portno<ipnn->maxports) {
+ sock->state = SS_CONNECTED;
+ ipn_node->portno=portno;
+ ipn_node->flags |= ifr->ifr_flags & IPN_NODEFLAG_DEVMASK;
+ ipnn->connport[portno]=ipn_node;
+ err=ipn_netdev_activate(ipn_node);
+ if (err) {
+ sock->state = SS_UNCONNECTED;
+ ipn_protocol_table[ipnn->protocol]->ipn_p_delport(ipn_node);
+ ipn_node->dev=NULL;
+ ipn_node->portno= -1;
+ ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
+ ipnn->connport[portno]=NULL;
+ } else {
+ ipn_protocol_table[ipnn->protocol]->ipn_p_postnewport(ipn_node);
+ list_del(&ipn_node->nodelist);
+ list_add_tail(&ipn_node->nodelist,&ipnn->connectqueue);
+ }
+ } else {
+ ipn_netdev_close(ipn_node);
+ err=-EADDRNOTAVAIL;
+ ipn_node->dev=NULL;
+ }
+ } else
+ err=-EINVAL;
+ up(&ipnn->ipnn_mutex);
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+/* join a netdev, a socket gets connected to a persistent node
+ * not connected to another socket */
+static int ipn_join_netdev(struct socket *sock,struct ifreq *ifr)
+{
+ int err=0;
+ struct net_device *dev;
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_node *ipn_joined;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ if (sock->state != SS_UNCONNECTED)
+ return -EISCONN;
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (down_interruptible(&ipnn->ipnn_mutex)) {
+ up(&ipn_glob_mutex);
+ return -ERESTARTSYS;
+ }
+ dev=__dev_get_by_name(ipn_node->net,ifr->ifr_name);
+ if (!dev)
+ dev=__dev_get_by_index(ipn_node->net,ifr->ifr_ifindex);
+ if (dev && (ipn_joined=ipn_netdev2node(dev)) != NULL) { /* the interface does exist */
+ int i;
+ for (i=0;i<ipnn->maxports && ipn_joined != ipnn->connport[i] ;i++)
+ ;
+ if (i < ipnn->maxports) { /* found */
+ /* ipn_joined is substituted to ipn_node */
+ ((struct ipn_sock *)sock->sk)->node=ipn_joined;
+ ipn_joined->flags |= IPN_NODEFLAG_INUSE;
+ atomic_dec(&ipnn->refcnt);
+ kmem_cache_free(ipn_node_cache,ipn_node);
+ } else
+ err=-EPERM;
+ } else
+ err=-EADDRNOTAVAIL;
+ up(&ipnn->ipnn_mutex);
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+/* set persistence of a node looking for it by interface name
+ * (it is for sysadm, to close network interfaces)*/
+static int ipn_setpersist_netdev(struct ifreq *ifr, int value)
+{
+ struct net_device *dev;
+ struct ipn_node *ipn_node;
+ int err=0;
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ dev=__dev_get_by_name(&init_net,ifr->ifr_name);
+ if (!dev)
+ dev=__dev_get_by_index(&init_net,ifr->ifr_ifindex);
+ if (dev && (ipn_node=ipn_netdev2node(dev)) != NULL)
+ _ipn_setpersist(ipn_node,value);
+ else
+ err=-EADDRNOTAVAIL;
+ up(&ipn_glob_mutex);
+ return err;
+}
+
+/* IPN IOCTL */
+static int ipn_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ void __user* argp = (void __user*)arg;
+ struct ifreq ifr;
+
+ if (ipn_node->shutdown == SHUTDOWN_XMASK)
+ return -ECONNRESET;
+
+ /* get arguments */
+ switch (cmd) {
+ case IPN_SETPERSIST_NETDEV:
+ case IPN_CLRPERSIST_NETDEV:
+ case IPN_CONN_NETDEV:
+ case IPN_JOIN_NETDEV:
+ case SIOCSIFHWADDR:
+ if (copy_from_user(&ifr, argp, sizeof ifr))
+ return -EFAULT;
+ ifr.ifr_name[IFNAMSIZ-1] = '\0';
+ }
+
+ /* actions for unconnected and unbound sockets */
+ switch (cmd) {
+ case IPN_SETPERSIST_NETDEV:
+ return ipn_setpersist_netdev(&ifr,1);
+ case IPN_CLRPERSIST_NETDEV:
+ return ipn_setpersist_netdev(&ifr,0);
+ case SIOCSIFHWADDR:
+ if (capable(CAP_NET_ADMIN))
+ return -EPERM;
+ if (ipn_node->dev && (ipn_node->flags &IPN_NODEFLAG_TAP))
+ return dev_set_mac_address(ipn_node->dev, &ifr.ifr_hwaddr);
+ else
+ return -EADDRNOTAVAIL;
+ }
+ if (ipnn == NULL || (ipnn->flags & IPN_FLAG_TERMINATED))
+ return -ENOTCONN;
+ /* actions for connected or bound sockets */
+ switch (cmd) {
+ case IPN_CONN_NETDEV:
+ return ipn_connect_netdev(sock,&ifr);
+ case IPN_JOIN_NETDEV:
+ return ipn_join_netdev(sock,&ifr);
+ case IPN_SETPERSIST:
+ return ipn_setpersist(ipn_node,arg);
+ default:
+ if (ipnn) {
+ int rv;
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ return -ERESTARTSYS;
+ rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_ioctl(ipn_node,cmd,arg);
+ up(&ipnn->ipnn_mutex);
+ return rv;
+ } else
+ return -EOPNOTSUPP;
+ }
+}
+
+/* shutdown: close socket for input or for output.
+ * shutdown can be called prior to connect and it is not reversible */
+static int ipn_shutdown(struct socket *sock, int mode) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ int oldshutdown=ipn_node->shutdown;
+ mode = (mode+1)&(RCV_SHUTDOWN|SEND_SHUTDOWN);
+
+ ipn_node->shutdown |= mode;
+
+ if(ipnn) {
+ if (down_interruptible(&ipnn->ipnn_mutex)) {
+ ipn_node->shutdown = oldshutdown;
+ return -ERESTARTSYS;
+ }
+ oldshutdown=ipn_node->shutdown-oldshutdown;
+ if (sock->state == SS_CONNECTED && oldshutdown) {
+ ipn_net_update_counters(ipnn,
+ (ipn_node->shutdown & RCV_SHUTDOWN)?0:-1,
+ (ipn_node->shutdown & SEND_SHUTDOWN)?0:-1);
+ }
+
+ /* if recv channel has been shut down, flush the recv queue */
+ if ((ipn_node->shutdown & RCV_SHUTDOWN))
+ ipn_flush_recvqueue(ipn_node);
+ up(&ipnn->ipnn_mutex);
+ }
+ return 0;
+}
+
+/* injectmsg: a new message is entering the ipn network.
+ * injectmsg gets called by send and by the grab/tap node */
+int ipn_proto_injectmsg(struct ipn_node *from, struct msgpool_item *msg)
+{
+ struct ipn_network *ipnn=from->ipn;
+ int err=0;
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ err=-ERESTARTSYS;
+ else {
+ ipn_protocol_table[ipnn->protocol]->ipn_p_handlemsg(from, msg);
+ up(&ipnn->ipnn_mutex);
+ }
+ return err;
+}
+
+/* SEND MSG */
+static int ipn_sendmsg(struct kiocb *kiocb, struct socket *sock,
+ struct msghdr *msg, size_t len) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ struct msgpool_item *newmsg;
+ int err=0;
+
+ if (unlikely(sock->state != SS_CONNECTED))
+ return -ENOTCONN;
+ if (unlikely(ipn_node->shutdown & SEND_SHUTDOWN)) {
+ if (ipn_node->shutdown == SHUTDOWN_XMASK)
+ return -ECONNRESET;
+ else
+ return -EPIPE;
+ }
+ if (len > ipnn->mtu)
+ return -EOVERFLOW;
+ newmsg=ipn_msgpool_alloc_locking(ipnn);
+ if (!newmsg)
+ return -ENOMEM;
+ newmsg->len=len;
+ err=memcpy_fromiovec(newmsg->data, msg->msg_iov, len);
+ if (!err)
+ ipn_proto_injectmsg(ipn_node, newmsg);
+ ipn_msgpool_put(newmsg,ipnn);
+ return err;
+}
+
+/* enqueue an oob message. "to" is the destination */
+void ipn_proto_oobsendmsg(struct ipn_node *to, struct msgpool_item *msg)
+{
+ if (to) {
+ if (!to->dev) { /* no oob to netdev */
+ struct msgitem *msgitem;
+ struct ipn_network *ipnn=to->ipn;
+ spin_lock(&to->msglock);
+ if ((to->shutdown & RCV_SHUTDOWN_NO_OOB) == 0 &&
+ (ipnn->flags & IPN_FLAG_LOSSLESS ||
+ to->oobmsgcount < ipnn->msgpool_size)) {
+ if ((msgitem=kmem_cache_alloc(ipn_msgitem_cache,GFP_KERNEL))!=NULL) {
+ msgitem->msg=msg;
+ to->totmsgcount++;
+ to->oobmsgcount++;
+ list_add_tail(&msgitem->list, &to->oobmsgqueue);
+ ipn_msgpool_hold(msg);
+ }
+ }
+ spin_unlock(&to->msglock);
+ wake_up_interruptible(&to->read_wait);
+ }
+ }
+}
+
+/* ipn_proto_sendmsg is called by protocol implementation to enqueue a
+ * for a destination (to).*/
+void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg)
+{
+ if (to) {
+ if (to->dev) {
+ ipn_netdev_sendmsg(to,msg);
+ } else {
+ /* socket send */
+ struct msgitem *msgitem;
+ struct ipn_network *ipnn=to->ipn;
+ spin_lock(&to->msglock);
+ if ((ipnn->flags & IPN_FLAG_LOSSLESS ||
+ to->totmsgcount < ipnn->msgpool_size) &&
+ (to->shutdown & RCV_SHUTDOWN)==0) {
+ if ((msgitem=kmem_cache_alloc(ipn_msgitem_cache,GFP_KERNEL))!=NULL) {
+ msgitem->msg=msg;
+ to->totmsgcount++;
+ list_add_tail(&msgitem->list, &to->msgqueue);
+ ipn_msgpool_hold(msg);
+ }
+ }
+ spin_unlock(&to->msglock);
+ wake_up_interruptible(&to->read_wait);
+ }
+ }
+}
+
+/* IPN RECV */
+static int ipn_recvmsg(struct kiocb *kiocb, struct socket *sock,
+ struct msghdr *msg, size_t len, int flags) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ struct msgitem *msgitem;
+ struct msgpool_item *currmsg;
+
+ if (unlikely(sock->state != SS_CONNECTED))
+ return -ENOTCONN;
+
+ if (unlikely((ipn_node->shutdown & XRCV_SHUTDOWN) == XRCV_SHUTDOWN)) {
+ if (ipn_node->shutdown == SHUTDOWN_XMASK) /*EOF, nothing can be read*/
+ return 0;
+ else
+ return -EPIPE; /*trying to read on a write only node */
+ }
+
+ /* wait for a message */
+ spin_lock(&ipn_node->msglock);
+ while (ipn_node->totmsgcount == 0) {
+ spin_unlock(&ipn_node->msglock);
+ if (wait_event_interruptible(ipn_node->read_wait,
+ !(ipn_node->totmsgcount == 0)))
+ return -ERESTARTSYS;
+ spin_lock(&ipn_node->msglock);
+ }
+ /* oob gets delivered first. oob are rare */
+ if (likely(list_empty(&ipn_node->oobmsgqueue)))
+ msgitem=list_first_entry(&ipn_node->msgqueue, struct msgitem, list);
+ else {
+ msgitem=list_first_entry(&ipn_node->oobmsgqueue, struct msgitem, list);
+ msg->msg_flags |= MSG_OOB;
+ ipn_node->oobmsgcount--;
+ }
+ list_del(&msgitem->list);
+ ipn_node->totmsgcount--;
+ spin_unlock(&ipn_node->msglock);
+ currmsg=msgitem->msg;
+ if (currmsg->len < len)
+ len=currmsg->len;
+ memcpy_toiovec(msg->msg_iov, currmsg->data, len);
+ ipn_msgpool_put(currmsg,ipnn);
+ kmem_cache_free(ipn_msgitem_cache,msgitem);
+
+ return len;
+}
+
+/* resize a network: change the # of communication ports (connport) */
+static int ipn_netresize(struct ipn_network *ipnn,int newsize)
+{
+ int oldsize,min;
+ struct ipn_node **newconnport;
+ struct ipn_node **oldconnport;
+ int err;
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ return -ERESTARTSYS;
+ oldsize=ipnn->maxports;
+ if (newsize == oldsize) {
+ up(&ipnn->ipnn_mutex);
+ return 0;
+ }
+ min=oldsize;
+ /* shrink a network. all the ports we are going to eliminate
+ * must be unused! */
+ if (newsize < oldsize) {
+ int i;
+ for (i=newsize; i<oldsize; i++)
+ if (ipnn->connport[i]) {
+ up(&ipnn->ipnn_mutex);
+ return -EADDRINUSE;
+ }
+ min=newsize;
+ }
+ oldconnport=ipnn->connport;
+ /* allocate the new connport array and copy the old one */
+ newconnport=kzalloc(newsize * sizeof(struct ipn_node *),GFP_KERNEL);
+ if (!newconnport) {
+ up(&ipnn->ipnn_mutex);
+ return -ENOMEM;
+ }
+ memcpy(newconnport,oldconnport,min * sizeof(struct ipn_node *));
+ ipnn->connport=newconnport;
+ ipnn->maxports=newsize;
+ /* notify the protocol that the netowrk has been resized */
+ err=ipn_protocol_table[ipnn->protocol]->ipn_p_resizenet(ipnn,oldsize,newsize);
+ if (err) {
+ /* roll back if the resize operation failed for the protocol */
+ ipnn->connport=oldconnport;
+ ipnn->maxports=oldsize;
+ kfree(newconnport);
+ } else
+ /* successful mission, network resized */
+ kfree(oldconnport);
+ up(&ipnn->ipnn_mutex);
+ return err;
+}
+
+/* IPN SETSOCKOPT */
+static int ipn_setsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int optlen) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+
+ if (ipn_node->shutdown == SHUTDOWN_XMASK)
+ return -ECONNRESET;
+ if (level != 0 && level != ipn_node->protocol+1)
+ return -EPROTONOSUPPORT;
+ if (level > 0) {
+ /* protocol specific sockopt */
+ if (ipnn) {
+ int rv;
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ return -ERESTARTSYS;
+ rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_setsockopt(ipn_node,optname,optval,optlen);
+ up(&ipnn->ipnn_mutex);
+ return rv;
+ } else
+ return -EOPNOTSUPP;
+ } else {
+ if (optname == IPN_SO_DESCR) {
+ if (optlen > IPN_DESCRLEN)
+ return -EINVAL;
+ else {
+ memset(ipn_node->descr,0,IPN_DESCRLEN);
+ copy_from_user(ipn_node->descr,optval,optlen);
+ ipn_node->descr[optlen-1]=0;
+ return 0;
+ }
+ } else {
+ if (optlen < sizeof(int))
+ return -EINVAL;
+ else if ((optname & IPN_SO_PREBIND) && (ipnn != NULL))
+ return -EISCONN;
+ else {
+ int val;
+ get_user(val, (int __user *) optval);
+ if ((optname & IPN_SO_PREBIND) && !ipn_node->pbp) {
+ struct pre_bind_parms std=STD_BIND_PARMS;
+ ipn_node->pbp=kzalloc(sizeof(struct pre_bind_parms),GFP_KERNEL);
+ if (!ipn_node->pbp)
+ return -ENOMEM;
+ *(ipn_node->pbp)=std;
+ }
+ switch (optname) {
+ case IPN_SO_PORT:
+ if (sock->state == SS_UNCONNECTED)
+ ipn_node->portno=val;
+ else
+ return -EISCONN;
+ break;
+ case IPN_SO_CHANGE_NUMNODES:
+ if ((ipn_node->flags & IPN_NODEFLAG_BOUND)!=0) {
+ if (val <= 0)
+ return -EINVAL;
+ else
+ return ipn_netresize(ipnn,val);
+ } else
+ val=-ENOTCONN;
+ break;
+ case IPN_SO_WANT_OOB_NUMNODES:
+ if (val)
+ ipn_node->flags |= IPN_NODEFLAG_OOB_NUMNODES;
+ else
+ ipn_node->flags &= ~IPN_NODEFLAG_OOB_NUMNODES;
+ break;
+ case IPN_SO_HANDLE_OOB:
+ if (val)
+ ipn_node->shutdown &= ~RCV_SHUTDOWN_NO_OOB;
+ else
+ ipn_node->shutdown |= RCV_SHUTDOWN_NO_OOB;
+ break;
+ case IPN_SO_MTU:
+ if (val <= 0)
+ return -EINVAL;
+ else
+ ipn_node->pbp->mtu=val;
+ break;
+ case IPN_SO_NUMNODES:
+ if (val <= 0)
+ return -EINVAL;
+ else
+ ipn_node->pbp->maxports=val;
+ break;
+ case IPN_SO_MSGPOOLSIZE:
+ if (val <= 0)
+ return -EINVAL;
+ else
+ ipn_node->pbp->msgpoolsize=val;
+ break;
+ case IPN_SO_FLAGS:
+ ipn_node->pbp->flags=val;
+ break;
+ case IPN_SO_MODE:
+ ipn_node->pbp->mode=val;
+ break;
+ }
+ return 0;
+ }
+ }
+ }
+}
+
+/* IPN GETSOCKOPT */
+static int ipn_getsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int __user *optlen) {
+ struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
+ struct ipn_network *ipnn=ipn_node->ipn;
+ int len;
+
+ if (ipn_node->shutdown == SHUTDOWN_XMASK)
+ return -ECONNRESET;
+ if (level != 0 && level != ipn_node->protocol+1)
+ return -EPROTONOSUPPORT;
+ if (level > 0) {
+ if (ipnn) {
+ int rv;
+ /* protocol specific sockopt */
+ if (down_interruptible(&ipnn->ipnn_mutex))
+ return -ERESTARTSYS;
+ rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_getsockopt(ipn_node,optname,optval,optlen);
+ up(&ipnn->ipnn_mutex);
+ return rv;
+ } else
+ return -EOPNOTSUPP;
+ } else {
+ if (get_user(len, optlen))
+ return -EFAULT;
+ if (optname == IPN_SO_DESCR) {
+ if (len < IPN_DESCRLEN)
+ return -EINVAL;
+ else {
+ if (len > IPN_DESCRLEN)
+ len=IPN_DESCRLEN;
+ if(put_user(len, optlen))
+ return -EFAULT;
+ if(copy_to_user(optval,ipn_node->descr,len))
+ return -EFAULT;
+ return 0;
+ }
+ } else {
+ int val=-2;
+ switch (optname) {
+ case IPN_SO_PORT:
+ val=ipn_node->portno;
+ break;
+ case IPN_SO_MTU:
+ if (ipnn)
+ val=ipnn->mtu;
+ else if (ipn_node->pbp)
+ val=ipn_node->pbp->mtu;
+ break;
+ case IPN_SO_NUMNODES:
+ if (ipnn)
+ val=ipnn->maxports;
+ else if (ipn_node->pbp)
+ val=ipn_node->pbp->maxports;
+ break;
+ case IPN_SO_MSGPOOLSIZE:
+ if (ipnn)
+ val=ipnn->msgpool_size;
+ else if (ipn_node->pbp)
+ val=ipn_node->pbp->msgpoolsize;
+ break;
+ case IPN_SO_FLAGS:
+ if (ipnn)
+ val=ipnn->flags;
+ else if (ipn_node->pbp)
+ val=ipn_node->pbp->flags;
+ break;
+ case IPN_SO_MODE:
+ if (ipnn)
+ val=-1;
+ else if (ipn_node->pbp)
+ val=ipn_node->pbp->mode;
+ break;
+ }
+ if (val < -1)
+ return -EINVAL;
+ else {
+ if (len < sizeof(int))
+ return -EOVERFLOW;
+ else {
+ len = sizeof(int);
+ if(put_user(len, optlen))
+ return -EFAULT;
+ if(copy_to_user(optval,&val,len))
+ return -EFAULT;
+ return 0;
+ }
+ }
+ }
+ }
+}
+
+/* BROADCAST/HUB implementation */
+
+static int ipn_bcast_newport(struct ipn_node *newport) {
+ struct ipn_network *ipnn=newport->ipn;
+ int i;
+ for (i=0;i<ipnn->maxports;i++) {
+ if (ipnn->connport[i] == NULL)
+ return i;
+ }
+ return -1;
+}
+
+static int ipn_bcast_handlemsg(struct ipn_node *from,
+ struct msgpool_item *msgitem){
+ struct ipn_network *ipnn=from->ipn;
+
+ struct ipn_node *ipn_node;
+ list_for_each_entry(ipn_node, &ipnn->connectqueue, nodelist) {
+ if (ipn_node != from)
+ ipn_proto_sendmsg(ipn_node,msgitem);
+ }
+ return 0;
+}
+
+static void ipn_null_delport(struct ipn_node *oldport) {}
+static void ipn_null_postnewport(struct ipn_node *newport) {}
+static void ipn_null_predelport(struct ipn_node *oldport) {}
+static int ipn_null_newnet(struct ipn_network *newnet) {return 0;}
+static int ipn_null_resizenet(struct ipn_network *net,int oldsize,int newsize) {
+ return 0;}
+static void ipn_null_delnet(struct ipn_network *oldnet) {}
+static int ipn_null_setsockopt(struct ipn_node *port,int optname,
+ char __user *optval, int optlen) {return -EOPNOTSUPP;}
+static int ipn_null_getsockopt(struct ipn_node *port,int optname,
+ char __user *optval, int *optlen) {return -EOPNOTSUPP;}
+static int ipn_null_ioctl(struct ipn_node *port,unsigned int request,
+ unsigned long arg) {return -EOPNOTSUPP;}
+
+/* Protocol Registration/deregisteration */
+
+void ipn_init_protocol(struct ipn_protocol *p)
+{
+ if (p->ipn_p_delport == NULL) p->ipn_p_delport=ipn_null_delport;
+ if (p->ipn_p_postnewport == NULL) p->ipn_p_postnewport=ipn_null_postnewport;
+ if (p->ipn_p_predelport == NULL) p->ipn_p_predelport=ipn_null_predelport;
+ if (p->ipn_p_newnet == NULL) p->ipn_p_newnet=ipn_null_newnet;
+ if (p->ipn_p_resizenet == NULL) p->ipn_p_resizenet=ipn_null_resizenet;
+ if (p->ipn_p_delnet == NULL) p->ipn_p_delnet=ipn_null_delnet;
+ if (p->ipn_p_setsockopt == NULL) p->ipn_p_setsockopt=ipn_null_setsockopt;
+ if (p->ipn_p_getsockopt == NULL) p->ipn_p_getsockopt=ipn_null_getsockopt;
+ if (p->ipn_p_ioctl == NULL) p->ipn_p_ioctl=ipn_null_ioctl;
+}
+
+int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service)
+{
+ int rv=0;
+ if (ipn_service->ipn_p_newport == NULL ||
+ ipn_service->ipn_p_handlemsg == NULL)
+ return -EINVAL;
+ ipn_init_protocol(ipn_service);
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (protocol > 1 && protocol <= IPN_MAX_PROTO) {
+ protocol--;
+ if (ipn_protocol_table[protocol])
+ rv= -EEXIST;
+ else {
+ ipn_service->refcnt=0;
+ ipn_protocol_table[protocol]=ipn_service;
+ printk(KERN_INFO "IPN: Registered protocol %d\n",protocol+1);
+ }
+ } else
+ rv= -EINVAL;
+ up(&ipn_glob_mutex);
+ return rv;
+}
+
+int ipn_proto_deregister(int protocol)
+{
+ int rv=0;
+ if (down_interruptible(&ipn_glob_mutex))
+ return -ERESTARTSYS;
+ if (protocol > 1 && protocol <= IPN_MAX_PROTO) {
+ protocol--;
+ if (ipn_protocol_table[protocol]) {
+ if (ipn_protocol_table[protocol]->refcnt == 0) {
+ ipn_protocol_table[protocol]=NULL;
+ printk(KERN_INFO "IPN: Unregistered protocol %d\n",protocol+1);
+ } else
+ rv=-EADDRINUSE;
+ } else
+ rv= -ENOENT;
+ } else
+ rv= -EINVAL;
+ up(&ipn_glob_mutex);
+ return rv;
+}
+
+/* MAIN SECTION */
+/* Module constructor/destructor */
+static struct net_proto_family ipn_family_ops = {
+ .family = PF_IPN,
+ .create = ipn_create,
+ .owner = THIS_MODULE,
+};
+
+/* IPN constructor */
+static int ipn_init(void)
+{
+ int rc;
+
+ ipn_init_protocol(&ipn_bcast);
+ ipn_network_cache=kmem_cache_create("ipn_network",sizeof(struct ipn_network),0,0,NULL);
+ if (!ipn_network_cache) {
+ printk(KERN_CRIT "%s: Cannot create ipn_network SLAB cache!\n",
+ __FUNCTION__);
+ rc=-ENOMEM;
+ goto out;
+ }
+
+ ipn_node_cache=kmem_cache_create("ipn_node",sizeof(struct ipn_node),0,0,NULL);
+ if (!ipn_node_cache) {
+ printk(KERN_CRIT "%s: Cannot create ipn_node SLAB cache!\n",
+ __FUNCTION__);
+ rc=-ENOMEM;
+ goto out_net;
+ }
+
+ ipn_msgitem_cache=kmem_cache_create("ipn_msgitem",sizeof(struct msgitem),0,0,NULL);
+ if (!ipn_msgitem_cache) {
+ printk(KERN_CRIT "%s: Cannot create ipn_msgitem SLAB cache!\n",
+ __FUNCTION__);
+ rc=-ENOMEM;
+ goto out_net_node;
+ }
+
+ rc=proto_register(&ipn_proto,1);
+ if (rc != 0) {
+ printk(KERN_CRIT "%s: Cannot register the protocol!\n",
+ __FUNCTION__);
+ goto out_net_node_msg;
+ }
+
+ sock_register(&ipn_family_ops);
+ ipn_netdev_init();
+ printk(KERN_INFO "IPN: Virtual Square Project, University of Bologna 2007\n");
+ return 0;
+
+out_net_node_msg:
+ kmem_cache_destroy(ipn_msgitem_cache);
+out_net_node:
+ kmem_cache_destroy(ipn_node_cache);
+out_net:
+ kmem_cache_destroy(ipn_network_cache);
+out:
+ return rc;
+}
+
+/* IPN destructor */
+static void ipn_exit(void)
+{
+ ipn_netdev_fini();
+ if (ipn_msgitem_cache)
+ kmem_cache_destroy(ipn_msgitem_cache);
+ if (ipn_node_cache)
+ kmem_cache_destroy(ipn_node_cache);
+ if (ipn_network_cache)
+ kmem_cache_destroy(ipn_network_cache);
+ sock_unregister(PF_IPN);
+ proto_unregister(&ipn_proto);
+ printk(KERN_INFO "IPN removed\n");
+}
+
+module_init(ipn_init);
+module_exit(ipn_exit);
+
+EXPORT_SYMBOL_GPL(ipn_proto_register);
+EXPORT_SYMBOL_GPL(ipn_proto_deregister);
+EXPORT_SYMBOL_GPL(ipn_proto_sendmsg);
+EXPORT_SYMBOL_GPL(ipn_proto_oobsendmsg);
+EXPORT_SYMBOL_GPL(ipn_msgpool_alloc);
+EXPORT_SYMBOL_GPL(ipn_msgpool_put);
diff -Naur linux-2.6.24-rc5/net/ipn/ipn_netdev.c linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.c
--- linux-2.6.24-rc5/net/ipn/ipn_netdev.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.c 2007-12-16 18:53:24.000000000 +0100
@@ -0,0 +1,276 @@
+/*
+ * Inter process networking (virtual distributed ethernet) module
+ * Net devices: tap and grab
+ * (part of the View-OS project: wiki.virtualsquare.org)
+ *
+ * Copyright (C) 2007 Renzo Davoli ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Due to this file being licensed under the GPL there is controversy over
+ * whether this permits you to write a module that #includes this file
+ * without placing your module under the GPL. Please consult a lawyer for
+ * advice before doing this.
+ *
+ * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/socket.h>
+#include <linux/poll.h>
+#include <linux/un.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <net/sock.h>
+#include <net/af_ipn.h>
+
+#define DRV_NAME "ipn"
+#define DRV_VERSION "0.3"
+
+static const struct ethtool_ops ipn_ethtool_ops;
+
+struct ipntap {
+ struct ipn_node *ipn_node;
+ struct net_device_stats stats;
+};
+
+/* TAP Net device open. */
+static int ipntap_net_open(struct net_device *dev)
+{
+ netif_start_queue(dev);
+ return 0;
+}
+
+/* TAP Net device close. */
+static int ipntap_net_close(struct net_device *dev)
+{
+ netif_stop_queue(dev);
+ return 0;
+}
+
+static struct net_device_stats *ipntap_net_stats(struct net_device *dev)
+{
+ struct ipntap *ipntap = netdev_priv(dev);
+ return &ipntap->stats;
+}
+
+/* receive from a TAP */
+static int ipn_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct ipntap *ipntap = netdev_priv(dev);
+ struct ipn_node *ipn_node=ipntap->ipn_node;
+ struct msgpool_item *newmsg;
+ if (!ipn_node || !ipn_node->ipn || skb->len > ipn_node->ipn->mtu)
+ goto drop;
+ newmsg=ipn_msgpool_alloc(ipn_node->ipn);
+ if (!newmsg)
+ goto drop;
+ newmsg->len=skb->len;
+ memcpy(newmsg->data,skb->data,skb->len);
+ ipn_proto_injectmsg(ipntap->ipn_node,newmsg);
+ ipn_msgpool_put(newmsg,ipn_node->ipn);
+ ipntap->stats.tx_packets++;
+ ipntap->stats.tx_bytes += skb->len;
+ kfree_skb(skb);
+ return 0;
+
+drop:
+ ipntap->stats.tx_dropped++;
+ kfree_skb(skb);
+ return 0;
+}
+
+/* receive from a GRAB via interface hook */
+struct sk_buff *ipn_handle_hook(struct ipn_node *ipn_node, struct sk_buff *skb)
+{
+ char *data=(skb->data)-(skb->mac_len);
+ int len=skb->len+skb->mac_len;
+
+ if (ipn_node &&
+ ((ipn_node->flags & IPN_NODEFLAG_DEVMASK) == IPN_NODEFLAG_GRAB) &&
+ ipn_node->ipn && len<=ipn_node->ipn->mtu) {
+ struct msgpool_item *newmsg;
+ newmsg=ipn_msgpool_alloc(ipn_node->ipn);
+ if (newmsg) {
+ newmsg->len=len;
+ memcpy(newmsg->data,data,len);
+ ipn_proto_injectmsg(ipn_node,newmsg);
+ ipn_msgpool_put(newmsg,ipn_node->ipn);
+ }
+ }
+
+ return (skb);
+}
+
+static void ipntap_setup(struct net_device *dev)
+{
+ dev->open = ipntap_net_open;
+ dev->hard_start_xmit = ipn_net_xmit;
+ dev->stop = ipntap_net_close;
+ dev->get_stats = ipntap_net_stats;
+ dev->ethtool_ops = &ipn_ethtool_ops;
+}
+
+
+struct net_device *ipn_netdev_alloc(struct net *net,int type, char *name, int *err)
+{
+ struct net_device *dev=NULL;
+ *err=0;
+ if (!name || *name==0)
+ name="ipn%d";
+ switch (type) {
+ case IPN_NODEFLAG_TAP:
+ dev=alloc_netdev(sizeof(struct ipntap), name, ipntap_setup);
+ if (!dev)
+ *err= -ENOMEM;
+ ether_setup(dev);
+ /* this commented code is similar to tuntap MAC assignment.
+ * why tuntap does not use the random_ether_addr?
+ *(u16 *)dev->dev_addr = htons(0x00FF);
+ get_random_bytes(dev->dev_addr + sizeof(u16), 4);*/
+ random_ether_addr((u8 *)&dev->dev_addr);
+ break;
+ case IPN_NODEFLAG_GRAB:
+ dev=dev_get_by_name(net,name);
+ if (dev) {
+ if (dev->flags & IFF_LOOPBACK)
+ *err= -EINVAL;
+ else if (rcu_dereference(dev->ipn_port) != NULL)
+ *err= -EBUSY;
+ if (*err)
+ dev=NULL;
+ }
+ break;
+ }
+ return dev;
+}
+
+int ipn_netdev_activate(struct ipn_node *ipn_node)
+{
+ int rv=-EINVAL;
+ switch (ipn_node->flags & IPN_NODEFLAG_DEVMASK) {
+ case IPN_NODEFLAG_TAP:
+ {
+ struct ipntap *ipntap=netdev_priv(ipn_node->dev);
+ ipntap->ipn_node=ipn_node;
+ rtnl_lock();
+ if ((rv=register_netdevice(ipn_node->dev)) == 0)
+ rcu_assign_pointer(ipn_node->dev->ipn_port, ipn_node);
+ rtnl_unlock();
+ if (rv) {/* error! */
+ ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
+ free_netdev(ipn_node->dev);
+ }
+ }
+ break;
+ case IPN_NODEFLAG_GRAB:
+ rtnl_lock();
+ rcu_assign_pointer(ipn_node->dev->ipn_port, ipn_node);
+ dev_set_promiscuity(ipn_node->dev,1);
+ rtnl_unlock();
+ rv=0;
+ break;
+ }
+ return rv;
+}
+
+void ipn_netdev_close(struct ipn_node *ipn_node)
+{
+ switch (ipn_node->flags & IPN_NODEFLAG_DEVMASK) {
+ case IPN_NODEFLAG_TAP:
+ ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
+ rtnl_lock();
+ unregister_netdevice(ipn_node->dev);
+ rtnl_unlock();
+ free_netdev(ipn_node->dev);
+ break;
+ case IPN_NODEFLAG_GRAB:
+ ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
+ rtnl_lock();
+ rcu_assign_pointer(ipn_node->dev->ipn_port, NULL);
+ dev_set_promiscuity(ipn_node->dev,-1);
+ rtnl_unlock();
+ break;
+ }
+}
+
+void ipn_netdev_sendmsg(struct ipn_node *to,struct msgpool_item *msg)
+{
+ struct sk_buff *skb;
+ struct net_device *dev=to->dev;
+ struct ipntap *ipntap=netdev_priv(dev);
+
+ if (msg->len > dev->mtu)
+ return;
+ skb=alloc_skb(msg->len+NET_IP_ALIGN,GFP_KERNEL);
+ if (!skb) {
+ ipntap->stats.rx_dropped++;
+ return;
+ }
+ memcpy(skb_put(skb,msg->len),msg->data,msg->len);
+ switch (to->flags & IPN_NODEFLAG_DEVMASK) {
+ case IPN_NODEFLAG_TAP:
+ skb->protocol = eth_type_trans(skb, dev);
+ netif_rx(skb);
+ ipntap->stats.rx_packets++;
+ ipntap->stats.rx_bytes += msg->len;
+ break;
+ case IPN_NODEFLAG_GRAB:
+ skb->dev = dev;
+ dev_queue_xmit(skb);
+ break;
+ }
+}
+
+/* ethtool interface */
+
+static int ipn_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+ cmd->supported = 0;
+ cmd->advertising = 0;
+ cmd->speed = SPEED_10;
+ cmd->duplex = DUPLEX_FULL;
+ cmd->port = PORT_TP;
+ cmd->phy_address = 0;
+ cmd->transceiver = XCVR_INTERNAL;
+ cmd->autoneg = AUTONEG_DISABLE;
+ cmd->maxtxpkt = 0;
+ cmd->maxrxpkt = 0;
+ return 0;
+}
+
+static void ipn_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
+{
+ strcpy(info->driver, DRV_NAME);
+ strcpy(info->version, DRV_VERSION);
+ strcpy(info->fw_version, "N/A");
+}
+
+static const struct ethtool_ops ipn_ethtool_ops = {
+ .get_settings = ipn_get_settings,
+ .get_drvinfo = ipn_get_drvinfo,
+ /* not implemented (yet?)
+ .get_msglevel = ipn_get_msglevel,
+ .set_msglevel = ipn_set_msglevel,
+ .get_link = ipn_get_link,
+ .get_rx_csum = ipn_get_rx_csum,
+ .set_rx_csum = ipn_set_rx_csum */
+};
+
+int ipn_netdev_init(void)
+{
+ ipn_handle_frame_hook=ipn_handle_hook;
+ return 0;
+}
+
+void ipn_netdev_fini(void)
+{
+ ipn_handle_frame_hook=NULL;
+}
diff -Naur linux-2.6.24-rc5/net/ipn/ipn_netdev.h linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.h
--- linux-2.6.24-rc5/net/ipn/ipn_netdev.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.h 2007-12-16 16:30:04.000000000 +0100
@@ -0,0 +1,47 @@
+#ifndef _IPN_NETDEV_H
+#define _IPN_NETDEV_H
+/*
+ * Inter process networking (virtual distributed ethernet) module
+ * Net devices: tap and grab
+ * (part of the View-OS project: wiki.virtualsquare.org)
+ *
+ * Copyright (C) 2007 Renzo Davoli ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Due to this file being licensed under the GPL there is controversy over
+ * whether this permits you to write a module that #includes this file
+ * without placing your module under the GPL. Please consult a lawyer for
+ * advice before doing this.
+ *
+ * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/socket.h>
+#include <linux/poll.h>
+#include <linux/un.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <linux/etherdevice.h>
+#include <linux/if_bridge.h>
+#include <net/sock.h>
+#include <net/af_ipn.h>
+
+struct net_device *ipn_netdev_alloc(struct net *net,int type, char *name, int *err);
+int ipn_netdev_activate(struct ipn_node *ipn_node);
+void ipn_netdev_close(struct ipn_node *ipn_node);
+void ipn_netdev_sendmsg(struct ipn_node *to,struct msgpool_item *msg);
+int ipn_netdev_init(void);
+void ipn_netdev_fini(void);
+
+inline struct ipn_node *ipn_netdev2node(struct net_device *dev)
+{
+ return rcu_dereference(dev->ipn_port);
+}
+#endif

2007-12-17 10:21:36

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/1] IPN: Inter Process Networking

On Mon, 17 Dec 2007, Renzo Davoli wrote:

> Inter Process Networking (PATCH):
>
> 1. WHAT IS IPN?
> ---------------
>
> IPN is a new address family designed for one-to-many, many-to-many and
> peer-to-peer communication among processes.
> Berkeley sockets have been designed for client-server or point-to-point
> communication; AF_UNIX does not support multicast/broadcast. AF_IPN
> does, in a simple, efficient but extensible way.
> IPN is an Inter Process Communication paradigm where all the processes
> appear as they were connected by a networking bus.
>
> On IPN, processes can interoperate using real networking protocols
> (e.g. ethernet) but also using application defined protocols (maybe
> just sending ascii strings, video or audio frames, etc).
> IPN provides networking (in the broaden definition you can imagine) to
> the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
> if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.
>
> IPN networks can be interconnected with real networks or IPN networks
> running on different computers can interoperate (can be connected by
> virtual cables).
>
> IPN is part of the Virtual Square Project (vde, lwipv6, view-os,
> umview/kmview, see wiki.virtualsquare.org).

other then the fact that this is bi-directional, how is this better then
using pipes and splice?

wouldn't it be better to just add the ability for multiple writers to send
to the same pipe, and then have all of them splice into the output of that
pipe? this would give the same data-agnostic communication that you are
looking for, and with the minor detail that software would have to filter
out messages that they send, would appear to meet all the goals you are
looking at, useing existing kernel features that are designed to be very
high performance.

David Lang

2007-12-17 10:54:33

by Ludovico Gardenghi

[permalink] [raw]
Subject: Re: [PATCH 0/1] IPN: Inter Process Networking

On Mon, Dec 17, 2007 at 03:31:48AM -0800, [email protected] wrote:

> wouldn't it be better to just add the ability for multiple writers to send
> to the same pipe, and then have all of them splice into the output of that
> pipe? this would give the same data-agnostic communication that you are
> looking for, and with the minor detail that software would have to filter
> out messages that they send, would appear to meet all the goals you are
> looking at, useing existing kernel features that are designed to be very
> high performance.

Being able to define both filtering policies (think of a virtual
ethernet layer 2 switch, for instance. We have situations where dozens
or hundreds of virtual cables are connected to the same switch, it would
be much, much slower if you had to awake all the user processes for each
single non-broadcast ethernet frame, and send them useless data) and
delivery guarantees (lossless vs best-effort delivery) are not minor
details in our opinion.

We might have added a level2 virtual ethernet switch at kernel
level, but it seemed to specific. With a minor effort we have split the
"dumb" bus (IPN) and the ability to process specific structured data
with specific policies (sub-modules as kvde_switch).

We surely may adapt existing features (AF_UNIX, or pipes) but they offer
a quite established interface and semantics and we think it should be
better to add a new family. This would prevent from breaking what
already exists and leaving more freedom in defining the new family
according to needs.

As for ptrace vs utrace: ptrace has been designed for debugging; trying
to bend it to be fit for virtualization is likely to end up in an
intricated interface and implementation. utrace has been designed in a
much more general way. You can implement ptrace over utrace, but you can
use utrace also for virtualization in a cleaner, simpler and more
efficient way. Why not?

Ludovico
--
<[email protected]> #acheronte (irc.freenode.net) ICQ: 64483080
GPG ID: 07F89BB8 Jabber: [email protected] Yahoo: gardenghelle
-- This is signature nr. 3556

2007-12-17 11:00:23

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/1] IPN: Inter Process Networking

On Mon, 17 Dec 2007, Ludovico Gardenghi wrote:

> On Mon, Dec 17, 2007 at 03:31:48AM -0800, [email protected] wrote:
>
>> wouldn't it be better to just add the ability for multiple writers to send
>> to the same pipe, and then have all of them splice into the output of that
>> pipe? this would give the same data-agnostic communication that you are
>> looking for, and with the minor detail that software would have to filter
>> out messages that they send, would appear to meet all the goals you are
>> looking at, useing existing kernel features that are designed to be very
>> high performance.
>
> Being able to define both filtering policies (think of a virtual
> ethernet layer 2 switch, for instance. We have situations where dozens
> or hundreds of virtual cables are connected to the same switch, it would
> be much, much slower if you had to awake all the user processes for each
> single non-broadcast ethernet frame, and send them useless data) and
> delivery guarantees (lossless vs best-effort delivery) are not minor
> details in our opinion.
>
> We might have added a level2 virtual ethernet switch at kernel
> level, but it seemed to specific. With a minor effort we have split the
> "dumb" bus (IPN) and the ability to process specific structured data
> with specific policies (sub-modules as kvde_switch).

it seems like you are mixing your use cases and arguing reasons for one
when answering questions about another.

if you are talking network connections between virtual systems, then the
exiting tap interfaces would seem to do everything you are looking for.
you can add them to bridges, route between them, filter traffic between
them (at whatever layer you want with netfilter), use multicast, etc as
you would any real interface.

if, however, you are talking about non-network communications (your
example of sending raw video frames across the interface), and want
multiple processes to receive them, this sounds like exactly the thing
that splice was designed to do, distribute data to multiple recipiants
simultaniously and efficiantly.

I think you need to seperate out these two use cases (and any others you
are advocating this for) and argue each one on it's own.

> We surely may adapt existing features (AF_UNIX, or pipes) but they offer
> a quite established interface and semantics and we think it should be
> better to add a new family. This would prevent from breaking what
> already exists and leaving more freedom in defining the new family
> according to needs.

for a new family to be valuble, you need to show what it does that isn't
available in existing families.

> As for ptrace vs utrace: ptrace has been designed for debugging; trying
> to bend it to be fit for virtualization is likely to end up in an
> intricated interface and implementation. utrace has been designed in a
> much more general way. You can implement ptrace over utrace, but you can
> use utrace also for virtualization in a cleaner, simpler and more
> efficient way. Why not?

I'm not familiar enough with ptrace vs utrace to know this argument. but I
haven't heard any of the virtualization people complaining about the
existing interfaces. They seem to have been happily useing them for a
number of years.

David Lang

2007-12-17 11:51:06

by Ludovico Gardenghi

[permalink] [raw]
Subject: Re: [PATCH 0/1] IPN: Inter Process Networking

On Mon, Dec 17, 2007 at 04:10:19AM -0800, [email protected] wrote:

> if you are talking network connections between virtual systems, then the
> exiting tap interfaces would seem to do everything you are looking for. you
> can add them to bridges, route between them, filter traffic between them
> (at whatever layer you want with netfilter), use multicast, etc as you
> would any real interface.
>
> if, however, you are talking about non-network communications (your example
> of sending raw video frames across the interface), and want multiple
> processes to receive them, this sounds like exactly the thing that splice
> was designed to do, distribute data to multiple recipiants simultaniously
> and efficiantly.

I'll try to explain.

Our first interest was to be able to interconnect virtual, real, and partial
virtual machines. We developed VDE for this, it's a user-level L2
switch. Specific as it may be, it's quite popular as a simple but
flexible tool. It can interconnect UML, Qemu, UMView, slirp, everything that
can be connected to a tap interface, etc.

So, you say, it's a networking issue and we could live with tun/tap.
There's a major point here: at present, dealing with tun/tap, bridges,
routing is quite difficult if you are a *regular* user with *no*
capabilites at all. You have tun/tap persistency and association to a
specific user (or group, recently), at most. That's good - we don't want
regular users to mess with global networking rules and settings.

Think of a bunch of etherogeneous virtual machines, partial virtual
machines (i.e. VMs where only a subset of system calls may be
virtualized or not depending on the parameters - that's the case of
View-OS) that must be interconnected and that may or may not have a
connection to a real network interface (maybe via a tunnel towards a
different machine). There's no need for administrator intervention here.
Why should an user have to ask root to create lots of tap interfaces for
him, bind them in a bridge and set up filtering/routing rules? What
would the list of interfaces become when different users asked for the
same thing at the same time?

You could define a specific interconnecting bus, but we've already have
it: ethernet. VDE comes in help as it allows regular users to build
distributed ethernet networks.

VDE works fine, but at present often results in a bottleneck because of
the high number of user-processes involved and user-kernel-user switches
needed in order to transfer a single ethernet frame. Moving the core
inside the kernel would limit this problem and result in faster
communication with still no need for root intervention or global
namespace messing. (we're thinking if something can be done working with
containers or similar structures, both for networking and partial
virtualization, but that's another topic).

So we started thinking how to use existing kernel structures, and we
concluded that:

- no existing kernel structures appeared to be optimal for this work;
- if we've had to design a new structure, it would have been more
useful if we tried to be as more general as we could.

At present we're still focused on networking and other applications are
just examples, but we thought that adding a general extensible multipoint
IPC family is quite better than adding the most specific solution to our
current problem.

Maybe people with experience in other fields may tell us if there are
other problems that can be resolved, or optimized, or simply made
simpler, with IPN. Maybe our proposal is not the best as for interface
and semantics. But we feel that it may fill an "empty space" in the
available IPC mechanisms with a quite simple but powerful approach.

> for a new family to be valuble, you need to show what it does that isn't
> available in existing families.

Is it "more acceptable" to add a new address family or to add features
to existing ones? (my question is purely informative, I don't want to
sound sarcastic or whatever) For instance, someone proposed "let's just
add access control to the netlink family". It seems a though work.

You proposed splice, other have proposed multicast or netlink. If I have
understood correctly, splice helps in copying data to different
destinations in a very fast way. But it needs a userspace program that
receives data, iterates on fds and splices the data "out", calling a
syscall for each destination. syscall calling may have become very fast
but we still notice slowdowns due to the reasons I've explained before.

--- (the following is not related to IPN but i wanted to answer this too)

> I'm not familiar enough with ptrace vs utrace to know this argument. but I
> haven't heard any of the virtualization people complaining about the
> existing interfaces. They seem to have been happily useing them for a
> number of years.

ptrace has a number of drawbacks that have been partially addressed
adding flags and parameters for "cheating" and obtaining better
performances. It's *slow* expecially if you want to copy data to/from the
process' memory (you need a system call for each word of memory). It
cannot be used in an efficient way to trace only a subset of system
calls. All or none. It has problems with signal management.
User-Mode Linux works because it's a very specific and "simple" case of
virtualization. If you want to do coarse-grained virtualization it may
be ok, but as soon as you want to add fine tuning while keeping
efficiency it becomes a hell.

We're developing tools intended to let a *regular* user (no root
intervention, again) to "build" a personal view of the system resources
(network, filesystem, etc) starting from what "he can do/see" as that
users and adding, removing, changing things. We'd like to let a user
live in a "potentially virtual" world exactly identical to the "real"
one. When he wants to change something he can. Mounting remote
filesystems, creating new virtual network interfaces, editing global
configuration files. No security issues here, there are no privileged
processes running. Nothing gets really mounted. The virtualizing layer
takes care of "building" the view around the user's processes.

We've done this with ptrace. It works but it surely cannot be used as a
"everyday" shell around every user process. We've done this with utrace.
The syscall-capturing functions have shrunk of an order of magnitude and
everything is more efficient, functional and cleaner.

Ludovico
--
<[email protected]> #acheronte (irc.freenode.net) ICQ: 64483080
GPG ID: 07F89BB8 Jabber: [email protected] Yahoo: gardenghelle
-- This is signature nr. 3558

2007-12-17 16:27:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/1] IPN: Inter Process Networking

On Mon, Dec 17, 2007 at 10:27:47AM +0100, Renzo Davoli wrote:
> Inter Process Networking Patch.
>
> It applies to 2.6.24-rc5, include documentation, the new kernel option
> (experimental), kernel include file include/net/af_ipn.h and the
> protocol directory net/ipn.

Some RCU-related questions interspersed below. Summary:

o It is not clear to me that the updates (rcu_assign_pointer())
are consistently locked.

o I don't see any sign of RCU read-side primitives.

That said, I cannot claim much expertise on this area of the kernel,
so am very likely missing something.

Thanx, Paul

> renzo
>
> Signed-off-by: Renzo Davoli <[email protected]>
>
> diff -Naur linux-2.6.24-rc5/Documentation/networking/ipn.txt linux-2.6.24-rc5-ipn/Documentation/networking/ipn.txt
> --- linux-2.6.24-rc5/Documentation/networking/ipn.txt 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/Documentation/networking/ipn.txt 2007-12-16 16:30:01.000000000 +0100
> @@ -0,0 +1,326 @@
> +Inter Process Networking (IPN)
> +
> +IPN is an Inter Process Communication service. It uses the same programming
> +interface and protocols used for networking. Processes using IPN are connected
> +to a "network" (many to many communication). The messages or packets sent by a
> +process on an IPN network can be delivered to many other processes connected to
> +the same IPN network, potentially to all the other processes. Different
> +protocols can be defined on the IPN service. The basic one is the broadcast
> +(level 1) protocol: all the packets get received by all the processes but the
> +sender. It is also possible to define more sophisticated protocols. For example
> +it is possible to have IPN sockets dipatching packets using the Ethernet
> +protocol (like a Virtual Distributed Ethernet - VDE switch), or Internet
> +Protocol (like a layer 3 switch). These are just examples, several other
> +policies can be defined.
> +
> +Description:
> +------------
> +
> +The Berkeley socket Application Programming Interface (API) was designed for
> +client server applications and for point-to-point communications. There is not
> +a support for broadcasting/multicasting domains.
> +
> +IPN updates the interface by introducing a new protocol family (PF_IPN or
> +AF_IPN). PF_IPN is similar to PF_UNIX but for IPN the Socket API calls have a
> +different (extended) behavior.
> +
> + #include <sys/socket.h>
> + #include <sys/un.h>
> + #include <sys/ipn.h>
> +
> + sockfd = socket(AF_IPN, int socket_type, int protocol);
> +
> +creates a communication socket. The only socket_type defined is SOCK_RAW, other
> +socket_types can be used for future extensions. A socket cannot be used to send
> +or receive data until it gets connected (using the "connect" call). The
> +protocol argument defines the policy used by the socket. Protocol IPN_BROADCAST
> +(1) is the basic policy: a packet is sent to all the receipients but the sender
> +itself. The policy IPN_ANY (0) can be used to connect or bind a pre-existing
> +IPN network regardless of the policy used. (2 will be IPN_VDESWITCH and 3
> +IPN_VDESWITCHL3).
> +
> +The address format is the same of PF_UNIX (a.k.a PF_LOCAL), see unix(7) manual.
> +
> + int bind(int sockfd, const struct sockaddr *my_addr, socklen_t addrlen);
> +
> +This call creates an IPN network if it does not exist, or join an existing
> +network (just for management) if it already exists. The policy of the network
> +must be consistent with the protocol argument of the "socket" call. A new
> +network has the policy defined for the socket. "bind" or "connect" operations
> +on existing networks fail if the policy of the socket is neither IPN_ANY nor
> +the same of the network. (A network should not be created by a IPN_ANY socket).
> +An IPN network appears in the file system as a unix socket. The execution
> +permission (x) on this file is required for "bind' to succeed (otherwise -EPERM
> +is returned). Similarly the read/write permissions (rw) permits the "connect"
> +operation for reading (receiving) or writing (sending) packets respectively.
> +When a socket is bound (but not connected) to a IPN network the process does
> +not receive or send any data but it can call "ioctl" or "setsockopt" to
> +configure the network.
> +
> + int connect(int sockfd, const struct sockaddr *serv_addr, socklen_t addrlen);
> +
> +This call connects a socket to an existing IPN network. The socket can be
> +already bound (through the "bind" call) or unbound. Unbound connected sockets
> +receive and send data but they cannot configure the network. The read or write
> +permission on the socket (rw) is required to "connect" the channel and
> +read/write respectively. When "connect" succeeds and provided the socket has
> +appropriate permissions, the process can sends packets and receives all the
> +packets sent by other processes and delivered to it by the network policy. The
> +socket can receive data at any time (like a network interface) so the process
> +must be able to handle incoming data (using select/poll or multithreading).
> +Obviously higher lever protocols can also prevent the reception of unexpected
> +messages by design. It is the case of networks used with with exactly one
> +sender, all the other processes can simply receive the data and the sender will
> +never receive any packet. It is also possible to have sockets with different
> +roles assigning reading permission to some and writing permissions to others.
> +If data overrun occurs there can be data loss or the sender can be blocked
> +depending on the policy of the socket (LOSSY or LOSSLESS, see over). Bind must
> +be called before connect. The correct sequences are: socket+bind: just for
> +management, socket+bind+connect: management and communication. socket+connect:
> +communication without management).
> +
> +The calls "accept" and "listen" are not defined for AF_IPN, as there is not any
> +server. All the communication takes place among peers.
> +
> +Data can be sent and received using read, write, send, recv, sendto, recvfrom, sendmsg, recvmsg.
> +
> +Socket options and flags.
> +-------------------------
> +
> +These options can be set by getsockopt and setsockopt.
> +
> +There are two different kinds of options: network options and node options. The
> +formers define the structure of the network and must be set prior to bind. It
> +is not currently possible to change this flag of an existing network. When a
> +socket is bound and/or connected to an existing network getsockopt gives the
> +current value of the options. Node options define parameters of the node. These
> +must be set prior to connect.
> +
> +***Network Options (These options can be set prior to bind/connec
> +
> +IPN_SO_FLAGS: This tag permits to set/get the network flags.
> +
> +IPN_FLAG_LOSSLESS: this flag defines the behavior in case of network
> +overloading or data overrun, i.e. when some process are too slow in consuming
> +the packets for the network buffers. When the network is LOSSY (the flag is
> +cleared) packets get dropped in case of buffer overflow. A LOSSLESS (flag set)
> +IPN network blocks the sender if the buffer is full. LOSSY is the default
> +behavior.
> +
> +IPN_SO_NUMNODES: max number of connected sockets (default value 32)
> +
> +IPN_SO_MTU: maximum transfer unit: maximum size of packets (default value 1514,
> +Ethernet frame, including VLAN).
> +
> +IPN_SO_MSGPOOLSIZE: size of the buffer (#of pending packets, default value 8).
> +This option has two different meanings depending on the LOSSY/LOSSLESS behavior
> +of the network. For LOSSY networks, this is the maximum number of pending
> +packets of each node. For LOSSLESS network this is the global number of the
> +pending packets in the network. When the same packet is sent to many
> +destinations it is counted just once.
> +
> +IPN_SO_MODE: this option specifies the permission to use when the socket gets
> +created on the file system. It is modified by the process' umask in the usual
> +way. The created socket permission are (mode & ~umask).
> +
> +***Network Options (Options for bound/connected sockets)
> +
> +IPN_SO_CHANGE_NUMNODES: (runtime) change of the number of ipn network ports.
> +
> +***Node Options
> +
> +IPN_SO_PORT: (default value IPN_PORTNO_ANY) This option specify the port number
> +where the socket must be connected. When IPN_PORTNO_ANY the port number is
> +decided by the service. There can be network services where different ports
> +have different definitions (e.g. different VLANs for ports of virtual Ethernet
> +switches).
> +
> +IPN_SO_DESCR: This is the description of the node. It is a string, having
> +maxlength IPN_DESCRLEN. It is just used by debugging tools.
> +
> +IPN_SO_HANDLE_OOB: The node is able to manage Out Of Band protocol messages
> +
> +IPN _SO_WANT_OOB_NUMNODES: The socket wants OOB messages to notify the change
> +of #writers #readers (requires IPN_SO_HANDLE_OOB)
> +
> +TAP and GRAB nodes for IPN networks
> +-----------------------------------
> +
> +It is possible to connect IPN sockets to virtual and real network interfaces
> +using specific ioctl and provided the user has the permission to configure the
> +network (e.g. the CAP_NET_ADMIN Posix capability). A virtual interface
> +connected to an IPN network is similar to a tap interface (provided by the
> +tuntap module). A tap interface appears as an ethernet interface to the hosting
> +operating system, all the packets sent and received through the tap interface
> +get received and sent by the application which created the tap interface. IPN
> +virtual network interface appears in the same way but the packets are received
> +and sent through the IPN network and delivered consistently with the policy
> +(BROADCAST acts as a basic HUB for the connected processes). It is also
> +possible to *grab* a real interface. In this case the closest example is the
> +Linux kernel ethernet bridge. When a real interface is connected to a IPN all
> +the packets received from the real network are injected also into the IPN and
> +all the packets sent by the IPN through the real network 'port' get sent on the
> +real network.
> +
> +ioctl is used for creation or control of TAP or GRAB interfaces.
> +
> + int ioctl(int d, int request, .../* arg */);
> +
> +A list of the request values currently supported follows.
> +
> +IPN_CONN_NETDEV: (struct ifreq *arg). This call creates a TAP interface or
> +implements a GRAB on an existing interface and connects it to a bound IPN
> +socket. The field ifr_flags can be IPN_NODEFLAG_TAP for a TAP interface,
> +IPN_NODEFLAG_GRAB to grab an existing interface. The field ifr_name is the
> +desired name for the new TAP interface or is the name of the interface to grab
> +(e.g. eth0). For TAP interfaces, ifr_name can be an empty string. The interface
> +in this latter case is named ipn followed by a number (e.g. ipn0, ipn1, ...).
> +This ioctl must be used on a bound but unconnected socket. When the call
> +succeeds, the socket gets the connected status, but the packets are sent and
> +received through the interface. Persistence apply only to interface nodes (TAP
> +or GRAB).
> +
> +IPN_SETPERSIST (int arg). If (arg != 0) it gives the interface the persistent
> +status: the network interface survives and stay connected to the IPN network
> +when the socket is closed. When (arg == 0) the standard behavior is resumed:
> +the interface is deleted or the grabbing is terminated when the socket is
> +closed.
> +
> +IPN_JOIN_NETDEV: (struct ifreq *arg). This call reconnects a socket to an
> +existing persistent node. The interface can be defined either by name
> +(ifr_name) or by index (ifr_index). If there is already a socket controlling
> +the interface this call fails (EADDRNOTAVAIL).
> +
> +There are also some ioctl that can be used by a sysadm to give/clear
> +persistence on existing IPN interfaces. These calls apply to unbound sockets.
> +
> +IPN_SETPERSIST_NETDEV: (struct ifreq *arg). This call sets the persistence
> +status of an IPN interface. The interface can be defined either by name
> +(ifr_name) or by index (ifr_index).
> +
> +IPN_CLRPERSIST_NETDEV: (struct ifreq *arg). This call clears the persistence
> +status of an IPN interface. The interface is specified as in the opposite call
> +above. The interface is deleted (TAP) or the grabbing is terminated when the
> +socket is closed, or immediately if the interface is not controlled by a
> +socket. If the IPN network had the interface as its sole node, the IPN network
> +is terminated, too.
> +
> +When unloading the ipn kernel module, all the persistent flags of interfaces
> +are cleared.
> +
> +Related Work.
> +-------------
> +
> +IPN is able to give a unifying solution to several problems and creates new
> +opportunities for applications.
> +
> +Several existing tools can be implemented using IPN sockets:
> +
> + * VDE. Level 2 service implements a VDE switch in the kernel, providing a
> + considerable speedup.
> + * Tap (tuntap) networking for virtual machines
> + * Kernel ethernet bridge
> + * All the applications which need multicasting of data streams, like tee
> +
> +A continuous stream of data (like audio/video/midi etc) can be sent on an IPN
> +network and several application can receive the broadcast just by joining the
> +channel.
> +
> +It is possible to write programs that forward packets between different IPN
> +networks running on the same or on different systems extending the IPN in the
> +same way as cables extend ethernet networks connecting switches or hubs
> +together. (VDE cables are examples of such a kind of programs).
> +
> +IPN interface to protocol modules
> +---------------------------------
> +
> +struct ipn_protocol {
> + int refcnt;
> + int (*ipn_p_newport)(struct ipn_node *newport);
> + int (*ipn_p_handlemsg)(struct ipn_node *from,struct msgpool_item *msgitem, int depth);
> + void (*ipn_p_delport)(struct ipn_node *oldport);
> + void (*ipn_p_postnewport)(struct ipn_node *newport);
> + void (*ipn_p_predelport)(struct ipn_node *oldport);
> + int (*ipn_p_newnet)(struct ipn_network *newnet);
> + int (*ipn_p_resizenet)(struct ipn_network *net,int oldsize,int newsize);
> + void (*ipn_p_delnet)(struct ipn_network *oldnet);
> + int (*ipn_p_setsockopt)(struct ipn_node *port,int optname,
> + char __user *optval, int optlen);
> + int (*ipn_p_getsockopt)(struct ipn_node *port,int optname,
> + char __user *optval, int *optlen);
> + int (*ipn_p_ioctl)(struct ipn_node *port,unsigned int request,
> + unsigned long arg);
> +};
> +
> +int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service);
> +int ipn_proto_deregister(int protocol);
> +
> +void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg, int depth);
> +
> +
> +A protocol (sub) module must define its own ipn_protocol structure (maybe a
> +global static variable).
> +
> +ipn_proto_register must be called in the module init to register the protocol
> +to the IPN core module. ipn_proto_deregister must be called in the destructor
> +of the module. It fails if there are already running networks based on this
> +protocol.
> +
> +Only two fields must be initialized in any case: ipn_p_newport and
> +ipn_p_handlemsg.
> +
> +ipn_p_newport is the new network node notification. The return value is the
> +port number of the new node. This call can be used to allocate and set private
> +data used by the protocol (the field proto_private of the struct ipn_node has
> +been defined for this purpose).
> +
> +ipn_p_handlemsg is the notification of a message that must be dispatched. This
> +function should call ipn_proto_sendmsg for each recipient. It is possible for
> +the protocol to change the message (provided the global length of the packet
> +does not exceed the MTU of the network). Depth is for loop control. Two IPN can
> +be interconnected by kernel cables (not implemented yet). Cycles of cables
> +would generate infinite loops of packets. After a pre-defined number of hops
> +the packet gets dropped (it is like EMLINK for symbolic links). Depth value
> +must be copied to all ipn_proto_sendmsg calls. Usually the handlemsg function
> +has the following structure:
> +
> +static int ipn_xxxxx_handlemsg(struct ipn_node *from, struct msgpool_item *msgitem, int depth)
> +{
> + /* compute the set of receipients */
> + for (/*each receipient "to"*/)
> + ipn_proto_sendmsg(to,msgitem,depth);
> +}
> +
> +It is also possible to send different packets to different recipients.
> +
> +struct msgpool_item *newitem=ipn_msgpool_alloc(from->ipn);
> +/* create a new contents for the packet by filling in newitem->len and newitem->data */
> +ipn_proto_sendmsg(recipient1,newitem,depth);
> +ipn_proto_sendmsg(recipient2,newitem,depth);
> +....
> +ipn_msgpool_put(newitem);
> +
> +(please remember to call ipn_msgpool_put after the sendmsg of packets allocated
> +by the protocol submodule).
> +
> +ipn_p_delport is used to deallocate port related data structures.
> +
> +ipn_p_postnewport and ipn_p_predelport are used to notify new nodes or deleted
> +nodes. newport and delport get called before activating the port and after
> +disactivating it respectively, therefore it is not possible to use the new port
> +or deleted port to signal the change on the net itself. ipn_p_postnewport and
> +ipn_p_predelport get called just after the activation and just before the
> +deactivation thus the protocols can already send packets on the network.
> +
> +ipn_p_newnet and ipn_p_delnet notify the creation/deletion of a IPN network
> +using the given protocol.
> +
> +ipn_p_resizenet notifies a number of ports change
> +
> +ipn_p_setsockopt and ipn_p_getsockopt can be used to provide specific socket
> +options.
> +
> +ipn_p_ioctl protocols can implement also specific ioctl services.
> +
> +Further documentation and examples can be found in the Virtual Square project
> +web site: wiki.virtualsquare.org
> diff -Naur linux-2.6.24-rc5/MAINTAINERS linux-2.6.24-rc5-ipn/MAINTAINERS
> --- linux-2.6.24-rc5/MAINTAINERS 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/MAINTAINERS 2007-12-16 16:30:01.000000000 +0100
> @@ -2094,6 +2094,15 @@
> W: http://openipmi.sourceforge.net/
> S: Supported
>
> +IPN INTER PROCESS NETWORKING
> +P: Renzo Davoli
> +M: [email protected]
> +P: Ludovico Gardenghi
> +M: [email protected]
> +L: [email protected]
> +W: http://wiki.virtualsquare.org
> +S: Maintained
> +
> IPX NETWORK LAYER
> P: Arnaldo Carvalho de Melo
> M: [email protected]
> diff -Naur linux-2.6.24-rc5/include/linux/net.h linux-2.6.24-rc5-ipn/include/linux/net.h
> --- linux-2.6.24-rc5/include/linux/net.h 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/include/linux/net.h 2007-12-16 16:30:03.000000000 +0100
> @@ -25,7 +25,7 @@
> struct inode;
> struct net;
>
> -#define NPROTO 34 /* should be enough for now.. */
> +#define NPROTO 35 /* should be enough for now.. */
>
> #define SYS_SOCKET 1 /* sys_socket(2) */
> #define SYS_BIND 2 /* sys_bind(2) */
> diff -Naur linux-2.6.24-rc5/include/linux/netdevice.h linux-2.6.24-rc5-ipn/include/linux/netdevice.h
> --- linux-2.6.24-rc5/include/linux/netdevice.h 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/include/linux/netdevice.h 2007-12-16 16:30:03.000000000 +0100
> @@ -705,6 +705,8 @@
> struct net_bridge_port *br_port;
> /* macvlan */
> struct macvlan_port *macvlan_port;
> + /* ipn */
> + struct ipn_node *ipn_port;
>
> /* class/net/name entry */
> struct device dev;
> diff -Naur linux-2.6.24-rc5/include/linux/socket.h linux-2.6.24-rc5-ipn/include/linux/socket.h
> --- linux-2.6.24-rc5/include/linux/socket.h 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/include/linux/socket.h 2007-12-16 16:30:03.000000000 +0100
> @@ -189,7 +189,8 @@
> #define AF_BLUETOOTH 31 /* Bluetooth sockets */
> #define AF_IUCV 32 /* IUCV sockets */
> #define AF_RXRPC 33 /* RxRPC sockets */
> -#define AF_MAX 34 /* For now.. */
> +#define AF_IPN 34 /* IPN sockets */
> +#define AF_MAX 35 /* For now.. */
>
> /* Protocol families, same as address families. */
> #define PF_UNSPEC AF_UNSPEC
> @@ -224,6 +225,7 @@
> #define PF_BLUETOOTH AF_BLUETOOTH
> #define PF_IUCV AF_IUCV
> #define PF_RXRPC AF_RXRPC
> +#define PF_IPN AF_IPN
> #define PF_MAX AF_MAX
>
> /* Maximum queue length specifiable by listen. */
> diff -Naur linux-2.6.24-rc5/include/net/af_ipn.h linux-2.6.24-rc5-ipn/include/net/af_ipn.h
> --- linux-2.6.24-rc5/include/net/af_ipn.h 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/include/net/af_ipn.h 2007-12-16 16:30:03.000000000 +0100
> @@ -0,0 +1,233 @@
> +#ifndef __LINUX_NET_AFIPN_H
> +#define __LINUX_NET_AFIPN_H
> +
> +#define IPN_ANY 0
> +#define IPN_BROADCAST 1
> +#define IPN_HUB 1
> +#define IPN_VDESWITCH 2
> +#define IPN_VDESWITCH_L3 3
> +
> +#define IPN_SO_PREBIND 0x80
> +#define IPN_SO_PORT 0
> +#define IPN_SO_DESCR 1
> +#define IPN_SO_CHANGE_NUMNODES 2
> +#define IPN_SO_HANDLE_OOB 3
> +#define IPN_SO_WANT_OOB_NUMNODES 4
> +#define IPN_SO_MTU (IPN_SO_PREBIND | 0)
> +#define IPN_SO_NUMNODES (IPN_SO_PREBIND | 1)
> +#define IPN_SO_MSGPOOLSIZE (IPN_SO_PREBIND | 2)
> +#define IPN_SO_FLAGS (IPN_SO_PREBIND | 3)
> +#define IPN_SO_MODE (IPN_SO_PREBIND | 4)
> +
> +#define IPN_PORTNO_ANY -1
> +
> +#define IPN_DESCRLEN 128
> +
> +#define IPN_FLAG_LOSSLESS 1
> +#define IPN_FLAG_TERMINATED 0x1000
> +
> +/* Ioctl defines */
> +#define IPN_SETPERSIST_NETDEV _IOW('I', 200, int)
> +#define IPN_CLRPERSIST_NETDEV _IOW('I', 201, int)
> +#define IPN_CONN_NETDEV _IOW('I', 202, int)
> +#define IPN_JOIN_NETDEV _IOW('I', 203, int)
> +#define IPN_SETPERSIST _IOW('I', 204, int)
> +
> +#define IPN_OOB_NUMNODE_TAG 0
> +
> +/* OOB message for change of numnodes
> + * Common fields for oob IPN signaling:
> + * @level=level of the service who generated the oob
> + * @tag=tag of the message
> + * Specific fields:
> + * @numreaders=number of readers
> + * @numwriters=number of writers
> + * */
> +struct numnode_oob {
> + int level;
> + int tag;
> + int numreaders;
> + int numwriters;
> +};
> +
> +#ifdef __KERNEL__
> +#include <linux/socket.h>
> +#include <linux/mutex.h>
> +#include <linux/un.h>
> +#include <net/sock.h>
> +#include <linux/netdevice.h>
> +
> +#define IPN_HASH_SIZE 256
> +
> +/* The AF_IPN socket */
> +struct msgpool_item;
> +struct ipn_network;
> +struct pre_bind_parms;
> +
> +/*
> + * ipn_node
> + *
> + * @nodelist=pointers for connectqueue or unconnectqueue (see network)
> + * @protocol=kind of service 0->standard broadcast
> + * @flags= see IPN_NODEFLAG_xxx
> + * @shutdown= SEND_SHUTDOWN/RCV_SHUTDOWN and OOBRCV_SHUTDOWN
> + * @descr=description of this port
> + * @portno=when connected: port of the netowrk (<0 means unconnected)
> + * @msglock=mutex on the msg queue
> + * @totmsgcount=total # of pending msgs
> + * @oobmsgcount=# of pending oob msgs
> + * @msgqueue=queue of messages
> + * @oobmsgqueue=queue of messages
> + * @read_wait=waitqueue for reading
> + * @net=current network
> + * @dev=device (TAP or GRAB)
> + * @ipn=network we are connected to
> + * @pbp=temporary storage for parms that must be set prior to bind
> + * @proto_private=handle for protocol private data
> + */
> +struct ipn_node {
> + struct list_head nodelist;
> + int protocol;
> + volatile unsigned char flags;
> + unsigned char shutdown;
> + char descr[IPN_DESCRLEN];
> + int portno;
> + spinlock_t msglock;
> + unsigned short totmsgcount;
> + unsigned short oobmsgcount;
> + struct list_head msgqueue;
> + struct list_head oobmsgqueue;
> + wait_queue_head_t read_wait;
> + struct net *net;
> + struct net_device *dev;
> + struct ipn_network *ipn;
> + struct pre_bind_parms *pbp;
> + void *proto_private;
> +};
> +#define IPN_NODEFLAG_BOUND 0x1 /* bind succeeded */
> +#define IPN_NODEFLAG_INUSE 0x2 /* is currently "used" (0 for persistent, unbound interfaces) */
> +#define IPN_NODEFLAG_PERSIST 0x4 /* if persist does not disappear on close (net interfaces) */
> +#define IPN_NODEFLAG_TAP 0x10 /* This is a tap interface */
> +#define IPN_NODEFLAG_GRAB 0x20 /* This is a grab of a real interface */
> +#define IPN_NODEFLAG_DEVMASK 0x30 /* True if this is a device */
> +#define IPN_NODEFLAG_OOB_NUMNODES 0x40 /* Node wants OOB for NNODES */
> +
> +/*
> + * ipn_sock
> + *
> + * unfortunately we must use a struct sock (most of the fields are useless) as
> + * this is the standard "agnostic" structure for socket implementation.
> + * This proofs that it is not "agnostic" enough!
> + */
> +
> +struct ipn_sock {
> + struct sock sk;
> + struct ipn_node *node;
> +};
> +
> +/*
> + * ipn_network network descriptor
> + *
> + * @hnode=hash to find this entry (looking for i-node)
> + * @unconnectqueue=queue of unconnected (bound) nodes
> + * @connectqueue=queue of connected nodes (faster for broadcasting)
> + * @refcnt=reference count (bound or connected sockets)
> + * @dentry/@mnt=to keep the file system descriptor into memory
> + * @ipnn_lock=lock for protocol functions
> + * @protocol=kind of service
> + * @flags=flags (IPN_FLAG_LOSSLESS)
> + * @maxports=number of ports available in this network
> + * @msgpool_nelem=number of pending messages
> + * @msgpool_size=max number of pending messages *per net* when IPN_FLAG_LOSSLESS
> + * @msgpool_size=max number of pending messages *per port*when LOSSY
> + * @mtu=MTU
> + * @send_wait=wait queue waiting for a message in the msgpool (IPN_FLAG_LOSSLESS)
> + * @msgpool_cache=slab for msgpool (unused yet)
> + * @proto_private=handle for protocol private data
> + * @connports=array of connected sockets
> + */
> +struct ipn_network {
> + struct hlist_node hnode;
> + struct list_head unconnectqueue;
> + struct list_head connectqueue;
> + atomic_t refcnt;
> + struct dentry *dentry;
> + struct vfsmount *mnt;
> + struct semaphore ipnn_mutex;
> + int sunaddr_len;
> + struct sockaddr_un sunaddr;
> + unsigned int protocol;
> + unsigned int flags;
> + int numreaders;
> + int numwriters;
> + atomic_t msgpool_nelem;
> + unsigned short maxports;
> + unsigned short msgpool_size;
> + unsigned short mtu;
> + wait_queue_head_t send_wait;
> + struct kmem_cache *msgpool_cache;
> + void *proto_private;
> + struct ipn_node **connport;
> +};
> +
> +/* struct msgpool_item
> + * the local copy of the message for dispatching
> + * @count refcount
> + * @len packet len
> + * @data payload
> + */
> +struct msgpool_item {
> + atomic_t count;
> + int len;
> + unsigned char data[0];
> +};
> +
> +struct msgpool_item *ipn_msgpool_alloc(struct ipn_network *ipnn);
> +void ipn_msgpool_put(struct msgpool_item *old, struct ipn_network *ipnn);
> +
> +/*
> + * protocol service:
> + *
> + * @refcnt: number of networks using this protocol
> + * @newport=upcall for reporting a new port. returns the portno, -1=error
> + * @handlemsg=dispatch a message.
> + * should call ipn_proto_sendmsg for each desctination
> + * can allocate other msgitems using ipn_msgpool_alloc to send
> + * different messages to different destinations;
> + * @delport=(may be null) reports the terminatio of a port
> + * @postnewport,@predelport: similar to newport/delport but during these calls
> + * the node is (still) connected. Useful when protocols need
> + * welcome and goodbye messages.
> + * @ipn_p_setsockopt
> + * @ipn_p_getsockopt
> + * @ipn_p_ioctl=(may be null) upcall to manage specific options or ctls.
> + */
> +struct ipn_protocol {
> + int refcnt;
> + int (*ipn_p_newport)(struct ipn_node *newport);
> + int (*ipn_p_handlemsg)(struct ipn_node *from,struct msgpool_item *msgitem);
> + void (*ipn_p_delport)(struct ipn_node *oldport);
> + void (*ipn_p_postnewport)(struct ipn_node *newport);
> + void (*ipn_p_predelport)(struct ipn_node *oldport);
> + int (*ipn_p_newnet)(struct ipn_network *newnet);
> + int (*ipn_p_resizenet)(struct ipn_network *net,int oldsize,int newsize);
> + void (*ipn_p_delnet)(struct ipn_network *oldnet);
> + int (*ipn_p_setsockopt)(struct ipn_node *port,int optname,
> + char __user *optval, int optlen);
> + int (*ipn_p_getsockopt)(struct ipn_node *port,int optname,
> + char __user *optval, int *optlen);
> + int (*ipn_p_ioctl)(struct ipn_node *port,unsigned int request,
> + unsigned long arg);
> +};
> +
> +int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service);
> +int ipn_proto_deregister(int protocol);
> +
> +int ipn_proto_injectmsg(struct ipn_node *from, struct msgpool_item *msg);
> +void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg);
> +void ipn_proto_oobsendmsg(struct ipn_node *to, struct msgpool_item *msg);
> +
> +extern struct sk_buff *(*ipn_handle_frame_hook)(struct ipn_node *p,
> + struct sk_buff *skb);
> +#endif
> +#endif
> diff -Naur linux-2.6.24-rc5/net/Kconfig linux-2.6.24-rc5-ipn/net/Kconfig
> --- linux-2.6.24-rc5/net/Kconfig 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/Kconfig 2007-12-16 16:30:04.000000000 +0100
> @@ -37,6 +37,7 @@
>
> source "net/packet/Kconfig"
> source "net/unix/Kconfig"
> +source "net/ipn/Kconfig"
> source "net/xfrm/Kconfig"
> source "net/iucv/Kconfig"
>
> diff -Naur linux-2.6.24-rc5/net/Makefile linux-2.6.24-rc5-ipn/net/Makefile
> --- linux-2.6.24-rc5/net/Makefile 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/Makefile 2007-12-16 16:30:04.000000000 +0100
> @@ -19,6 +19,7 @@
> obj-$(CONFIG_INET) += ipv4/
> obj-$(CONFIG_XFRM) += xfrm/
> obj-$(CONFIG_UNIX) += unix/
> +obj-$(CONFIG_IPN) += ipn/
> ifneq ($(CONFIG_IPV6),)
> obj-y += ipv6/
> endif
> diff -Naur linux-2.6.24-rc5/net/core/dev.c linux-2.6.24-rc5-ipn/net/core/dev.c
> --- linux-2.6.24-rc5/net/core/dev.c 2007-12-11 04:48:43.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/core/dev.c 2007-12-16 16:30:04.000000000 +0100
> @@ -1925,7 +1925,7 @@
> int *ret,
> struct net_device *orig_dev)
> {
> - if (skb->dev->macvlan_port == NULL)
> + if (!skb || skb->dev->macvlan_port == NULL)
> return skb;
>
> if (*pt_prev) {
> @@ -1938,6 +1938,32 @@
> #define handle_macvlan(skb, pt_prev, ret, orig_dev) (skb)
> #endif
>
> +#if defined(CONFIG_IPN) || defined(CONFIG_IPN_MODULE)
> +struct sk_buff *(*ipn_handle_frame_hook)(struct ipn_node *port,
> + struct sk_buff *skb) __read_mostly;
> +EXPORT_SYMBOL_GPL(ipn_handle_frame_hook);
> +
> +static inline struct sk_buff *handle_ipn(struct sk_buff *skb,
> + struct packet_type **pt_prev,
> + int *ret,
> + struct net_device *orig_dev)
> +{
> + struct ipn_node *port;
> +
> + if (!skb || skb->pkt_type == PACKET_LOOPBACK ||
> + (port = rcu_dereference(skb->dev->ipn_port)) == NULL)

Is this protected either by rcu_read_lock() or the update-side lock
(ipnn_mutex?)? One or the other is required.

> + return skb;
> +
> + if (*pt_prev) {
> + *ret = deliver_skb(skb, *pt_prev, orig_dev);
> + *pt_prev = NULL;
> + }
> + return ipn_handle_frame_hook(port, skb);
> +}
> +#else
> +#define handle_ipn(skb, pt_prev, ret, orig_dev) (skb)
> +#endif
> +
> #ifdef CONFIG_NET_CLS_ACT
> /* TODO: Maybe we should just force sch_ingress to be compiled in
> * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
> @@ -2070,9 +2096,8 @@
> #endif
>
> skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
> - if (!skb)
> - goto out;
> skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
> + skb = handle_ipn(skb, &pt_prev, &ret, orig_dev);

Same here -- is this protected either by rcu_read_lock() or by the
update-side mutex?

> if (!skb)
> goto out;
>
> diff -Naur linux-2.6.24-rc5/net/ipn/Kconfig linux-2.6.24-rc5-ipn/net/ipn/Kconfig
> --- linux-2.6.24-rc5/net/ipn/Kconfig 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/ipn/Kconfig 2007-12-16 16:30:04.000000000 +0100
> @@ -0,0 +1,21 @@
> +#
> +# Unix Domain Sockets
> +#
> +
> +config IPN
> + tristate "IPN domain sockets (EXPERIMENTAL)"
> + depends on EXPERIMENTAL
> + ---help---
> + If you say Y here, you will include support for IPN domain sockets.
> + Inter Process Networking socket are similar to Unix sockets but
> + they support peer-to-peer, one-to-many and many-to-many communication
> + among processes.
> + Sub-Modules can be loaded to provide dispatching protocols.
> + This service include the IPN_BROADCST policy: all the messages get
> + sent to all the receipients (but the sender itself).
> +
> + To compile this driver as a module, choose M here: the module will be
> + called ipn.
> +
> + If unsure, say 'N'.
> +
> diff -Naur linux-2.6.24-rc5/net/ipn/Makefile linux-2.6.24-rc5-ipn/net/ipn/Makefile
> --- linux-2.6.24-rc5/net/ipn/Makefile 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/ipn/Makefile 2007-12-16 16:30:04.000000000 +0100
> @@ -0,0 +1,8 @@
> +#
> +## Makefile for the IPN (Inter Process Networking) domain socket layer.
> +#
> +
> +obj-$(CONFIG_IPN) += ipn.o
> +
> +ipn-y := af_ipn.o ipn_netdev.o
> +
> diff -Naur linux-2.6.24-rc5/net/ipn/af_ipn.c linux-2.6.24-rc5-ipn/net/ipn/af_ipn.c
> --- linux-2.6.24-rc5/net/ipn/af_ipn.c 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/ipn/af_ipn.c 2007-12-16 18:53:13.000000000 +0100
> @@ -0,0 +1,1540 @@
> +/*
> + * Main inter process networking (virtual distributed ethernet) module
> + * (part of the View-OS project: wiki.virtualsquare.org)
> + *
> + * Copyright (C) 2007 Renzo Davoli ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * Due to this file being licensed under the GPL there is controversy over
> + * whether this permits you to write a module that #includes this file
> + * without placing your module under the GPL. Please consult a lawyer for
> + * advice before doing this.
> + *
> + * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/socket.h>
> +#include <linux/poll.h>
> +#include <linux/un.h>
> +#include <linux/list.h>
> +#include <linux/mount.h>
> +#include <net/sock.h>
> +#include <net/af_ipn.h>
> +#include "ipn_netdev.h"
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("VIEW-OS TEAM");
> +MODULE_DESCRIPTION("IPN Kernel Module");
> +
> +#define IPN_MAX_PROTO 4
> +
> +/*extension of RCV_SHUTDOWN defined in include/net/sock.h
> + * when the bit is set recv fails */
> +/* NO_OOB: do not send OOB */
> +#define RCV_SHUTDOWN_NO_OOB 4
> +/* EXTENDED MASK including OOB */
> +#define SHUTDOWN_XMASK (SHUTDOWN_MASK | RCV_SHUTDOWN_NO_OOB)
> +/* if XRCV_SHUTDOWN is all set recv fails */
> +#define XRCV_SHUTDOWN (RCV_SHUTDOWN | RCV_SHUTDOWN_NO_OOB)
> +
> +/* Network table and hash */
> +struct hlist_head ipn_network_table[IPN_HASH_SIZE + 1];
> +DEFINE_SPINLOCK(ipn_table_lock);
> +static struct kmem_cache *ipn_network_cache;
> +static struct kmem_cache *ipn_node_cache;
> +static struct kmem_cache *ipn_msgitem_cache;
> +static DECLARE_MUTEX(ipn_glob_mutex);
> +
> +/* Protocol 1: HUB/Broadcast default protocol. Function Prototypes */
> +static int ipn_bcast_newport(struct ipn_node *newport);
> +static int ipn_bcast_handlemsg(struct ipn_node *from,
> + struct msgpool_item *msgitem);
> +
> +/* default protocol IPN_BROADCAST (0) */
> +static struct ipn_protocol ipn_bcast = {
> + .refcnt=0,
> + .ipn_p_newport=ipn_bcast_newport,
> + .ipn_p_handlemsg=ipn_bcast_handlemsg};
> +/* Protocol table */
> +static struct ipn_protocol *ipn_protocol_table[IPN_MAX_PROTO]={&ipn_bcast};
> +
> +/* Socket call function prototypes */
> +static int ipn_release(struct socket *);
> +static int ipn_bind(struct socket *, struct sockaddr *, int);
> +static int ipn_connect(struct socket *, struct sockaddr *,
> + int addr_len, int flags);
> +static int ipn_getname(struct socket *, struct sockaddr *, int *, int);
> +static unsigned int ipn_poll(struct file *, struct socket *, poll_table *);
> +static int ipn_ioctl(struct socket *, unsigned int, unsigned long);
> +static int ipn_shutdown(struct socket *, int);
> +static int ipn_sendmsg(struct kiocb *, struct socket *,
> + struct msghdr *, size_t);
> +static int ipn_recvmsg(struct kiocb *, struct socket *,
> + struct msghdr *, size_t, int);
> +static int ipn_setsockopt(struct socket *sock, int level, int optname,
> + char __user *optval, int optlen);
> +static int ipn_getsockopt(struct socket *sock, int level, int optname,
> + char __user *optval, int __user *optlen);
> +
> +/* Network table Management
> + * inode->ipn_network hash table */
> +static inline void ipn_insert_network(struct hlist_head *list, struct ipn_network *ipnn)
> +{
> + spin_lock(&ipn_table_lock);
> + hlist_add_head(&ipnn->hnode, list);
> + spin_unlock(&ipn_table_lock);
> +}
> +
> +static inline void ipn_remove_network(struct ipn_network *ipnn)
> +{
> + spin_lock(&ipn_table_lock);
> + hlist_del(&ipnn->hnode);
> + spin_unlock(&ipn_table_lock);
> +}
> +
> +static struct ipn_network *ipn_find_network_byinode(struct inode *i)
> +{
> + struct ipn_network *ipnn;
> + struct hlist_node *node;
> +
> + spin_lock(&ipn_table_lock);
> + hlist_for_each_entry(ipnn, node,
> + &ipn_network_table[i->i_ino & (IPN_HASH_SIZE - 1)], hnode) {
> + struct dentry *dentry = ipnn->dentry;
> +
> + if(atomic_read(&ipnn->refcnt) > 0 && dentry && dentry->d_inode == i)
> + goto found;
> + }
> + ipnn = NULL;
> +found:
> + spin_unlock(&ipn_table_lock);
> + return ipnn;
> +}
> +
> +/* msgpool management
> + * msgpool_item are ipn_network dependent (each net has its own MTU)
> + * for each message sent there is one msgpool_item and many struct msgitem
> + * one for each receipient.
> + * msgitem are connected to the node's msgqueue or oobmsgqueue.
> + * when a message is delivered to a process the msgitem is deleted and
> + * the count of the msgpool_item is decreased.
> + * msgpool_item elements gets deleted automatically when count is 0*/
> +
> +struct msgitem {
> + struct list_head list;
> + struct msgpool_item *msg;
> +};
> +
> +/* alloc a fresh msgpool item. count is set to 1.
> + * the typical use is
> + * ipn_msgpool_alloc
> + * for each receipient
> + * enqueue messages to the process (using msgitem), ipn_msgpool_hold
> + * ipn_msgpool_put
> + * The message can be delivered concurrently. init count to 1 guarantees
> + * that it survives at least until is has been enqueued to all
> + * receivers */
> +struct msgpool_item *ipn_msgpool_alloc(struct ipn_network *ipnn)
> +{
> + struct msgpool_item *new;
> + new=kmem_cache_alloc(ipnn->msgpool_cache,GFP_KERNEL);
> + atomic_set(&new->count,1);
> + atomic_inc(&ipnn->msgpool_nelem);
> + return new;
> +}
> +
> +/* If the service il LOSSLESS, this msgpool call waits for an
> + * available msgpool item */
> +static struct msgpool_item *ipn_msgpool_alloc_locking(struct ipn_network *ipnn)
> +{
> + if (ipnn->flags & IPN_FLAG_LOSSLESS) {
> + while (atomic_read(&ipnn->msgpool_nelem) >= ipnn->msgpool_size) {
> + if (wait_event_interruptible_exclusive(ipnn->send_wait,
> + atomic_read(&ipnn->msgpool_nelem) < ipnn->msgpool_size))
> + return NULL;
> + }
> + }
> + return ipn_msgpool_alloc(ipnn);
> +}
> +
> +static inline void ipn_msgpool_hold(struct msgpool_item *msg)
> +{
> + atomic_inc(&msg->count);
> +}
> +
> +/* decrease count and delete msgpool_item if count == 0 */
> +void ipn_msgpool_put(struct msgpool_item *old,
> + struct ipn_network *ipnn)
> +{
> + if (atomic_dec_and_test(&old->count)) {
> + kmem_cache_free(ipnn->msgpool_cache,old);
> + atomic_dec(&ipnn->msgpool_nelem);
> + if (ipnn->flags & IPN_FLAG_LOSSLESS) /* could be done anyway */
> + wake_up_interruptible(&ipnn->send_wait);
> + }
> +}
> +
> +/* socket calls */
> +static const struct proto_ops ipn_ops = {
> + .family = PF_IPN,
> + .owner = THIS_MODULE,
> + .release = ipn_release,
> + .bind = ipn_bind,
> + .connect = ipn_connect,
> + .socketpair = sock_no_socketpair,
> + .accept = sock_no_accept,
> + .getname = ipn_getname,
> + .poll = ipn_poll,
> + .ioctl = ipn_ioctl,
> + .listen = sock_no_listen,
> + .shutdown = ipn_shutdown,
> + .setsockopt = ipn_setsockopt,
> + .getsockopt = ipn_getsockopt,
> + .sendmsg = ipn_sendmsg,
> + .recvmsg = ipn_recvmsg,
> + .mmap = sock_no_mmap,
> + .sendpage = sock_no_sendpage,
> +};
> +
> +static struct proto ipn_proto = {
> + .name = "IPN",
> + .owner = THIS_MODULE,
> + .obj_size = sizeof(struct ipn_sock),
> +};
> +
> +/* create a socket
> + * ipn_node is a separate structure, pointed by ipn_sock -> node
> + * when a node is "persistent", ipn_node survives while ipn_sock gets released*/
> +static int ipn_create(struct net *net,struct socket *sock, int protocol)
> +{
> + struct ipn_sock *ipn_sk;
> + struct ipn_node *ipn_node;
> +
> + if (net != &init_net)
> + return -EAFNOSUPPORT;
> +
> + if (sock->type != SOCK_RAW)
> + return -EPROTOTYPE;
> + if (protocol > 0)
> + protocol=protocol-1;
> + else
> + protocol=IPN_BROADCAST-1;
> + if (protocol < 0 || protocol >= IPN_MAX_PROTO ||
> + ipn_protocol_table[protocol] == NULL)
> + return -EPROTONOSUPPORT;
> + ipn_sk = (struct ipn_sock *) sk_alloc(net, PF_IPN, GFP_KERNEL, &ipn_proto);
> +
> + if (!ipn_sk)
> + return -ENOMEM;
> + ipn_sk->node=ipn_node=kmem_cache_alloc(ipn_node_cache,GFP_KERNEL);
> + if (!ipn_node) {
> + sock_put((struct sock *) ipn_sk);
> + return -ENOMEM;
> + }
> + sock_init_data(sock,(struct sock *) ipn_sk);
> + sock->state = SS_UNCONNECTED;
> + sock->ops = &ipn_ops;
> + sock->sk=(struct sock *)ipn_sk;
> + INIT_LIST_HEAD(&ipn_node->nodelist);
> + ipn_node->protocol=protocol;
> + ipn_node->flags=IPN_NODEFLAG_INUSE;
> + ipn_node->shutdown=RCV_SHUTDOWN_NO_OOB;
> + ipn_node->descr[0]=0;
> + ipn_node->portno=IPN_PORTNO_ANY;
> + ipn_node->net=net;
> + ipn_node->dev=NULL;
> + ipn_node->proto_private=NULL;
> + ipn_node->totmsgcount=0;
> + ipn_node->oobmsgcount=0;
> + spin_lock_init(&ipn_node->msglock);
> + INIT_LIST_HEAD(&ipn_node->msgqueue);
> + INIT_LIST_HEAD(&ipn_node->oobmsgqueue);
> + ipn_node->ipn=NULL;
> + init_waitqueue_head(&ipn_node->read_wait);
> + ipn_node->pbp=NULL;
> + return 0;
> +}
> +
> +/* update # of readers and # of writers counters for an ipn network.
> + * This function sends oob messages to nodes requesting the service */
> +static void ipn_net_update_counters(struct ipn_network *ipnn,
> + int chg_readers, int chg_writers) {
> + ipnn->numreaders += chg_readers;
> + ipnn->numwriters += chg_writers;
> + if (ipnn->mtu >= sizeof(struct numnode_oob))
> + {
> + struct msgpool_item *ipn_msg=ipn_msgpool_alloc(ipnn);
> + if (ipn_msg) {
> + struct numnode_oob *oob_msg=(struct numnode_oob *)(ipn_msg->data);
> + struct ipn_node *ipn_node;
> + ipn_msg->len=sizeof(struct numnode_oob);
> + oob_msg->level=IPN_ANY;
> + oob_msg->tag=IPN_OOB_NUMNODE_TAG;
> + oob_msg->numreaders=ipnn->numreaders;
> + oob_msg->numwriters=ipnn->numwriters;
> + list_for_each_entry(ipn_node, &ipnn->connectqueue, nodelist) {
> + if (ipn_node->flags & IPN_NODEFLAG_OOB_NUMNODES)
> + ipn_proto_oobsendmsg(ipn_node,ipn_msg);
> + }
> + ipn_msgpool_put(ipn_msg,ipnn);
> + }
> + }
> +}
> +
> +/* flush pending messages (for close and shutdown RCV) */
> +static void ipn_flush_recvqueue(struct ipn_node *ipn_node)
> +{
> + struct ipn_network *ipnn=ipn_node->ipn;
> + spin_lock(&ipn_node->msglock);
> + while (!list_empty(&ipn_node->msgqueue)) {
> + struct msgitem *msgitem=
> + list_first_entry(&ipn_node->msgqueue, struct msgitem, list);
> + list_del(&msgitem->list);
> + ipn_node->totmsgcount--;
> + ipn_msgpool_put(msgitem->msg,ipnn);
> + kmem_cache_free(ipn_msgitem_cache,msgitem);
> + }
> + spin_unlock(&ipn_node->msglock);
> +}
> +
> +/* flush pending oob messages (for socket close) */
> +static void ipn_flush_oobrecvqueue(struct ipn_node *ipn_node)
> +{
> + struct ipn_network *ipnn=ipn_node->ipn;
> + spin_lock(&ipn_node->msglock);
> + while (!list_empty(&ipn_node->oobmsgqueue)) {
> + struct msgitem *msgitem=
> + list_first_entry(&ipn_node->oobmsgqueue, struct msgitem, list);
> + list_del(&msgitem->list);
> + ipn_node->totmsgcount--;
> + ipn_node->oobmsgcount--;
> + ipn_msgpool_put(msgitem->msg,ipnn);
> + kmem_cache_free(ipn_msgitem_cache,msgitem);
> + }
> + spin_unlock(&ipn_node->msglock);
> +}
> +
> +/* Terminate node. The node is "logically" terminated. */
> +static int ipn_terminate_node(struct ipn_node *ipn_node)
> +{
> + struct ipn_network *ipnn=ipn_node->ipn;
> + if (ipnn) {
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + return -ERESTARTSYS;
> + if (ipn_node->portno >= 0) {
> + ipn_protocol_table[ipnn->protocol]->ipn_p_predelport(ipn_node);
> + ipnn->connport[ipn_node->portno]=NULL;
> + }
> + list_del(&ipn_node->nodelist);
> + ipn_flush_recvqueue(ipn_node);
> + ipn_flush_oobrecvqueue(ipn_node);
> + if (ipn_node->portno >= 0) {
> + ipn_protocol_table[ipnn->protocol]->ipn_p_delport(ipn_node);
> + ipn_node->ipn=NULL;
> + ipn_net_update_counters(ipnn,
> + (ipn_node->shutdown & RCV_SHUTDOWN)?0:-1,
> + (ipn_node->shutdown & SEND_SHUTDOWN)?0:-1);
> + up(&ipnn->ipnn_mutex);
> + if (ipn_node->dev)
> + ipn_netdev_close(ipn_node);

The rcu_assign_pointer() invoked by ipn_netdev_close() is protected
by ipnn_mutex?

> + }
> + /* No more network elements */
> + if (atomic_dec_and_test(&ipnn->refcnt))
> + {
> + ipn_protocol_table[ipnn->protocol]->ipn_p_delnet(ipnn);
> + ipn_remove_network(ipnn);
> + ipn_protocol_table[ipnn->protocol]->refcnt--;
> + if (ipnn->dentry) {
> + dput(ipnn->dentry);
> + mntput(ipnn->mnt);
> + }
> + module_put(THIS_MODULE);
> + if (ipnn->msgpool_cache)
> + kmem_cache_destroy(ipnn->msgpool_cache);
> + if (ipnn->connport)
> + kfree(ipnn->connport);
> + kmem_cache_free(ipn_network_cache, ipnn);
> + }
> + }
> + if (ipn_node->pbp) {
> + kfree(ipn_node->pbp);
> + ipn_node->pbp=NULL;
> + }
> + ipn_node->shutdown = SHUTDOWN_XMASK;
> + return 0;
> +}
> +
> +/* release of a socket */
> +static int ipn_release (struct socket *sock)
> +{
> + struct ipn_sock *ipn_sk=(struct ipn_sock *)sock->sk;
> + struct ipn_node *ipn_node=ipn_sk->node;
> + int rv;
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (ipn_node->flags & IPN_NODEFLAG_PERSIST) {
> + ipn_node->flags &= ~IPN_NODEFLAG_INUSE;
> + rv=0;
> + } else {
> + rv=ipn_terminate_node(ipn_node);
> + if (rv==0)
> + kmem_cache_free(ipn_node_cache,ipn_node);
> + }
> + if (rv==0)
> + sock_put((struct sock *) ipn_sk);
> + up(&ipn_glob_mutex);
> + return rv;
> +}
> +
> +/* _set persist, change the persistence of a node,
> + * when persistence gets cleared and the node is no longer used
> + * the node is terminated and freed.
> + * ipn_glob_mutex must be locked */
> +static int _ipn_setpersist(struct ipn_node *ipn_node, int persist)
> +{
> + int rv=0;
> + if (persist)
> + ipn_node->flags |= IPN_NODEFLAG_PERSIST;
> + else {
> + ipn_node->flags &= ~IPN_NODEFLAG_PERSIST;
> + if (!(ipn_node->flags & IPN_NODEFLAG_INUSE)) {
> + rv=ipn_terminate_node(ipn_node);
> + if (rv==0)
> + kmem_cache_free(ipn_node_cache,ipn_node);
> + }
> + }
> + return rv;
> +}
> +
> +/* ipn_setpersist
> + * lock ipn_glob_mutex and call __ipn_setpersist above */
> +static int ipn_setpersist(struct ipn_node *ipn_node, int persist)
> +{
> + int rv=0;
> + if (ipn_node->dev == NULL)
> + return -ENODEV;
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + rv=_ipn_setpersist(ipn_node,persist);
> + up(&ipn_glob_mutex);
> + return rv;
> +}
> +
> +/* several network parameters can be set by setsockopt prior to bind */
> +/* struct pre_bind_parms is a temporary stucture connected to ipn_node->pbp
> + * to keep the parameter values. */
> +struct pre_bind_parms {
> + unsigned short maxports;
> + unsigned short flags;
> + unsigned short msgpoolsize;
> + unsigned short mtu;
> + unsigned short mode;
> +};
> +
> +/* STD_PARMS: BITS_PER_LONG nodes, no flags, BITS_PER_BYTE pending msgs,
> + * Ethernet + VLAN MTU*/
> +#define STD_BIND_PARMS {BITS_PER_LONG, 0, BITS_PER_BYTE, 1514, 0x777};
> +
> +static int ipn_mkname(struct sockaddr_un * sunaddr, int len)
> +{
> + if (len <= sizeof(short) || len > sizeof(*sunaddr))
> + return -EINVAL;
> + if (!sunaddr || sunaddr->sun_family != AF_IPN)
> + return -EINVAL;
> + /*
> + * This may look like an off by one error but it is a bit more
> + * subtle. 108 is the longest valid AF_IPN path for a binding.
> + * sun_path[108] doesnt as such exist. However in kernel space
> + * we are guaranteed that it is a valid memory location in our
> + * kernel address buffer.
> + */
> + ((char *)sunaddr)[len]=0;
> + len = strlen(sunaddr->sun_path)+1+sizeof(short);
> + return len;
> +}
> +
> +
> +/* IPN BIND */
> +static int ipn_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
> +{
> + struct sockaddr_un *sunaddr=(struct sockaddr_un *)uaddr;
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct nameidata nd;
> + struct ipn_network *ipnn;
> + struct dentry * dentry = NULL;
> + int err;
> + struct pre_bind_parms parms=STD_BIND_PARMS;
> +
> + //printk("IPN bind\n");
> +
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (sock->state != SS_UNCONNECTED ||
> + ipn_node->ipn != NULL) {
> + err= -EISCONN;
> + goto out;
> + }
> +
> + if (ipn_node->protocol >= 0 &&
> + (ipn_node->protocol >= IPN_MAX_PROTO ||
> + ipn_protocol_table[ipn_node->protocol] == NULL)) {
> + err= -EPROTONOSUPPORT;
> + goto out;
> + }
> +
> + addr_len = ipn_mkname(sunaddr, addr_len);
> + if (addr_len < 0) {
> + err=addr_len;
> + goto out;
> + }
> +
> + /* check if there is already a socket with that name */
> + err = path_lookup(sunaddr->sun_path, LOOKUP_FOLLOW, &nd);
> + if (err) { /* it does not exist, NEW IPN socket! */
> + unsigned int mode;
> + /* Is it everything okay with the parent? */
> + err = path_lookup(sunaddr->sun_path, LOOKUP_PARENT, &nd);
> + if (err)
> + goto out_mknod_parent;
> + /* Do I have the permission to create a file? */
> + dentry = lookup_create(&nd, 0);
> + err = PTR_ERR(dentry);
> + if (IS_ERR(dentry))
> + goto out_mknod_unlock;
> + /*
> + * All right, let's create it.
> + */
> + if (ipn_node->pbp)
> + mode = ipn_node->pbp->mode;
> + else
> + mode = SOCK_INODE(sock)->i_mode;
> + mode = S_IFSOCK | (mode & ~current->fs->umask);
> + err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
> + if (err)
> + goto out_mknod_dput;
> + mutex_unlock(&nd.dentry->d_inode->i_mutex);
> + dput(nd.dentry);
> + nd.dentry = dentry;
> + /* create a new ipn_network item */
> + if (ipn_node->pbp)
> + parms=*ipn_node->pbp;
> + ipnn=kmem_cache_zalloc(ipn_network_cache,GFP_KERNEL);
> + if (!ipnn) {
> + err=-ENOMEM;
> + goto out_mknod_dput_ipnn;
> + }
> + ipnn->connport=kzalloc(parms.maxports * sizeof(struct ipn_node *),GFP_KERNEL);
> + if (!ipnn->connport) {
> + err=-ENOMEM;
> + goto out_mknod_dput_ipnn2;
> + }
> +
> + /* module refcnt is incremented for each network, thus
> + * rmmod is forbidden if there are persistent node */
> + if (!try_module_get(THIS_MODULE)) {
> + err = -EINVAL;
> + goto out_mknod_dput_ipnn2;
> + }
> + memcpy(&ipnn->sunaddr,sunaddr,addr_len);
> + ipnn->mtu=parms.mtu;
> + ipnn->msgpool_cache=kmem_cache_create(ipnn->sunaddr.sun_path,sizeof(struct msgpool_item)+ipnn->mtu,0,0,NULL);
> + if (!ipnn->msgpool_cache) {
> + err=-ENOMEM;
> + goto out_mknod_dput_putmodule;
> + }
> + INIT_LIST_HEAD(&ipnn->unconnectqueue);
> + INIT_LIST_HEAD(&ipnn->connectqueue);
> + atomic_set(&ipnn->refcnt,1);
> + ipnn->dentry=nd.dentry;
> + ipnn->mnt=nd.mnt;
> + init_MUTEX(&ipnn->ipnn_mutex);
> + ipnn->sunaddr_len=addr_len;
> + ipnn->protocol=ipn_node->protocol;
> + if (ipnn->protocol < 0) ipnn->protocol = 0;
> + ipn_protocol_table[ipnn->protocol]->refcnt++;
> + ipnn->flags=parms.flags;
> + ipnn->numreaders=0;
> + ipnn->numwriters=0;
> + ipnn->maxports=parms.maxports;
> + atomic_set(&ipnn->msgpool_nelem,0);
> + ipnn->msgpool_size=parms.msgpoolsize;
> + ipnn->proto_private=NULL;
> + init_waitqueue_head(&ipnn->send_wait);
> + err=ipn_protocol_table[ipnn->protocol]->ipn_p_newnet(ipnn);
> + if (err)
> + goto out_mknod_dput_putmodule;
> + ipn_insert_network(&ipn_network_table[nd.dentry->d_inode->i_ino & (IPN_HASH_SIZE-1)],ipnn);
> + } else {
> + /* join an existing network */
> + err = vfs_permission(&nd, MAY_EXEC);
> + if (err)
> + goto put_fail;
> + err = -ECONNREFUSED;
> + if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
> + goto put_fail;
> + ipnn=ipn_find_network_byinode(nd.dentry->d_inode);
> + if (!ipnn || (ipnn->flags & IPN_FLAG_TERMINATED))
> + goto put_fail;
> + list_add_tail(&ipn_node->nodelist,&ipnn->unconnectqueue);
> + atomic_inc(&ipnn->refcnt);
> + }
> + if (ipn_node->pbp) {
> + kfree(ipn_node->pbp);
> + ipn_node->pbp=NULL;
> + }
> + ipn_node->ipn=ipnn;
> + ipn_node->flags |= IPN_NODEFLAG_BOUND;
> + up(&ipn_glob_mutex);
> + return 0;
> +
> +put_fail:
> + path_release(&nd);
> +out:
> + up(&ipn_glob_mutex);
> + return err;
> +
> +out_mknod_dput_putmodule:
> + module_put(THIS_MODULE);
> +out_mknod_dput_ipnn2:
> + kfree(ipnn->connport);
> +out_mknod_dput_ipnn:
> + kmem_cache_free(ipn_network_cache,ipnn);
> +out_mknod_dput:
> + dput(dentry);
> +out_mknod_unlock:
> + mutex_unlock(&nd.dentry->d_inode->i_mutex);
> + path_release(&nd);
> +out_mknod_parent:
> + if (err==-EEXIST)
> + err=-EADDRINUSE;
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +/* IPN CONNECT */
> +static int ipn_connect(struct socket *sock, struct sockaddr *addr,
> + int addr_len, int flags){
> + struct sockaddr_un *sunaddr=(struct sockaddr_un*)addr;
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct nameidata nd;
> + struct ipn_network *ipnn,*previousipnn;
> + int err=0;
> + int portno;
> +
> + /* the socket cannot be connected twice */
> + if (sock->state != SS_UNCONNECTED)
> + return EISCONN;
> +
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> +
> + if ((previousipnn=ipn_node->ipn) == NULL) { /* unbound */
> + unsigned char mustshutdown=0;
> + err = ipn_mkname(sunaddr, addr_len);
> + if (err < 0)
> + goto out;
> + addr_len=err;
> + err = path_lookup(sunaddr->sun_path, LOOKUP_FOLLOW, &nd);
> + if (err)
> + goto out;
> + err = vfs_permission(&nd, MAY_READ);
> + if (err) {
> + if (err == -EACCES || err == -EROFS)
> + mustshutdown|=RCV_SHUTDOWN;
> + else
> + goto put_fail;
> + }
> + err = vfs_permission(&nd, MAY_WRITE);
> + if (err) {
> + if (err == -EACCES)
> + mustshutdown|=SEND_SHUTDOWN;
> + else
> + goto put_fail;
> + }
> + mustshutdown |= ipn_node->shutdown;
> + /* if the combination of shutdown and permissions leaves
> + * no abilities, connect returns EACCES */
> + if (mustshutdown == SHUTDOWN_XMASK) {
> + err=-EACCES;
> + goto put_fail;
> + } else {
> + err=0;
> + ipn_node->shutdown=mustshutdown;
> + }
> + if (!S_ISSOCK(nd.dentry->d_inode->i_mode)) {
> + err = -ECONNREFUSED;
> + goto put_fail;
> + }
> + ipnn=ipn_find_network_byinode(nd.dentry->d_inode);
> + if (!ipnn || (ipnn->flags & IPN_FLAG_TERMINATED)) {
> + err = -ECONNREFUSED;
> + goto put_fail;
> + }
> + if (ipn_node->protocol == IPN_ANY)
> + ipn_node->protocol=ipnn->protocol;
> + else if (ipnn->protocol != ipn_node->protocol) {
> + err = -EPROTO;
> + goto put_fail;
> + }
> + path_release(&nd);
> + ipn_node->ipn=ipnn;
> + } else
> + ipnn=ipn_node->ipn;
> +
> + if (down_interruptible(&ipnn->ipnn_mutex)) {
> + err=-ERESTARTSYS;
> + goto out;
> + }
> + portno = ipn_protocol_table[ipnn->protocol]->ipn_p_newport(ipn_node);
> + if (portno >= 0 && portno<ipnn->maxports) {
> + sock->state = SS_CONNECTED;
> + ipn_node->portno=portno;
> + ipnn->connport[portno]=ipn_node;
> + if (!(ipn_node->flags & IPN_NODEFLAG_BOUND)) {
> + atomic_inc(&ipnn->refcnt);
> + list_del(&ipn_node->nodelist);
> + }
> + list_add_tail(&ipn_node->nodelist,&ipnn->connectqueue);
> + ipn_net_update_counters(ipnn,
> + (ipn_node->shutdown & RCV_SHUTDOWN)?0:1,
> + (ipn_node->shutdown & SEND_SHUTDOWN)?0:1);
> + } else {
> + ipn_node->ipn=previousipnn; /* undo changes on ipn_node->ipn */
> + err=-EADDRNOTAVAIL;
> + }
> + up(&ipnn->ipnn_mutex);
> + up(&ipn_glob_mutex);
> + return err;
> +
> +put_fail:
> + path_release(&nd);
> +out:
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +static int ipn_getname(struct socket *sock, struct sockaddr *uaddr,
> + int *uaddr_len, int peer) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + struct sockaddr_un *sunaddr=(struct sockaddr_un *)uaddr;
> + int err=0;
> +
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (ipnn) {
> + *uaddr_len = ipnn->sunaddr_len;
> + memcpy(sunaddr,&ipnn->sunaddr,*uaddr_len);
> + } else
> + err = -ENOTCONN;
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +/* IPN POLL */
> +static unsigned int ipn_poll(struct file *file, struct socket *sock,
> + poll_table *wait) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + unsigned int mask=0;
> +
> + if (ipnn) {
> + poll_wait(file,&ipn_node->read_wait,wait);
> + if (ipnn->flags & IPN_FLAG_LOSSLESS)
> + poll_wait(file,&ipnn->send_wait,wait);
> + /* POLLIN if recv succeeds,
> + * POLL{PRI,RDNORM} if there are {oob,non-oob} messages */
> + if (ipn_node->totmsgcount > 0) mask |= POLLIN;
> + if (!(list_empty(&ipn_node->msgqueue))) mask |= POLLRDNORM;
> + if (!(list_empty(&ipn_node->oobmsgqueue))) mask |= POLLPRI;
> + if ((!(ipnn->flags & IPN_FLAG_LOSSLESS)) |
> + (atomic_read(&ipnn->msgpool_nelem) < ipnn->msgpool_size))
> + mask |= POLLOUT | POLLWRNORM;
> + }
> + return mask;
> +}
> +
> +/* connect netdev (from ioctl). connect a bound socket to a
> + * network device TAP or GRAB */
> +static int ipn_connect_netdev(struct socket *sock,struct ifreq *ifr)
> +{
> + int err=0;
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> + if (sock->state != SS_UNCONNECTED)
> + return -EISCONN;
> + if (!ipnn)
> + return -ENOTCONN; /* Maybe we need a different error for "NOT BOUND" */
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (down_interruptible(&ipnn->ipnn_mutex)) {
> + up(&ipn_glob_mutex);
> + return -ERESTARTSYS;
> + }
> + ipn_node->dev=ipn_netdev_alloc(ipn_node->net,ifr->ifr_flags,ifr->ifr_name,&err);
> + if (ipn_node->dev) {
> + int portno;
> + portno = ipn_protocol_table[ipnn->protocol]->ipn_p_newport(ipn_node);
> + if (portno >= 0 && portno<ipnn->maxports) {
> + sock->state = SS_CONNECTED;
> + ipn_node->portno=portno;
> + ipn_node->flags |= ifr->ifr_flags & IPN_NODEFLAG_DEVMASK;
> + ipnn->connport[portno]=ipn_node;
> + err=ipn_netdev_activate(ipn_node);
> + if (err) {
> + sock->state = SS_UNCONNECTED;
> + ipn_protocol_table[ipnn->protocol]->ipn_p_delport(ipn_node);
> + ipn_node->dev=NULL;
> + ipn_node->portno= -1;
> + ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
> + ipnn->connport[portno]=NULL;
> + } else {
> + ipn_protocol_table[ipnn->protocol]->ipn_p_postnewport(ipn_node);
> + list_del(&ipn_node->nodelist);
> + list_add_tail(&ipn_node->nodelist,&ipnn->connectqueue);
> + }
> + } else {
> + ipn_netdev_close(ipn_node);

Again, the rcu_assign_pointer() invoked by ipn_netdev_close() is protected
by ipnn_mutex?

> + err=-EADDRNOTAVAIL;
> + ipn_node->dev=NULL;
> + }
> + } else
> + err=-EINVAL;
> + up(&ipnn->ipnn_mutex);
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +/* join a netdev, a socket gets connected to a persistent node
> + * not connected to another socket */
> +static int ipn_join_netdev(struct socket *sock,struct ifreq *ifr)
> +{
> + int err=0;
> + struct net_device *dev;
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_node *ipn_joined;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + if (sock->state != SS_UNCONNECTED)
> + return -EISCONN;
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (down_interruptible(&ipnn->ipnn_mutex)) {
> + up(&ipn_glob_mutex);
> + return -ERESTARTSYS;
> + }
> + dev=__dev_get_by_name(ipn_node->net,ifr->ifr_name);
> + if (!dev)
> + dev=__dev_get_by_index(ipn_node->net,ifr->ifr_ifindex);
> + if (dev && (ipn_joined=ipn_netdev2node(dev)) != NULL) { /* the interface does exist */
> + int i;
> + for (i=0;i<ipnn->maxports && ipn_joined != ipnn->connport[i] ;i++)
> + ;
> + if (i < ipnn->maxports) { /* found */
> + /* ipn_joined is substituted to ipn_node */
> + ((struct ipn_sock *)sock->sk)->node=ipn_joined;
> + ipn_joined->flags |= IPN_NODEFLAG_INUSE;
> + atomic_dec(&ipnn->refcnt);
> + kmem_cache_free(ipn_node_cache,ipn_node);
> + } else
> + err=-EPERM;
> + } else
> + err=-EADDRNOTAVAIL;
> + up(&ipnn->ipnn_mutex);
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +/* set persistence of a node looking for it by interface name
> + * (it is for sysadm, to close network interfaces)*/
> +static int ipn_setpersist_netdev(struct ifreq *ifr, int value)
> +{
> + struct net_device *dev;
> + struct ipn_node *ipn_node;
> + int err=0;
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + dev=__dev_get_by_name(&init_net,ifr->ifr_name);
> + if (!dev)
> + dev=__dev_get_by_index(&init_net,ifr->ifr_ifindex);
> + if (dev && (ipn_node=ipn_netdev2node(dev)) != NULL)
> + _ipn_setpersist(ipn_node,value);
> + else
> + err=-EADDRNOTAVAIL;
> + up(&ipn_glob_mutex);
> + return err;
> +}
> +
> +/* IPN IOCTL */
> +static int ipn_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + void __user* argp = (void __user*)arg;
> + struct ifreq ifr;
> +
> + if (ipn_node->shutdown == SHUTDOWN_XMASK)
> + return -ECONNRESET;
> +
> + /* get arguments */
> + switch (cmd) {
> + case IPN_SETPERSIST_NETDEV:
> + case IPN_CLRPERSIST_NETDEV:
> + case IPN_CONN_NETDEV:
> + case IPN_JOIN_NETDEV:
> + case SIOCSIFHWADDR:
> + if (copy_from_user(&ifr, argp, sizeof ifr))
> + return -EFAULT;
> + ifr.ifr_name[IFNAMSIZ-1] = '\0';
> + }
> +
> + /* actions for unconnected and unbound sockets */
> + switch (cmd) {
> + case IPN_SETPERSIST_NETDEV:
> + return ipn_setpersist_netdev(&ifr,1);
> + case IPN_CLRPERSIST_NETDEV:
> + return ipn_setpersist_netdev(&ifr,0);
> + case SIOCSIFHWADDR:
> + if (capable(CAP_NET_ADMIN))
> + return -EPERM;
> + if (ipn_node->dev && (ipn_node->flags &IPN_NODEFLAG_TAP))
> + return dev_set_mac_address(ipn_node->dev, &ifr.ifr_hwaddr);
> + else
> + return -EADDRNOTAVAIL;
> + }
> + if (ipnn == NULL || (ipnn->flags & IPN_FLAG_TERMINATED))
> + return -ENOTCONN;
> + /* actions for connected or bound sockets */
> + switch (cmd) {
> + case IPN_CONN_NETDEV:
> + return ipn_connect_netdev(sock,&ifr);
> + case IPN_JOIN_NETDEV:
> + return ipn_join_netdev(sock,&ifr);
> + case IPN_SETPERSIST:
> + return ipn_setpersist(ipn_node,arg);
> + default:
> + if (ipnn) {
> + int rv;
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + return -ERESTARTSYS;
> + rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_ioctl(ipn_node,cmd,arg);
> + up(&ipnn->ipnn_mutex);
> + return rv;
> + } else
> + return -EOPNOTSUPP;
> + }
> +}
> +
> +/* shutdown: close socket for input or for output.
> + * shutdown can be called prior to connect and it is not reversible */
> +static int ipn_shutdown(struct socket *sock, int mode) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + int oldshutdown=ipn_node->shutdown;
> + mode = (mode+1)&(RCV_SHUTDOWN|SEND_SHUTDOWN);
> +
> + ipn_node->shutdown |= mode;
> +
> + if(ipnn) {
> + if (down_interruptible(&ipnn->ipnn_mutex)) {
> + ipn_node->shutdown = oldshutdown;
> + return -ERESTARTSYS;
> + }
> + oldshutdown=ipn_node->shutdown-oldshutdown;
> + if (sock->state == SS_CONNECTED && oldshutdown) {
> + ipn_net_update_counters(ipnn,
> + (ipn_node->shutdown & RCV_SHUTDOWN)?0:-1,
> + (ipn_node->shutdown & SEND_SHUTDOWN)?0:-1);
> + }
> +
> + /* if recv channel has been shut down, flush the recv queue */
> + if ((ipn_node->shutdown & RCV_SHUTDOWN))
> + ipn_flush_recvqueue(ipn_node);
> + up(&ipnn->ipnn_mutex);
> + }
> + return 0;
> +}
> +
> +/* injectmsg: a new message is entering the ipn network.
> + * injectmsg gets called by send and by the grab/tap node */
> +int ipn_proto_injectmsg(struct ipn_node *from, struct msgpool_item *msg)
> +{
> + struct ipn_network *ipnn=from->ipn;
> + int err=0;
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + err=-ERESTARTSYS;
> + else {
> + ipn_protocol_table[ipnn->protocol]->ipn_p_handlemsg(from, msg);
> + up(&ipnn->ipnn_mutex);
> + }
> + return err;
> +}
> +
> +/* SEND MSG */
> +static int ipn_sendmsg(struct kiocb *kiocb, struct socket *sock,
> + struct msghdr *msg, size_t len) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + struct msgpool_item *newmsg;
> + int err=0;
> +
> + if (unlikely(sock->state != SS_CONNECTED))
> + return -ENOTCONN;
> + if (unlikely(ipn_node->shutdown & SEND_SHUTDOWN)) {
> + if (ipn_node->shutdown == SHUTDOWN_XMASK)
> + return -ECONNRESET;
> + else
> + return -EPIPE;
> + }
> + if (len > ipnn->mtu)
> + return -EOVERFLOW;
> + newmsg=ipn_msgpool_alloc_locking(ipnn);
> + if (!newmsg)
> + return -ENOMEM;
> + newmsg->len=len;
> + err=memcpy_fromiovec(newmsg->data, msg->msg_iov, len);
> + if (!err)
> + ipn_proto_injectmsg(ipn_node, newmsg);
> + ipn_msgpool_put(newmsg,ipnn);
> + return err;
> +}
> +
> +/* enqueue an oob message. "to" is the destination */
> +void ipn_proto_oobsendmsg(struct ipn_node *to, struct msgpool_item *msg)
> +{
> + if (to) {
> + if (!to->dev) { /* no oob to netdev */
> + struct msgitem *msgitem;
> + struct ipn_network *ipnn=to->ipn;
> + spin_lock(&to->msglock);
> + if ((to->shutdown & RCV_SHUTDOWN_NO_OOB) == 0 &&
> + (ipnn->flags & IPN_FLAG_LOSSLESS ||
> + to->oobmsgcount < ipnn->msgpool_size)) {
> + if ((msgitem=kmem_cache_alloc(ipn_msgitem_cache,GFP_KERNEL))!=NULL) {
> + msgitem->msg=msg;
> + to->totmsgcount++;
> + to->oobmsgcount++;
> + list_add_tail(&msgitem->list, &to->oobmsgqueue);
> + ipn_msgpool_hold(msg);
> + }
> + }
> + spin_unlock(&to->msglock);
> + wake_up_interruptible(&to->read_wait);
> + }
> + }
> +}
> +
> +/* ipn_proto_sendmsg is called by protocol implementation to enqueue a
> + * for a destination (to).*/
> +void ipn_proto_sendmsg(struct ipn_node *to, struct msgpool_item *msg)
> +{
> + if (to) {
> + if (to->dev) {
> + ipn_netdev_sendmsg(to,msg);
> + } else {
> + /* socket send */
> + struct msgitem *msgitem;
> + struct ipn_network *ipnn=to->ipn;
> + spin_lock(&to->msglock);
> + if ((ipnn->flags & IPN_FLAG_LOSSLESS ||
> + to->totmsgcount < ipnn->msgpool_size) &&
> + (to->shutdown & RCV_SHUTDOWN)==0) {
> + if ((msgitem=kmem_cache_alloc(ipn_msgitem_cache,GFP_KERNEL))!=NULL) {
> + msgitem->msg=msg;
> + to->totmsgcount++;
> + list_add_tail(&msgitem->list, &to->msgqueue);
> + ipn_msgpool_hold(msg);
> + }
> + }
> + spin_unlock(&to->msglock);
> + wake_up_interruptible(&to->read_wait);
> + }
> + }
> +}
> +
> +/* IPN RECV */
> +static int ipn_recvmsg(struct kiocb *kiocb, struct socket *sock,
> + struct msghdr *msg, size_t len, int flags) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + struct msgitem *msgitem;
> + struct msgpool_item *currmsg;
> +
> + if (unlikely(sock->state != SS_CONNECTED))
> + return -ENOTCONN;
> +
> + if (unlikely((ipn_node->shutdown & XRCV_SHUTDOWN) == XRCV_SHUTDOWN)) {
> + if (ipn_node->shutdown == SHUTDOWN_XMASK) /*EOF, nothing can be read*/
> + return 0;
> + else
> + return -EPIPE; /*trying to read on a write only node */
> + }
> +
> + /* wait for a message */
> + spin_lock(&ipn_node->msglock);
> + while (ipn_node->totmsgcount == 0) {
> + spin_unlock(&ipn_node->msglock);
> + if (wait_event_interruptible(ipn_node->read_wait,
> + !(ipn_node->totmsgcount == 0)))
> + return -ERESTARTSYS;
> + spin_lock(&ipn_node->msglock);
> + }
> + /* oob gets delivered first. oob are rare */
> + if (likely(list_empty(&ipn_node->oobmsgqueue)))
> + msgitem=list_first_entry(&ipn_node->msgqueue, struct msgitem, list);
> + else {
> + msgitem=list_first_entry(&ipn_node->oobmsgqueue, struct msgitem, list);
> + msg->msg_flags |= MSG_OOB;
> + ipn_node->oobmsgcount--;
> + }
> + list_del(&msgitem->list);
> + ipn_node->totmsgcount--;
> + spin_unlock(&ipn_node->msglock);
> + currmsg=msgitem->msg;
> + if (currmsg->len < len)
> + len=currmsg->len;
> + memcpy_toiovec(msg->msg_iov, currmsg->data, len);
> + ipn_msgpool_put(currmsg,ipnn);
> + kmem_cache_free(ipn_msgitem_cache,msgitem);
> +
> + return len;
> +}
> +
> +/* resize a network: change the # of communication ports (connport) */
> +static int ipn_netresize(struct ipn_network *ipnn,int newsize)
> +{
> + int oldsize,min;
> + struct ipn_node **newconnport;
> + struct ipn_node **oldconnport;
> + int err;
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + return -ERESTARTSYS;
> + oldsize=ipnn->maxports;
> + if (newsize == oldsize) {
> + up(&ipnn->ipnn_mutex);
> + return 0;
> + }
> + min=oldsize;
> + /* shrink a network. all the ports we are going to eliminate
> + * must be unused! */
> + if (newsize < oldsize) {
> + int i;
> + for (i=newsize; i<oldsize; i++)
> + if (ipnn->connport[i]) {
> + up(&ipnn->ipnn_mutex);
> + return -EADDRINUSE;
> + }
> + min=newsize;
> + }
> + oldconnport=ipnn->connport;
> + /* allocate the new connport array and copy the old one */
> + newconnport=kzalloc(newsize * sizeof(struct ipn_node *),GFP_KERNEL);
> + if (!newconnport) {
> + up(&ipnn->ipnn_mutex);
> + return -ENOMEM;
> + }
> + memcpy(newconnport,oldconnport,min * sizeof(struct ipn_node *));
> + ipnn->connport=newconnport;
> + ipnn->maxports=newsize;
> + /* notify the protocol that the netowrk has been resized */
> + err=ipn_protocol_table[ipnn->protocol]->ipn_p_resizenet(ipnn,oldsize,newsize);
> + if (err) {
> + /* roll back if the resize operation failed for the protocol */
> + ipnn->connport=oldconnport;
> + ipnn->maxports=oldsize;
> + kfree(newconnport);
> + } else
> + /* successful mission, network resized */
> + kfree(oldconnport);
> + up(&ipnn->ipnn_mutex);
> + return err;
> +}
> +
> +/* IPN SETSOCKOPT */
> +static int ipn_setsockopt(struct socket *sock, int level, int optname,
> + char __user *optval, int optlen) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> +
> + if (ipn_node->shutdown == SHUTDOWN_XMASK)
> + return -ECONNRESET;
> + if (level != 0 && level != ipn_node->protocol+1)
> + return -EPROTONOSUPPORT;
> + if (level > 0) {
> + /* protocol specific sockopt */
> + if (ipnn) {
> + int rv;
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + return -ERESTARTSYS;
> + rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_setsockopt(ipn_node,optname,optval,optlen);
> + up(&ipnn->ipnn_mutex);
> + return rv;
> + } else
> + return -EOPNOTSUPP;
> + } else {
> + if (optname == IPN_SO_DESCR) {
> + if (optlen > IPN_DESCRLEN)
> + return -EINVAL;
> + else {
> + memset(ipn_node->descr,0,IPN_DESCRLEN);
> + copy_from_user(ipn_node->descr,optval,optlen);
> + ipn_node->descr[optlen-1]=0;
> + return 0;
> + }
> + } else {
> + if (optlen < sizeof(int))
> + return -EINVAL;
> + else if ((optname & IPN_SO_PREBIND) && (ipnn != NULL))
> + return -EISCONN;
> + else {
> + int val;
> + get_user(val, (int __user *) optval);
> + if ((optname & IPN_SO_PREBIND) && !ipn_node->pbp) {
> + struct pre_bind_parms std=STD_BIND_PARMS;
> + ipn_node->pbp=kzalloc(sizeof(struct pre_bind_parms),GFP_KERNEL);
> + if (!ipn_node->pbp)
> + return -ENOMEM;
> + *(ipn_node->pbp)=std;
> + }
> + switch (optname) {
> + case IPN_SO_PORT:
> + if (sock->state == SS_UNCONNECTED)
> + ipn_node->portno=val;
> + else
> + return -EISCONN;
> + break;
> + case IPN_SO_CHANGE_NUMNODES:
> + if ((ipn_node->flags & IPN_NODEFLAG_BOUND)!=0) {
> + if (val <= 0)
> + return -EINVAL;
> + else
> + return ipn_netresize(ipnn,val);
> + } else
> + val=-ENOTCONN;
> + break;
> + case IPN_SO_WANT_OOB_NUMNODES:
> + if (val)
> + ipn_node->flags |= IPN_NODEFLAG_OOB_NUMNODES;
> + else
> + ipn_node->flags &= ~IPN_NODEFLAG_OOB_NUMNODES;
> + break;
> + case IPN_SO_HANDLE_OOB:
> + if (val)
> + ipn_node->shutdown &= ~RCV_SHUTDOWN_NO_OOB;
> + else
> + ipn_node->shutdown |= RCV_SHUTDOWN_NO_OOB;
> + break;
> + case IPN_SO_MTU:
> + if (val <= 0)
> + return -EINVAL;
> + else
> + ipn_node->pbp->mtu=val;
> + break;
> + case IPN_SO_NUMNODES:
> + if (val <= 0)
> + return -EINVAL;
> + else
> + ipn_node->pbp->maxports=val;
> + break;
> + case IPN_SO_MSGPOOLSIZE:
> + if (val <= 0)
> + return -EINVAL;
> + else
> + ipn_node->pbp->msgpoolsize=val;
> + break;
> + case IPN_SO_FLAGS:
> + ipn_node->pbp->flags=val;
> + break;
> + case IPN_SO_MODE:
> + ipn_node->pbp->mode=val;
> + break;
> + }
> + return 0;
> + }
> + }
> + }
> +}
> +
> +/* IPN GETSOCKOPT */
> +static int ipn_getsockopt(struct socket *sock, int level, int optname,
> + char __user *optval, int __user *optlen) {
> + struct ipn_node *ipn_node=((struct ipn_sock *)sock->sk)->node;
> + struct ipn_network *ipnn=ipn_node->ipn;
> + int len;
> +
> + if (ipn_node->shutdown == SHUTDOWN_XMASK)
> + return -ECONNRESET;
> + if (level != 0 && level != ipn_node->protocol+1)
> + return -EPROTONOSUPPORT;
> + if (level > 0) {
> + if (ipnn) {
> + int rv;
> + /* protocol specific sockopt */
> + if (down_interruptible(&ipnn->ipnn_mutex))
> + return -ERESTARTSYS;
> + rv=ipn_protocol_table[ipn_node->protocol]->ipn_p_getsockopt(ipn_node,optname,optval,optlen);
> + up(&ipnn->ipnn_mutex);
> + return rv;
> + } else
> + return -EOPNOTSUPP;
> + } else {
> + if (get_user(len, optlen))
> + return -EFAULT;
> + if (optname == IPN_SO_DESCR) {
> + if (len < IPN_DESCRLEN)
> + return -EINVAL;
> + else {
> + if (len > IPN_DESCRLEN)
> + len=IPN_DESCRLEN;
> + if(put_user(len, optlen))
> + return -EFAULT;
> + if(copy_to_user(optval,ipn_node->descr,len))
> + return -EFAULT;
> + return 0;
> + }
> + } else {
> + int val=-2;
> + switch (optname) {
> + case IPN_SO_PORT:
> + val=ipn_node->portno;
> + break;
> + case IPN_SO_MTU:
> + if (ipnn)
> + val=ipnn->mtu;
> + else if (ipn_node->pbp)
> + val=ipn_node->pbp->mtu;
> + break;
> + case IPN_SO_NUMNODES:
> + if (ipnn)
> + val=ipnn->maxports;
> + else if (ipn_node->pbp)
> + val=ipn_node->pbp->maxports;
> + break;
> + case IPN_SO_MSGPOOLSIZE:
> + if (ipnn)
> + val=ipnn->msgpool_size;
> + else if (ipn_node->pbp)
> + val=ipn_node->pbp->msgpoolsize;
> + break;
> + case IPN_SO_FLAGS:
> + if (ipnn)
> + val=ipnn->flags;
> + else if (ipn_node->pbp)
> + val=ipn_node->pbp->flags;
> + break;
> + case IPN_SO_MODE:
> + if (ipnn)
> + val=-1;
> + else if (ipn_node->pbp)
> + val=ipn_node->pbp->mode;
> + break;
> + }
> + if (val < -1)
> + return -EINVAL;
> + else {
> + if (len < sizeof(int))
> + return -EOVERFLOW;
> + else {
> + len = sizeof(int);
> + if(put_user(len, optlen))
> + return -EFAULT;
> + if(copy_to_user(optval,&val,len))
> + return -EFAULT;
> + return 0;
> + }
> + }
> + }
> + }
> +}
> +
> +/* BROADCAST/HUB implementation */
> +
> +static int ipn_bcast_newport(struct ipn_node *newport) {
> + struct ipn_network *ipnn=newport->ipn;
> + int i;
> + for (i=0;i<ipnn->maxports;i++) {
> + if (ipnn->connport[i] == NULL)
> + return i;
> + }
> + return -1;
> +}
> +
> +static int ipn_bcast_handlemsg(struct ipn_node *from,
> + struct msgpool_item *msgitem){
> + struct ipn_network *ipnn=from->ipn;
> +
> + struct ipn_node *ipn_node;
> + list_for_each_entry(ipn_node, &ipnn->connectqueue, nodelist) {
> + if (ipn_node != from)
> + ipn_proto_sendmsg(ipn_node,msgitem);
> + }
> + return 0;
> +}
> +
> +static void ipn_null_delport(struct ipn_node *oldport) {}
> +static void ipn_null_postnewport(struct ipn_node *newport) {}
> +static void ipn_null_predelport(struct ipn_node *oldport) {}
> +static int ipn_null_newnet(struct ipn_network *newnet) {return 0;}
> +static int ipn_null_resizenet(struct ipn_network *net,int oldsize,int newsize) {
> + return 0;}
> +static void ipn_null_delnet(struct ipn_network *oldnet) {}
> +static int ipn_null_setsockopt(struct ipn_node *port,int optname,
> + char __user *optval, int optlen) {return -EOPNOTSUPP;}
> +static int ipn_null_getsockopt(struct ipn_node *port,int optname,
> + char __user *optval, int *optlen) {return -EOPNOTSUPP;}
> +static int ipn_null_ioctl(struct ipn_node *port,unsigned int request,
> + unsigned long arg) {return -EOPNOTSUPP;}
> +
> +/* Protocol Registration/deregisteration */
> +
> +void ipn_init_protocol(struct ipn_protocol *p)
> +{
> + if (p->ipn_p_delport == NULL) p->ipn_p_delport=ipn_null_delport;
> + if (p->ipn_p_postnewport == NULL) p->ipn_p_postnewport=ipn_null_postnewport;
> + if (p->ipn_p_predelport == NULL) p->ipn_p_predelport=ipn_null_predelport;
> + if (p->ipn_p_newnet == NULL) p->ipn_p_newnet=ipn_null_newnet;
> + if (p->ipn_p_resizenet == NULL) p->ipn_p_resizenet=ipn_null_resizenet;
> + if (p->ipn_p_delnet == NULL) p->ipn_p_delnet=ipn_null_delnet;
> + if (p->ipn_p_setsockopt == NULL) p->ipn_p_setsockopt=ipn_null_setsockopt;
> + if (p->ipn_p_getsockopt == NULL) p->ipn_p_getsockopt=ipn_null_getsockopt;
> + if (p->ipn_p_ioctl == NULL) p->ipn_p_ioctl=ipn_null_ioctl;
> +}
> +
> +int ipn_proto_register(int protocol,struct ipn_protocol *ipn_service)
> +{
> + int rv=0;
> + if (ipn_service->ipn_p_newport == NULL ||
> + ipn_service->ipn_p_handlemsg == NULL)
> + return -EINVAL;
> + ipn_init_protocol(ipn_service);
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (protocol > 1 && protocol <= IPN_MAX_PROTO) {
> + protocol--;
> + if (ipn_protocol_table[protocol])
> + rv= -EEXIST;
> + else {
> + ipn_service->refcnt=0;
> + ipn_protocol_table[protocol]=ipn_service;
> + printk(KERN_INFO "IPN: Registered protocol %d\n",protocol+1);
> + }
> + } else
> + rv= -EINVAL;
> + up(&ipn_glob_mutex);
> + return rv;
> +}
> +
> +int ipn_proto_deregister(int protocol)
> +{
> + int rv=0;
> + if (down_interruptible(&ipn_glob_mutex))
> + return -ERESTARTSYS;
> + if (protocol > 1 && protocol <= IPN_MAX_PROTO) {
> + protocol--;
> + if (ipn_protocol_table[protocol]) {
> + if (ipn_protocol_table[protocol]->refcnt == 0) {
> + ipn_protocol_table[protocol]=NULL;
> + printk(KERN_INFO "IPN: Unregistered protocol %d\n",protocol+1);
> + } else
> + rv=-EADDRINUSE;
> + } else
> + rv= -ENOENT;
> + } else
> + rv= -EINVAL;
> + up(&ipn_glob_mutex);
> + return rv;
> +}
> +
> +/* MAIN SECTION */
> +/* Module constructor/destructor */
> +static struct net_proto_family ipn_family_ops = {
> + .family = PF_IPN,
> + .create = ipn_create,
> + .owner = THIS_MODULE,
> +};
> +
> +/* IPN constructor */
> +static int ipn_init(void)
> +{
> + int rc;
> +
> + ipn_init_protocol(&ipn_bcast);
> + ipn_network_cache=kmem_cache_create("ipn_network",sizeof(struct ipn_network),0,0,NULL);
> + if (!ipn_network_cache) {
> + printk(KERN_CRIT "%s: Cannot create ipn_network SLAB cache!\n",
> + __FUNCTION__);
> + rc=-ENOMEM;
> + goto out;
> + }
> +
> + ipn_node_cache=kmem_cache_create("ipn_node",sizeof(struct ipn_node),0,0,NULL);
> + if (!ipn_node_cache) {
> + printk(KERN_CRIT "%s: Cannot create ipn_node SLAB cache!\n",
> + __FUNCTION__);
> + rc=-ENOMEM;
> + goto out_net;
> + }
> +
> + ipn_msgitem_cache=kmem_cache_create("ipn_msgitem",sizeof(struct msgitem),0,0,NULL);
> + if (!ipn_msgitem_cache) {
> + printk(KERN_CRIT "%s: Cannot create ipn_msgitem SLAB cache!\n",
> + __FUNCTION__);
> + rc=-ENOMEM;
> + goto out_net_node;
> + }
> +
> + rc=proto_register(&ipn_proto,1);
> + if (rc != 0) {
> + printk(KERN_CRIT "%s: Cannot register the protocol!\n",
> + __FUNCTION__);
> + goto out_net_node_msg;
> + }
> +
> + sock_register(&ipn_family_ops);
> + ipn_netdev_init();
> + printk(KERN_INFO "IPN: Virtual Square Project, University of Bologna 2007\n");
> + return 0;
> +
> +out_net_node_msg:
> + kmem_cache_destroy(ipn_msgitem_cache);
> +out_net_node:
> + kmem_cache_destroy(ipn_node_cache);
> +out_net:
> + kmem_cache_destroy(ipn_network_cache);
> +out:
> + return rc;
> +}
> +
> +/* IPN destructor */
> +static void ipn_exit(void)
> +{
> + ipn_netdev_fini();
> + if (ipn_msgitem_cache)
> + kmem_cache_destroy(ipn_msgitem_cache);
> + if (ipn_node_cache)
> + kmem_cache_destroy(ipn_node_cache);
> + if (ipn_network_cache)
> + kmem_cache_destroy(ipn_network_cache);
> + sock_unregister(PF_IPN);
> + proto_unregister(&ipn_proto);
> + printk(KERN_INFO "IPN removed\n");
> +}
> +
> +module_init(ipn_init);
> +module_exit(ipn_exit);
> +
> +EXPORT_SYMBOL_GPL(ipn_proto_register);
> +EXPORT_SYMBOL_GPL(ipn_proto_deregister);
> +EXPORT_SYMBOL_GPL(ipn_proto_sendmsg);
> +EXPORT_SYMBOL_GPL(ipn_proto_oobsendmsg);
> +EXPORT_SYMBOL_GPL(ipn_msgpool_alloc);
> +EXPORT_SYMBOL_GPL(ipn_msgpool_put);
> diff -Naur linux-2.6.24-rc5/net/ipn/ipn_netdev.c linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.c
> --- linux-2.6.24-rc5/net/ipn/ipn_netdev.c 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.c 2007-12-16 18:53:24.000000000 +0100
> @@ -0,0 +1,276 @@
> +/*
> + * Inter process networking (virtual distributed ethernet) module
> + * Net devices: tap and grab
> + * (part of the View-OS project: wiki.virtualsquare.org)
> + *
> + * Copyright (C) 2007 Renzo Davoli ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * Due to this file being licensed under the GPL there is controversy over
> + * whether this permits you to write a module that #includes this file
> + * without placing your module under the GPL. Please consult a lawyer for
> + * advice before doing this.
> + *
> + * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/socket.h>
> +#include <linux/poll.h>
> +#include <linux/un.h>
> +#include <linux/list.h>
> +#include <linux/mount.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ethtool.h>
> +#include <net/sock.h>
> +#include <net/af_ipn.h>
> +
> +#define DRV_NAME "ipn"
> +#define DRV_VERSION "0.3"
> +
> +static const struct ethtool_ops ipn_ethtool_ops;
> +
> +struct ipntap {
> + struct ipn_node *ipn_node;
> + struct net_device_stats stats;
> +};
> +
> +/* TAP Net device open. */
> +static int ipntap_net_open(struct net_device *dev)
> +{
> + netif_start_queue(dev);
> + return 0;
> +}
> +
> +/* TAP Net device close. */
> +static int ipntap_net_close(struct net_device *dev)
> +{
> + netif_stop_queue(dev);
> + return 0;
> +}
> +
> +static struct net_device_stats *ipntap_net_stats(struct net_device *dev)
> +{
> + struct ipntap *ipntap = netdev_priv(dev);
> + return &ipntap->stats;
> +}
> +
> +/* receive from a TAP */
> +static int ipn_net_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct ipntap *ipntap = netdev_priv(dev);
> + struct ipn_node *ipn_node=ipntap->ipn_node;
> + struct msgpool_item *newmsg;
> + if (!ipn_node || !ipn_node->ipn || skb->len > ipn_node->ipn->mtu)
> + goto drop;
> + newmsg=ipn_msgpool_alloc(ipn_node->ipn);
> + if (!newmsg)
> + goto drop;
> + newmsg->len=skb->len;
> + memcpy(newmsg->data,skb->data,skb->len);
> + ipn_proto_injectmsg(ipntap->ipn_node,newmsg);
> + ipn_msgpool_put(newmsg,ipn_node->ipn);
> + ipntap->stats.tx_packets++;
> + ipntap->stats.tx_bytes += skb->len;
> + kfree_skb(skb);
> + return 0;
> +
> +drop:
> + ipntap->stats.tx_dropped++;
> + kfree_skb(skb);
> + return 0;
> +}
> +
> +/* receive from a GRAB via interface hook */
> +struct sk_buff *ipn_handle_hook(struct ipn_node *ipn_node, struct sk_buff *skb)
> +{
> + char *data=(skb->data)-(skb->mac_len);
> + int len=skb->len+skb->mac_len;
> +
> + if (ipn_node &&
> + ((ipn_node->flags & IPN_NODEFLAG_DEVMASK) == IPN_NODEFLAG_GRAB) &&
> + ipn_node->ipn && len<=ipn_node->ipn->mtu) {
> + struct msgpool_item *newmsg;
> + newmsg=ipn_msgpool_alloc(ipn_node->ipn);
> + if (newmsg) {
> + newmsg->len=len;
> + memcpy(newmsg->data,data,len);
> + ipn_proto_injectmsg(ipn_node,newmsg);
> + ipn_msgpool_put(newmsg,ipn_node->ipn);
> + }
> + }
> +
> + return (skb);
> +}
> +
> +static void ipntap_setup(struct net_device *dev)
> +{
> + dev->open = ipntap_net_open;
> + dev->hard_start_xmit = ipn_net_xmit;
> + dev->stop = ipntap_net_close;
> + dev->get_stats = ipntap_net_stats;
> + dev->ethtool_ops = &ipn_ethtool_ops;
> +}
> +
> +
> +struct net_device *ipn_netdev_alloc(struct net *net,int type, char *name, int *err)
> +{
> + struct net_device *dev=NULL;
> + *err=0;
> + if (!name || *name==0)
> + name="ipn%d";
> + switch (type) {
> + case IPN_NODEFLAG_TAP:
> + dev=alloc_netdev(sizeof(struct ipntap), name, ipntap_setup);
> + if (!dev)
> + *err= -ENOMEM;
> + ether_setup(dev);
> + /* this commented code is similar to tuntap MAC assignment.
> + * why tuntap does not use the random_ether_addr?
> + *(u16 *)dev->dev_addr = htons(0x00FF);
> + get_random_bytes(dev->dev_addr + sizeof(u16), 4);*/
> + random_ether_addr((u8 *)&dev->dev_addr);
> + break;
> + case IPN_NODEFLAG_GRAB:
> + dev=dev_get_by_name(net,name);
> + if (dev) {
> + if (dev->flags & IFF_LOOPBACK)
> + *err= -EINVAL;
> + else if (rcu_dereference(dev->ipn_port) != NULL)

This one requires either rcu_read_lock() or the update-side lock. In
theory, you omit rcu_dereference() given that you are only comparing to
NULL, but readability is greatly enhanced by marking the access anyway.

That is, assuming that you are actually using RCU here (I don't see any
sign of rcu_read_lock() or similar primitive, so I have doubts).

> + *err= -EBUSY;
> + if (*err)
> + dev=NULL;
> + }
> + break;
> + }
> + return dev;
> +}
> +
> +int ipn_netdev_activate(struct ipn_node *ipn_node)
> +{
> + int rv=-EINVAL;
> + switch (ipn_node->flags & IPN_NODEFLAG_DEVMASK) {
> + case IPN_NODEFLAG_TAP:
> + {
> + struct ipntap *ipntap=netdev_priv(ipn_node->dev);
> + ipntap->ipn_node=ipn_node;
> + rtnl_lock();
> + if ((rv=register_netdevice(ipn_node->dev)) == 0)
> + rcu_assign_pointer(ipn_node->dev->ipn_port, ipn_node);

Does rtnl_lock() imply the ipnn_mutex? If not, does the caller acquire
ipnn_mutex? Or do the other rcu_assign_pointer() calls that assign to
ipnn_port also hold RTNL?

> + rtnl_unlock();
> + if (rv) {/* error! */
> + ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
> + free_netdev(ipn_node->dev);
> + }
> + }
> + break;
> + case IPN_NODEFLAG_GRAB:
> + rtnl_lock();
> + rcu_assign_pointer(ipn_node->dev->ipn_port, ipn_node);

Ditto.

> + dev_set_promiscuity(ipn_node->dev,1);
> + rtnl_unlock();
> + rv=0;
> + break;
> + }
> + return rv;
> +}
> +
> +void ipn_netdev_close(struct ipn_node *ipn_node)
> +{
> + switch (ipn_node->flags & IPN_NODEFLAG_DEVMASK) {
> + case IPN_NODEFLAG_TAP:
> + ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
> + rtnl_lock();
> + unregister_netdevice(ipn_node->dev);
> + rtnl_unlock();
> + free_netdev(ipn_node->dev);
> + break;
> + case IPN_NODEFLAG_GRAB:
> + ipn_node->flags &= ~IPN_NODEFLAG_DEVMASK;
> + rtnl_lock();
> + rcu_assign_pointer(ipn_node->dev->ipn_port, NULL);

Ditto.

> + dev_set_promiscuity(ipn_node->dev,-1);
> + rtnl_unlock();
> + break;
> + }
> +}
> +
> +void ipn_netdev_sendmsg(struct ipn_node *to,struct msgpool_item *msg)
> +{
> + struct sk_buff *skb;
> + struct net_device *dev=to->dev;
> + struct ipntap *ipntap=netdev_priv(dev);
> +
> + if (msg->len > dev->mtu)
> + return;
> + skb=alloc_skb(msg->len+NET_IP_ALIGN,GFP_KERNEL);
> + if (!skb) {
> + ipntap->stats.rx_dropped++;
> + return;
> + }
> + memcpy(skb_put(skb,msg->len),msg->data,msg->len);
> + switch (to->flags & IPN_NODEFLAG_DEVMASK) {
> + case IPN_NODEFLAG_TAP:
> + skb->protocol = eth_type_trans(skb, dev);
> + netif_rx(skb);
> + ipntap->stats.rx_packets++;
> + ipntap->stats.rx_bytes += msg->len;
> + break;
> + case IPN_NODEFLAG_GRAB:
> + skb->dev = dev;
> + dev_queue_xmit(skb);
> + break;
> + }
> +}
> +
> +/* ethtool interface */
> +
> +static int ipn_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
> +{
> + cmd->supported = 0;
> + cmd->advertising = 0;
> + cmd->speed = SPEED_10;
> + cmd->duplex = DUPLEX_FULL;
> + cmd->port = PORT_TP;
> + cmd->phy_address = 0;
> + cmd->transceiver = XCVR_INTERNAL;
> + cmd->autoneg = AUTONEG_DISABLE;
> + cmd->maxtxpkt = 0;
> + cmd->maxrxpkt = 0;
> + return 0;
> +}
> +
> +static void ipn_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
> +{
> + strcpy(info->driver, DRV_NAME);
> + strcpy(info->version, DRV_VERSION);
> + strcpy(info->fw_version, "N/A");
> +}
> +
> +static const struct ethtool_ops ipn_ethtool_ops = {
> + .get_settings = ipn_get_settings,
> + .get_drvinfo = ipn_get_drvinfo,
> + /* not implemented (yet?)
> + .get_msglevel = ipn_get_msglevel,
> + .set_msglevel = ipn_set_msglevel,
> + .get_link = ipn_get_link,
> + .get_rx_csum = ipn_get_rx_csum,
> + .set_rx_csum = ipn_set_rx_csum */
> +};
> +
> +int ipn_netdev_init(void)
> +{
> + ipn_handle_frame_hook=ipn_handle_hook;
> + return 0;
> +}
> +
> +void ipn_netdev_fini(void)
> +{
> + ipn_handle_frame_hook=NULL;
> +}
> diff -Naur linux-2.6.24-rc5/net/ipn/ipn_netdev.h linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.h
> --- linux-2.6.24-rc5/net/ipn/ipn_netdev.h 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc5-ipn/net/ipn/ipn_netdev.h 2007-12-16 16:30:04.000000000 +0100
> @@ -0,0 +1,47 @@
> +#ifndef _IPN_NETDEV_H
> +#define _IPN_NETDEV_H
> +/*
> + * Inter process networking (virtual distributed ethernet) module
> + * Net devices: tap and grab
> + * (part of the View-OS project: wiki.virtualsquare.org)
> + *
> + * Copyright (C) 2007 Renzo Davoli ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * Due to this file being licensed under the GPL there is controversy over
> + * whether this permits you to write a module that #includes this file
> + * without placing your module under the GPL. Please consult a lawyer for
> + * advice before doing this.
> + *
> + * WARNING: THIS CODE IS ALREADY EXPERIMENTAL
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/socket.h>
> +#include <linux/poll.h>
> +#include <linux/un.h>
> +#include <linux/list.h>
> +#include <linux/mount.h>
> +#include <linux/etherdevice.h>
> +#include <linux/if_bridge.h>
> +#include <net/sock.h>
> +#include <net/af_ipn.h>
> +
> +struct net_device *ipn_netdev_alloc(struct net *net,int type, char *name, int *err);
> +int ipn_netdev_activate(struct ipn_node *ipn_node);
> +void ipn_netdev_close(struct ipn_node *ipn_node);
> +void ipn_netdev_sendmsg(struct ipn_node *to,struct msgpool_item *msg);
> +int ipn_netdev_init(void);
> +void ipn_netdev_fini(void);
> +
> +inline struct ipn_node *ipn_netdev2node(struct net_device *dev)
> +{
> + return rcu_dereference(dev->ipn_port);

This call seems to always be protected by ipnn_mutex. So the
rcu_dereference() is OK, but not absolutely required.

> +}
> +#endif
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-12-17 20:17:52

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/1] IPN: Inter Process Networking

On Mon, 17 Dec 2007, Ludovico Gardenghi wrote:

> On Mon, Dec 17, 2007 at 04:10:19AM -0800, [email protected] wrote:
>
>> if you are talking network connections between virtual systems, then the
>> exiting tap interfaces would seem to do everything you are looking for. you
>> can add them to bridges, route between them, filter traffic between them
>> (at whatever layer you want with netfilter), use multicast, etc as you
>> would any real interface.
>>
>> if, however, you are talking about non-network communications (your example
>> of sending raw video frames across the interface), and want multiple
>> processes to receive them, this sounds like exactly the thing that splice
>> was designed to do, distribute data to multiple recipiants simultaniously
>> and efficiantly.
>
> I'll try to explain.
>
> Our first interest was to be able to interconnect virtual, real, and partial
> virtual machines. We developed VDE for this, it's a user-level L2
> switch. Specific as it may be, it's quite popular as a simple but
> flexible tool. It can interconnect UML, Qemu, UMView, slirp, everything that
> can be connected to a tap interface, etc.
>
> So, you say, it's a networking issue and we could live with tun/tap.
> There's a major point here: at present, dealing with tun/tap, bridges,
> routing is quite difficult if you are a *regular* user with *no*
> capabilites at all. You have tun/tap persistency and association to a
> specific user (or group, recently), at most. That's good - we don't want
> regular users to mess with global networking rules and settings.
>
> Think of a bunch of etherogeneous virtual machines, partial virtual
> machines (i.e. VMs where only a subset of system calls may be
> virtualized or not depending on the parameters - that's the case of
> View-OS) that must be interconnected and that may or may not have a
> connection to a real network interface (maybe via a tunnel towards a
> different machine). There's no need for administrator intervention here.
> Why should an user have to ask root to create lots of tap interfaces for
> him, bind them in a bridge and set up filtering/routing rules? What
> would the list of interfaces become when different users asked for the
> same thing at the same time?
>
> You could define a specific interconnecting bus, but we've already have
> it: ethernet. VDE comes in help as it allows regular users to build
> distributed ethernet networks.
>
> VDE works fine, but at present often results in a bottleneck because of
> the high number of user-processes involved and user-kernel-user switches
> needed in order to transfer a single ethernet frame. Moving the core
> inside the kernel would limit this problem and result in faster
> communication with still no need for root intervention or global
> namespace messing. (we're thinking if something can be done working with
> containers or similar structures, both for networking and partial
> virtualization, but that's another topic).

so it sounds like the real issue you are trying to deal with is that only
root is allowed to make changes to the networking configuration, and you
want to allow non-root users to make changes.

in doing this you started by duplicating the kernel networking
functionality into userspace (your userspace L2 switch) and are running
into performance problems so trying to push this into the kernel to reduce
context switches.

besides your approach I see two other options on their way into the
kernel.

1. no changes, run your switch in a VM and your users (with their group
permissions) connect their VM interfaces to the interfaces of the VM
running the switch/filtering. this allows them 'root' inside the VM where
they can make all these changes.

this may have the same performance problems as your current userspace
switch.

2. networking virtualization. there is work being done to be able to have
what would be essentially multiple networking stacks on a machine to allow
a VM/container to control some things without having to go through the
tun/tap interface. This would allow a user to change the filtering rules
without the changes being global.

however, note that if the VM's are more then just a test-bed and actually
need to talk to the outside world, at some point they will need to connect
to the real interfaces, and making that connection should still require
superuser privilages on the master kernel.

besides, useing the standard networking stack has the advantage that if
you end up needing to spread your VM's across multiple machines the
support is already there, where adding a new IPC mechanism will require
figuring out how to extend that mechanism across machines.

It also doesn't require the applications to be coded specificly for your
mechanism. they just use standard networking API's and the virtual
connections happen for them.

> So we started thinking how to use existing kernel structures, and we
> concluded that:
>
> - no existing kernel structures appeared to be optimal for this work;
> - if we've had to design a new structure, it would have been more
> useful if we tried to be as more general as we could.
>
> At present we're still focused on networking and other applications are
> just examples, but we thought that adding a general extensible multipoint
> IPC family is quite better than adding the most specific solution to our
> current problem.
>
> Maybe people with experience in other fields may tell us if there are
> other problems that can be resolved, or optimized, or simply made
> simpler, with IPN. Maybe our proposal is not the best as for interface
> and semantics. But we feel that it may fill an "empty space" in the
> available IPC mechanisms with a quite simple but powerful approach.

I'm not the person to approve or disapprove adding this to the kernel, but
when something similar was proposed a week or so ago, the reaction from
many kernel developers was basicly the same as I articulated. While the
explination this time is far more complete, I don't think it presents a
compelling case for why this needs to be added instead of useing existing
mechanisms.

the fact that the answers for "why not use splice" have been "it doesn't
work for networking" and "why not use tun/tap" with "we may want to use it
for non-networking things in the future" seems like you are playing both
ends against the middle.

in this message you have articulated a new issue (the fact that tun/tap
will do everything you want, you just want to be able to do this without
needing to have superuser privilages). This is a valid issue, but there
are other options on the way that may address this.

>> for a new family to be valuble, you need to show what it does that isn't
>> available in existing families.
>
> Is it "more acceptable" to add a new address family or to add features
> to existing ones? (my question is purely informative, I don't want to
> sound sarcastic or whatever) For instance, someone proposed "let's just
> add access control to the netlink family". It seems a though work.

useually it seems to be to add features to an existing family, especially
if the features that need to be added are virtualization ones.

> You proposed splice, other have proposed multicast or netlink.

and the different answers have been to different problems.

if you need IP type routing and filtering then multicast over tun/tap is
probably a far better answer.

if you are needing unstructured communication betwen processes then
netlink or splice are the right answers (which one is right for your
application depends on what you are trying to do)

> If I have
> understood correctly, splice helps in copying data to different
> destinations in a very fast way. But it needs a userspace program that
> receives data, iterates on fds and splices the data "out", calling a
> syscall for each destination. syscall calling may have become very fast
> but we still notice slowdowns due to the reasons I've explained before.

but receiving data into a program from any source (pipe, kernel-space
networking, userspace networking) all require system calls to transfer the
data. so saying that receiving data from a pipe involes overhead on all
receiving processes in a non-issue.

> --- (the following is not related to IPN but i wanted to answer this too)
<snipped discription of ptrace/utrace>
thanks for the explination.

David Lang