Le 29/09/2014 20:43, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>> <[email protected]> wrote:
>>>>> Nicolas Dichtel <[email protected]> writes:
>>>>>
>>>>>> The goal of this serie is to be able to multicast netlink messages with an
>>>>>> attribute that identify a peer netns.
>>>>>> This is needed by the userland to interpret some informations contained in
>>>>>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>>>>>> of x-netns netdevice (see also
>>>>>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>>>>>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>>>>
>>>>> I want say that the problem addressed by patch 3/5 of this series is a
>>>>> fundamentally valid problem. We have network objects spanning network
>>>>> namespaces and it would be very nice to be able to talk about them in
>>>>> netlink, and file descriptors are too local and argubably too heavy
>>>>> weight for netlink quires and especially for netlink broadcast messages.
>>>>>
>>>>> Furthermore the concept of ineternal concept of peernet2id seems valid.
>>>>>
>>>>> However what you do not address is a way for CRIU (aka process
>>>>> migration) to be able to restore these ids after process migration.
>>>>> Going farther it looks like you are actively breaking process migration
>>>>> at this time, making this set of patches a no-go.
>> Ok, I will look more deeply into CRIU.
>>
>>>>>
>>>>> When adding a new form of namespace id CRIU patches are just about
>>>>> as necessary as iproute patches.
>> Noted.
>
>
>
>>>>> That does not describe what you have actually implemented in the
>>>>> patches.
>>>>>
>>>>> I see two ways to go with this.
>>>>>
>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>> network namespaces. The table would need to be populated manually by
>>>>> the likes of ip netns add.
>>>>>
>>>>> That flips the order of assignment and makes this idea solid.
>> I have a preference for this solution, because it allows to have a full
>> broadcast messages. When you have a lot of network interfaces (> 10k),
>> it saves a lot of time to avoid another request to get all informations.
>
> My practical question is how often does it happen that we care?
In fact, I don't think that scenarii with a lot of netns have a full mesh of
x-netns interfaces. It will be more one "link" netns with the physical
interface and all other with one interface with the link part in this "link"
netns. Hence, only one nsid is needing in each netns.
>
>>>>> Unfortunately in the case of a fully referencing mesh of N network
>>>>> namespaces such a mesh winds up taking O(N^2) space, which seems
>>>>> undesirable.
>> Memory consumption vs performances ;-)
>> In fact, when you have a lot of netns, you already should have some memory
>> available (at least N lo interfaces + N interfaces (veth or a x-netns
>> interface)). I'm not convinced that this is really an obstacle.
>
> I would have to see how it all fits together. O(N^2) grows a lot faster
> that N. So after a point it isn't in the same ballpark of memory
> consumption.
>
>>> broadcast message business, and only care about the remote namespace for
>>> unicast messages. Putting the work in an infrequently used slow path
>>> instead of a comparitively common path gives us much more freedom in
>>> the implementation.
>> I think it's better to have a full netlink messages, instead a partial one.
>> There is already a lot of attributes added for each rtnl interface messages to
>> be sure to describe all parameters of these interfaces.
>> And if the user don't care about ids (user has not set any id with iproute2),
>> we can just add the same attribute with id 0 (let's say it's a reserved id) to
>> indicate that the link part of this interface is in another netns.
>
> I imagine an id like that is something we would want ip netns add to
> set, and probably set in all existing network namespaces as well.
>
>> The great benefit of your first proposal is that the ids are set by the
>> userspace and thus it allows a high flexibility.
>>
>> Would you accept a patch that implements this first solution?
>
> I would not fundamentally reject it. I would really like to make
> certain we think through how it will be used and what the practical
> benefits are. Depending on how it is used the data structure could
> be a killer or it could be a case where we see how to manage it and
> simply don't care.
I will send a v3, so we can talk about it.
Thank you,
Nicolas
This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
net/core/rtnetlink.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1b9329512496..57959a85ed2c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_NUM_RX_QUEUES] = { .type = NLA_U32 },
[IFLA_PHYS_PORT_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
[IFLA_CARRIER_CHANGES] = { .type = NLA_U32 }, /* ignored */
+ [IFLA_LINK_NETNSID] = { .type = NLA_S32 },
};
static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1983,7 +1984,7 @@ replay:
struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 0];
struct nlattr **data = NULL;
struct nlattr **slave_data = NULL;
- struct net *dest_net;
+ struct net *dest_net, *link_net = NULL;
if (ops) {
if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
@@ -2089,7 +2090,18 @@ replay:
if (IS_ERR(dest_net))
return PTR_ERR(dest_net);
- dev = rtnl_create_link(dest_net, ifname, name_assign_type, ops, tb);
+ if (tb[IFLA_LINK_NETNSID]) {
+ int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
+
+ link_net = get_net_ns_by_id(dest_net, id);
+ if (link_net == NULL) {
+ err = -EINVAL;
+ goto out;
+ }
+ }
+
+ dev = rtnl_create_link(link_net ? : dest_net, ifname,
+ name_assign_type, ops, tb);
if (IS_ERR(dev)) {
err = PTR_ERR(dev);
goto out;
@@ -2117,9 +2129,16 @@ replay:
}
}
err = rtnl_configure_link(dev, ifm);
- if (err < 0)
+ if (err < 0) {
unregister_netdevice(dev);
+ goto out;
+ }
+
+ if (link_net)
+ err = dev_change_net_namespace(dev, dest_net, ifname);
out:
+ if (link_net)
+ put_net(link_net);
put_net(dest_net);
return err;
}
--
2.1.0
This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels). When there is no id,
we put NETNSA_NSINDEX_UNKNOWN into this attribute to indicate to userland that
the link netns is different from the interface netns. Hence, userland knows that
some information like IFLA_LINK are not interpretable.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
include/net/rtnetlink.h | 2 ++
include/uapi/linux/if_link.h | 1 +
net/core/rtnetlink.c | 13 +++++++++++++
3 files changed, 16 insertions(+)
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index e21b9f9653c0..6c6d5393fc34 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -46,6 +46,7 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
* to create when creating a new device.
* @get_num_rx_queues: Function to determine number of receive queues
* to create when creating a new device.
+ * @get_link_net: Function to get the i/o netns of the device
*/
struct rtnl_link_ops {
struct list_head list;
@@ -93,6 +94,7 @@ struct rtnl_link_ops {
int (*fill_slave_info)(struct sk_buff *skb,
const struct net_device *dev,
const struct net_device *slave_dev);
+ struct net *(*get_link_net)(const struct net_device *dev);
};
int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0bdb77e16875..938c0c02ed2e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
IFLA_CARRIER,
IFLA_PHYS_PORT_ID,
IFLA_CARRIER_CHANGES,
+ IFLA_LINK_NETNSID,
__IFLA_MAX
};
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6882686ca3a..1b9329512496 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -862,6 +862,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
+ nla_total_size(1) /* IFLA_OPERSTATE */
+ nla_total_size(1) /* IFLA_LINKMODE */
+ nla_total_size(4) /* IFLA_CARRIER_CHANGES */
+ + nla_total_size(4) /* IFLA_LINK_NETNSID */
+ nla_total_size(ext_filter_mask
& RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
+ rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
@@ -1134,6 +1135,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
goto nla_put_failure;
}
+ if (dev->rtnl_link_ops &&
+ dev->rtnl_link_ops->get_link_net) {
+ struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+ if (!net_eq(dev_net(dev), link_net)) {
+ int id = peernet2id(dev_net(dev), link_net);
+
+ if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+ goto nla_put_failure;
+ }
+ }
+
if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
goto nla_put_failure;
--
2.1.0
The goal of this serie is to be able to multicast netlink messages with an
attribute that identify a peer netns.
This is needed by the userland to interpret some informations contained in
netlink messages (like IFLA_LINK value, but also some other attributes in case
of x-netns netdevice (see also
http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
Ids of peer netns are set by userland via a new genl messages. These ids are
stored per netns and are local (ie only valid in the netns where they are set).
To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
the id of a peer netns.
Patch 1/4 introduces the netlink API mechanism to set and get these ids.
Patch 2/4 and 3/4 shows an example of how to use these ids in rtnetlink
messages. And patch 4/4 shows that the netlink messages can be symetric between
a GET and a SET.
iproute2 patches are available, I can send them on demand.
Here is a small screenshot to show how it can be used by userland:
$ ip netns add foo
$ ip netns del foo
$ ip netns
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns add foo
$ ip netns exec foo ip netns set init_net 0
$ ip netns
foo
init_net
$ ip netns exec foo ip netns
foo
init_net (id: 0)
$ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
$ ip netns exec foo ip l ls ipip1
6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default
link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0
The parameter link-netnsid shows us where the interface sends and receives
packets (and thus we know where encapsulated addresses are set).
RFCv2 -> RFCv3:
ids are now defined by userland (via netlink). Ids are stored in each netns
(and they are local to this netns).
add get_link_net support for ip6 tunnels
netnsid is now a s32 instead of a u32
RFCv1 -> RFCv2:
remove useless ()
ids are now stored in the user ns. It's possible to get an id for a peer netns
only if the current netns and the peer netns have the same user ns parent.
MAINTAINERS | 1 +
include/net/ip6_tunnel.h | 1 +
include/net/ip_tunnels.h | 1 +
include/net/net_namespace.h | 5 ++
include/net/rtnetlink.h | 2 +
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/if_link.h | 1 +
include/uapi/linux/netns.h | 31 +++++++
net/core/net_namespace.c | 195 +++++++++++++++++++++++++++++++++++++++++++
net/core/rtnetlink.c | 38 ++++++++-
net/ipv4/ip_gre.c | 2 +
net/ipv4/ip_tunnel.c | 8 ++
net/ipv4/ip_vti.c | 1 +
net/ipv4/ipip.c | 1 +
net/ipv6/ip6_gre.c | 1 +
net/ipv6/ip6_tunnel.c | 9 ++
net/ipv6/ip6_vti.c | 1 +
net/ipv6/sit.c | 1 +
net/netlink/genetlink.c | 4 +
19 files changed, 301 insertions(+), 3 deletions(-)
Comments are welcome.
Regards,
Nicolas
With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to netns (ie valid only into one netns).
This will be useful for netlink messages when a x-netns interface is dumped.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
MAINTAINERS | 1 +
include/net/net_namespace.h | 5 ++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/netns.h | 31 +++++++
net/core/net_namespace.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
net/netlink/genetlink.c | 4 +
6 files changed, 237 insertions(+)
create mode 100644 include/uapi/linux/netns.h
diff --git a/MAINTAINERS b/MAINTAINERS
index f8db3c3acc67..8e7f5d668e6a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6278,6 +6278,7 @@ F: include/linux/netdevice.h
F: include/uapi/linux/in.h
F: include/uapi/linux/net.h
F: include/uapi/linux/netdevice.h
+F: include/uapi/linux/netns.h
F: tools/net/
F: tools/testing/selftests/net/
F: lib/random32.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 361d26077196..d8847d978b59 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -59,6 +59,7 @@ struct net {
struct list_head exit_list; /* Use only net_mutex */
struct user_namespace *user_ns; /* Owning user namespace */
+ struct idr netns_ids;
unsigned int proc_inum;
@@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
#define __net_initconst __initconst
#endif
+int peernet2id(struct net *net, struct net *peer);
+struct net *get_net_ns_by_id(struct net *net, int id);
+int netns_genl_register(void);
+
struct pernet_operations {
struct list_head list;
int (*init)(struct net *net);
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70e150ebc6c9..33a0bbfe4736 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -276,6 +276,7 @@ header-y += netfilter_decnet.h
header-y += netfilter_ipv4.h
header-y += netfilter_ipv6.h
header-y += netlink.h
+header-y += netns.h
header-y += netrom.h
header-y += nfc.h
header-y += nfs.h
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 000000000000..8ebb08885795
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,31 @@
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME "netns"
+#define NETNS_GENL_VERSION 0x1
+
+/* Commands */
+enum {
+ NETNS_CMD_UNSPEC,
+ NETNS_CMD_NEWID,
+ NETNS_CMD_GETID,
+ __NETNS_CMD_MAX,
+};
+
+#define NETNS_CMD_MAX (__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+ NETNSA_NONE,
+#define NETNSA_NSINDEX_UNKNOWN -1
+ NETNSA_NSID,
+ NETNSA_PID,
+ NETNSA_FD,
+ __NETNSA_MAX,
+};
+
+#define NETNSA_MAX (__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7f155175bba8..4a5680ed42fb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -15,6 +15,8 @@
#include <linux/file.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
@@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
}
}
+/* This function is used by idr_for_each(). If net is equal to peer, the
+ * function returns the id so that idr_for_each() stops. Because we cannot
+ * returns the id 0 (idr_for_each() will not stop), we return the magic value
+ * -1 for it.
+ */
+static int net_eq_idr(int id, void *net, void *peer)
+{
+ if (net_eq(net, peer))
+ return id ? : -1;
+ return 0;
+}
+
+/* returns NETNSA_NSINDEX_UNKNOWN if not found */
+int peernet2id(struct net *net, struct net *peer)
+{
+ int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
+
+ ASSERT_RTNL();
+
+ /* Magic value for id 0. */
+ if (id == -1)
+ return 0;
+ if (id == 0)
+ return NETNSA_NSINDEX_UNKNOWN;
+
+ return id;
+}
+
+struct net *get_net_ns_by_id(struct net *net, int id)
+{
+ struct net *peer;
+
+ if (id < 0)
+ return NULL;
+
+ rcu_read_lock();
+ peer = idr_find(&net->netns_ids, id);
+ if (peer)
+ get_net(peer);
+ rcu_read_unlock();
+
+ return peer;
+}
+
/*
* setup_net runs the initializers for the network namespace object.
*/
@@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
atomic_set(&net->passive, 1);
net->dev_base_seq = 1;
net->user_ns = user_ns;
+ idr_init(&net->netns_ids);
#ifdef NETNS_REFCNT_DEBUG
atomic_set(&net->use_count, 0);
@@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry(net, &net_kill_list, cleanup_list) {
list_del_rcu(&net->list);
list_add_tail(&net->exit_list, &net_exit_list);
+ for_each_net(tmp) {
+ int id = peernet2id(tmp, net);
+
+ if (id >= 0)
+ idr_remove(&tmp->netns_ids, id);
+ }
+ idr_destroy(&net->netns_ids);
+
}
rtnl_unlock();
@@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
.exit = net_ns_net_exit,
};
+static struct genl_family netns_genl_family = {
+ .id = GENL_ID_GENERATE,
+ .name = NETNS_GENL_NAME,
+ .version = NETNS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = NETNSA_MAX,
+ .netnsok = true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+ [NETNSA_NONE] = { .type = NLA_UNSPEC },
+ [NETNSA_NSID] = { .type = NLA_S32 },
+ [NETNSA_PID] = { .type = NLA_U32 },
+ [NETNSA_FD] = { .type = NLA_U32 },
+};
+
+static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
+{
+ struct net *net = genl_info_net(info);
+ struct net *peer;
+ int nsid, err;
+
+ if (!info->attrs[NETNSA_NSID])
+ return -EINVAL;
+ nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
+ if (nsid < 0)
+ return -EINVAL;
+
+ if (info->attrs[NETNSA_PID])
+ peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+ else if (info->attrs[NETNSA_FD])
+ peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+ else
+ return -EINVAL;
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ rtnl_lock();
+ if (peernet2id(net, peer) >= 0) {
+ err = -EEXIST;
+ goto out;
+ }
+
+ err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
+ if (err >= 0)
+ err = 0;
+out:
+ rtnl_unlock();
+ put_net(peer);
+ return err;
+}
+
+static int netns_nl_get_size(void)
+{
+ return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+ ;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+ int cmd, struct net *net, struct net *peer)
+{
+ void *hdr;
+ int id;
+
+ hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
+ if (!hdr)
+ return -EMSGSIZE;
+
+ rtnl_lock();
+ id = peernet2id(net, peer);
+ rtnl_unlock();
+ if (nla_put_s32(skb, NETNSA_NSID, id))
+ goto nla_put_failure;
+
+ return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+ genlmsg_cancel(skb, hdr);
+ return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
+{
+ struct net *net = genl_info_net(info);
+ struct sk_buff *msg;
+ int err = -ENOBUFS;
+ struct net *peer;
+
+ if (info->attrs[NETNSA_PID])
+ peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+ else if (info->attrs[NETNSA_FD])
+ peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+ else
+ return -EINVAL;
+
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+ if (!msg) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+ NLM_F_ACK, NETNS_CMD_GETID, net, peer);
+ if (err < 0)
+ goto err_out;
+
+ err = genlmsg_unicast(net, msg, info->snd_portid);
+ goto out;
+
+err_out:
+ nlmsg_free(msg);
+out:
+ put_net(peer);
+ return err;
+}
+
+static struct genl_ops netns_genl_ops[] = {
+ {
+ .cmd = NETNS_CMD_NEWID,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_newid,
+ .flags = GENL_ADMIN_PERM,
+ },
+ {
+ .cmd = NETNS_CMD_GETID,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_getid,
+ .flags = GENL_ADMIN_PERM,
+ },
+};
+
+int netns_genl_register(void)
+{
+ return genl_register_family_with_ops(&netns_genl_family,
+ netns_genl_ops);
+}
+
static int __init net_ns_init(void)
{
struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 76393f2f4b22..c6f39e40c9f3 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1029,6 +1029,10 @@ static int __init genl_init(void)
if (err)
goto problem;
+ err = netns_genl_register();
+ if (err < 0)
+ goto problem;
+
return 0;
problem:
--
2.1.0
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
include/net/ip6_tunnel.h | 1 +
include/net/ip_tunnels.h | 1 +
net/ipv4/ip_gre.c | 2 ++
net/ipv4/ip_tunnel.c | 8 ++++++++
net/ipv4/ip_vti.c | 1 +
net/ipv4/ipip.c | 1 +
net/ipv6/ip6_gre.c | 1 +
net/ipv6/ip6_tunnel.c | 9 +++++++++
net/ipv6/ip6_vti.c | 1 +
net/ipv6/sit.c | 1 +
10 files changed, 26 insertions(+)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593dab6af7..8648519f4555 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
__u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
__u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
+struct net *ip6_tnl_get_link_net(const struct net_device *dev);
static inline void ip6tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
{
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 7f538ba6e267..c92a99b5b77e 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -119,6 +119,7 @@ struct ip_tunnel_net {
int ip_tunnel_init(struct net_device *dev);
void ip_tunnel_uninit(struct net_device *dev);
void ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev);
int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
struct rtnl_link_ops *ops, char *devname);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0485ef18d254..c75974986053 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -827,6 +827,7 @@ static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipgre_get_size,
.fill_info = ipgre_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
@@ -841,6 +842,7 @@ static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipgre_get_size,
.fill_info = ipgre_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static int __net_init ipgre_tap_init_net(struct net *net)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index b75b47b0a223..a8ab238d0df4 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -954,6 +954,14 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
}
EXPORT_SYMBOL_GPL(ip_tunnel_dellink);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev)
+{
+ struct ip_tunnel *tunnel = netdev_priv(dev);
+
+ return tunnel->net;
+}
+EXPORT_SYMBOL(ip_tunnel_get_link_net);
+
int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
struct rtnl_link_ops *ops, char *devname)
{
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index e453cb724a95..93862411669c 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -530,6 +530,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
.changelink = vti_changelink,
.get_size = vti_get_size,
.fill_info = vti_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static int __init vti_init(void)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index ea88ab3102a8..406910d04b1b 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -498,6 +498,7 @@ static struct rtnl_link_ops ipip_link_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipip_get_size,
.fill_info = ipip_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct xfrm_tunnel ipip_handler __read_mostly = {
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 9a0a1aafe727..10981f568250 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1659,6 +1659,7 @@ static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
.dellink = ip6gre_dellink,
.get_size = ip6gre_get_size,
.fill_info = ip6gre_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index e01bd0399297..b86d9f4ea5ec 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1699,6 +1699,14 @@ nla_put_failure:
return -EMSGSIZE;
}
+struct net *ip6_tnl_get_link_net(const struct net_device *dev)
+{
+ struct ip6_tnl *tunnel = netdev_priv(dev);
+
+ return tunnel->net;
+}
+EXPORT_SYMBOL(ip6_tnl_get_link_net);
+
static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
[IFLA_IPTUN_LINK] = { .type = NLA_U32 },
[IFLA_IPTUN_LOCAL] = { .len = sizeof(struct in6_addr) },
@@ -1722,6 +1730,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
.dellink = ip6_tnl_dellink,
.get_size = ip6_tnl_get_size,
.fill_info = ip6_tnl_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static struct xfrm6_tunnel ip4ip6_handler __read_mostly = {
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 7f52fd9fa7b0..88e8aadcfac1 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -988,6 +988,7 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
.changelink = vti6_changelink,
.get_size = vti6_get_size,
.fill_info = vti6_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0d4e27466f82..02ef387811be 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1765,6 +1765,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
.get_size = ipip6_get_size,
.fill_info = ipip6_fill_info,
.dellink = ipip6_dellink,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct xfrm_tunnel sit_handler __read_mostly = {
--
2.1.0
Nicolas Dichtel <[email protected]> writes:
> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>> Nicolas Dichtel <[email protected]> writes:
>>
>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>> Andy Lutomirski <[email protected]> writes:
>>>>
>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>> <[email protected]> wrote:
>>>>>> I see two ways to go with this.
>>>>>>
>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>> network namespaces. The table would need to be populated manually by
>>>>>> the likes of ip netns add.
>>>>>>
>>>>>> That flips the order of assignment and makes this idea solid.
>>> I have a preference for this solution, because it allows to have a full
>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>> it saves a lot of time to avoid another request to get all informations.
>>
>> My practical question is how often does it happen that we care?
> In fact, I don't think that scenarii with a lot of netns have a full mesh of
> x-netns interfaces. It will be more one "link" netns with the physical
> interface and all other with one interface with the link part in this "link"
> netns. Hence, only one nsid is needing in each netns.
I will buy that a full mesh is unlikely.
For people doing simulations anything physical has a limited number of
links.
For people wanting all to all connectivity setting up an internal
macvlan (or the equivalent) is likely much simpler and more efficient
that a full mesh.
So the question in my mind is how do we create these identifiers at need
(when we create the cross network namespace links) instead of at network
namespace creation time. I don't see an answer to that in your patches,
and perhaps it obvious.
Eric
On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
<[email protected]> wrote:
> Nicolas Dichtel <[email protected]> writes:
>
>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <[email protected]> writes:
>>>
>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>
>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>> <[email protected]> wrote:
>>>>>>> I see two ways to go with this.
>>>>>>>
>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>> network namespaces. The table would need to be populated manually by
>>>>>>> the likes of ip netns add.
>>>>>>>
>>>>>>> That flips the order of assignment and makes this idea solid.
>>>> I have a preference for this solution, because it allows to have a full
>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>> it saves a lot of time to avoid another request to get all informations.
>>>
>>> My practical question is how often does it happen that we care?
>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>> x-netns interfaces. It will be more one "link" netns with the physical
>> interface and all other with one interface with the link part in this "link"
>> netns. Hence, only one nsid is needing in each netns.
>
> I will buy that a full mesh is unlikely.
>
> For people doing simulations anything physical has a limited number of
> links.
>
> For people wanting all to all connectivity setting up an internal
> macvlan (or the equivalent) is likely much simpler and more efficient
> that a full mesh.
>
> So the question in my mind is how do we create these identifiers at need
> (when we create the cross network namespace links) instead of at network
> namespace creation time. I don't see an answer to that in your patches,
> and perhaps it obvious.
>
I wonder whether part of the problem is that we're thinking about
scoping wrong. What if we made the hierarchy more explicit?
For example, we could give each netns an admin-assigned identifier
(e.g. a 64-bit number, maybe required to be unique, maybe not)
relative to its containing userns. Then we could come up with a way
to identify user namespaces (i.e. inode number relative to containing
user ns, if that's well-defined).
>From user code's perspective, netnses that are in the requester's
userns or its descendents are identified by a path through a (possibly
zero-length) sequence of userns ids followed by a netns id. Netnses
outside the requester's userns hierarchy cannot be named at all.
Would this make sense? It should keep the asymptotic complexity of
everything under control and, for users of very large numbers of
network namespaces with complex routing, it doesn't require a
correspondingly large number of fds. It would have the added benefit
of allowing the same scheme to be used for all the other namespace
types, although it could be a bit odd for pid namespaces, which really
do have their own hierarchy.
--Andy
Nicolas Dichtel <[email protected]> writes:
> With this patch, a user can define an id for a peer netns by providing a FD or a
> PID. These ids are local to netns (ie valid only into one netns).
>
> This will be useful for netlink messages when a x-netns interface is
> dumped.
You have a "id -> struct net *" table but you don't have a
"struct net * -> id" table which looks like it will impact the
performance of peernet2id at scale.
Eric
Andy Lutomirski <[email protected]> writes:
> On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Nicolas Dichtel <[email protected]> writes:
>>
>>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>>> Nicolas Dichtel <[email protected]> writes:
>>>>
>>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>
>>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>>> <[email protected]> wrote:
>>>>>>>> I see two ways to go with this.
>>>>>>>>
>>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>> network namespaces. The table would need to be populated manually by
>>>>>>>> the likes of ip netns add.
>>>>>>>>
>>>>>>>> That flips the order of assignment and makes this idea solid.
>>>>> I have a preference for this solution, because it allows to have a full
>>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>>> it saves a lot of time to avoid another request to get all informations.
>>>>
>>>> My practical question is how often does it happen that we care?
>>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>>> x-netns interfaces. It will be more one "link" netns with the physical
>>> interface and all other with one interface with the link part in this "link"
>>> netns. Hence, only one nsid is needing in each netns.
>>
>> I will buy that a full mesh is unlikely.
>>
>> For people doing simulations anything physical has a limited number of
>> links.
>>
>> For people wanting all to all connectivity setting up an internal
>> macvlan (or the equivalent) is likely much simpler and more efficient
>> that a full mesh.
>>
>> So the question in my mind is how do we create these identifiers at need
>> (when we create the cross network namespace links) instead of at network
>> namespace creation time. I don't see an answer to that in your patches,
>> and perhaps it obvious.
>>
>
> I wonder whether part of the problem is that we're thinking about
> scoping wrong. What if we made the hierarchy more explicit?
>
> For example, we could give each netns an admin-assigned identifier
> (e.g. a 64-bit number, maybe required to be unique, maybe not)
> relative to its containing userns. Then we could come up with a way
> to identify user namespaces (i.e. inode number relative to containing
> user ns, if that's well-defined).
If as suggested we only assign ids when a tunnel (or equivalent) is
created between two network namespaces the space cost is a non-issue.
The ids become at worst a constant factor addition to the cost of the
tunnel.
To keep things simple we may want to assign a free id (if one does not
exist) when we connect a tunnel to a network namespace.
> From user code's perspective, netnses that are in the requester's
> userns or its descendents are identified by a path through a (possibly
> zero-length) sequence of userns ids followed by a netns id. Netnses
> outside the requester's userns hierarchy cannot be named at all.
>
> Would this make sense?
Nope. What happens if I migrate 2 of the 4 network namespaces in a user
namespace? The migration potentially fails. Application migration does
not require user namespace migration.
Eric
On Thu, Oct 2, 2014 at 12:45 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Nicolas Dichtel <[email protected]> writes:
>>>
>>>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>>>> Nicolas Dichtel <[email protected]> writes:
>>>>>
>>>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>>
>>>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> I see two ways to go with this.
>>>>>>>>>
>>>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>>> network namespaces. The table would need to be populated manually by
>>>>>>>>> the likes of ip netns add.
>>>>>>>>>
>>>>>>>>> That flips the order of assignment and makes this idea solid.
>>>>>> I have a preference for this solution, because it allows to have a full
>>>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>>>> it saves a lot of time to avoid another request to get all informations.
>>>>>
>>>>> My practical question is how often does it happen that we care?
>>>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>>>> x-netns interfaces. It will be more one "link" netns with the physical
>>>> interface and all other with one interface with the link part in this "link"
>>>> netns. Hence, only one nsid is needing in each netns.
>>>
>>> I will buy that a full mesh is unlikely.
>>>
>>> For people doing simulations anything physical has a limited number of
>>> links.
>>>
>>> For people wanting all to all connectivity setting up an internal
>>> macvlan (or the equivalent) is likely much simpler and more efficient
>>> that a full mesh.
>>>
>>> So the question in my mind is how do we create these identifiers at need
>>> (when we create the cross network namespace links) instead of at network
>>> namespace creation time. I don't see an answer to that in your patches,
>>> and perhaps it obvious.
>>>
>>
>> I wonder whether part of the problem is that we're thinking about
>> scoping wrong. What if we made the hierarchy more explicit?
>>
>> For example, we could give each netns an admin-assigned identifier
>> (e.g. a 64-bit number, maybe required to be unique, maybe not)
>> relative to its containing userns. Then we could come up with a way
>> to identify user namespaces (i.e. inode number relative to containing
>> user ns, if that's well-defined).
>
> If as suggested we only assign ids when a tunnel (or equivalent) is
> created between two network namespaces the space cost is a non-issue.
> The ids become at worst a constant factor addition to the cost of the
> tunnel.
>
> To keep things simple we may want to assign a free id (if one does not
> exist) when we connect a tunnel to a network namespace.
>
>> From user code's perspective, netnses that are in the requester's
>> userns or its descendents are identified by a path through a (possibly
>> zero-length) sequence of userns ids followed by a netns id. Netnses
>> outside the requester's userns hierarchy cannot be named at all.
>>
>> Would this make sense?
>
> Nope. What happens if I migrate 2 of the 4 network namespaces in a user
> namespace? The migration potentially fails. Application migration does
> not require user namespace migration.
Hmm. I guess that, as long as those network namespaces aren't
connected to anything else, migrating like that makes sense and ought
to work. Fair enough.
--Andy
Le 02/10/2014 21:33, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> With this patch, a user can define an id for a peer netns by providing a FD or a
>> PID. These ids are local to netns (ie valid only into one netns).
>>
>> This will be useful for netlink messages when a x-netns interface is
>> dumped.
>
> You have a "id -> struct net *" table but you don't have a
> "struct net * -> id" table which looks like it will impact the
> performance of peernet2id at scale.
It is indirectly stores in 'struct idr'. It can be optimized later, with a
proper algorithm to find quickly this 'struct net *' (hash table? something
else?). A basic algorithm will not be more scalable than the current
idr_for_each().
Le 02/10/2014 21:20, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <[email protected]> writes:
>>>
>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>
>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>> <[email protected]> wrote:
>>>>>>> I see two ways to go with this.
>>>>>>>
>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>> network namespaces. The table would need to be populated manually by
>>>>>>> the likes of ip netns add.
>>>>>>>
>>>>>>> That flips the order of assignment and makes this idea solid.
>>>> I have a preference for this solution, because it allows to have a full
>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>> it saves a lot of time to avoid another request to get all informations.
>>>
>>> My practical question is how often does it happen that we care?
>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>> x-netns interfaces. It will be more one "link" netns with the physical
>> interface and all other with one interface with the link part in this "link"
>> netns. Hence, only one nsid is needing in each netns.
>
> I will buy that a full mesh is unlikely.
>
> For people doing simulations anything physical has a limited number of
> links.
>
> For people wanting all to all connectivity setting up an internal
> macvlan (or the equivalent) is likely much simpler and more efficient
> that a full mesh.
>
> So the question in my mind is how do we create these identifiers at need
> (when we create the cross network namespace links) instead of at network
> namespace creation time. I don't see an answer to that in your patches,
> and perhaps it obvious.
For me, it is the responsability of the user who creates the netns. He should
know what will be done with this new netns, hence he may or may not define an
id. It's also possible to delegate this to the user who will create the tunnel.
In other words, it's part of the configuration.
With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to netns (ie valid only into one netns).
This will be useful for netlink messages when a x-netns interface is dumped.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
MAINTAINERS | 1 +
include/net/net_namespace.h | 5 ++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/netns.h | 38 +++++++++
net/core/net_namespace.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
net/netlink/genetlink.c | 4 +
6 files changed, 244 insertions(+)
create mode 100644 include/uapi/linux/netns.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 43898b1a8a2d..de7e6fcbd5c2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6382,6 +6382,7 @@ F: include/linux/netdevice.h
F: include/uapi/linux/in.h
F: include/uapi/linux/net.h
F: include/uapi/linux/netdevice.h
+F: include/uapi/linux/netns.h
F: tools/net/
F: tools/testing/selftests/net/
F: lib/random32.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index e0d64667a4b3..0f1367a71b81 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -59,6 +59,7 @@ struct net {
struct list_head exit_list; /* Use only net_mutex */
struct user_namespace *user_ns; /* Owning user namespace */
+ struct idr netns_ids;
unsigned int proc_inum;
@@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
#define __net_initconst __initconst
#endif
+int peernet2id(struct net *net, struct net *peer);
+struct net *get_net_ns_by_id(struct net *net, int id);
+int netns_genl_register(void);
+
struct pernet_operations {
struct list_head list;
int (*init)(struct net *net);
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6cad97485bad..d7f49c69585a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -277,6 +277,7 @@ header-y += netfilter_decnet.h
header-y += netfilter_ipv4.h
header-y += netfilter_ipv6.h
header-y += netlink.h
+header-y += netns.h
header-y += netrom.h
header-y += nfc.h
header-y += nfs.h
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 000000000000..2edf129377de
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,38 @@
+/* Copyright (c) 2014 6WIND S.A.
+ * Author: Nicolas Dichtel <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME "netns"
+#define NETNS_GENL_VERSION 0x1
+
+/* Commands */
+enum {
+ NETNS_CMD_UNSPEC,
+ NETNS_CMD_NEWID,
+ NETNS_CMD_GETID,
+ __NETNS_CMD_MAX,
+};
+
+#define NETNS_CMD_MAX (__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+ NETNSA_NONE,
+#define NETNSA_NSINDEX_UNKNOWN -1
+ NETNSA_NSID,
+ NETNSA_PID,
+ NETNSA_FD,
+ __NETNSA_MAX,
+};
+
+#define NETNSA_MAX (__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7f155175bba8..4a5680ed42fb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -15,6 +15,8 @@
#include <linux/file.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
@@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
}
}
+/* This function is used by idr_for_each(). If net is equal to peer, the
+ * function returns the id so that idr_for_each() stops. Because we cannot
+ * returns the id 0 (idr_for_each() will not stop), we return the magic value
+ * -1 for it.
+ */
+static int net_eq_idr(int id, void *net, void *peer)
+{
+ if (net_eq(net, peer))
+ return id ? : -1;
+ return 0;
+}
+
+/* returns NETNSA_NSINDEX_UNKNOWN if not found */
+int peernet2id(struct net *net, struct net *peer)
+{
+ int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
+
+ ASSERT_RTNL();
+
+ /* Magic value for id 0. */
+ if (id == -1)
+ return 0;
+ if (id == 0)
+ return NETNSA_NSINDEX_UNKNOWN;
+
+ return id;
+}
+
+struct net *get_net_ns_by_id(struct net *net, int id)
+{
+ struct net *peer;
+
+ if (id < 0)
+ return NULL;
+
+ rcu_read_lock();
+ peer = idr_find(&net->netns_ids, id);
+ if (peer)
+ get_net(peer);
+ rcu_read_unlock();
+
+ return peer;
+}
+
/*
* setup_net runs the initializers for the network namespace object.
*/
@@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
atomic_set(&net->passive, 1);
net->dev_base_seq = 1;
net->user_ns = user_ns;
+ idr_init(&net->netns_ids);
#ifdef NETNS_REFCNT_DEBUG
atomic_set(&net->use_count, 0);
@@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry(net, &net_kill_list, cleanup_list) {
list_del_rcu(&net->list);
list_add_tail(&net->exit_list, &net_exit_list);
+ for_each_net(tmp) {
+ int id = peernet2id(tmp, net);
+
+ if (id >= 0)
+ idr_remove(&tmp->netns_ids, id);
+ }
+ idr_destroy(&net->netns_ids);
+
}
rtnl_unlock();
@@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
.exit = net_ns_net_exit,
};
+static struct genl_family netns_genl_family = {
+ .id = GENL_ID_GENERATE,
+ .name = NETNS_GENL_NAME,
+ .version = NETNS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = NETNSA_MAX,
+ .netnsok = true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+ [NETNSA_NONE] = { .type = NLA_UNSPEC },
+ [NETNSA_NSID] = { .type = NLA_S32 },
+ [NETNSA_PID] = { .type = NLA_U32 },
+ [NETNSA_FD] = { .type = NLA_U32 },
+};
+
+static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
+{
+ struct net *net = genl_info_net(info);
+ struct net *peer;
+ int nsid, err;
+
+ if (!info->attrs[NETNSA_NSID])
+ return -EINVAL;
+ nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
+ if (nsid < 0)
+ return -EINVAL;
+
+ if (info->attrs[NETNSA_PID])
+ peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+ else if (info->attrs[NETNSA_FD])
+ peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+ else
+ return -EINVAL;
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ rtnl_lock();
+ if (peernet2id(net, peer) >= 0) {
+ err = -EEXIST;
+ goto out;
+ }
+
+ err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
+ if (err >= 0)
+ err = 0;
+out:
+ rtnl_unlock();
+ put_net(peer);
+ return err;
+}
+
+static int netns_nl_get_size(void)
+{
+ return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+ ;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+ int cmd, struct net *net, struct net *peer)
+{
+ void *hdr;
+ int id;
+
+ hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
+ if (!hdr)
+ return -EMSGSIZE;
+
+ rtnl_lock();
+ id = peernet2id(net, peer);
+ rtnl_unlock();
+ if (nla_put_s32(skb, NETNSA_NSID, id))
+ goto nla_put_failure;
+
+ return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+ genlmsg_cancel(skb, hdr);
+ return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
+{
+ struct net *net = genl_info_net(info);
+ struct sk_buff *msg;
+ int err = -ENOBUFS;
+ struct net *peer;
+
+ if (info->attrs[NETNSA_PID])
+ peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+ else if (info->attrs[NETNSA_FD])
+ peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+ else
+ return -EINVAL;
+
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+ if (!msg) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+ NLM_F_ACK, NETNS_CMD_GETID, net, peer);
+ if (err < 0)
+ goto err_out;
+
+ err = genlmsg_unicast(net, msg, info->snd_portid);
+ goto out;
+
+err_out:
+ nlmsg_free(msg);
+out:
+ put_net(peer);
+ return err;
+}
+
+static struct genl_ops netns_genl_ops[] = {
+ {
+ .cmd = NETNS_CMD_NEWID,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_newid,
+ .flags = GENL_ADMIN_PERM,
+ },
+ {
+ .cmd = NETNS_CMD_GETID,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_getid,
+ .flags = GENL_ADMIN_PERM,
+ },
+};
+
+int netns_genl_register(void)
+{
+ return genl_register_family_with_ops(&netns_genl_family,
+ netns_genl_ops);
+}
+
static int __init net_ns_init(void)
{
struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 76393f2f4b22..c6f39e40c9f3 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1029,6 +1029,10 @@ static int __init genl_init(void)
if (err)
goto problem;
+ err = netns_genl_register();
+ if (err < 0)
+ goto problem;
+
return 0;
problem:
--
2.1.0
The goal of this serie is to be able to multicast netlink messages with an
attribute that identify a peer netns.
This is needed by the userland to interpret some informations contained in
netlink messages (like IFLA_LINK value, but also some other attributes in case
of x-netns netdevice (see also
http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
Ids of peer netns are set by userland via a new genl messages. These ids are
stored per netns and are local (ie only valid in the netns where they are set).
To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
the id of a peer netns. Note that it will be possible to add a table (struct net
-> id) later to optimize this lookup if needed.
Patch 1/4 introduces the netlink API mechanism to set and get these ids.
Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
messages. And patch 4/4 shows that the netlink messages can be symetric between
a GET and a SET.
iproute2 patches are available, I can send them on demand.
Here is a small screenshot to show how it can be used by userland.
First, setup netns and required ids:
$ ip netns add foo
$ ip netns del foo
$ ip netns
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns add foo
$ ip netns exec foo ip netns set init_net 0
$ ip netns
foo
init_net
$ ip netns exec foo ip netns
foo
init_net (id: 0)
Now, add and display an ipip tunnel, with its link part in init_net (id 0 in
netns foo) and the netdevice in foo:
$ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
$ ip netns exec foo ip l ls ipip1
6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default
link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0
The parameter link-netnsid shows us where the interface sends and receives
packets (and thus we know where encapsulated addresses are set).
RFCv3 -> v4:
rebase on net-next
add copyright text in the new netns.h file
RFCv2 -> RFCv3:
ids are now defined by userland (via netlink). Ids are stored in each netns
(and they are local to this netns).
add get_link_net support for ip6 tunnels
netnsid is now a s32 instead of a u32
RFCv1 -> RFCv2:
remove useless ()
ids are now stored in the user ns. It's possible to get an id for a peer netns
only if the current netns and the peer netns have the same user ns parent.
MAINTAINERS | 1 +
include/net/ip6_tunnel.h | 1 +
include/net/ip_tunnels.h | 1 +
include/net/net_namespace.h | 5 ++
include/net/rtnetlink.h | 2 +
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/if_link.h | 1 +
include/uapi/linux/netns.h | 38 +++++++++
net/core/net_namespace.c | 195 +++++++++++++++++++++++++++++++++++++++++++
net/core/rtnetlink.c | 38 ++++++++-
net/ipv4/ip_gre.c | 2 +
net/ipv4/ip_tunnel.c | 8 ++
net/ipv4/ip_vti.c | 1 +
net/ipv4/ipip.c | 1 +
net/ipv6/ip6_gre.c | 1 +
net/ipv6/ip6_tunnel.c | 9 ++
net/ipv6/ip6_vti.c | 1 +
net/ipv6/sit.c | 1 +
net/netlink/genetlink.c | 4 +
19 files changed, 308 insertions(+), 3 deletions(-)
Comments are welcome.
Regards,
Nicolas
This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
net/core/rtnetlink.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1b9329512496..57959a85ed2c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_NUM_RX_QUEUES] = { .type = NLA_U32 },
[IFLA_PHYS_PORT_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
[IFLA_CARRIER_CHANGES] = { .type = NLA_U32 }, /* ignored */
+ [IFLA_LINK_NETNSID] = { .type = NLA_S32 },
};
static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1983,7 +1984,7 @@ replay:
struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 0];
struct nlattr **data = NULL;
struct nlattr **slave_data = NULL;
- struct net *dest_net;
+ struct net *dest_net, *link_net = NULL;
if (ops) {
if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
@@ -2089,7 +2090,18 @@ replay:
if (IS_ERR(dest_net))
return PTR_ERR(dest_net);
- dev = rtnl_create_link(dest_net, ifname, name_assign_type, ops, tb);
+ if (tb[IFLA_LINK_NETNSID]) {
+ int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
+
+ link_net = get_net_ns_by_id(dest_net, id);
+ if (link_net == NULL) {
+ err = -EINVAL;
+ goto out;
+ }
+ }
+
+ dev = rtnl_create_link(link_net ? : dest_net, ifname,
+ name_assign_type, ops, tb);
if (IS_ERR(dev)) {
err = PTR_ERR(dev);
goto out;
@@ -2117,9 +2129,16 @@ replay:
}
}
err = rtnl_configure_link(dev, ifm);
- if (err < 0)
+ if (err < 0) {
unregister_netdevice(dev);
+ goto out;
+ }
+
+ if (link_net)
+ err = dev_change_net_namespace(dev, dest_net, ifname);
out:
+ if (link_net)
+ put_net(link_net);
put_net(dest_net);
return err;
}
--
2.1.0
This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels). When there is no id,
we put NETNSA_NSINDEX_UNKNOWN into this attribute to indicate to userland that
the link netns is different from the interface netns. Hence, userland knows that
some information like IFLA_LINK are not interpretable.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
include/net/rtnetlink.h | 2 ++
include/uapi/linux/if_link.h | 1 +
net/core/rtnetlink.c | 13 +++++++++++++
3 files changed, 16 insertions(+)
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index e21b9f9653c0..6c6d5393fc34 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -46,6 +46,7 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
* to create when creating a new device.
* @get_num_rx_queues: Function to determine number of receive queues
* to create when creating a new device.
+ * @get_link_net: Function to get the i/o netns of the device
*/
struct rtnl_link_ops {
struct list_head list;
@@ -93,6 +94,7 @@ struct rtnl_link_ops {
int (*fill_slave_info)(struct sk_buff *skb,
const struct net_device *dev,
const struct net_device *slave_dev);
+ struct net *(*get_link_net)(const struct net_device *dev);
};
int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 7072d8325016..d2729f63cf01 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
IFLA_CARRIER,
IFLA_PHYS_PORT_ID,
IFLA_CARRIER_CHANGES,
+ IFLA_LINK_NETNSID,
__IFLA_MAX
};
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6882686ca3a..1b9329512496 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -862,6 +862,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
+ nla_total_size(1) /* IFLA_OPERSTATE */
+ nla_total_size(1) /* IFLA_LINKMODE */
+ nla_total_size(4) /* IFLA_CARRIER_CHANGES */
+ + nla_total_size(4) /* IFLA_LINK_NETNSID */
+ nla_total_size(ext_filter_mask
& RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
+ rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
@@ -1134,6 +1135,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
goto nla_put_failure;
}
+ if (dev->rtnl_link_ops &&
+ dev->rtnl_link_ops->get_link_net) {
+ struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+ if (!net_eq(dev_net(dev), link_net)) {
+ int id = peernet2id(dev_net(dev), link_net);
+
+ if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+ goto nla_put_failure;
+ }
+ }
+
if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
goto nla_put_failure;
--
2.1.0
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.
Signed-off-by: Nicolas Dichtel <[email protected]>
---
include/net/ip6_tunnel.h | 1 +
include/net/ip_tunnels.h | 1 +
net/ipv4/ip_gre.c | 2 ++
net/ipv4/ip_tunnel.c | 8 ++++++++
net/ipv4/ip_vti.c | 1 +
net/ipv4/ipip.c | 1 +
net/ipv6/ip6_gre.c | 1 +
net/ipv6/ip6_tunnel.c | 9 +++++++++
net/ipv6/ip6_vti.c | 1 +
net/ipv6/sit.c | 1 +
10 files changed, 26 insertions(+)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593dab6af7..8648519f4555 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
__u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
__u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
+struct net *ip6_tnl_get_link_net(const struct net_device *dev);
static inline void ip6tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
{
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 5bc6edeb7143..ce4ff6161fab 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -122,6 +122,7 @@ struct ip_tunnel_net {
int ip_tunnel_init(struct net_device *dev);
void ip_tunnel_uninit(struct net_device *dev);
void ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev);
int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
struct rtnl_link_ops *ops, char *devname);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 12055fdbe716..9e2e29a8c989 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -827,6 +827,7 @@ static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipgre_get_size,
.fill_info = ipgre_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
@@ -841,6 +842,7 @@ static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipgre_get_size,
.fill_info = ipgre_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static int __net_init ipgre_tap_init_net(struct net *net)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 0bb8e141eacc..3e1edd544b27 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -972,6 +972,14 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
}
EXPORT_SYMBOL_GPL(ip_tunnel_dellink);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev)
+{
+ struct ip_tunnel *tunnel = netdev_priv(dev);
+
+ return tunnel->net;
+}
+EXPORT_SYMBOL(ip_tunnel_get_link_net);
+
int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
struct rtnl_link_ops *ops, char *devname)
{
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 3e861011e4a3..f0fab26e4ddc 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -530,6 +530,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
.changelink = vti_changelink,
.get_size = vti_get_size,
.fill_info = vti_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static int __init vti_init(void)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 37096d64730e..e7a183baba0a 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -498,6 +498,7 @@ static struct rtnl_link_ops ipip_link_ops __read_mostly = {
.dellink = ip_tunnel_dellink,
.get_size = ipip_get_size,
.fill_info = ipip_fill_info,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct xfrm_tunnel ipip_handler __read_mostly = {
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 12c3c8ef3849..5165ac7fde22 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1661,6 +1661,7 @@ static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
.dellink = ip6gre_dellink,
.get_size = ip6gre_get_size,
.fill_info = ip6gre_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 9409887fb664..6b2534ea9c54 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1703,6 +1703,14 @@ nla_put_failure:
return -EMSGSIZE;
}
+struct net *ip6_tnl_get_link_net(const struct net_device *dev)
+{
+ struct ip6_tnl *tunnel = netdev_priv(dev);
+
+ return tunnel->net;
+}
+EXPORT_SYMBOL(ip6_tnl_get_link_net);
+
static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
[IFLA_IPTUN_LINK] = { .type = NLA_U32 },
[IFLA_IPTUN_LOCAL] = { .len = sizeof(struct in6_addr) },
@@ -1726,6 +1734,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
.dellink = ip6_tnl_dellink,
.get_size = ip6_tnl_get_size,
.fill_info = ip6_tnl_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static struct xfrm6_tunnel ip4ip6_handler __read_mostly = {
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d440bb585524..43966dcc9603 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -992,6 +992,7 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
.changelink = vti6_changelink,
.get_size = vti6_get_size,
.fill_info = vti6_fill_info,
+ .get_link_net = ip6_tnl_get_link_net,
};
static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 58e5b4710127..c858d0eb267a 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1765,6 +1765,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
.get_size = ipip6_get_size,
.fill_info = ipip6_fill_info,
.dellink = ipip6_dellink,
+ .get_link_net = ip_tunnel_get_link_net,
};
static struct xfrm_tunnel sit_handler __read_mostly = {
--
2.1.0
Nicolas Dichtel <[email protected]> writes:
> With this patch, a user can define an id for a peer netns by providing a FD or a
> PID. These ids are local to netns (ie valid only into one netns).
Scratches head. Do you actually find value in using the pid instead of
a file descriptor?
Doing things by pid was an early attempt to make things work, and has
been a bit clutsy. If you don't find value in it I would recommend just
supporting getting/setting the network namespace by file descriptor.
Eric
> This will be useful for netlink messages when a x-netns interface is dumped.
>
> Signed-off-by: Nicolas Dichtel <[email protected]>
> ---
> MAINTAINERS | 1 +
> include/net/net_namespace.h | 5 ++
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/netns.h | 38 +++++++++
> net/core/net_namespace.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
> net/netlink/genetlink.c | 4 +
> 6 files changed, 244 insertions(+)
> create mode 100644 include/uapi/linux/netns.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 43898b1a8a2d..de7e6fcbd5c2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6382,6 +6382,7 @@ F: include/linux/netdevice.h
> F: include/uapi/linux/in.h
> F: include/uapi/linux/net.h
> F: include/uapi/linux/netdevice.h
> +F: include/uapi/linux/netns.h
> F: tools/net/
> F: tools/testing/selftests/net/
> F: lib/random32.c
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index e0d64667a4b3..0f1367a71b81 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -59,6 +59,7 @@ struct net {
> struct list_head exit_list; /* Use only net_mutex */
>
> struct user_namespace *user_ns; /* Owning user namespace */
> + struct idr netns_ids;
>
> unsigned int proc_inum;
>
> @@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
> #define __net_initconst __initconst
> #endif
>
> +int peernet2id(struct net *net, struct net *peer);
> +struct net *get_net_ns_by_id(struct net *net, int id);
> +int netns_genl_register(void);
> +
> struct pernet_operations {
> struct list_head list;
> int (*init)(struct net *net);
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 6cad97485bad..d7f49c69585a 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -277,6 +277,7 @@ header-y += netfilter_decnet.h
> header-y += netfilter_ipv4.h
> header-y += netfilter_ipv6.h
> header-y += netlink.h
> +header-y += netns.h
> header-y += netrom.h
> header-y += nfc.h
> header-y += nfs.h
> diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
> new file mode 100644
> index 000000000000..2edf129377de
> --- /dev/null
> +++ b/include/uapi/linux/netns.h
> @@ -0,0 +1,38 @@
> +/* Copyright (c) 2014 6WIND S.A.
> + * Author: Nicolas Dichtel <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + */
> +#ifndef _UAPI_LINUX_NETNS_H_
> +#define _UAPI_LINUX_NETNS_H_
> +
> +/* Generic netlink messages */
> +
> +#define NETNS_GENL_NAME "netns"
> +#define NETNS_GENL_VERSION 0x1
> +
> +/* Commands */
> +enum {
> + NETNS_CMD_UNSPEC,
> + NETNS_CMD_NEWID,
> + NETNS_CMD_GETID,
> + __NETNS_CMD_MAX,
> +};
> +
> +#define NETNS_CMD_MAX (__NETNS_CMD_MAX - 1)
> +
> +/* Attributes */
> +enum {
> + NETNSA_NONE,
> +#define NETNSA_NSINDEX_UNKNOWN -1
> + NETNSA_NSID,
> + NETNSA_PID,
> + NETNSA_FD,
> + __NETNSA_MAX,
> +};
> +
> +#define NETNSA_MAX (__NETNSA_MAX - 1)
> +
> +#endif /* _UAPI_LINUX_NETNS_H_ */
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 7f155175bba8..4a5680ed42fb 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -15,6 +15,8 @@
> #include <linux/file.h>
> #include <linux/export.h>
> #include <linux/user_namespace.h>
> +#include <linux/netns.h>
> +#include <net/genetlink.h>
> #include <net/net_namespace.h>
> #include <net/netns/generic.h>
>
> @@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
> }
> }
>
> +/* This function is used by idr_for_each(). If net is equal to peer, the
> + * function returns the id so that idr_for_each() stops. Because we cannot
> + * returns the id 0 (idr_for_each() will not stop), we return the magic value
> + * -1 for it.
> + */
> +static int net_eq_idr(int id, void *net, void *peer)
> +{
> + if (net_eq(net, peer))
> + return id ? : -1;
> + return 0;
> +}
> +
> +/* returns NETNSA_NSINDEX_UNKNOWN if not found */
> +int peernet2id(struct net *net, struct net *peer)
> +{
> + int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
> +
> + ASSERT_RTNL();
> +
> + /* Magic value for id 0. */
> + if (id == -1)
> + return 0;
> + if (id == 0)
> + return NETNSA_NSINDEX_UNKNOWN;
> +
> + return id;
> +}
> +
> +struct net *get_net_ns_by_id(struct net *net, int id)
> +{
> + struct net *peer;
> +
> + if (id < 0)
> + return NULL;
> +
> + rcu_read_lock();
> + peer = idr_find(&net->netns_ids, id);
> + if (peer)
> + get_net(peer);
> + rcu_read_unlock();
> +
> + return peer;
> +}
> +
> /*
> * setup_net runs the initializers for the network namespace object.
> */
> @@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
> atomic_set(&net->passive, 1);
> net->dev_base_seq = 1;
> net->user_ns = user_ns;
> + idr_init(&net->netns_ids);
>
> #ifdef NETNS_REFCNT_DEBUG
> atomic_set(&net->use_count, 0);
> @@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
> list_for_each_entry(net, &net_kill_list, cleanup_list) {
> list_del_rcu(&net->list);
> list_add_tail(&net->exit_list, &net_exit_list);
> + for_each_net(tmp) {
> + int id = peernet2id(tmp, net);
> +
> + if (id >= 0)
> + idr_remove(&tmp->netns_ids, id);
> + }
> + idr_destroy(&net->netns_ids);
> +
> }
> rtnl_unlock();
>
> @@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
> .exit = net_ns_net_exit,
> };
>
> +static struct genl_family netns_genl_family = {
> + .id = GENL_ID_GENERATE,
> + .name = NETNS_GENL_NAME,
> + .version = NETNS_GENL_VERSION,
> + .hdrsize = 0,
> + .maxattr = NETNSA_MAX,
> + .netnsok = true,
> +};
> +
> +static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
> + [NETNSA_NONE] = { .type = NLA_UNSPEC },
> + [NETNSA_NSID] = { .type = NLA_S32 },
> + [NETNSA_PID] = { .type = NLA_U32 },
> + [NETNSA_FD] = { .type = NLA_U32 },
> +};
> +
> +static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
> +{
> + struct net *net = genl_info_net(info);
> + struct net *peer;
> + int nsid, err;
> +
> + if (!info->attrs[NETNSA_NSID])
> + return -EINVAL;
> + nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
> + if (nsid < 0)
> + return -EINVAL;
> +
> + if (info->attrs[NETNSA_PID])
> + peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
> + else if (info->attrs[NETNSA_FD])
> + peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
> + else
> + return -EINVAL;
> + if (IS_ERR(peer))
> + return PTR_ERR(peer);
> +
> + rtnl_lock();
> + if (peernet2id(net, peer) >= 0) {
> + err = -EEXIST;
> + goto out;
> + }
> +
> + err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
> + if (err >= 0)
> + err = 0;
> +out:
> + rtnl_unlock();
> + put_net(peer);
> + return err;
> +}
> +
> +static int netns_nl_get_size(void)
> +{
> + return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
> + ;
> +}
> +
> +static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
> + int cmd, struct net *net, struct net *peer)
> +{
> + void *hdr;
> + int id;
> +
> + hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
> + if (!hdr)
> + return -EMSGSIZE;
> +
> + rtnl_lock();
> + id = peernet2id(net, peer);
> + rtnl_unlock();
> + if (nla_put_s32(skb, NETNSA_NSID, id))
> + goto nla_put_failure;
> +
> + return genlmsg_end(skb, hdr);
> +
> +nla_put_failure:
> + genlmsg_cancel(skb, hdr);
> + return -EMSGSIZE;
> +}
> +
> +static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
> +{
> + struct net *net = genl_info_net(info);
> + struct sk_buff *msg;
> + int err = -ENOBUFS;
> + struct net *peer;
> +
> + if (info->attrs[NETNSA_PID])
> + peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
> + else if (info->attrs[NETNSA_FD])
> + peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
> + else
> + return -EINVAL;
> +
> + if (IS_ERR(peer))
> + return PTR_ERR(peer);
> +
> + msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
> + if (!msg) {
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
> + NLM_F_ACK, NETNS_CMD_GETID, net, peer);
> + if (err < 0)
> + goto err_out;
> +
> + err = genlmsg_unicast(net, msg, info->snd_portid);
> + goto out;
> +
> +err_out:
> + nlmsg_free(msg);
> +out:
> + put_net(peer);
> + return err;
> +}
> +
> +static struct genl_ops netns_genl_ops[] = {
> + {
> + .cmd = NETNS_CMD_NEWID,
> + .policy = netns_nl_policy,
> + .doit = netns_nl_cmd_newid,
> + .flags = GENL_ADMIN_PERM,
> + },
> + {
> + .cmd = NETNS_CMD_GETID,
> + .policy = netns_nl_policy,
> + .doit = netns_nl_cmd_getid,
> + .flags = GENL_ADMIN_PERM,
> + },
> +};
> +
> +int netns_genl_register(void)
> +{
> + return genl_register_family_with_ops(&netns_genl_family,
> + netns_genl_ops);
> +}
> +
> static int __init net_ns_init(void)
> {
> struct net_generic *ng;
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 76393f2f4b22..c6f39e40c9f3 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -1029,6 +1029,10 @@ static int __init genl_init(void)
> if (err)
> goto problem;
>
> + err = netns_genl_register();
> + if (err < 0)
> + goto problem;
> +
> return 0;
>
> problem:
Nicolas Dichtel <[email protected]> writes:
> The goal of this serie is to be able to multicast netlink messages with an
> attribute that identify a peer netns.
> This is needed by the userland to interpret some informations contained in
> netlink messages (like IFLA_LINK value, but also some other attributes in case
> of x-netns netdevice (see also
> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>
> Ids of peer netns are set by userland via a new genl messages. These ids are
> stored per netns and are local (ie only valid in the netns where they are set).
> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
> the id of a peer netns. Note that it will be possible to add a table (struct net
> -> id) later to optimize this lookup if needed.
>
> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
> messages. And patch 4/4 shows that the netlink messages can be symetric between
> a GET and a SET.
>
> iproute2 patches are available, I can send them on demand.
A quick reply. I think this patchset is in the right general direction.
There are some oddball details that seem odd/awkward to me such as using
genetlink instead of rtnetlink to get and set the ids, and not having
ids if they are not set (that feels like a maintenance/usability challenge).
I would like to give your patches a deep review, but I won't be able to
do that for a couple of weeks. I am deep in the process of moving,
and will be mostly offline until about the Nov 11th.
Eric
> Here is a small screenshot to show how it can be used by userland.
>
> First, setup netns and required ids:
> $ ip netns add foo
> $ ip netns del foo
> $ ip netns
> $ touch /var/run/netns/init_net
> $ mount --bind /proc/1/ns/net /var/run/netns/init_net
> $ ip netns add foo
> $ ip netns exec foo ip netns set init_net 0
> $ ip netns
> foo
> init_net
> $ ip netns exec foo ip netns
> foo
> init_net (id: 0)
>
> Now, add and display an ipip tunnel, with its link part in init_net (id 0 in
> netns foo) and the netdevice in foo:
> $ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
> $ ip netns exec foo ip l ls ipip1
> 6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default
> link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0
>
> The parameter link-netnsid shows us where the interface sends and receives
> packets (and thus we know where encapsulated addresses are set).
>
> RFCv3 -> v4:
> rebase on net-next
> add copyright text in the new netns.h file
>
> RFCv2 -> RFCv3:
> ids are now defined by userland (via netlink). Ids are stored in each netns
> (and they are local to this netns).
> add get_link_net support for ip6 tunnels
> netnsid is now a s32 instead of a u32
>
> RFCv1 -> RFCv2:
> remove useless ()
> ids are now stored in the user ns. It's possible to get an id for a peer netns
> only if the current netns and the peer netns have the same user ns parent.
>
> MAINTAINERS | 1 +
> include/net/ip6_tunnel.h | 1 +
> include/net/ip_tunnels.h | 1 +
> include/net/net_namespace.h | 5 ++
> include/net/rtnetlink.h | 2 +
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/if_link.h | 1 +
> include/uapi/linux/netns.h | 38 +++++++++
> net/core/net_namespace.c | 195 +++++++++++++++++++++++++++++++++++++++++++
> net/core/rtnetlink.c | 38 ++++++++-
> net/ipv4/ip_gre.c | 2 +
> net/ipv4/ip_tunnel.c | 8 ++
> net/ipv4/ip_vti.c | 1 +
> net/ipv4/ipip.c | 1 +
> net/ipv6/ip6_gre.c | 1 +
> net/ipv6/ip6_tunnel.c | 9 ++
> net/ipv6/ip6_vti.c | 1 +
> net/ipv6/sit.c | 1 +
> net/netlink/genetlink.c | 4 +
> 19 files changed, 308 insertions(+), 3 deletions(-)
>
> Comments are welcome.
>
> Regards,
> Nicolas
Le 30/10/2014 19:35, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> With this patch, a user can define an id for a peer netns by providing a FD or a
>> PID. These ids are local to netns (ie valid only into one netns).
>
> Scratches head. Do you actually find value in using the pid instead of
> a file descriptor?
I copied the mechanism from rtnl_link_get_net():
First check if the user provides a PID, if not, check for a FD.
>
> Doing things by pid was an early attempt to make things work, and has
> been a bit clutsy. If you don't find value in it I would recommend just
> supporting getting/setting the network namespace by file descriptor.
Hmm, if I understand well, it's what is done in the patch:
[snip]
>> +static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
>> +{
[snip]
>> + if (info->attrs[NETNSA_PID])
>> + peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
>> + else if (info->attrs[NETNSA_FD])
>> + peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
>> + else
>> + return -EINVAL;
>> + if (IS_ERR(peer))
>> + return PTR_ERR(peer);
Am I right?
Regards,
Nicolas
Le 30/10/2014 19:41, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> The goal of this serie is to be able to multicast netlink messages with an
>> attribute that identify a peer netns.
>> This is needed by the userland to interpret some informations contained in
>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>> of x-netns netdevice (see also
>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>
>> Ids of peer netns are set by userland via a new genl messages. These ids are
>> stored per netns and are local (ie only valid in the netns where they are set).
>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>> the id of a peer netns. Note that it will be possible to add a table (struct net
>> -> id) later to optimize this lookup if needed.
>>
>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>> a GET and a SET.
>>
>> iproute2 patches are available, I can send them on demand.
>
> A quick reply. I think this patchset is in the right general direction.
> There are some oddball details that seem odd/awkward to me such as using
> genetlink instead of rtnetlink to get and set the ids, and not having
> ids if they are not set (that feels like a maintenance/usability challenge).
No problem to use rtnetlink, in fact, I hesitated.
For the second point, I'm not sure to follow you: how to have an id, which will
not break migration, without asking the user to set it?
Note that if the user does not provide an id, you still have a magic value to
say "it's a peer netns but we don't know which one".
>
> I would like to give your patches a deep review, but I won't be able to
> do that for a couple of weeks. I am deep in the process of moving,
> and will be mostly offline until about the Nov 11th.
No problem, I will wait.
I would be great to get a final version for the 3.19 ;-)
Thank you,
Nicolas
Nicolas Dichtel <[email protected]> writes:
> Le 30/10/2014 19:41, Eric W. Biederman a écrit :
>> Nicolas Dichtel <[email protected]> writes:
>>
>>> The goal of this serie is to be able to multicast netlink messages with an
>>> attribute that identify a peer netns.
>>> This is needed by the userland to interpret some informations contained in
>>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>>> of x-netns netdevice (see also
>>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>>
>>> Ids of peer netns are set by userland via a new genl messages. These ids are
>>> stored per netns and are local (ie only valid in the netns where they are set).
>>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>>> the id of a peer netns. Note that it will be possible to add a table (struct net
>>> -> id) later to optimize this lookup if needed.
>>>
>>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>>> a GET and a SET.
>>>
>>> iproute2 patches are available, I can send them on demand.
>>
>> A quick reply. I think this patchset is in the right general direction.
>> There are some oddball details that seem odd/awkward to me such as using
>> genetlink instead of rtnetlink to get and set the ids, and not having
>> ids if they are not set (that feels like a maintenance/usability challenge).
> No problem to use rtnetlink, in fact, I hesitated.
>
> For the second point, I'm not sure to follow you: how to have an id, which will
> not break migration, without asking the user to set it?
We have that situtation with ifindex already. Basically the thought is
to allow an id to be set, but also allow an id to be auto-generated if
we use an namespace without an id being set.
My gut says if we can figure that out we will have an interface with
much more utility.
> Note that if the user does not provide an id, you still have a magic value to
> say "it's a peer netns but we don't know which one".
That is certainly an improvement in clarity over where we are today.
>> I would like to give your patches a deep review, but I won't be able to
>> do that for a couple of weeks. I am deep in the process of moving,
>> and will be mostly offline until about the Nov 11th.
>
> No problem, I will wait.
> I would be great to get a final version for the 3.19 ;-)
Eric
From: [email protected] (Eric W. Biederman)
Date: Thu, 30 Oct 2014 11:41:03 -0700
> Nicolas Dichtel <[email protected]> writes:
>
>> The goal of this serie is to be able to multicast netlink messages with an
>> attribute that identify a peer netns.
>> This is needed by the userland to interpret some informations contained in
>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>> of x-netns netdevice (see also
>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>
>> Ids of peer netns are set by userland via a new genl messages. These ids are
>> stored per netns and are local (ie only valid in the netns where they are set).
>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>> the id of a peer netns. Note that it will be possible to add a table (struct net
>> -> id) later to optimize this lookup if needed.
>>
>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>> a GET and a SET.
>>
>> iproute2 patches are available, I can send them on demand.
>
> A quick reply. I think this patchset is in the right general direction.
> There are some oddball details that seem odd/awkward to me such as using
> genetlink instead of rtnetlink to get and set the ids, and not having
> ids if they are not set (that feels like a maintenance/usability challenge).
>
> I would like to give your patches a deep review, but I won't be able to
> do that for a couple of weeks. I am deep in the process of moving,
> and will be mostly offline until about the Nov 11th.
I'm going to mark this patch set 'deferred' in patchwork until things
move forward.
Thanks.
Le 31/10/2014 20:14, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> Le 30/10/2014 19:41, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <[email protected]> writes:
>>>
>>>> The goal of this serie is to be able to multicast netlink messages with an
>>>> attribute that identify a peer netns.
>>>> This is needed by the userland to interpret some informations contained in
>>>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>>>> of x-netns netdevice (see also
>>>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>>>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>>>
>>>> Ids of peer netns are set by userland via a new genl messages. These ids are
>>>> stored per netns and are local (ie only valid in the netns where they are set).
>>>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>>>> the id of a peer netns. Note that it will be possible to add a table (struct net
>>>> -> id) later to optimize this lookup if needed.
>>>>
>>>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>>>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>>>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>>>> a GET and a SET.
>>>>
>>>> iproute2 patches are available, I can send them on demand.
>>>
>>> A quick reply. I think this patchset is in the right general direction.
>>> There are some oddball details that seem odd/awkward to me such as using
>>> genetlink instead of rtnetlink to get and set the ids, and not having
>>> ids if they are not set (that feels like a maintenance/usability challenge).
>> No problem to use rtnetlink, in fact, I hesitated.
>>
>> For the second point, I'm not sure to follow you: how to have an id, which will
>> not break migration, without asking the user to set it?
>
> We have that situtation with ifindex already. Basically the thought is
> to allow an id to be set, but also allow an id to be auto-generated if
> we use an namespace without an id being set.
If my understanding is correct, the difference is that we want to hide some
netns.
Do you think we can generate an id for each netns that does not have one and
relying on the fact that this id has no meaning unless you have a netns file
descriptor that allow you to get the id of this netns?
Regards,
Nicolas
Le 30/10/2014 19:41, Eric W. Biederman a écrit :
> Nicolas Dichtel <[email protected]> writes:
>
>> The goal of this serie is to be able to multicast netlink messages with an
>> attribute that identify a peer netns.
>> This is needed by the userland to interpret some informations contained in
>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>> of x-netns netdevice (see also
>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>
>> Ids of peer netns are set by userland via a new genl messages. These ids are
>> stored per netns and are local (ie only valid in the netns where they are set).
>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>> the id of a peer netns. Note that it will be possible to add a table (struct net
>> -> id) later to optimize this lookup if needed.
>>
>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>> a GET and a SET.
>>
>> iproute2 patches are available, I can send them on demand.
>
> A quick reply. I think this patchset is in the right general direction.
> There are some oddball details that seem odd/awkward to me such as using
> genetlink instead of rtnetlink to get and set the ids, and not having
> ids if they are not set (that feels like a maintenance/usability challenge).
>
> I would like to give your patches a deep review, but I won't be able to
> do that for a couple of weeks. I am deep in the process of moving,
> and will be mostly offline until about the Nov 11th.
Eric, did you have a chance to look at this?
Regards,
Nicolas
Le 05/11/2014 15:23, Nicolas Dichtel a écrit :
> Le 31/10/2014 20:14, Eric W. Biederman a écrit :
>> Nicolas Dichtel <[email protected]> writes:
>>
>>> Le 30/10/2014 19:41, Eric W. Biederman a écrit :
>>>> Nicolas Dichtel <[email protected]> writes:
>>>>
>>>>> The goal of this serie is to be able to multicast netlink messages with an
>>>>> attribute that identify a peer netns.
>>>>> This is needed by the userland to interpret some informations contained in
>>>>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>>>>> of x-netns netdevice (see also
>>>>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>>>>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>>>>
>>>>> Ids of peer netns are set by userland via a new genl messages. These ids are
>>>>> stored per netns and are local (ie only valid in the netns where they are
>>>>> set).
>>>>> To avoid allocating an int for each peer netns, I use idr_for_each() to
>>>>> retrieve
>>>>> the id of a peer netns. Note that it will be possible to add a table
>>>>> (struct net
>>>>> -> id) later to optimize this lookup if needed.
>>>>>
>>>>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>>>>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>>>>> messages. And patch 4/4 shows that the netlink messages can be symetric
>>>>> between
>>>>> a GET and a SET.
>>>>>
>>>>> iproute2 patches are available, I can send them on demand.
>>>>
>>>> A quick reply. I think this patchset is in the right general direction.
>>>> There are some oddball details that seem odd/awkward to me such as using
>>>> genetlink instead of rtnetlink to get and set the ids, and not having
>>>> ids if they are not set (that feels like a maintenance/usability challenge).
>>> No problem to use rtnetlink, in fact, I hesitated.
>>>
>>> For the second point, I'm not sure to follow you: how to have an id, which will
>>> not break migration, without asking the user to set it?
>>
>> We have that situtation with ifindex already. Basically the thought is
>> to allow an id to be set, but also allow an id to be auto-generated if
>> we use an namespace without an id being set.
> If my understanding is correct, the difference is that we want to hide some
> netns.
> Do you think we can generate an id for each netns that does not have one and
> relying on the fact that this id has no meaning unless you have a netns file
> descriptor that allow you to get the id of this netns?
Any comment Eric ?
Thank you,
Nicolas