2018-04-24 20:46:09

by Christian Brauner

[permalink] [raw]
Subject: [PATCH net-next 0/2 v2] netns: uevent performance tweaks

Hey everyone,

This is v2 of "netns: uevent performance tweaks" which contains *no
functional changes* just a minor indendation fix as requested by David.

Like Eric requested, I did extensive testing that prove significant
performance improvements when using per-netns uevent sequence numbers
with decoupled locks. The results and test descriptions were added to
the commit message of
[PATCH 2/2 v1] netns: isolate seqnums to use per-netns locks.

This series deals with a bunch of performance improvements when sending
out uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
initial user namespace in the global uevent socket list.
Eric, this is the exact patch we agreed upon in
https://lkml.org/lkml/2018/4/10/592.
A very detailed rationale is present in the commit message for
[PATCH 1/2] netns: restrict uevents
- Decouple the locking for network namespaces in the global uevent
socket list from the locking for network namespaces not in the global
uevent socket list.
A very detailed rationale including performance test results is
present in the commit message for
[PATCH 2/2] netns: isolate seqnums to use per-netns locks

Thanks!
Christian

Christian Brauner (2):
netns: restrict uevents
netns: isolate seqnums to use per-netns locks

include/linux/kobject.h | 2 +
include/net/net_namespace.h | 3 +
kernel/ksysfs.c | 11 +++-
lib/kobject_uevent.c | 122 ++++++++++++++++++++++++++++--------
net/core/net_namespace.c | 14 +++++
5 files changed, 126 insertions(+), 26 deletions(-)

--
2.17.0



2018-04-24 20:45:55

by Christian Brauner

[permalink] [raw]
Subject: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.
Testing revealed significant performance improvements. For details see
"Testing" below.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace. In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Testing:
Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
with decoupled locking and one without. To ensure that testing made sense
both kernels carried the patch to remove network namespaces not owned by
the initial user namespace from the uevent socket list.
Three tests were constructed. All of them showed significant performance
improvements with per-netns uevent sequence numbers and decoupled locking.

# Testcase 1:
Only Injecting Uevents into network namespaces not owned by the initial
user namespace.
- created 1000 new user namespace + network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times
meaning 10,000,000 (10 million) uevents were injected. (The high
number of uevent injections should get rid of a lot of jitter.)
The injection was done by fork()ing 1000 uevent injectors in a simple
for-loop to ensure that uevents were injected in parallel.
- mean transaction time was calculated:
- *without* uevent sequence number namespacing: 67 μs
- *with* uevent sequence number namespacing: 55 μs
- makes a difference of: 12 μs
- a t-test was performed on the two data vectors which revealed
shows significant performance improvements:
Welch Two Sample t-test
data: x1 and y1
t = 405.16, df = 18883000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
12.14949 12.26761
sample estimates:
mean of x mean of y
68.48594 56.27739

# Testcase 2:
Injecting Uevents into network namespaces not owned by the initial user
namespace and network namespaces owned by the initial user namespace.
- created 500 new user namespace + network namespace pairs
- created 500 new network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times
meaning 10,000,000 (10 million) uevents were injected. (The high
number of uevent injections should get rid of a lot of jitter.)
The injection was done by fork()ing 1000 uevent injectors in a simple
for-loop to ensure that uevents were injected in parallel.
- mean transaction time was calculated:
- *without* uevent sequence number namespacing: 572 μs
- *with* uevent sequence number namespacing: 514 μs
- makes a difference of: 58 μs
- a t-test was performed on the two data vectors which revealed
shows significant performance improvements:
Welch Two Sample t-test
data: x2 and y2
t = 38.685, df = 19682000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
55.10630 60.98815
sample estimates:
mean of x mean of y
572.9684 514.9211

# Testcase 3:
Created 500 new user namespace + network namespace pairs *without uevent
listeners*
- created 500 new network namespace pairs *without uevent listeners*
- injected uevents into each of those network namespaces 10,000 times
meaning 10,000,000 (10 million) uevents were injected. (The high number
of uevent injections should get rid of a lot of jitter.)
The injection was done by fork()ing 1000 uevent injectors in a simple
for-loop to ensure that uevents were injected in parallel.
- mean transaction time was calculated:
- *without* uevent sequence number namespacing: 206 μs
- *with* uevent sequence number namespacing: 163 μs
- makes a difference of: 43 μs
- a t-test was performed on the two data vectors which revealed
shows significant performance improvements:
Welch Two Sample t-test
data: x3 and y3
t = 58.37, df = 17711000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
41.77860 44.68178
sample estimates:
mean of x mean of y
207.2632 164.0330

Signed-off-by: Christian Brauner <[email protected]>
---
Changelog v1->v2:
* non-functional change: fix indendation for C directives in
kernel/ksysfs.c
Changelog v0->v1:
* add detailed test results to the commit message
* account for kernels compiled without CONFIG_NET
---
include/linux/kobject.h | 2 +
include/net/net_namespace.h | 3 ++
kernel/ksysfs.c | 11 +++-
lib/kobject_uevent.c | 104 +++++++++++++++++++++++++++++-------
net/core/net_namespace.c | 14 +++++
5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..4e608968907f 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,8 +36,10 @@
extern char uevent_helper[];
#endif

+#ifndef CONFIG_NET
/* counter to tag the uevent, read only except for the kobject core */
extern u64 uevent_seqnum;
+#endif

/*
* The actions here must match the index to the string array
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
struct sock *genl_sock;

struct uevent_sock *uevent_sock; /* uevent socket */
+ /* counter to tag the uevent, read only except for the kobject core */
+ u64 uevent_seqnum;

struct list_head dev_base_head;
struct hlist_head *dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;

struct net *get_net_ns_by_pid(pid_t pid);
struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);

#ifdef CONFIG_SYSCTL
void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..38b70b90a21f 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/capability.h>
#include <linux/compiler.h>
+#include <net/net_namespace.h>

#include <linux/rcupdate.h> /* rcu_expedited and rcu_normal */

@@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
static ssize_t uevent_seqnum_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+ u64 seqnum;
+
+#ifdef CONFIG_NET
+ seqnum = get_ns_uevent_seqnum_by_vpid();
+#else
+ seqnum = uevent_seqnum;
+#endif
+
+ return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
}
KERNEL_ATTR_RO(uevent_seqnum);

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..5da20def556d 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,42 @@
#include <net/net_namespace.h>


+#ifndef CONFIG_NET
u64 uevent_seqnum;
+#endif
+
#ifdef CONFIG_UEVENT_HELPER
char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
#endif

+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
struct uevent_sock {
struct list_head list;
struct sock *sk;
+ /*
+ * This mutex protects uevent sockets and the uevent counter of
+ * network namespaces *not* owned by init_user_ns.
+ * For network namespaces owned by init_user_ns this lock is *not*
+ * valid instead the global uevent_sock_mutex must be used!
+ */
+ struct mutex sk_mutex;
};

#ifdef CONFIG_NET
static LIST_HEAD(uevent_sock_list);
#endif

-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
static DEFINE_MUTEX(uevent_sock_mutex);

/* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)

return 0;
}
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+ if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+ WARN(1, KERN_ERR "Failed to append sequence number. "
+ "Too many uevent variables\n");
+ return false;
+ }
+
+ if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+ WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+ return false;
+ }
+
+ return true;
+}
#endif

#ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,

/* send netlink message */
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+ /* bump sequence number */
+ u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
struct sock *uevent_sock = ue_sk->sk;
+ char buf[SEQNUM_BUFSIZE];

if (!netlink_has_listeners(uevent_sock, 1))
continue;

if (!skb) {
- /* allocate message with the maximum possible size */
+ /* calculate header length */
size_t len = strlen(action_string) + strlen(devpath) + 2;
char *scratch;

+ /* allocate message with the maximum possible size */
retval = -ENOMEM;
- skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+ skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
if (!skb)
continue;

@@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
scratch = skb_put(skb, len);
sprintf(scratch, "%s@%s", action_string, devpath);

+ /* add env */
skb_put_data(skb, env->buf, env->buflen);

NETLINK_CB(skb).dst_group = 1;
}

+ /* prepare netns seqnum */
+ retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+ if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+ continue;
+ retval++;
+
+ if (!can_hold_seqnum(env, retval))
+ continue;
+
+ /* append netns seqnum */
+ skb_put_data(skb, buf, retval);
+
retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
0, 1, GFP_KERNEL,
kobj_bcast_filter,
@@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
/* ENOBUFS should be handled in userspace */
if (retval == -ENOBUFS || retval == -ESRCH)
retval = 0;
+
+ /* remove netns seqnum */
+ skb_trim(skb, env->buflen);
}
consume_skb(skb);
+#else
+ uevent_seqnum++;
#endif
return retval;
}
@@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
}

mutex_lock(&uevent_sock_mutex);
- /* we will send an event, so request a new sequence number */
- retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
- if (retval) {
- mutex_unlock(&uevent_sock_mutex);
- goto exit;
- }
- retval = kobject_uevent_net_broadcast(kobj, env, action_string,
- devpath);
+ retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
mutex_unlock(&uevent_sock_mutex);

#ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
EXPORT_SYMBOL_GPL(add_uevent_var);

#if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
struct netlink_ext_ack *extack)
{
- /* u64 to chars: 2^64 - 1 = 21 chars */
- char buf[sizeof("SEQNUM=") + 21];
+ struct sock *usk = ue_sk->sk;
+ char buf[SEQNUM_BUFSIZE];
struct sk_buff *skbc;
int ret;

/* bump and prepare sequence number */
- ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
- if (ret < 0 || (size_t)ret >= sizeof(buf))
+ ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+ ++sock_net(ue_sk->sk)->uevent_seqnum);
+ if (ret < 0 || ret >= SEQNUM_BUFSIZE)
return -ENOMEM;
ret++;

@@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
return -EPERM;
}

- mutex_lock(&uevent_sock_mutex);
- ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
- mutex_unlock(&uevent_sock_mutex);
+ if (net->user_ns == &init_user_ns)
+ mutex_lock(&uevent_sock_mutex);
+ else
+ mutex_lock(&net->uevent_sock->sk_mutex);
+ ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+ if (net->user_ns == &init_user_ns)
+ mutex_unlock(&uevent_sock_mutex);
+ else
+ mutex_unlock(&net->uevent_sock->sk_mutex);

return ret;
}
@@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
mutex_lock(&uevent_sock_mutex);
list_add_tail(&ue_sk->list, &uevent_sock_list);
mutex_unlock(&uevent_sock_mutex);
+ } else {
+ /*
+ * Uevent sockets and counters for network namespaces
+ * not owned by the initial user namespace have their
+ * own mutex.
+ */
+ mutex_init(&ue_sk->sk_mutex);
}

return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..8894638f5150 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
}
EXPORT_SYMBOL_GPL(get_net_ns_by_pid);

+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+ pid_t cur_pid;
+ struct net *net;
+
+ cur_pid = task_pid_vnr(current);
+ net = get_net_ns_by_pid(cur_pid);
+ if (IS_ERR(net))
+ return 0;
+
+ return net->uevent_seqnum;
+}
+EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
+
static __net_init int net_ns_net_init(struct net *net)
{
#ifdef CONFIG_NET_NS
--
2.17.0


2018-04-24 20:46:28

by Christian Brauner

[permalink] [raw]
Subject: [PATCH net-next 1/2 v2] netns: restrict uevents

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless to
them. Mainly, because non-initial user namespaces cannot easily manage
devices unless they have a privileged host-process helping them out. This
means that there will be a thundering herd of activity when there
shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning user
namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway. I have
a) verified that is is true for every udev implementation out there b)
that this behavior has been present in all udev implementations from the
very beginning.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on the
list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket that
has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a lot
of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we significantly
increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in receiving
device events especially if they are delegated to them by a privileged
userspace process. One prime example are SR-IOV enabled devices that are
explicitly designed to be handed of to other users such as VMs or
containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can let
a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
the network namespace of the netlink socket) userspace process make a
decision what uevents should be sent. This removes the need to blindly
broadcast uevents into all user namespaces and provides a performant and
safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se. This
means if the user namespace of a given task is unshared but the network
namespace is kept and is owned by the initial user namespace a listener
that is opening the uevent socket in that network namespace can still
listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <[email protected]>
---
Changelog v1->v2:
* patch unchanged
Changelog v0->v1:
* patch unchanged
---
lib/kobject_uevent.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)

net->uevent_sock = ue_sk;

- mutex_lock(&uevent_sock_mutex);
- list_add_tail(&ue_sk->list, &uevent_sock_list);
- mutex_unlock(&uevent_sock_mutex);
+ /* Restrict uevents to initial user namespace. */
+ if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+ mutex_lock(&uevent_sock_mutex);
+ list_add_tail(&ue_sk->list, &uevent_sock_list);
+ mutex_unlock(&uevent_sock_mutex);
+ }
+
return 0;
}

@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
{
struct uevent_sock *ue_sk = net->uevent_sock;

- mutex_lock(&uevent_sock_mutex);
- list_del(&ue_sk->list);
- mutex_unlock(&uevent_sock_mutex);
+ if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+ mutex_lock(&uevent_sock_mutex);
+ list_del(&ue_sk->list);
+ mutex_unlock(&uevent_sock_mutex);
+ }

netlink_kernel_release(ue_sk->sk);
kfree(ue_sk);
--
2.17.0


2018-04-24 21:55:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

Christian Brauner <[email protected]> writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking everything.
> Testing revealed significant performance improvements. For details see
> "Testing" below.

Maybe. Your locking is wrong, and a few other things are wrong. see
below.

> Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> owned by the intial user namespace can be sent uevents from a sufficiently
> privileged userspace process.
> In order to send a uevent into a network namespace not owned by the initial
> user namespace we currently still need to take the *global mutex* that
> locks the uevent socket list even though the list *only contains network
> namespaces owned by the initial user namespace*. This needs to be done
> because the uevent counter is a global variable. Taking the global lock is
> performance sensitive since a user on the host can spawn a pool of n
> process that each create their own new user and network namespaces and then
> go on to inject uevents in parallel into the network namespace of all of
> these processes. This can have a significant performance impact for the
> host's udevd since it means that there can be a lot of delay between a
> device being added and the corresponding uevent being sent out and
> available for processing by udevd. It also means that each network
> namespace not owned by the initial user namespace which userspace has sent
> a uevent to will need to wait until the lock becomes available.
>
> Implementation:
> This patch gives each network namespace its own uevent sequence number.
> Each network namespace not owned by the initial user namespace receives its
> own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> file it is clearly documented which lock has to be taken. All network
> namespaces owned by the initial user namespace will still share the same
> lock since they are all served sequentially via the uevent socket list.
> This decouples the locking and ensures that the host retrieves uevents as
> fast as possible even if there are a lot of uevents injected into network
> namespaces not owned by the initial user namespace. In addition, each
> network namespace not owned by the initial user namespace does not have to
> wait on any other network namespace not sharing the same user namespace.
>
> Testing:
> Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> with decoupled locking and one without. To ensure that testing made sense
> both kernels carried the patch to remove network namespaces not owned by
> the initial user namespace from the uevent socket list.
> Three tests were constructed. All of them showed significant performance
> improvements with per-netns uevent sequence numbers and decoupled locking.
>
> # Testcase 1:
> Only Injecting Uevents into network namespaces not owned by the initial
> user namespace.
> - created 1000 new user namespace + network namespace pairs
> - opened a uevent listener in each of those namespace pairs
> - injected uevents into each of those network namespaces 10,000 times
> meaning 10,000,000 (10 million) uevents were injected. (The high
> number of uevent injections should get rid of a lot of jitter.)
> The injection was done by fork()ing 1000 uevent injectors in a simple
> for-loop to ensure that uevents were injected in parallel.
> - mean transaction time was calculated:
> - *without* uevent sequence number namespacing: 67 μs
> - *with* uevent sequence number namespacing: 55 μs
> - makes a difference of: 12 μs
> - a t-test was performed on the two data vectors which revealed
> shows significant performance improvements:
> Welch Two Sample t-test
> data: x1 and y1
> t = 405.16, df = 18883000, p-value < 2.2e-16
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> 12.14949 12.26761
> sample estimates:
> mean of x mean of y
> 68.48594 56.27739
>
> # Testcase 2:
> Injecting Uevents into network namespaces not owned by the initial user
> namespace and network namespaces owned by the initial user namespace.
> - created 500 new user namespace + network namespace pairs
> - created 500 new network namespace pairs
> - opened a uevent listener in each of those namespace pairs
> - injected uevents into each of those network namespaces 10,000 times
> meaning 10,000,000 (10 million) uevents were injected. (The high
> number of uevent injections should get rid of a lot of jitter.)
> The injection was done by fork()ing 1000 uevent injectors in a simple
> for-loop to ensure that uevents were injected in parallel.
> - mean transaction time was calculated:
> - *without* uevent sequence number namespacing: 572 μs
> - *with* uevent sequence number namespacing: 514 μs
> - makes a difference of: 58 μs
> - a t-test was performed on the two data vectors which revealed
> shows significant performance improvements:
> Welch Two Sample t-test
> data: x2 and y2
> t = 38.685, df = 19682000, p-value < 2.2e-16
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> 55.10630 60.98815
> sample estimates:
> mean of x mean of y
> 572.9684 514.9211
>
> # Testcase 3:
> Created 500 new user namespace + network namespace pairs *without uevent
> listeners*
> - created 500 new network namespace pairs *without uevent listeners*
> - injected uevents into each of those network namespaces 10,000 times
> meaning 10,000,000 (10 million) uevents were injected. (The high number
> of uevent injections should get rid of a lot of jitter.)
> The injection was done by fork()ing 1000 uevent injectors in a simple
> for-loop to ensure that uevents were injected in parallel.
> - mean transaction time was calculated:
> - *without* uevent sequence number namespacing: 206 μs
> - *with* uevent sequence number namespacing: 163 μs
> - makes a difference of: 43 μs
> - a t-test was performed on the two data vectors which revealed
> shows significant performance improvements:
> Welch Two Sample t-test
> data: x3 and y3
> t = 58.37, df = 17711000, p-value < 2.2e-16
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> 41.77860 44.68178
> sample estimates:
> mean of x mean of y
> 207.2632 164.0330
>
> Signed-off-by: Christian Brauner <[email protected]>
> ---
> Changelog v1->v2:
> * non-functional change: fix indendation for C directives in
> kernel/ksysfs.c
> Changelog v0->v1:
> * add detailed test results to the commit message
> * account for kernels compiled without CONFIG_NET
> ---
> include/linux/kobject.h | 2 +
> include/net/net_namespace.h | 3 ++
> kernel/ksysfs.c | 11 +++-
> lib/kobject_uevent.c | 104 +++++++++++++++++++++++++++++-------
> net/core/net_namespace.c | 14 +++++
> 5 files changed, 114 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> index 7f6f93c3df9c..4e608968907f 100644
> --- a/include/linux/kobject.h
> +++ b/include/linux/kobject.h
> @@ -36,8 +36,10 @@
> extern char uevent_helper[];
> #endif
>
> +#ifndef CONFIG_NET
> /* counter to tag the uevent, read only except for the kobject core */
> extern u64 uevent_seqnum;
> +#endif

That smells like an implementation bug somewhere.

> /*
> * The actions here must match the index to the string array
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 47e35cce3b64..e4e171b1ba69 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -85,6 +85,8 @@ struct net {
> struct sock *genl_sock;
>
> struct uevent_sock *uevent_sock; /* uevent socket */
> + /* counter to tag the uevent, read only except for the kobject core */
> + u64 uevent_seqnum;
>
> struct list_head dev_base_head;
> struct hlist_head *dev_name_head;
> @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
>
> struct net *get_net_ns_by_pid(pid_t pid);
> struct net *get_net_ns_by_fd(int fd);
> +u64 get_ns_uevent_seqnum_by_vpid(void);
>
> #ifdef CONFIG_SYSCTL
> void ipx_register_sysctl(void);
> diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
> index 46ba853656f6..38b70b90a21f 100644
> --- a/kernel/ksysfs.c
> +++ b/kernel/ksysfs.c
> @@ -19,6 +19,7 @@
> #include <linux/sched.h>
> #include <linux/capability.h>
> #include <linux/compiler.h>
> +#include <net/net_namespace.h>
>
> #include <linux/rcupdate.h> /* rcu_expedited and rcu_normal */
>
> @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
> static ssize_t uevent_seqnum_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
> + u64 seqnum;
> +
> +#ifdef CONFIG_NET
> + seqnum = get_ns_uevent_seqnum_by_vpid();
> +#else
> + seqnum = uevent_seqnum;
> +#endif

This can be simplified to be just:
seqnum = current->nsproxy->net_ns->uevent_seqnum;

Except that is not correct either. As every instance of sysfs has a
network namespace associated with it, and you are not fetching that
network namespace.

Typically this would call for making this file per network namespace
so you would have this information available. Sigh. I don't know if
there is an easy way to do that for this file.

> +
> + return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
> }
> KERNEL_ATTR_RO(uevent_seqnum);
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index f5f5038787ac..5da20def556d 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -29,21 +29,42 @@
> #include <net/net_namespace.h>
>
>
> +#ifndef CONFIG_NET
> u64 uevent_seqnum;
> +#endif
> +
> #ifdef CONFIG_UEVENT_HELPER
> char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
> #endif
>
> +/*
> + * Size a buffer needs to be in order to hold the largest possible sequence
> + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
> + */
> +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
> struct uevent_sock {
> struct list_head list;
> struct sock *sk;
> + /*
> + * This mutex protects uevent sockets and the uevent counter of
> + * network namespaces *not* owned by init_user_ns.
> + * For network namespaces owned by init_user_ns this lock is *not*
> + * valid instead the global uevent_sock_mutex must be used!
> + */
> + struct mutex sk_mutex;
> };
>
> #ifdef CONFIG_NET
> static LIST_HEAD(uevent_sock_list);
> #endif
>
> -/* This lock protects uevent_seqnum and uevent_sock_list */
> +/*
> + * This mutex protects uevent sockets and the uevent counter of network
> + * namespaces owned by init_user_ns.
> + * For network namespaces not owned by init_user_ns this lock is *not*
> + * valid instead the network namespace specific sk_mutex in struct
> + * uevent_sock must be used!
> + */
> static DEFINE_MUTEX(uevent_sock_mutex);
>
> /* the strings here must match the enum in include/linux/kobject.h */
> @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>
> return 0;
> }
> +
> +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
> +{
> + if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
> + WARN(1, KERN_ERR "Failed to append sequence number. "
> + "Too many uevent variables\n");
> + return false;
> + }
> +
> + if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
> + WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
> + return false;
> + }
> +
> + return true;
> +}
> #endif
>
> #ifdef CONFIG_UEVENT_HELPER
> @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>
> /* send netlink message */
> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> + /* bump sequence number */
> + u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
> struct sock *uevent_sock = ue_sk->sk;
> + char buf[SEQNUM_BUFSIZE];
>
> if (!netlink_has_listeners(uevent_sock, 1))
> continue;
>
> if (!skb) {
> - /* allocate message with the maximum possible size */
> + /* calculate header length */
> size_t len = strlen(action_string) + strlen(devpath) + 2;
> char *scratch;
>
> + /* allocate message with the maximum possible size */
> retval = -ENOMEM;
> - skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> + skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
> if (!skb)
> continue;
>
> @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> scratch = skb_put(skb, len);
> sprintf(scratch, "%s@%s", action_string, devpath);
>
> + /* add env */
> skb_put_data(skb, env->buf, env->buflen);
>
> NETLINK_CB(skb).dst_group = 1;
> }
>
> + /* prepare netns seqnum */
> + retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
> + if (retval < 0 || retval >= SEQNUM_BUFSIZE)
> + continue;
> + retval++;
> +
> + if (!can_hold_seqnum(env, retval))
> + continue;

You have allocated enough space in the skb why does can_hold_seqnum make
sense?

Do you need to back seqnum out of the env later for this to work twice
in a row?

> +
> + /* append netns seqnum */
> + skb_put_data(skb, buf, retval);
> +
> retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
> 0, 1, GFP_KERNEL,
> kobj_bcast_filter,
> @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> /* ENOBUFS should be handled in userspace */
> if (retval == -ENOBUFS || retval == -ESRCH)
> retval = 0;
> +
> + /* remove netns seqnum */
> + skb_trim(skb, env->buflen);

Have you checked to see if the seqnum actually makes it to userspace.
> }
> consume_skb(skb);
> +#else
> + uevent_seqnum++;
> #endif
> return retval;
> }
> @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> }
>
> mutex_lock(&uevent_sock_mutex);
> - /* we will send an event, so request a new sequence number */
> - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
> - if (retval) {
> - mutex_unlock(&uevent_sock_mutex);
> - goto exit;
> - }
> - retval = kobject_uevent_net_broadcast(kobj, env, action_string,
> - devpath);
> + retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
> mutex_unlock(&uevent_sock_mutex);

How does all of this work with events for network devices that are not
in the initial network namespace. This looks to me like this code fails
to take the sk_mutex.

> #ifdef CONFIG_UEVENT_HELPER
> @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> EXPORT_SYMBOL_GPL(add_uevent_var);
>
> #if defined(CONFIG_NET)
> -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
> struct netlink_ext_ack *extack)
> {
> - /* u64 to chars: 2^64 - 1 = 21 chars */
> - char buf[sizeof("SEQNUM=") + 21];
> + struct sock *usk = ue_sk->sk;
> + char buf[SEQNUM_BUFSIZE];
> struct sk_buff *skbc;
> int ret;
>
> /* bump and prepare sequence number */
> - ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
> - if (ret < 0 || (size_t)ret >= sizeof(buf))
> + ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
> + ++sock_net(ue_sk->sk)->uevent_seqnum);
> + if (ret < 0 || ret >= SEQNUM_BUFSIZE)
> return -ENOMEM;
> ret++;
>
> @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
> return -EPERM;
> }
>
> - mutex_lock(&uevent_sock_mutex);
> - ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
> - mutex_unlock(&uevent_sock_mutex);
> + if (net->user_ns == &init_user_ns)
> + mutex_lock(&uevent_sock_mutex);
> + else
> + mutex_lock(&net->uevent_sock->sk_mutex);
> + ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
> + if (net->user_ns == &init_user_ns)
> + mutex_unlock(&uevent_sock_mutex);
> + else
> + mutex_unlock(&net->uevent_sock->sk_mutex);
>
> return ret;
> }
> @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
> mutex_lock(&uevent_sock_mutex);
> list_add_tail(&ue_sk->list, &uevent_sock_list);
> mutex_unlock(&uevent_sock_mutex);
> + } else {
> + /*
> + * Uevent sockets and counters for network namespaces
> + * not owned by the initial user namespace have their
> + * own mutex.
> + */
> + mutex_init(&ue_sk->sk_mutex);
> }
>
> return 0;
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a11e03f920d3..8894638f5150 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
> }
> EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>
> +u64 get_ns_uevent_seqnum_by_vpid(void)
> +{
> + pid_t cur_pid;
> + struct net *net;
> +
> + cur_pid = task_pid_vnr(current);
> + net = get_net_ns_by_pid(cur_pid);
> + if (IS_ERR(net))
> + return 0;
> +
> + return net->uevent_seqnum;
> +}
> +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);

I just have to say this function is completely crazy.
You go from the tsk to the pid back to the tsk.
And you leak a struct net pointer.

It is much simpler and less racy to say:

current->nsproxy->net_ns->uevent_seqnum;

That you are accessing current->nsproxy means nsproxy can't
change. The rcu_read_lock etc that get_net_ns_by_pid does
is there for accessing non-current tasks.



> static __net_init int net_ns_net_init(struct net *net)
> {
> #ifdef CONFIG_NET_NS

2018-04-24 21:57:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents


We already do this in practice in userspace. It doesn't make much
sense to perform this delivery. So we might as well make this optimization.

Christian Brauner <[email protected]> writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
> Broadcasting uevents into all network namespaces introduces significant
> overhead.
> All processes that listen to uevents running in non-initial user
> namespaces will end up responding to uevents that will be meaningless to
> them. Mainly, because non-initial user namespaces cannot easily manage
> devices unless they have a privileged host-process helping them out. This
> means that there will be a thundering herd of activity when there
> shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
> Uevents are filtered by userspace in a user namespace because the
> received uid != 0. Instead the uid associated with the event will be
> 65534 == "nobody" because the global root uid is not mapped.
> This means we can safely and without introducing regressions modify the
> kernel to not send uevents into all network namespaces whose owning user
> namespace is not the initial user namespace because we know that
> userspace will ignore the message because of the uid anyway. I have
> a) verified that is is true for every udev implementation out there b)
> that this behavior has been present in all udev implementations from the
> very beginning.
> - Removing needless overhead/Increasing performance:
> Currently, the uevent socket for each network namespace is added to the
> global variable uevent_sock_list. The list itself needs to be protected
> by a mutex. So everytime a uevent is generated the mutex is taken on the
> list. The mutex is held *from the creation of the uevent (memory
> allocation, string creation etc. until all uevent sockets have been
> handled*. This is aggravated by the fact that for each uevent socket that
> has listeners the mc_list must be walked as well which means we're
> talking O(n^2) here. Given that a standard Linux workload usually has
> quite a lot of network namespaces and - in the face of containers - a lot
> of user namespaces this quickly becomes a performance problem (see
> "Thundering herd" above). By just recording uevent sockets of network
> namespaces that are owned by the initial user namespace we significantly
> increase performance in this codepath.
> - Injecting uevents:
> There's a valid argument that containers might be interested in receiving
> device events especially if they are delegated to them by a privileged
> userspace process. One prime example are SR-IOV enabled devices that are
> explicitly designed to be handed of to other users such as VMs or
> containers.
> This use-case can now be correctly handled since
> commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> introduced the ability to send uevents from userspace. As such we can let
> a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> the network namespace of the netlink socket) userspace process make a
> decision what uevents should be sent. This removes the need to blindly
> broadcast uevents into all user namespaces and provides a performant and
> safe solution to this problem.
> - Filtering logic:
> This patch filters by *owning user namespace of the network namespace a
> given task resides in* and not by user namespace of the task per se. This
> means if the user namespace of a given task is unshared but the network
> namespace is kept and is owned by the initial user namespace a listener
> that is opening the uevent socket in that network namespace can still
> listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>

> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
> lib/kobject_uevent.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..f5f5038787ac 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
>
> net->uevent_sock = ue_sk;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_add_tail(&ue_sk->list, &uevent_sock_list);
> - mutex_unlock(&uevent_sock_mutex);
> + /* Restrict uevents to initial user namespace. */
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_add_tail(&ue_sk->list, &uevent_sock_list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
> +
> return 0;
> }
>
> @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> {
> struct uevent_sock *ue_sk = net->uevent_sock;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_del(&ue_sk->list);
> - mutex_unlock(&uevent_sock_mutex);
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_del(&ue_sk->list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
>
> netlink_kernel_release(ue_sk->sk);
> kfree(ue_sk);

2018-04-24 22:24:10

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking everything.
> > Testing revealed significant performance improvements. For details see
> > "Testing" below.
>
> Maybe. Your locking is wrong, and a few other things are wrong. see
> below.

Thanks for the review! Happy to rework this until it's in a mergeable shape.

>
> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> > owned by the intial user namespace can be sent uevents from a sufficiently
> > privileged userspace process.
> > In order to send a uevent into a network namespace not owned by the initial
> > user namespace we currently still need to take the *global mutex* that
> > locks the uevent socket list even though the list *only contains network
> > namespaces owned by the initial user namespace*. This needs to be done
> > because the uevent counter is a global variable. Taking the global lock is
> > performance sensitive since a user on the host can spawn a pool of n
> > process that each create their own new user and network namespaces and then
> > go on to inject uevents in parallel into the network namespace of all of
> > these processes. This can have a significant performance impact for the
> > host's udevd since it means that there can be a lot of delay between a
> > device being added and the corresponding uevent being sent out and
> > available for processing by udevd. It also means that each network
> > namespace not owned by the initial user namespace which userspace has sent
> > a uevent to will need to wait until the lock becomes available.
> >
> > Implementation:
> > This patch gives each network namespace its own uevent sequence number.
> > Each network namespace not owned by the initial user namespace receives its
> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> > file it is clearly documented which lock has to be taken. All network
> > namespaces owned by the initial user namespace will still share the same
> > lock since they are all served sequentially via the uevent socket list.
> > This decouples the locking and ensures that the host retrieves uevents as
> > fast as possible even if there are a lot of uevents injected into network
> > namespaces not owned by the initial user namespace. In addition, each
> > network namespace not owned by the initial user namespace does not have to
> > wait on any other network namespace not sharing the same user namespace.
> >
> > Testing:
> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> > with decoupled locking and one without. To ensure that testing made sense
> > both kernels carried the patch to remove network namespaces not owned by
> > the initial user namespace from the uevent socket list.
> > Three tests were constructed. All of them showed significant performance
> > improvements with per-netns uevent sequence numbers and decoupled locking.
> >
> > # Testcase 1:
> > Only Injecting Uevents into network namespaces not owned by the initial
> > user namespace.
> > - created 1000 new user namespace + network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times
> > meaning 10,000,000 (10 million) uevents were injected. (The high
> > number of uevent injections should get rid of a lot of jitter.)
> > The injection was done by fork()ing 1000 uevent injectors in a simple
> > for-loop to ensure that uevents were injected in parallel.
> > - mean transaction time was calculated:
> > - *without* uevent sequence number namespacing: 67 μs
> > - *with* uevent sequence number namespacing: 55 μs
> > - makes a difference of: 12 μs
> > - a t-test was performed on the two data vectors which revealed
> > shows significant performance improvements:
> > Welch Two Sample t-test
> > data: x1 and y1
> > t = 405.16, df = 18883000, p-value < 2.2e-16
> > alternative hypothesis: true difference in means is not equal to 0
> > 95 percent confidence interval:
> > 12.14949 12.26761
> > sample estimates:
> > mean of x mean of y
> > 68.48594 56.27739
> >
> > # Testcase 2:
> > Injecting Uevents into network namespaces not owned by the initial user
> > namespace and network namespaces owned by the initial user namespace.
> > - created 500 new user namespace + network namespace pairs
> > - created 500 new network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times
> > meaning 10,000,000 (10 million) uevents were injected. (The high
> > number of uevent injections should get rid of a lot of jitter.)
> > The injection was done by fork()ing 1000 uevent injectors in a simple
> > for-loop to ensure that uevents were injected in parallel.
> > - mean transaction time was calculated:
> > - *without* uevent sequence number namespacing: 572 μs
> > - *with* uevent sequence number namespacing: 514 μs
> > - makes a difference of: 58 μs
> > - a t-test was performed on the two data vectors which revealed
> > shows significant performance improvements:
> > Welch Two Sample t-test
> > data: x2 and y2
> > t = 38.685, df = 19682000, p-value < 2.2e-16
> > alternative hypothesis: true difference in means is not equal to 0
> > 95 percent confidence interval:
> > 55.10630 60.98815
> > sample estimates:
> > mean of x mean of y
> > 572.9684 514.9211
> >
> > # Testcase 3:
> > Created 500 new user namespace + network namespace pairs *without uevent
> > listeners*
> > - created 500 new network namespace pairs *without uevent listeners*
> > - injected uevents into each of those network namespaces 10,000 times
> > meaning 10,000,000 (10 million) uevents were injected. (The high number
> > of uevent injections should get rid of a lot of jitter.)
> > The injection was done by fork()ing 1000 uevent injectors in a simple
> > for-loop to ensure that uevents were injected in parallel.
> > - mean transaction time was calculated:
> > - *without* uevent sequence number namespacing: 206 μs
> > - *with* uevent sequence number namespacing: 163 μs
> > - makes a difference of: 43 μs
> > - a t-test was performed on the two data vectors which revealed
> > shows significant performance improvements:
> > Welch Two Sample t-test
> > data: x3 and y3
> > t = 58.37, df = 17711000, p-value < 2.2e-16
> > alternative hypothesis: true difference in means is not equal to 0
> > 95 percent confidence interval:
> > 41.77860 44.68178
> > sample estimates:
> > mean of x mean of y
> > 207.2632 164.0330
> >
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
> > Changelog v1->v2:
> > * non-functional change: fix indendation for C directives in
> > kernel/ksysfs.c
> > Changelog v0->v1:
> > * add detailed test results to the commit message
> > * account for kernels compiled without CONFIG_NET
> > ---
> > include/linux/kobject.h | 2 +
> > include/net/net_namespace.h | 3 ++
> > kernel/ksysfs.c | 11 +++-
> > lib/kobject_uevent.c | 104 +++++++++++++++++++++++++++++-------
> > net/core/net_namespace.c | 14 +++++
> > 5 files changed, 114 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> > index 7f6f93c3df9c..4e608968907f 100644
> > --- a/include/linux/kobject.h
> > +++ b/include/linux/kobject.h
> > @@ -36,8 +36,10 @@
> > extern char uevent_helper[];
> > #endif
> >
> > +#ifndef CONFIG_NET
> > /* counter to tag the uevent, read only except for the kobject core */
> > extern u64 uevent_seqnum;
> > +#endif
>
> That smells like an implementation bug somewhere.

Sorry, I'm not following. I'm I'm not mistaken there won't be any struct
net when CONFIG_NET=n. This has been reported by kbuild robot with alpha
and CONFIG_NET=n.

>
> > /*
> > * The actions here must match the index to the string array
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 47e35cce3b64..e4e171b1ba69 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -85,6 +85,8 @@ struct net {
> > struct sock *genl_sock;
> >
> > struct uevent_sock *uevent_sock; /* uevent socket */
> > + /* counter to tag the uevent, read only except for the kobject core */
> > + u64 uevent_seqnum;
> >
> > struct list_head dev_base_head;
> > struct hlist_head *dev_name_head;
> > @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
> >
> > struct net *get_net_ns_by_pid(pid_t pid);
> > struct net *get_net_ns_by_fd(int fd);
> > +u64 get_ns_uevent_seqnum_by_vpid(void);
> >
> > #ifdef CONFIG_SYSCTL
> > void ipx_register_sysctl(void);
> > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
> > index 46ba853656f6..38b70b90a21f 100644
> > --- a/kernel/ksysfs.c
> > +++ b/kernel/ksysfs.c
> > @@ -19,6 +19,7 @@
> > #include <linux/sched.h>
> > #include <linux/capability.h>
> > #include <linux/compiler.h>
> > +#include <net/net_namespace.h>
> >
> > #include <linux/rcupdate.h> /* rcu_expedited and rcu_normal */
> >
> > @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
> > static ssize_t uevent_seqnum_show(struct kobject *kobj,
> > struct kobj_attribute *attr, char *buf)
> > {
> > - return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
> > + u64 seqnum;
> > +
> > +#ifdef CONFIG_NET
> > + seqnum = get_ns_uevent_seqnum_by_vpid();
> > +#else
> > + seqnum = uevent_seqnum;
> > +#endif
>
> This can be simplified to be just:
> seqnum = current->nsproxy->net_ns->uevent_seqnum;

Does that work even if CONFIG_NET=n?

>
> Except that is not correct either. As every instance of sysfs has a
> network namespace associated with it, and you are not fetching that
> network namespace.

I'm not yet familiar with all aspects of sysfs so thanks for pointing
that out. Then I'll try to come up with a way to fetch the network
namespace associated with sysfs. Unless you already know exactly how to
do this and can point it out.
This would also lets us drop get_ns_uevent_seqnum_by_vpid().

>
> Typically this would call for making this file per network namespace
> so you would have this information available. Sigh. I don't know if
> there is an easy way to do that for this file.
>
> > +
> > + return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
> > }
> > KERNEL_ATTR_RO(uevent_seqnum);
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index f5f5038787ac..5da20def556d 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -29,21 +29,42 @@
> > #include <net/net_namespace.h>
> >
> >
> > +#ifndef CONFIG_NET
> > u64 uevent_seqnum;
> > +#endif
> > +
> > #ifdef CONFIG_UEVENT_HELPER
> > char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
> > #endif
> >
> > +/*
> > + * Size a buffer needs to be in order to hold the largest possible sequence
> > + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
> > + */
> > +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
> > struct uevent_sock {
> > struct list_head list;
> > struct sock *sk;
> > + /*
> > + * This mutex protects uevent sockets and the uevent counter of
> > + * network namespaces *not* owned by init_user_ns.
> > + * For network namespaces owned by init_user_ns this lock is *not*
> > + * valid instead the global uevent_sock_mutex must be used!
> > + */
> > + struct mutex sk_mutex;
> > };
> >
> > #ifdef CONFIG_NET
> > static LIST_HEAD(uevent_sock_list);
> > #endif
> >
> > -/* This lock protects uevent_seqnum and uevent_sock_list */
> > +/*
> > + * This mutex protects uevent sockets and the uevent counter of network
> > + * namespaces owned by init_user_ns.
> > + * For network namespaces not owned by init_user_ns this lock is *not*
> > + * valid instead the network namespace specific sk_mutex in struct
> > + * uevent_sock must be used!
> > + */
> > static DEFINE_MUTEX(uevent_sock_mutex);
> >
> > /* the strings here must match the enum in include/linux/kobject.h */
> > @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >
> > return 0;
> > }
> > +
> > +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
> > +{
> > + if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
> > + WARN(1, KERN_ERR "Failed to append sequence number. "
> > + "Too many uevent variables\n");
> > + return false;
> > + }
> > +
> > + if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
> > + WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > #endif
> >
> > #ifdef CONFIG_UEVENT_HELPER
> > @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >
> > /* send netlink message */
> > list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> > + /* bump sequence number */
> > + u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
> > struct sock *uevent_sock = ue_sk->sk;
> > + char buf[SEQNUM_BUFSIZE];
> >
> > if (!netlink_has_listeners(uevent_sock, 1))
> > continue;
> >
> > if (!skb) {
> > - /* allocate message with the maximum possible size */
> > + /* calculate header length */
> > size_t len = strlen(action_string) + strlen(devpath) + 2;
> > char *scratch;
> >
> > + /* allocate message with the maximum possible size */
> > retval = -ENOMEM;
> > - skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> > + skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
> > if (!skb)
> > continue;
> >
> > @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> > scratch = skb_put(skb, len);
> > sprintf(scratch, "%s@%s", action_string, devpath);
> >
> > + /* add env */
> > skb_put_data(skb, env->buf, env->buflen);
> >
> > NETLINK_CB(skb).dst_group = 1;
> > }
> >
> > + /* prepare netns seqnum */
> > + retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
> > + if (retval < 0 || retval >= SEQNUM_BUFSIZE)
> > + continue;
> > + retval++;
> > +
> > + if (!can_hold_seqnum(env, retval))
> > + continue;
>
> You have allocated enough space in the skb why does can_hold_seqnum make
> sense?

Because it doesn't check whether the socket buffer can hold the sequence
number but checks whether the uevent buffer size in "env" can hold it.
uevents are only delivered if the env buffer is large enough to hold all
of the info including the sequence number. That's independent of the
socket buffer.

>
> Do you need to back seqnum out of the env later for this to work twice
> in a row?

I guess I can just override it. It just felt cleaner to trim it.

>
> > +
> > + /* append netns seqnum */
> > + skb_put_data(skb, buf, retval);
> > +
> > retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
> > 0, 1, GFP_KERNEL,
> > kobj_bcast_filter,
> > @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> > /* ENOBUFS should be handled in userspace */
> > if (retval == -ENOBUFS || retval == -ESRCH)
> > retval = 0;
> > +
> > + /* remove netns seqnum */
> > + skb_trim(skb, env->buflen);
>
> Have you checked to see if the seqnum actually makes it to userspace.

Yes, I did. I also wonder why it wouldn't make it. Any specific reason
why you suspect this?

> > }
> > consume_skb(skb);
> > +#else
> > + uevent_seqnum++;
> > #endif
> > return retval;
> > }
> > @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> > }
> >
> > mutex_lock(&uevent_sock_mutex);
> > - /* we will send an event, so request a new sequence number */
> > - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
> > - if (retval) {
> > - mutex_unlock(&uevent_sock_mutex);
> > - goto exit;
> > - }
> > - retval = kobject_uevent_net_broadcast(kobj, env, action_string,
> > - devpath);
> > + retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
> > mutex_unlock(&uevent_sock_mutex);
>
> How does all of this work with events for network devices that are not
> in the initial network namespace. This looks to me like this code fails
> to take the sk_mutex.

But in this list only non-initial network namespaces that are owned by
the initial user namespace are recorded and for these uevent_sock_mutex
has to be taken. Am I missing something?

>
> > #ifdef CONFIG_UEVENT_HELPER
> > @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> > EXPORT_SYMBOL_GPL(add_uevent_var);
> >
> > #if defined(CONFIG_NET)
> > -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> > +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
> > struct netlink_ext_ack *extack)
> > {
> > - /* u64 to chars: 2^64 - 1 = 21 chars */
> > - char buf[sizeof("SEQNUM=") + 21];
> > + struct sock *usk = ue_sk->sk;
> > + char buf[SEQNUM_BUFSIZE];
> > struct sk_buff *skbc;
> > int ret;
> >
> > /* bump and prepare sequence number */
> > - ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
> > - if (ret < 0 || (size_t)ret >= sizeof(buf))
> > + ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
> > + ++sock_net(ue_sk->sk)->uevent_seqnum);
> > + if (ret < 0 || ret >= SEQNUM_BUFSIZE)
> > return -ENOMEM;
> > ret++;
> >
> > @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
> > return -EPERM;
> > }
> >
> > - mutex_lock(&uevent_sock_mutex);
> > - ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
> > - mutex_unlock(&uevent_sock_mutex);
> > + if (net->user_ns == &init_user_ns)
> > + mutex_lock(&uevent_sock_mutex);
> > + else
> > + mutex_lock(&net->uevent_sock->sk_mutex);
> > + ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
> > + if (net->user_ns == &init_user_ns)
> > + mutex_unlock(&uevent_sock_mutex);
> > + else
> > + mutex_unlock(&net->uevent_sock->sk_mutex);
> >
> > return ret;
> > }
> > @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
> > mutex_lock(&uevent_sock_mutex);
> > list_add_tail(&ue_sk->list, &uevent_sock_list);
> > mutex_unlock(&uevent_sock_mutex);
> > + } else {
> > + /*
> > + * Uevent sockets and counters for network namespaces
> > + * not owned by the initial user namespace have their
> > + * own mutex.
> > + */
> > + mutex_init(&ue_sk->sk_mutex);
> > }
> >
> > return 0;
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index a11e03f920d3..8894638f5150 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
> > }
> > EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
> >
> > +u64 get_ns_uevent_seqnum_by_vpid(void)
> > +{
> > + pid_t cur_pid;
> > + struct net *net;
> > +
> > + cur_pid = task_pid_vnr(current);
> > + net = get_net_ns_by_pid(cur_pid);
> > + if (IS_ERR(net))
> > + return 0;
> > +
> > + return net->uevent_seqnum;
> > +}
> > +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
>
> I just have to say this function is completely crazy.
> You go from the tsk to the pid back to the tsk.
> And you leak a struct net pointer.
>
> It is much simpler and less racy to say:
>
> current->nsproxy->net_ns->uevent_seqnum;
>
> That you are accessing current->nsproxy means nsproxy can't
> change. The rcu_read_lock etc that get_net_ns_by_pid does
> is there for accessing non-current tasks.
>
>
>
> > static __net_init int net_ns_net_init(struct net *net)
> > {
> > #ifdef CONFIG_NET_NS

2018-04-24 22:43:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents


Bah. This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace? The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter. That logic will almost
certainly need to be turned inside out. Sign not as easy as I would
have hoped.

Eric

Christian Brauner <[email protected]> writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
> Broadcasting uevents into all network namespaces introduces significant
> overhead.
> All processes that listen to uevents running in non-initial user
> namespaces will end up responding to uevents that will be meaningless to
> them. Mainly, because non-initial user namespaces cannot easily manage
> devices unless they have a privileged host-process helping them out. This
> means that there will be a thundering herd of activity when there
> shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
> Uevents are filtered by userspace in a user namespace because the
> received uid != 0. Instead the uid associated with the event will be
> 65534 == "nobody" because the global root uid is not mapped.
> This means we can safely and without introducing regressions modify the
> kernel to not send uevents into all network namespaces whose owning user
> namespace is not the initial user namespace because we know that
> userspace will ignore the message because of the uid anyway. I have
> a) verified that is is true for every udev implementation out there b)
> that this behavior has been present in all udev implementations from the
> very beginning.
> - Removing needless overhead/Increasing performance:
> Currently, the uevent socket for each network namespace is added to the
> global variable uevent_sock_list. The list itself needs to be protected
> by a mutex. So everytime a uevent is generated the mutex is taken on the
> list. The mutex is held *from the creation of the uevent (memory
> allocation, string creation etc. until all uevent sockets have been
> handled*. This is aggravated by the fact that for each uevent socket that
> has listeners the mc_list must be walked as well which means we're
> talking O(n^2) here. Given that a standard Linux workload usually has
> quite a lot of network namespaces and - in the face of containers - a lot
> of user namespaces this quickly becomes a performance problem (see
> "Thundering herd" above). By just recording uevent sockets of network
> namespaces that are owned by the initial user namespace we significantly
> increase performance in this codepath.
> - Injecting uevents:
> There's a valid argument that containers might be interested in receiving
> device events especially if they are delegated to them by a privileged
> userspace process. One prime example are SR-IOV enabled devices that are
> explicitly designed to be handed of to other users such as VMs or
> containers.
> This use-case can now be correctly handled since
> commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> introduced the ability to send uevents from userspace. As such we can let
> a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> the network namespace of the netlink socket) userspace process make a
> decision what uevents should be sent. This removes the need to blindly
> broadcast uevents into all user namespaces and provides a performant and
> safe solution to this problem.
> - Filtering logic:
> This patch filters by *owning user namespace of the network namespace a
> given task resides in* and not by user namespace of the task per se. This
> means if the user namespace of a given task is unshared but the network
> namespace is kept and is owned by the initial user namespace a listener
> that is opening the uevent socket in that network namespace can still
> listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner <[email protected]>
> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
> lib/kobject_uevent.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..f5f5038787ac 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
>
> net->uevent_sock = ue_sk;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_add_tail(&ue_sk->list, &uevent_sock_list);
> - mutex_unlock(&uevent_sock_mutex);
> + /* Restrict uevents to initial user namespace. */
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_add_tail(&ue_sk->list, &uevent_sock_list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
> +
> return 0;
> }
>
> @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> {
> struct uevent_sock *ue_sk = net->uevent_sock;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_del(&ue_sk->list);
> - mutex_unlock(&uevent_sock_mutex);
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_del(&ue_sk->list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
>
> netlink_kernel_release(ue_sk->sk);
> kfree(ue_sk);

2018-04-24 22:55:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

Christian Brauner <[email protected]> writes:

> On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > Now that it's possible to have a different set of uevents in different
>> > network namespaces, per-network namespace uevent sequence numbers are
>> > introduced. This increases performance as locking is now restricted to the
>> > network namespace affected by the uevent rather than locking everything.
>> > Testing revealed significant performance improvements. For details see
>> > "Testing" below.
>>
>> Maybe. Your locking is wrong, and a few other things are wrong. see
>> below.
>
> Thanks for the review! Happy to rework this until it's in a mergeable shape.
>
>>
>> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
>> > owned by the intial user namespace can be sent uevents from a sufficiently
>> > privileged userspace process.
>> > In order to send a uevent into a network namespace not owned by the initial
>> > user namespace we currently still need to take the *global mutex* that
>> > locks the uevent socket list even though the list *only contains network
>> > namespaces owned by the initial user namespace*. This needs to be done
>> > because the uevent counter is a global variable. Taking the global lock is
>> > performance sensitive since a user on the host can spawn a pool of n
>> > process that each create their own new user and network namespaces and then
>> > go on to inject uevents in parallel into the network namespace of all of
>> > these processes. This can have a significant performance impact for the
>> > host's udevd since it means that there can be a lot of delay between a
>> > device being added and the corresponding uevent being sent out and
>> > available for processing by udevd. It also means that each network
>> > namespace not owned by the initial user namespace which userspace has sent
>> > a uevent to will need to wait until the lock becomes available.
>> >
>> > Implementation:
>> > This patch gives each network namespace its own uevent sequence number.
>> > Each network namespace not owned by the initial user namespace receives its
>> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
>> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
>> > file it is clearly documented which lock has to be taken. All network
>> > namespaces owned by the initial user namespace will still share the same
>> > lock since they are all served sequentially via the uevent socket list.
>> > This decouples the locking and ensures that the host retrieves uevents as
>> > fast as possible even if there are a lot of uevents injected into network
>> > namespaces not owned by the initial user namespace. In addition, each
>> > network namespace not owned by the initial user namespace does not have to
>> > wait on any other network namespace not sharing the same user namespace.
>> >
>> > Testing:
>> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
>> > with decoupled locking and one without. To ensure that testing made sense
>> > both kernels carried the patch to remove network namespaces not owned by
>> > the initial user namespace from the uevent socket list.
>> > Three tests were constructed. All of them showed significant performance
>> > improvements with per-netns uevent sequence numbers and decoupled locking.
>> >
>> > # Testcase 1:
>> > Only Injecting Uevents into network namespaces not owned by the initial
>> > user namespace.
>> > - created 1000 new user namespace + network namespace pairs
>> > - opened a uevent listener in each of those namespace pairs
>> > - injected uevents into each of those network namespaces 10,000 times
>> > meaning 10,000,000 (10 million) uevents were injected. (The high
>> > number of uevent injections should get rid of a lot of jitter.)
>> > The injection was done by fork()ing 1000 uevent injectors in a simple
>> > for-loop to ensure that uevents were injected in parallel.
>> > - mean transaction time was calculated:
>> > - *without* uevent sequence number namespacing: 67 μs
>> > - *with* uevent sequence number namespacing: 55 μs
>> > - makes a difference of: 12 μs
>> > - a t-test was performed on the two data vectors which revealed
>> > shows significant performance improvements:
>> > Welch Two Sample t-test
>> > data: x1 and y1
>> > t = 405.16, df = 18883000, p-value < 2.2e-16
>> > alternative hypothesis: true difference in means is not equal to 0
>> > 95 percent confidence interval:
>> > 12.14949 12.26761
>> > sample estimates:
>> > mean of x mean of y
>> > 68.48594 56.27739
>> >
>> > # Testcase 2:
>> > Injecting Uevents into network namespaces not owned by the initial user
>> > namespace and network namespaces owned by the initial user namespace.
>> > - created 500 new user namespace + network namespace pairs
>> > - created 500 new network namespace pairs
>> > - opened a uevent listener in each of those namespace pairs
>> > - injected uevents into each of those network namespaces 10,000 times
>> > meaning 10,000,000 (10 million) uevents were injected. (The high
>> > number of uevent injections should get rid of a lot of jitter.)
>> > The injection was done by fork()ing 1000 uevent injectors in a simple
>> > for-loop to ensure that uevents were injected in parallel.
>> > - mean transaction time was calculated:
>> > - *without* uevent sequence number namespacing: 572 μs
>> > - *with* uevent sequence number namespacing: 514 μs
>> > - makes a difference of: 58 μs
>> > - a t-test was performed on the two data vectors which revealed
>> > shows significant performance improvements:
>> > Welch Two Sample t-test
>> > data: x2 and y2
>> > t = 38.685, df = 19682000, p-value < 2.2e-16
>> > alternative hypothesis: true difference in means is not equal to 0
>> > 95 percent confidence interval:
>> > 55.10630 60.98815
>> > sample estimates:
>> > mean of x mean of y
>> > 572.9684 514.9211
>> >
>> > # Testcase 3:
>> > Created 500 new user namespace + network namespace pairs *without uevent
>> > listeners*
>> > - created 500 new network namespace pairs *without uevent listeners*
>> > - injected uevents into each of those network namespaces 10,000 times
>> > meaning 10,000,000 (10 million) uevents were injected. (The high number
>> > of uevent injections should get rid of a lot of jitter.)
>> > The injection was done by fork()ing 1000 uevent injectors in a simple
>> > for-loop to ensure that uevents were injected in parallel.
>> > - mean transaction time was calculated:
>> > - *without* uevent sequence number namespacing: 206 μs
>> > - *with* uevent sequence number namespacing: 163 μs
>> > - makes a difference of: 43 μs
>> > - a t-test was performed on the two data vectors which revealed
>> > shows significant performance improvements:
>> > Welch Two Sample t-test
>> > data: x3 and y3
>> > t = 58.37, df = 17711000, p-value < 2.2e-16
>> > alternative hypothesis: true difference in means is not equal to 0
>> > 95 percent confidence interval:
>> > 41.77860 44.68178
>> > sample estimates:
>> > mean of x mean of y
>> > 207.2632 164.0330
>> >
>> > Signed-off-by: Christian Brauner <[email protected]>
>> > ---
>> > Changelog v1->v2:
>> > * non-functional change: fix indendation for C directives in
>> > kernel/ksysfs.c
>> > Changelog v0->v1:
>> > * add detailed test results to the commit message
>> > * account for kernels compiled without CONFIG_NET
>> > ---
>> > include/linux/kobject.h | 2 +
>> > include/net/net_namespace.h | 3 ++
>> > kernel/ksysfs.c | 11 +++-
>> > lib/kobject_uevent.c | 104 +++++++++++++++++++++++++++++-------
>> > net/core/net_namespace.c | 14 +++++
>> > 5 files changed, 114 insertions(+), 20 deletions(-)
>> >
>> > diff --git a/include/linux/kobject.h b/include/linux/kobject.h
>> > index 7f6f93c3df9c..4e608968907f 100644
>> > --- a/include/linux/kobject.h
>> > +++ b/include/linux/kobject.h
>> > @@ -36,8 +36,10 @@
>> > extern char uevent_helper[];
>> > #endif
>> >
>> > +#ifndef CONFIG_NET
>> > /* counter to tag the uevent, read only except for the kobject core */
>> > extern u64 uevent_seqnum;
>> > +#endif
>>
>> That smells like an implementation bug somewhere.
>
> Sorry, I'm not following. I'm I'm not mistaken there won't be any struct
> net when CONFIG_NET=n. This has been reported by kbuild robot with alpha
> and CONFIG_NET=n.

The Kbuild robot isn't wrong. I am just saying this likely comes up
because the code is not clean.

If uevent_seqnum is tied to a network namespace than it doesn't make
sense to exist without netork namespaces.

At the very least this feels like an untested special case. Untested
special cases tend to break when people are not paying attention which
can be dangerous.

>> > /*
>> > * The actions here must match the index to the string array
>> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>> > index 47e35cce3b64..e4e171b1ba69 100644
>> > --- a/include/net/net_namespace.h
>> > +++ b/include/net/net_namespace.h
>> > @@ -85,6 +85,8 @@ struct net {
>> > struct sock *genl_sock;
>> >
>> > struct uevent_sock *uevent_sock; /* uevent socket */
>> > + /* counter to tag the uevent, read only except for the kobject core */
>> > + u64 uevent_seqnum;
>> >
>> > struct list_head dev_base_head;
>> > struct hlist_head *dev_name_head;
>> > @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
>> >
>> > struct net *get_net_ns_by_pid(pid_t pid);
>> > struct net *get_net_ns_by_fd(int fd);
>> > +u64 get_ns_uevent_seqnum_by_vpid(void);
>> >
>> > #ifdef CONFIG_SYSCTL
>> > void ipx_register_sysctl(void);
>> > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
>> > index 46ba853656f6..38b70b90a21f 100644
>> > --- a/kernel/ksysfs.c
>> > +++ b/kernel/ksysfs.c
>> > @@ -19,6 +19,7 @@
>> > #include <linux/sched.h>
>> > #include <linux/capability.h>
>> > #include <linux/compiler.h>
>> > +#include <net/net_namespace.h>
>> >
>> > #include <linux/rcupdate.h> /* rcu_expedited and rcu_normal */
>> >
>> > @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
>> > static ssize_t uevent_seqnum_show(struct kobject *kobj,
>> > struct kobj_attribute *attr, char *buf)
>> > {
>> > - return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
>> > + u64 seqnum;
>> > +
>> > +#ifdef CONFIG_NET
>> > + seqnum = get_ns_uevent_seqnum_by_vpid();
>> > +#else
>> > + seqnum = uevent_seqnum;
>> > +#endif
>>
>> This can be simplified to be just:
>> seqnum = current->nsproxy->net_ns->uevent_seqnum;
>
> Does that work even if CONFIG_NET=n?

No. It is a NULL pointer dereference but the net_ns member continues
to exist.

>> Except that is not correct either. As every instance of sysfs has a
>> network namespace associated with it, and you are not fetching that
>> network namespace.
>
> I'm not yet familiar with all aspects of sysfs so thanks for pointing
> that out. Then I'll try to come up with a way to fetch the network
> namespace associated with sysfs. Unless you already know exactly how to
> do this and can point it out.
> This would also lets us drop get_ns_uevent_seqnum_by_vpid().

Everywhere else the distinction has been at a directory level. With one
the set of directory entries being different depending on which network
namespace sysfs was mounted in. For /sys/kernel/seqnum that looks like
fresh work.

Plus I expect there are some embedded systems without networking that
may still need uevents via /sbin/hotplug. Which makes this doubly
tricky.

I honestly don't know the answer to this one.

>> Typically this would call for making this file per network namespace
>> so you would have this information available. Sigh. I don't know if
>> there is an easy way to do that for this file.
>>
>> > +
>> > + return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
>> > }
>> > KERNEL_ATTR_RO(uevent_seqnum);
>> >
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index f5f5038787ac..5da20def556d 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -29,21 +29,42 @@
>> > #include <net/net_namespace.h>
>> >
>> >
>> > +#ifndef CONFIG_NET
>> > u64 uevent_seqnum;
>> > +#endif
>> > +
>> > #ifdef CONFIG_UEVENT_HELPER
>> > char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
>> > #endif
>> >
>> > +/*
>> > + * Size a buffer needs to be in order to hold the largest possible sequence
>> > + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
>> > + */
>> > +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
>> > struct uevent_sock {
>> > struct list_head list;
>> > struct sock *sk;
>> > + /*
>> > + * This mutex protects uevent sockets and the uevent counter of
>> > + * network namespaces *not* owned by init_user_ns.
>> > + * For network namespaces owned by init_user_ns this lock is *not*
>> > + * valid instead the global uevent_sock_mutex must be used!
>> > + */
>> > + struct mutex sk_mutex;
>> > };
>> >
>> > #ifdef CONFIG_NET
>> > static LIST_HEAD(uevent_sock_list);
>> > #endif
>> >
>> > -/* This lock protects uevent_seqnum and uevent_sock_list */
>> > +/*
>> > + * This mutex protects uevent sockets and the uevent counter of network
>> > + * namespaces owned by init_user_ns.
>> > + * For network namespaces not owned by init_user_ns this lock is *not*
>> > + * valid instead the network namespace specific sk_mutex in struct
>> > + * uevent_sock must be used!
>> > + */
>> > static DEFINE_MUTEX(uevent_sock_mutex);
>> >
>> > /* the strings here must match the enum in include/linux/kobject.h */
>> > @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >
>> > return 0;
>> > }
>> > +
>> > +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
>> > +{
>> > + if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
>> > + WARN(1, KERN_ERR "Failed to append sequence number. "
>> > + "Too many uevent variables\n");
>> > + return false;
>> > + }
>> > +
>> > + if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
>> > + WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
>> > + return false;
>> > + }
>> > +
>> > + return true;
>> > +}
>> > #endif
>> >
>> > #ifdef CONFIG_UEVENT_HELPER
>> > @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> >
>> > /* send netlink message */
>> > list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>> > + /* bump sequence number */
>> > + u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
>> > struct sock *uevent_sock = ue_sk->sk;
>> > + char buf[SEQNUM_BUFSIZE];
>> >
>> > if (!netlink_has_listeners(uevent_sock, 1))
>> > continue;
>> >
>> > if (!skb) {
>> > - /* allocate message with the maximum possible size */
>> > + /* calculate header length */
>> > size_t len = strlen(action_string) + strlen(devpath) + 2;
>> > char *scratch;
>> >
>> > + /* allocate message with the maximum possible size */
>> > retval = -ENOMEM;
>> > - skb = alloc_skb(len + env->buflen, GFP_KERNEL);
>> > + skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
>> > if (!skb)
>> > continue;
>> >
>> > @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> > scratch = skb_put(skb, len);
>> > sprintf(scratch, "%s@%s", action_string, devpath);
>> >
>> > + /* add env */
>> > skb_put_data(skb, env->buf, env->buflen);
>> >
>> > NETLINK_CB(skb).dst_group = 1;
>> > }
>> >
>> > + /* prepare netns seqnum */
>> > + retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
>> > + if (retval < 0 || retval >= SEQNUM_BUFSIZE)
>> > + continue;
>> > + retval++;
>> > +
>> > + if (!can_hold_seqnum(env, retval))
>> > + continue;
>>
>> You have allocated enough space in the skb why does can_hold_seqnum make
>> sense?
>
> Because it doesn't check whether the socket buffer can hold the sequence
> number but checks whether the uevent buffer size in "env" can hold it.
> uevents are only delivered if the env buffer is large enough to hold all
> of the info including the sequence number. That's independent of the
> socket buffer.

The practical question is why does that get called every time through
the loop?

>>
>> Do you need to back seqnum out of the env later for this to work twice
>> in a row?
>
> I guess I can just override it. It just felt cleaner to trim it.

You trim it from the socket, but what about the environment?

>
>>
>> > +
>> > + /* append netns seqnum */
>> > + skb_put_data(skb, buf, retval);
>> > +
>> > retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
>> > 0, 1, GFP_KERNEL,
>> > kobj_bcast_filter,
>> > @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> > /* ENOBUFS should be handled in userspace */
>> > if (retval == -ENOBUFS || retval == -ESRCH)
>> > retval = 0;
>> > +
>> > + /* remove netns seqnum */
>> > + skb_trim(skb, env->buflen);
>>
>> Have you checked to see if the seqnum actually makes it to userspace.
>
> Yes, I did. I also wonder why it wouldn't make it. Any specific reason
> why you suspect this?

trim? I am not really certain what that does and if when you deliever
to multiple network namespaces if it might not cause the wrong seqnum
to be delivered. You are using the same initial buffer.

I haven't stared at the socket buffer code enough to remember off hand
what the semantics are. Perhaps whatever netlink_broadcast does
is sufficient to prevent issues.

>
>> > }
>> > consume_skb(skb);
>> > +#else
>> > + uevent_seqnum++;
>> > #endif
>> > return retval;
>> > }
>> > @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>> > }
>> >
>> > mutex_lock(&uevent_sock_mutex);
>> > - /* we will send an event, so request a new sequence number */
>> > - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
>> > - if (retval) {
>> > - mutex_unlock(&uevent_sock_mutex);
>> > - goto exit;
>> > - }
>> > - retval = kobject_uevent_net_broadcast(kobj, env, action_string,
>> > - devpath);
>> > + retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
>> > mutex_unlock(&uevent_sock_mutex);
>>
>> How does all of this work with events for network devices that are not
>> in the initial network namespace. This looks to me like this code fails
>> to take the sk_mutex.
>
> But in this list only non-initial network namespaces that are owned by
> the initial user namespace are recorded and for these uevent_sock_mutex
> has to be taken. Am I missing something?

The bug I missed in your first patch. We still have to deliver uevents
to secondary network namespaces for their network devices.
At which point you need your new sk_mutex.


>> > #ifdef CONFIG_UEVENT_HELPER
>> > @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>> > EXPORT_SYMBOL_GPL(add_uevent_var);
>> >
>> > #if defined(CONFIG_NET)
>> > -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
>> > +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
>> > struct netlink_ext_ack *extack)
>> > {
>> > - /* u64 to chars: 2^64 - 1 = 21 chars */
>> > - char buf[sizeof("SEQNUM=") + 21];
>> > + struct sock *usk = ue_sk->sk;
>> > + char buf[SEQNUM_BUFSIZE];
>> > struct sk_buff *skbc;
>> > int ret;
>> >
>> > /* bump and prepare sequence number */
>> > - ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
>> > - if (ret < 0 || (size_t)ret >= sizeof(buf))
>> > + ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
>> > + ++sock_net(ue_sk->sk)->uevent_seqnum);
>> > + if (ret < 0 || ret >= SEQNUM_BUFSIZE)
>> > return -ENOMEM;
>> > ret++;
>> >
>> > @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
>> > return -EPERM;
>> > }
>> >
>> > - mutex_lock(&uevent_sock_mutex);
>> > - ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
>> > - mutex_unlock(&uevent_sock_mutex);
>> > + if (net->user_ns == &init_user_ns)
>> > + mutex_lock(&uevent_sock_mutex);
>> > + else
>> > + mutex_lock(&net->uevent_sock->sk_mutex);
>> > + ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
>> > + if (net->user_ns == &init_user_ns)
>> > + mutex_unlock(&uevent_sock_mutex);
>> > + else
>> > + mutex_unlock(&net->uevent_sock->sk_mutex);
>> >
>> > return ret;
>> > }
>> > @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
>> > mutex_lock(&uevent_sock_mutex);
>> > list_add_tail(&ue_sk->list, &uevent_sock_list);
>> > mutex_unlock(&uevent_sock_mutex);
>> > + } else {
>> > + /*
>> > + * Uevent sockets and counters for network namespaces
>> > + * not owned by the initial user namespace have their
>> > + * own mutex.
>> > + */
>> > + mutex_init(&ue_sk->sk_mutex);
>> > }
>> >
>> > return 0;
>> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> > index a11e03f920d3..8894638f5150 100644
>> > --- a/net/core/net_namespace.c
>> > +++ b/net/core/net_namespace.c
>> > @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
>> > }
>> > EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>> >
>> > +u64 get_ns_uevent_seqnum_by_vpid(void)
>> > +{
>> > + pid_t cur_pid;
>> > + struct net *net;
>> > +
>> > + cur_pid = task_pid_vnr(current);
>> > + net = get_net_ns_by_pid(cur_pid);
>> > + if (IS_ERR(net))
>> > + return 0;
>> > +
>> > + return net->uevent_seqnum;
>> > +}
>> > +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
>>
>> I just have to say this function is completely crazy.
>> You go from the tsk to the pid back to the tsk.
>> And you leak a struct net pointer.
>>
>> It is much simpler and less racy to say:
>>
>> current->nsproxy->net_ns->uevent_seqnum;
>>
>> That you are accessing current->nsproxy means nsproxy can't
>> change. The rcu_read_lock etc that get_net_ns_by_pid does
>> is there for accessing non-current tasks.
>>
>>
>>
>> > static __net_init int net_ns_net_init(struct net *net)
>> > {
>> > #ifdef CONFIG_NET_NS

2018-04-24 22:58:07

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
>
> Bah. This code is obviously correct and probably wrong.
>
> How do we deliver uevents for network devices that are outside of the
> initial user namespace? The kernel still needs to deliver those.
>
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter. That logic will almost
> certainly need to be turned inside out. Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

>
> Eric
>
> Christian Brauner <[email protected]> writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> > Broadcasting uevents into all network namespaces introduces significant
> > overhead.
> > All processes that listen to uevents running in non-initial user
> > namespaces will end up responding to uevents that will be meaningless to
> > them. Mainly, because non-initial user namespaces cannot easily manage
> > devices unless they have a privileged host-process helping them out. This
> > means that there will be a thundering herd of activity when there
> > shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> > Uevents are filtered by userspace in a user namespace because the
> > received uid != 0. Instead the uid associated with the event will be
> > 65534 == "nobody" because the global root uid is not mapped.
> > This means we can safely and without introducing regressions modify the
> > kernel to not send uevents into all network namespaces whose owning user
> > namespace is not the initial user namespace because we know that
> > userspace will ignore the message because of the uid anyway. I have
> > a) verified that is is true for every udev implementation out there b)
> > that this behavior has been present in all udev implementations from the
> > very beginning.
> > - Removing needless overhead/Increasing performance:
> > Currently, the uevent socket for each network namespace is added to the
> > global variable uevent_sock_list. The list itself needs to be protected
> > by a mutex. So everytime a uevent is generated the mutex is taken on the
> > list. The mutex is held *from the creation of the uevent (memory
> > allocation, string creation etc. until all uevent sockets have been
> > handled*. This is aggravated by the fact that for each uevent socket that
> > has listeners the mc_list must be walked as well which means we're
> > talking O(n^2) here. Given that a standard Linux workload usually has
> > quite a lot of network namespaces and - in the face of containers - a lot
> > of user namespaces this quickly becomes a performance problem (see
> > "Thundering herd" above). By just recording uevent sockets of network
> > namespaces that are owned by the initial user namespace we significantly
> > increase performance in this codepath.
> > - Injecting uevents:
> > There's a valid argument that containers might be interested in receiving
> > device events especially if they are delegated to them by a privileged
> > userspace process. One prime example are SR-IOV enabled devices that are
> > explicitly designed to be handed of to other users such as VMs or
> > containers.
> > This use-case can now be correctly handled since
> > commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> > introduced the ability to send uevents from userspace. As such we can let
> > a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> > the network namespace of the netlink socket) userspace process make a
> > decision what uevents should be sent. This removes the need to blindly
> > broadcast uevents into all user namespaces and provides a performant and
> > safe solution to this problem.
> > - Filtering logic:
> > This patch filters by *owning user namespace of the network namespace a
> > given task resides in* and not by user namespace of the task per se. This
> > means if the user namespace of a given task is unshared but the network
> > namespace is kept and is owned by the initial user namespace a listener
> > that is opening the uevent socket in that network namespace can still
> > listen to uevents.
> >
> > [1]: https://lkml.org/lkml/2018/4/4/739
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
> > Changelog v1->v2:
> > * patch unchanged
> > Changelog v0->v1:
> > * patch unchanged
> > ---
> > lib/kobject_uevent.c | 18 ++++++++++++------
> > 1 file changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..f5f5038787ac 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
> >
> > net->uevent_sock = ue_sk;
> >
> > - mutex_lock(&uevent_sock_mutex);
> > - list_add_tail(&ue_sk->list, &uevent_sock_list);
> > - mutex_unlock(&uevent_sock_mutex);
> > + /* Restrict uevents to initial user namespace. */
> > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > + mutex_lock(&uevent_sock_mutex);
> > + list_add_tail(&ue_sk->list, &uevent_sock_list);
> > + mutex_unlock(&uevent_sock_mutex);
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> > {
> > struct uevent_sock *ue_sk = net->uevent_sock;
> >
> > - mutex_lock(&uevent_sock_mutex);
> > - list_del(&ue_sk->list);
> > - mutex_unlock(&uevent_sock_mutex);
> > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > + mutex_lock(&uevent_sock_mutex);
> > + list_del(&ue_sk->list);
> > + mutex_unlock(&uevent_sock_mutex);
> > + }
> >
> > netlink_kernel_release(ue_sk->sk);
> > kfree(ue_sk);

2018-04-24 23:03:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

Christian Brauner <[email protected]> writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
>
> Bah. This code is obviously correct and probably wrong.
>
> How do we deliver uevents for network devices that are outside of the
> initial user namespace? The kernel still needs to deliver those.
>
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter. That logic will almost
> certainly need to be turned inside out. Sign not as easy as I would
> have hoped.
>
> My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths. Ordinary
devices that use uevent_sock_list. Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem. We
will unfortunately drop those uevents today because their uids are not
valid. But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
> Christian Brauner <[email protected]> writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> > Broadcasting uevents into all network namespaces introduces significant
> > overhead.
> > All processes that listen to uevents running in non-initial user
> > namespaces will end up responding to uevents that will be meaningless to
> > them. Mainly, because non-initial user namespaces cannot easily manage
> > devices unless they have a privileged host-process helping them out. This
> > means that there will be a thundering herd of activity when there
> > shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> > Uevents are filtered by userspace in a user namespace because the
> > received uid != 0. Instead the uid associated with the event will be
> > 65534 == "nobody" because the global root uid is not mapped.
> > This means we can safely and without introducing regressions modify the
> > kernel to not send uevents into all network namespaces whose owning user
> > namespace is not the initial user namespace because we know that
> > userspace will ignore the message because of the uid anyway. I have
> > a) verified that is is true for every udev implementation out there b)
> > that this behavior has been present in all udev implementations from the
> > very beginning.
> > - Removing needless overhead/Increasing performance:
> > Currently, the uevent socket for each network namespace is added to the
> > global variable uevent_sock_list. The list itself needs to be protected
> > by a mutex. So everytime a uevent is generated the mutex is taken on the
> > list. The mutex is held *from the creation of the uevent (memory
> > allocation, string creation etc. until all uevent sockets have been
> > handled*. This is aggravated by the fact that for each uevent socket that
> > has listeners the mc_list must be walked as well which means we're
> > talking O(n^2) here. Given that a standard Linux workload usually has
> > quite a lot of network namespaces and - in the face of containers - a lot
> > of user namespaces this quickly becomes a performance problem (see
> > "Thundering herd" above). By just recording uevent sockets of network
> > namespaces that are owned by the initial user namespace we significantly
> > increase performance in this codepath.
> > - Injecting uevents:
> > There's a valid argument that containers might be interested in receiving
> > device events especially if they are delegated to them by a privileged
> > userspace process. One prime example are SR-IOV enabled devices that are
> > explicitly designed to be handed of to other users such as VMs or
> > containers.
> > This use-case can now be correctly handled since
> > commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> > introduced the ability to send uevents from userspace. As such we can let
> > a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> > the network namespace of the netlink socket) userspace process make a
> > decision what uevents should be sent. This removes the need to blindly
> > broadcast uevents into all user namespaces and provides a performant and
> > safe solution to this problem.
> > - Filtering logic:
> > This patch filters by *owning user namespace of the network namespace a
> > given task resides in* and not by user namespace of the task per se. This
> > means if the user namespace of a given task is unshared but the network
> > namespace is kept and is owned by the initial user namespace a listener
> > that is opening the uevent socket in that network namespace can still
> > listen to uevents.
> >
> > [1]: https://lkml.org/lkml/2018/4/4/739
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
> > Changelog v1->v2:
> > * patch unchanged
> > Changelog v0->v1:
> > * patch unchanged
> > ---
> > lib/kobject_uevent.c | 18 ++++++++++++------
> > 1 file changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..f5f5038787ac 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
> >
> > net->uevent_sock = ue_sk;
> >
> > - mutex_lock(&uevent_sock_mutex);
> > - list_add_tail(&ue_sk->list, &uevent_sock_list);
> > - mutex_unlock(&uevent_sock_mutex);
> > + /* Restrict uevents to initial user namespace. */
> > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > + mutex_lock(&uevent_sock_mutex);
> > + list_add_tail(&ue_sk->list, &uevent_sock_list);
> > + mutex_unlock(&uevent_sock_mutex);
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> > {
> > struct uevent_sock *ue_sk = net->uevent_sock;
> >
> > - mutex_lock(&uevent_sock_mutex);
> > - list_del(&ue_sk->list);
> > - mutex_unlock(&uevent_sock_mutex);
> > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > + mutex_lock(&uevent_sock_mutex);
> > + list_del(&ue_sk->list);
> > + mutex_unlock(&uevent_sock_mutex);
> > + }
> >
> > netlink_kernel_release(ue_sk->sk);
> > kfree(ue_sk);

2018-04-26 16:15:29

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
> >
> > Bah. This code is obviously correct and probably wrong.
> >
> > How do we deliver uevents for network devices that are outside of the
> > initial user namespace? The kernel still needs to deliver those.
> >
> > The logic to figure out which network namespace a device needs to be
> > delivered to is is present in kobj_bcast_filter. That logic will almost
> > certainly need to be turned inside out. Sign not as easy as I would
> > have hoped.
> >
> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> > out and come up with a new patch.
>
> I may have mis-understood.
>
> I heard and am still hearing additional filtering to reduce the places
> the packet is delievered.
>
> I am saying something needs to change to increase the number of places
> the packet is delivered.
>
> For the special class of devices that kobj_bcast_filter would apply to
> those need to be delivered to netowrk namespaces that are no longer on
> uevent_sock_list.
>
> So the code fundamentally needs to split into two paths. Ordinary
> devices that use uevent_sock_list. Network devices that are just
> delivered in their own network namespace.
>
> netlink_broadcast_filtered gets to go away completely.

The split *might* make sense but I think you're wrong about removing the
kobj_bcast_filter. The current filter doesn't operate on the uevent
socket in uevent_sock_list itself it rather operates on the sockets in
mc_list. And if socket in mc_list can have a different network namespace
then the uevent_socket itself then your way won't work. That's why my
original patch added additional filtering in there. The way I see it we
need something like:

init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);

The question that remains is whether we can rely on the network
namespace information we can gather from the kobject_ns_type_operations
to decide where we want to broadcast that event to. So something *like*:

ops = kobj_ns_ops(kobj);
if (!ops && kobj->kset) {
struct kobject *ksobj = &kobj->kset->kobj;
if (ksobj->parent != NULL)
ops = kobj_ns_ops(ksobj->parent);
}

if (ops && ops->netlink_ns && kobj->ktype->namespace)
if (ops->type == KOBJ_NS_TYPE_NET)
net = kobj->ktype->namespace(kobj);

if (!net || net->user_ns == &init_user_ns)
ret = init_user_ns_broadcast(env, action_string, devpath);
else
ret = user_ns_broadcast(net->uevent_sock->sk, env,
action_string, devpath);

Christian

> The logic of figuring out the network namespace though becomes trickier.
>
> Now it may make sense to have all of that as an additional patch on top
> of this one or perhaps a precursor patch that addresses the problem. We
> will unfortunately drop those uevents today because their uids are not
> valid. But they are not delivered anywhere else so to allow them to be
> received we need to fix them.
>
> Eric
>
> >
> > Christian Brauner <[email protected]> writes:
> > > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> > >
> > > enabled sending hotplug events into all network namespaces back in 2010.
> > > Over time the set of uevents that get sent into all network namespaces has
> > > shrunk a little. We have now reached the point where hotplug events for all
> > > devices that carry a namespace tag are filtered according to that
> > > namespace. Specifically, they are filtered whenever the namespace tag of
> > > the kobject does not match the namespace tag of the netlink socket. One
> > > example are network devices. Uevents for network devices only show up in
> > > the network namespaces these devices are moved to or created in.
> > >
> > > However, any uevent for a kobject that does not have a namespace tag
> > > associated with it will not be filtered and we will broadcast it into all
> > > network namespaces. This behavior stopped making sense when user namespaces
> > > were introduced.
> > >
> > > This patch restricts uevents to the initial user namespace for a couple of
> > > reasons that have been extensively discusses on the mailing list [1].
> > > - Thundering herd:
> > > Broadcasting uevents into all network namespaces introduces significant
> > > overhead.
> > > All processes that listen to uevents running in non-initial user
> > > namespaces will end up responding to uevents that will be meaningless to
> > > them. Mainly, because non-initial user namespaces cannot easily manage
> > > devices unless they have a privileged host-process helping them out. This
> > > means that there will be a thundering herd of activity when there
> > > shouldn't be any.
> > > - Uevents from non-root users are already filtered in userspace:
> > > Uevents are filtered by userspace in a user namespace because the
> > > received uid != 0. Instead the uid associated with the event will be
> > > 65534 == "nobody" because the global root uid is not mapped.
> > > This means we can safely and without introducing regressions modify the
> > > kernel to not send uevents into all network namespaces whose owning user
> > > namespace is not the initial user namespace because we know that
> > > userspace will ignore the message because of the uid anyway. I have
> > > a) verified that is is true for every udev implementation out there b)
> > > that this behavior has been present in all udev implementations from the
> > > very beginning.
> > > - Removing needless overhead/Increasing performance:
> > > Currently, the uevent socket for each network namespace is added to the
> > > global variable uevent_sock_list. The list itself needs to be protected
> > > by a mutex. So everytime a uevent is generated the mutex is taken on the
> > > list. The mutex is held *from the creation of the uevent (memory
> > > allocation, string creation etc. until all uevent sockets have been
> > > handled*. This is aggravated by the fact that for each uevent socket that
> > > has listeners the mc_list must be walked as well which means we're
> > > talking O(n^2) here. Given that a standard Linux workload usually has
> > > quite a lot of network namespaces and - in the face of containers - a lot
> > > of user namespaces this quickly becomes a performance problem (see
> > > "Thundering herd" above). By just recording uevent sockets of network
> > > namespaces that are owned by the initial user namespace we significantly
> > > increase performance in this codepath.
> > > - Injecting uevents:
> > > There's a valid argument that containers might be interested in receiving
> > > device events especially if they are delegated to them by a privileged
> > > userspace process. One prime example are SR-IOV enabled devices that are
> > > explicitly designed to be handed of to other users such as VMs or
> > > containers.
> > > This use-case can now be correctly handled since
> > > commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> > > introduced the ability to send uevents from userspace. As such we can let
> > > a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> > > the network namespace of the netlink socket) userspace process make a
> > > decision what uevents should be sent. This removes the need to blindly
> > > broadcast uevents into all user namespaces and provides a performant and
> > > safe solution to this problem.
> > > - Filtering logic:
> > > This patch filters by *owning user namespace of the network namespace a
> > > given task resides in* and not by user namespace of the task per se. This
> > > means if the user namespace of a given task is unshared but the network
> > > namespace is kept and is owned by the initial user namespace a listener
> > > that is opening the uevent socket in that network namespace can still
> > > listen to uevents.
> > >
> > > [1]: https://lkml.org/lkml/2018/4/4/739
> > > Signed-off-by: Christian Brauner <[email protected]>
> > > ---
> > > Changelog v1->v2:
> > > * patch unchanged
> > > Changelog v0->v1:
> > > * patch unchanged
> > > ---
> > > lib/kobject_uevent.c | 18 ++++++++++++------
> > > 1 file changed, 12 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > > index 15ea216a67ce..f5f5038787ac 100644
> > > --- a/lib/kobject_uevent.c
> > > +++ b/lib/kobject_uevent.c
> > > @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
> > >
> > > net->uevent_sock = ue_sk;
> > >
> > > - mutex_lock(&uevent_sock_mutex);
> > > - list_add_tail(&ue_sk->list, &uevent_sock_list);
> > > - mutex_unlock(&uevent_sock_mutex);
> > > + /* Restrict uevents to initial user namespace. */
> > > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > > + mutex_lock(&uevent_sock_mutex);
> > > + list_add_tail(&ue_sk->list, &uevent_sock_list);
> > > + mutex_unlock(&uevent_sock_mutex);
> > > + }
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> > > {
> > > struct uevent_sock *ue_sk = net->uevent_sock;
> > >
> > > - mutex_lock(&uevent_sock_mutex);
> > > - list_del(&ue_sk->list);
> > > - mutex_unlock(&uevent_sock_mutex);
> > > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > > + mutex_lock(&uevent_sock_mutex);
> > > + list_del(&ue_sk->list);
> > > + mutex_unlock(&uevent_sock_mutex);
> > > + }
> > >
> > > netlink_kernel_release(ue_sk->sk);
> > > kfree(ue_sk);

2018-04-26 16:50:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

Christian Brauner <[email protected]> writes:

> On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
>> >
>> > Bah. This code is obviously correct and probably wrong.
>> >
>> > How do we deliver uevents for network devices that are outside of the
>> > initial user namespace? The kernel still needs to deliver those.
>> >
>> > The logic to figure out which network namespace a device needs to be
>> > delivered to is is present in kobj_bcast_filter. That logic will almost
>> > certainly need to be turned inside out. Sign not as easy as I would
>> > have hoped.
>> >
>> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
>> > out and come up with a new patch.
>>
>> I may have mis-understood.
>>
>> I heard and am still hearing additional filtering to reduce the places
>> the packet is delievered.
>>
>> I am saying something needs to change to increase the number of places
>> the packet is delivered.
>>
>> For the special class of devices that kobj_bcast_filter would apply to
>> those need to be delivered to netowrk namespaces that are no longer on
>> uevent_sock_list.
>>
>> So the code fundamentally needs to split into two paths. Ordinary
>> devices that use uevent_sock_list. Network devices that are just
>> delivered in their own network namespace.
>>
>> netlink_broadcast_filtered gets to go away completely.
>
> The split *might* make sense but I think you're wrong about removing the
> kobj_bcast_filter. The current filter doesn't operate on the uevent
> socket in uevent_sock_list itself it rather operates on the sockets in
> mc_list. And if socket in mc_list can have a different network namespace
> then the uevent_socket itself then your way won't work. That's why my
> original patch added additional filtering in there. The way I see it we
> need something like:

We already filter the sockets in the mc_list by network namespace.

When a packet is transmitted with netlink_broadcast it is only
transmitted within a single network namespace.

Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
with it's source network namespace so no confusion will result, and the
permission checks have been done to make it safe. So you can safely
ignore that case. Please ignore that case. It only needs to be
considered if refactoring af_netlink.c

When I added netlink_broadcast_filtered I imagined that we would need
code that worked across network namespaces that worked for different
namespaces. So it looked like we would need the level of granularity
that you can get with netlink_broadcast_filtered. It turns out we don't
and that it was a case of over design. As the only split we care about
is per network namespace there is no need for
netlink_broadcast_filtered.

> init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>
> The question that remains is whether we can rely on the network
> namespace information we can gather from the kobject_ns_type_operations
> to decide where we want to broadcast that event to. So something
> *like*:

We can. We already do. That is what kobj_bcast_filter implements.

> ops = kobj_ns_ops(kobj);
> if (!ops && kobj->kset) {
> struct kobject *ksobj = &kobj->kset->kobj;
> if (ksobj->parent != NULL)
> ops = kobj_ns_ops(ksobj->parent);
> }
>
> if (ops && ops->netlink_ns && kobj->ktype->namespace)
> if (ops->type == KOBJ_NS_TYPE_NET)
> net = kobj->ktype->namespace(kobj);

Please note the only entry in the enumeration in the kobj_ns_type
enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
check for ops->type in this case is redundant.

That is something else that could be simplifed. At the time it was the
necessary to get the sysfs changes merged.

> if (!net || net->user_ns == &init_user_ns)
> ret = init_user_ns_broadcast(env, action_string, devpath);
> else
> ret = user_ns_broadcast(net->uevent_sock->sk, env,
> action_string, devpath);

Almost.

if (!net)
kobject_uevent_net_broadcast(kobj, env, action_string,
dev_path);
else
netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);


I am handwaving to get the skb in the netlink_broadcast case but that
should be enough for you to see what I am thinking.

My only concern with the above is that we almost certainly need to fix
the credentials on the skb so that userspace does not drop the packet
sent to a network namespace because it has the credentials that will
cause userspace to drop the packet today.

But it should be straight forward to look at net->user_ns, to fix the
credentials.

Eric

2018-04-26 17:04:58

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
> >> >
> >> > Bah. This code is obviously correct and probably wrong.
> >> >
> >> > How do we deliver uevents for network devices that are outside of the
> >> > initial user namespace? The kernel still needs to deliver those.
> >> >
> >> > The logic to figure out which network namespace a device needs to be
> >> > delivered to is is present in kobj_bcast_filter. That logic will almost
> >> > certainly need to be turned inside out. Sign not as easy as I would
> >> > have hoped.
> >> >
> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> >> > out and come up with a new patch.
> >>
> >> I may have mis-understood.
> >>
> >> I heard and am still hearing additional filtering to reduce the places
> >> the packet is delievered.
> >>
> >> I am saying something needs to change to increase the number of places
> >> the packet is delivered.
> >>
> >> For the special class of devices that kobj_bcast_filter would apply to
> >> those need to be delivered to netowrk namespaces that are no longer on
> >> uevent_sock_list.
> >>
> >> So the code fundamentally needs to split into two paths. Ordinary
> >> devices that use uevent_sock_list. Network devices that are just
> >> delivered in their own network namespace.
> >>
> >> netlink_broadcast_filtered gets to go away completely.
> >
> > The split *might* make sense but I think you're wrong about removing the
> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> > socket in uevent_sock_list itself it rather operates on the sockets in
> > mc_list. And if socket in mc_list can have a different network namespace
> > then the uevent_socket itself then your way won't work. That's why my
> > original patch added additional filtering in there. The way I see it we
> > need something like:
>
> We already filter the sockets in the mc_list by network namespace.

Oh really? That's good to know. I haven't found where in the code this
actually happens. I thought that when netlink_bind() is called anyone
could register themselves in mc_list.

>
> When a packet is transmitted with netlink_broadcast it is only
> transmitted within a single network namespace.
>
> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> with it's source network namespace so no confusion will result, and the
> permission checks have been done to make it safe. So you can safely
> ignore that case. Please ignore that case. It only needs to be
> considered if refactoring af_netlink.c
>
> When I added netlink_broadcast_filtered I imagined that we would need
> code that worked across network namespaces that worked for different
> namespaces. So it looked like we would need the level of granularity
> that you can get with netlink_broadcast_filtered. It turns out we don't
> and that it was a case of over design. As the only split we care about
> is per network namespace there is no need for
> netlink_broadcast_filtered.
>
> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >
> > The question that remains is whether we can rely on the network
> > namespace information we can gather from the kobject_ns_type_operations
> > to decide where we want to broadcast that event to. So something
> > *like*:
>
> We can. We already do. That is what kobj_bcast_filter implements.
>
> > ops = kobj_ns_ops(kobj);
> > if (!ops && kobj->kset) {
> > struct kobject *ksobj = &kobj->kset->kobj;
> > if (ksobj->parent != NULL)
> > ops = kobj_ns_ops(ksobj->parent);
> > }
> >
> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
> > if (ops->type == KOBJ_NS_TYPE_NET)
> > net = kobj->ktype->namespace(kobj);
>
> Please note the only entry in the enumeration in the kobj_ns_type
> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
> check for ops->type in this case is redundant.

Yes, I know the reason for doing it explicitly is to block the case
where kobjects get tagged with other namespaces. So we'd need to be
vigilant should that ever happen but fine.

>
> That is something else that could be simplifed. At the time it was the
> necessary to get the sysfs changes merged.
>
> > if (!net || net->user_ns == &init_user_ns)
> > ret = init_user_ns_broadcast(env, action_string, devpath);
> > else
> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
> > action_string, devpath);
>
> Almost.
>
> if (!net)
> kobject_uevent_net_broadcast(kobj, env, action_string,
> dev_path);
> else
> netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
>
>
> I am handwaving to get the skb in the netlink_broadcast case but that
> should be enough for you to see what I am thinking.

I have added a helper alloc_uevent_skb() that can be used in both cases.

static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
const char *action_string,
const char *devpath)
{
struct sk_buff *skb = NULL;
char *scratch;
size_t len;

/* allocate message with maximum possible size */
len = strlen(action_string) + strlen(devpath) + 2;
skb = alloc_skb(len + env->buflen, GFP_KERNEL);
if (!skb)
return NULL;

/* add header */
scratch = skb_put(skb, len);
sprintf(scratch, "%s@%s", action_string, devpath);

skb_put_data(skb, env->buf, env->buflen);

NETLINK_CB(skb).dst_group = 1;

return skb;
}

>
> My only concern with the above is that we almost certainly need to fix
> the credentials on the skb so that userspace does not drop the packet
> sent to a network namespace because it has the credentials that will
> cause userspace to drop the packet today.
>
> But it should be straight forward to look at net->user_ns, to fix the
> credentials.

Yes, afaict, the only thing that needs to be updated is the uid.

Thanks Eric!
Christian

2018-04-26 17:13:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

Christian Brauner <[email protected]> writes:

> On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <[email protected]> writes:
>> >>
>> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
>> >> >
>> >> > Bah. This code is obviously correct and probably wrong.
>> >> >
>> >> > How do we deliver uevents for network devices that are outside of the
>> >> > initial user namespace? The kernel still needs to deliver those.
>> >> >
>> >> > The logic to figure out which network namespace a device needs to be
>> >> > delivered to is is present in kobj_bcast_filter. That logic will almost
>> >> > certainly need to be turned inside out. Sign not as easy as I would
>> >> > have hoped.
>> >> >
>> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
>> >> > out and come up with a new patch.
>> >>
>> >> I may have mis-understood.
>> >>
>> >> I heard and am still hearing additional filtering to reduce the places
>> >> the packet is delievered.
>> >>
>> >> I am saying something needs to change to increase the number of places
>> >> the packet is delivered.
>> >>
>> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> those need to be delivered to netowrk namespaces that are no longer on
>> >> uevent_sock_list.
>> >>
>> >> So the code fundamentally needs to split into two paths. Ordinary
>> >> devices that use uevent_sock_list. Network devices that are just
>> >> delivered in their own network namespace.
>> >>
>> >> netlink_broadcast_filtered gets to go away completely.
>> >
>> > The split *might* make sense but I think you're wrong about removing the
>> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> > socket in uevent_sock_list itself it rather operates on the sockets in
>> > mc_list. And if socket in mc_list can have a different network namespace
>> > then the uevent_socket itself then your way won't work. That's why my
>> > original patch added additional filtering in there. The way I see it we
>> > need something like:
>>
>> We already filter the sockets in the mc_list by network namespace.
>
> Oh really? That's good to know. I haven't found where in the code this
> actually happens. I thought that when netlink_bind() is called anyone
> could register themselves in mc_list.

The code in af_netlink.c does:
> static void do_one_broadcast(struct sock *sk,
> struct netlink_broadcast_data *p)
> {
> struct netlink_sock *nlk = nlk_sk(sk);
> int val;
>
> if (p->exclude_sk == sk)
> return;
>
> if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> !test_bit(p->group - 1, nlk->groups))
> return;
>
> if (!net_eq(sock_net(sk), p->net)) {
^^^^^^^^^^^^ Here
> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> return;
^^^^^^^^^^^ Here
>
> if (!peernet_has_id(sock_net(sk), p->net))
> return;
>
> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> CAP_NET_BROADCAST))
> return;
> }

Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
you out if you are the wrong network namespace.


>> When a packet is transmitted with netlink_broadcast it is only
>> transmitted within a single network namespace.
>>
>> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> with it's source network namespace so no confusion will result, and the
>> permission checks have been done to make it safe. So you can safely
>> ignore that case. Please ignore that case. It only needs to be
>> considered if refactoring af_netlink.c
>>
>> When I added netlink_broadcast_filtered I imagined that we would need
>> code that worked across network namespaces that worked for different
>> namespaces. So it looked like we would need the level of granularity
>> that you can get with netlink_broadcast_filtered. It turns out we don't
>> and that it was a case of over design. As the only split we care about
>> is per network namespace there is no need for
>> netlink_broadcast_filtered.
>>
>> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
>> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>> >
>> > The question that remains is whether we can rely on the network
>> > namespace information we can gather from the kobject_ns_type_operations
>> > to decide where we want to broadcast that event to. So something
>> > *like*:
>>
>> We can. We already do. That is what kobj_bcast_filter implements.
>>
>> > ops = kobj_ns_ops(kobj);
>> > if (!ops && kobj->kset) {
>> > struct kobject *ksobj = &kobj->kset->kobj;
>> > if (ksobj->parent != NULL)
>> > ops = kobj_ns_ops(ksobj->parent);
>> > }
>> >
>> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
>> > if (ops->type == KOBJ_NS_TYPE_NET)
>> > net = kobj->ktype->namespace(kobj);
>>
>> Please note the only entry in the enumeration in the kobj_ns_type
>> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
>> check for ops->type in this case is redundant.
>
> Yes, I know the reason for doing it explicitly is to block the case
> where kobjects get tagged with other namespaces. So we'd need to be
> vigilant should that ever happen but fine.

It is fine to keep the check.

I was intending to point out that it is much more likely that we remove
the enumeration and remove some of the extra abstraction, than another
namespace is implemented there.

>> That is something else that could be simplifed. At the time it was the
>> necessary to get the sysfs changes merged.
>>
>> > if (!net || net->user_ns == &init_user_ns)
>> > ret = init_user_ns_broadcast(env, action_string, devpath);
>> > else
>> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
>> > action_string, devpath);
>>
>> Almost.
>>
>> if (!net)
>> kobject_uevent_net_broadcast(kobj, env, action_string,
>> dev_path);
>> else
>> netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
>>
>>
>> I am handwaving to get the skb in the netlink_broadcast case but that
>> should be enough for you to see what I am thinking.
>
> I have added a helper alloc_uevent_skb() that can be used in both cases.
>
> static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
> const char *action_string,
> const char *devpath)
> {
> struct sk_buff *skb = NULL;
> char *scratch;
> size_t len;
>
> /* allocate message with maximum possible size */
> len = strlen(action_string) + strlen(devpath) + 2;
> skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> if (!skb)
> return NULL;
>
> /* add header */
> scratch = skb_put(skb, len);
> sprintf(scratch, "%s@%s", action_string, devpath);
>
> skb_put_data(skb, env->buf, env->buflen);
>
> NETLINK_CB(skb).dst_group = 1;
>
> return skb;
> }
>
>>
>> My only concern with the above is that we almost certainly need to fix
>> the credentials on the skb so that userspace does not drop the packet
>> sent to a network namespace because it has the credentials that will
>> cause userspace to drop the packet today.
>>
>> But it should be straight forward to look at net->user_ns, to fix the
>> credentials.
>
> Yes, afaict, the only thing that needs to be updated is the uid.

I suspect there may also be a gid.

Eric

2018-04-26 21:30:02

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <[email protected]> writes:
> >> >>
> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
> >> >> >
> >> >> > Bah. This code is obviously correct and probably wrong.
> >> >> >
> >> >> > How do we deliver uevents for network devices that are outside of the
> >> >> > initial user namespace? The kernel still needs to deliver those.
> >> >> >
> >> >> > The logic to figure out which network namespace a device needs to be
> >> >> > delivered to is is present in kobj_bcast_filter. That logic will almost
> >> >> > certainly need to be turned inside out. Sign not as easy as I would
> >> >> > have hoped.
> >> >> >
> >> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> >> >> > out and come up with a new patch.
> >> >>
> >> >> I may have mis-understood.
> >> >>
> >> >> I heard and am still hearing additional filtering to reduce the places
> >> >> the packet is delievered.
> >> >>
> >> >> I am saying something needs to change to increase the number of places
> >> >> the packet is delivered.
> >> >>
> >> >> For the special class of devices that kobj_bcast_filter would apply to
> >> >> those need to be delivered to netowrk namespaces that are no longer on
> >> >> uevent_sock_list.
> >> >>
> >> >> So the code fundamentally needs to split into two paths. Ordinary
> >> >> devices that use uevent_sock_list. Network devices that are just
> >> >> delivered in their own network namespace.
> >> >>
> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >
> >> > The split *might* make sense but I think you're wrong about removing the
> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> > mc_list. And if socket in mc_list can have a different network namespace
> >> > then the uevent_socket itself then your way won't work. That's why my
> >> > original patch added additional filtering in there. The way I see it we
> >> > need something like:
> >>
> >> We already filter the sockets in the mc_list by network namespace.
> >
> > Oh really? That's good to know. I haven't found where in the code this
> > actually happens. I thought that when netlink_bind() is called anyone
> > could register themselves in mc_list.
>
> The code in af_netlink.c does:
> > static void do_one_broadcast(struct sock *sk,
> > struct netlink_broadcast_data *p)
> > {
> > struct netlink_sock *nlk = nlk_sk(sk);
> > int val;
> >
> > if (p->exclude_sk == sk)
> > return;
> >
> > if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> > !test_bit(p->group - 1, nlk->groups))
> > return;
> >
> > if (!net_eq(sock_net(sk), p->net)) {
> ^^^^^^^^^^^^ Here
> > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> > return;
> ^^^^^^^^^^^ Here
> >
> > if (!peernet_has_id(sock_net(sk), p->net))
> > return;
> >
> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> > CAP_NET_BROADCAST))
> > return;
> > }
>
> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> you out if you are the wrong network namespace.
>
>
> >> When a packet is transmitted with netlink_broadcast it is only
> >> transmitted within a single network namespace.
> >>
> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> with it's source network namespace so no confusion will result, and the
> >> permission checks have been done to make it safe. So you can safely
> >> ignore that case. Please ignore that case. It only needs to be
> >> considered if refactoring af_netlink.c
> >>
> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> code that worked across network namespaces that worked for different
> >> namespaces. So it looked like we would need the level of granularity
> >> that you can get with netlink_broadcast_filtered. It turns out we don't
> >> and that it was a case of over design. As the only split we care about
> >> is per network namespace there is no need for
> >> netlink_broadcast_filtered.
> >>
> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >> >
> >> > The question that remains is whether we can rely on the network
> >> > namespace information we can gather from the kobject_ns_type_operations
> >> > to decide where we want to broadcast that event to. So something
> >> > *like*:
> >>
> >> We can. We already do. That is what kobj_bcast_filter implements.
> >>
> >> > ops = kobj_ns_ops(kobj);
> >> > if (!ops && kobj->kset) {
> >> > struct kobject *ksobj = &kobj->kset->kobj;
> >> > if (ksobj->parent != NULL)
> >> > ops = kobj_ns_ops(ksobj->parent);
> >> > }
> >> >
> >> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
> >> > if (ops->type == KOBJ_NS_TYPE_NET)
> >> > net = kobj->ktype->namespace(kobj);
> >>
> >> Please note the only entry in the enumeration in the kobj_ns_type
> >> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
> >> check for ops->type in this case is redundant.
> >
> > Yes, I know the reason for doing it explicitly is to block the case
> > where kobjects get tagged with other namespaces. So we'd need to be
> > vigilant should that ever happen but fine.
>
> It is fine to keep the check.
>
> I was intending to point out that it is much more likely that we remove
> the enumeration and remove some of the extra abstraction, than another
> namespace is implemented there.
>
> >> That is something else that could be simplifed. At the time it was the
> >> necessary to get the sysfs changes merged.
> >>
> >> > if (!net || net->user_ns == &init_user_ns)
> >> > ret = init_user_ns_broadcast(env, action_string, devpath);
> >> > else
> >> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
> >> > action_string, devpath);
> >>
> >> Almost.
> >>
> >> if (!net)
> >> kobject_uevent_net_broadcast(kobj, env, action_string,
> >> dev_path);
> >> else
> >> netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
> >>
> >>
> >> I am handwaving to get the skb in the netlink_broadcast case but that
> >> should be enough for you to see what I am thinking.
> >
> > I have added a helper alloc_uevent_skb() that can be used in both cases.
> >
> > static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
> > const char *action_string,
> > const char *devpath)
> > {
> > struct sk_buff *skb = NULL;
> > char *scratch;
> > size_t len;
> >
> > /* allocate message with maximum possible size */
> > len = strlen(action_string) + strlen(devpath) + 2;
> > skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> > if (!skb)
> > return NULL;
> >
> > /* add header */
> > scratch = skb_put(skb, len);
> > sprintf(scratch, "%s@%s", action_string, devpath);
> >
> > skb_put_data(skb, env->buf, env->buflen);
> >
> > NETLINK_CB(skb).dst_group = 1;
> >
> > return skb;
> > }
> >
> >>
> >> My only concern with the above is that we almost certainly need to fix
> >> the credentials on the skb so that userspace does not drop the packet

I guess we simply want:
if (user_ns != &init_user_ns) {
NETLINK_CB(skb).creds.uid = (kuid_t)0;
NETLINK_CB(skb).creds.gid = kgid_t)0;
}

instead of the more complicated and - imho wrong:

if (user_ns != &init_user_ns) {
/* fix credentials for udev running in user namespace */
kuid_t uid = NETLINK_CB(skb).creds.uid;
kgid_t gid = NETLINK_CB(skb).creds.gid;
NETLINK_CB(skb).creds.uid = from_kuid_munged(user_ns, uid);
NETLINK_CB(skb).creds.gid = from_kgid_munged(user_ns, gid);
}

Christian

> >> sent to a network namespace because it has the credentials that will
> >> cause userspace to drop the packet today.
> >>
> >> But it should be straight forward to look at net->user_ns, to fix the
> >> credentials.
> >
> > Yes, afaict, the only thing that needs to be updated is the uid.
>
> I suspect there may also be a gid.
>
> Eric

2018-04-27 00:37:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

Christian Brauner <[email protected]> writes:

> On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <[email protected]> writes:
>> >>
>> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner <[email protected]> writes:
>> >> >>
>> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
>> >> >> >
>> >> >> > Bah. This code is obviously correct and probably wrong.
>> >> >> >
>> >> >> > How do we deliver uevents for network devices that are outside of the
>> >> >> > initial user namespace? The kernel still needs to deliver those.
>> >> >> >
>> >> >> > The logic to figure out which network namespace a device needs to be
>> >> >> > delivered to is is present in kobj_bcast_filter. That logic will almost
>> >> >> > certainly need to be turned inside out. Sign not as easy as I would
>> >> >> > have hoped.
>> >> >> >
>> >> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
>> >> >> > out and come up with a new patch.
>> >> >>
>> >> >> I may have mis-understood.
>> >> >>
>> >> >> I heard and am still hearing additional filtering to reduce the places
>> >> >> the packet is delievered.
>> >> >>
>> >> >> I am saying something needs to change to increase the number of places
>> >> >> the packet is delivered.
>> >> >>
>> >> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> >> those need to be delivered to netowrk namespaces that are no longer on
>> >> >> uevent_sock_list.
>> >> >>
>> >> >> So the code fundamentally needs to split into two paths. Ordinary
>> >> >> devices that use uevent_sock_list. Network devices that are just
>> >> >> delivered in their own network namespace.
>> >> >>
>> >> >> netlink_broadcast_filtered gets to go away completely.
>> >> >
>> >> > The split *might* make sense but I think you're wrong about removing the
>> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> >> > socket in uevent_sock_list itself it rather operates on the sockets in
>> >> > mc_list. And if socket in mc_list can have a different network namespace
>> >> > then the uevent_socket itself then your way won't work. That's why my
>> >> > original patch added additional filtering in there. The way I see it we
>> >> > need something like:
>> >>
>> >> We already filter the sockets in the mc_list by network namespace.
>> >
>> > Oh really? That's good to know. I haven't found where in the code this
>> > actually happens. I thought that when netlink_bind() is called anyone
>> > could register themselves in mc_list.
>>
>> The code in af_netlink.c does:
>> > static void do_one_broadcast(struct sock *sk,
>> > struct netlink_broadcast_data *p)
>> > {
>> > struct netlink_sock *nlk = nlk_sk(sk);
>> > int val;
>> >
>> > if (p->exclude_sk == sk)
>> > return;
>> >
>> > if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
>> > !test_bit(p->group - 1, nlk->groups))
>> > return;
>> >
>> > if (!net_eq(sock_net(sk), p->net)) {
>> ^^^^^^^^^^^^ Here
>> > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> > return;
>> ^^^^^^^^^^^ Here
>> >
>> > if (!peernet_has_id(sock_net(sk), p->net))
>> > return;
>> >
>> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> > CAP_NET_BROADCAST))
>> > return;
>> > }
>>
>> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
>> you out if you are the wrong network namespace.
>>
>>
>> >> When a packet is transmitted with netlink_broadcast it is only
>> >> transmitted within a single network namespace.
>> >>
>> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> >> with it's source network namespace so no confusion will result, and the
>> >> permission checks have been done to make it safe. So you can safely
>> >> ignore that case. Please ignore that case. It only needs to be
>> >> considered if refactoring af_netlink.c
>> >>
>> >> When I added netlink_broadcast_filtered I imagined that we would need
>> >> code that worked across network namespaces that worked for different
>> >> namespaces. So it looked like we would need the level of granularity
>> >> that you can get with netlink_broadcast_filtered. It turns out we don't
>> >> and that it was a case of over design. As the only split we care about
>> >> is per network namespace there is no need for
>> >> netlink_broadcast_filtered.
>> >>
>> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
>> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>> >> >
>> >> > The question that remains is whether we can rely on the network
>> >> > namespace information we can gather from the kobject_ns_type_operations
>> >> > to decide where we want to broadcast that event to. So something
>> >> > *like*:
>> >>
>> >> We can. We already do. That is what kobj_bcast_filter implements.
>> >>
>> >> > ops = kobj_ns_ops(kobj);
>> >> > if (!ops && kobj->kset) {
>> >> > struct kobject *ksobj = &kobj->kset->kobj;
>> >> > if (ksobj->parent != NULL)
>> >> > ops = kobj_ns_ops(ksobj->parent);
>> >> > }
>> >> >
>> >> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
>> >> > if (ops->type == KOBJ_NS_TYPE_NET)
>> >> > net = kobj->ktype->namespace(kobj);
>> >>
>> >> Please note the only entry in the enumeration in the kobj_ns_type
>> >> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
>> >> check for ops->type in this case is redundant.
>> >
>> > Yes, I know the reason for doing it explicitly is to block the case
>> > where kobjects get tagged with other namespaces. So we'd need to be
>> > vigilant should that ever happen but fine.
>>
>> It is fine to keep the check.
>>
>> I was intending to point out that it is much more likely that we remove
>> the enumeration and remove some of the extra abstraction, than another
>> namespace is implemented there.
>>
>> >> That is something else that could be simplifed. At the time it was the
>> >> necessary to get the sysfs changes merged.
>> >>
>> >> > if (!net || net->user_ns == &init_user_ns)
>> >> > ret = init_user_ns_broadcast(env, action_string, devpath);
>> >> > else
>> >> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
>> >> > action_string, devpath);
>> >>
>> >> Almost.
>> >>
>> >> if (!net)
>> >> kobject_uevent_net_broadcast(kobj, env, action_string,
>> >> dev_path);
>> >> else
>> >> netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
>> >>
>> >>
>> >> I am handwaving to get the skb in the netlink_broadcast case but that
>> >> should be enough for you to see what I am thinking.
>> >
>> > I have added a helper alloc_uevent_skb() that can be used in both cases.
>> >
>> > static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
>> > const char *action_string,
>> > const char *devpath)
>> > {
>> > struct sk_buff *skb = NULL;
>> > char *scratch;
>> > size_t len;
>> >
>> > /* allocate message with maximum possible size */
>> > len = strlen(action_string) + strlen(devpath) + 2;
>> > skb = alloc_skb(len + env->buflen, GFP_KERNEL);
>> > if (!skb)
>> > return NULL;
>> >
>> > /* add header */
>> > scratch = skb_put(skb, len);
>> > sprintf(scratch, "%s@%s", action_string, devpath);
>> >
>> > skb_put_data(skb, env->buf, env->buflen);
>> >
>> > NETLINK_CB(skb).dst_group = 1;
>> >
>> > return skb;
>> > }
>> >
>> >>
>> >> My only concern with the above is that we almost certainly need to fix
>> >> the credentials on the skb so that userspace does not drop the packet
>
> I guess we simply want:
> if (user_ns != &init_user_ns) {
> NETLINK_CB(skb).creds.uid = (kuid_t)0;
> NETLINK_CB(skb).creds.gid = kgid_t)0;
> }

I believe the above is what we already have.

> instead of the more complicated and - imho wrong:
>
> if (user_ns != &init_user_ns) {
> /* fix credentials for udev running in user namespace */
> kuid_t uid = NETLINK_CB(skb).creds.uid;
> kgid_t gid = NETLINK_CB(skb).creds.gid;
> NETLINK_CB(skb).creds.uid = from_kuid_munged(user_ns, uid);
> NETLINK_CB(skb).creds.gid = from_kgid_munged(user_ns, gid);
> }

The above is most definitely wrong as we store kuids and kgids in
"NETLINK_CB(skb).creds".

I am pretty certain what we want is:
kuid_t root_uid = make_kuid(net->user_ns, 0);
kgid_g root_gid = make_kgid(net->user_ns, 0);
if (!uid_valid(root_uid))
root_uid = GLOBAL_ROOT_UID;
if (!gid_valid(root_gid))
root_gid = GLOBAL_ROOT_GID;
NETLINK_CB(skb).creds.uid = root_uid;
NETLINK_CB(skb).creds.gid = root_gid;

Let's be careful and only fix this for the networking uevents please.
We want the other onces to just go away.

The networking uevents we have to fix or they will be gone completely.

Eric

2018-04-27 08:43:34

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents

On Thu, Apr 26, 2018 at 07:35:47PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <[email protected]> writes:
> >> >>
> >> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner <[email protected]> writes:
> >> >> >>
> >> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <[email protected]> wrote:
> >> >> >> >
> >> >> >> > Bah. This code is obviously correct and probably wrong.
> >> >> >> >
> >> >> >> > How do we deliver uevents for network devices that are outside of the
> >> >> >> > initial user namespace? The kernel still needs to deliver those.
> >> >> >> >
> >> >> >> > The logic to figure out which network namespace a device needs to be
> >> >> >> > delivered to is is present in kobj_bcast_filter. That logic will almost
> >> >> >> > certainly need to be turned inside out. Sign not as easy as I would
> >> >> >> > have hoped.
> >> >> >> >
> >> >> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> >> >> >> > out and come up with a new patch.
> >> >> >>
> >> >> >> I may have mis-understood.
> >> >> >>
> >> >> >> I heard and am still hearing additional filtering to reduce the places
> >> >> >> the packet is delievered.
> >> >> >>
> >> >> >> I am saying something needs to change to increase the number of places
> >> >> >> the packet is delivered.
> >> >> >>
> >> >> >> For the special class of devices that kobj_bcast_filter would apply to
> >> >> >> those need to be delivered to netowrk namespaces that are no longer on
> >> >> >> uevent_sock_list.
> >> >> >>
> >> >> >> So the code fundamentally needs to split into two paths. Ordinary
> >> >> >> devices that use uevent_sock_list. Network devices that are just
> >> >> >> delivered in their own network namespace.
> >> >> >>
> >> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >> >
> >> >> > The split *might* make sense but I think you're wrong about removing the
> >> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> >> > mc_list. And if socket in mc_list can have a different network namespace
> >> >> > then the uevent_socket itself then your way won't work. That's why my
> >> >> > original patch added additional filtering in there. The way I see it we
> >> >> > need something like:
> >> >>
> >> >> We already filter the sockets in the mc_list by network namespace.
> >> >
> >> > Oh really? That's good to know. I haven't found where in the code this
> >> > actually happens. I thought that when netlink_bind() is called anyone
> >> > could register themselves in mc_list.
> >>
> >> The code in af_netlink.c does:
> >> > static void do_one_broadcast(struct sock *sk,
> >> > struct netlink_broadcast_data *p)
> >> > {
> >> > struct netlink_sock *nlk = nlk_sk(sk);
> >> > int val;
> >> >
> >> > if (p->exclude_sk == sk)
> >> > return;
> >> >
> >> > if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> >> > !test_bit(p->group - 1, nlk->groups))
> >> > return;
> >> >
> >> > if (!net_eq(sock_net(sk), p->net)) {
> >> ^^^^^^^^^^^^ Here
> >> > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> > return;
> >> ^^^^^^^^^^^ Here
> >> >
> >> > if (!peernet_has_id(sock_net(sk), p->net))
> >> > return;
> >> >
> >> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> > CAP_NET_BROADCAST))
> >> > return;
> >> > }
> >>
> >> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> >> you out if you are the wrong network namespace.
> >>
> >>
> >> >> When a packet is transmitted with netlink_broadcast it is only
> >> >> transmitted within a single network namespace.
> >> >>
> >> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> >> with it's source network namespace so no confusion will result, and the
> >> >> permission checks have been done to make it safe. So you can safely
> >> >> ignore that case. Please ignore that case. It only needs to be
> >> >> considered if refactoring af_netlink.c
> >> >>
> >> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> >> code that worked across network namespaces that worked for different
> >> >> namespaces. So it looked like we would need the level of granularity
> >> >> that you can get with netlink_broadcast_filtered. It turns out we don't
> >> >> and that it was a case of over design. As the only split we care about
> >> >> is per network namespace there is no need for
> >> >> netlink_broadcast_filtered.
> >> >>
> >> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> >> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >> >> >
> >> >> > The question that remains is whether we can rely on the network
> >> >> > namespace information we can gather from the kobject_ns_type_operations
> >> >> > to decide where we want to broadcast that event to. So something
> >> >> > *like*:
> >> >>
> >> >> We can. We already do. That is what kobj_bcast_filter implements.
> >> >>
> >> >> > ops = kobj_ns_ops(kobj);
> >> >> > if (!ops && kobj->kset) {
> >> >> > struct kobject *ksobj = &kobj->kset->kobj;
> >> >> > if (ksobj->parent != NULL)
> >> >> > ops = kobj_ns_ops(ksobj->parent);
> >> >> > }
> >> >> >
> >> >> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
> >> >> > if (ops->type == KOBJ_NS_TYPE_NET)
> >> >> > net = kobj->ktype->namespace(kobj);
> >> >>
> >> >> Please note the only entry in the enumeration in the kobj_ns_type
> >> >> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET. So the
> >> >> check for ops->type in this case is redundant.
> >> >
> >> > Yes, I know the reason for doing it explicitly is to block the case
> >> > where kobjects get tagged with other namespaces. So we'd need to be
> >> > vigilant should that ever happen but fine.
> >>
> >> It is fine to keep the check.
> >>
> >> I was intending to point out that it is much more likely that we remove
> >> the enumeration and remove some of the extra abstraction, than another
> >> namespace is implemented there.
> >>
> >> >> That is something else that could be simplifed. At the time it was the
> >> >> necessary to get the sysfs changes merged.
> >> >>
> >> >> > if (!net || net->user_ns == &init_user_ns)
> >> >> > ret = init_user_ns_broadcast(env, action_string, devpath);
> >> >> > else
> >> >> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
> >> >> > action_string, devpath);
> >> >>
> >> >> Almost.
> >> >>
> >> >> if (!net)
> >> >> kobject_uevent_net_broadcast(kobj, env, action_string,
> >> >> dev_path);
> >> >> else
> >> >> netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
> >> >>
> >> >>
> >> >> I am handwaving to get the skb in the netlink_broadcast case but that
> >> >> should be enough for you to see what I am thinking.
> >> >
> >> > I have added a helper alloc_uevent_skb() that can be used in both cases.
> >> >
> >> > static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
> >> > const char *action_string,
> >> > const char *devpath)
> >> > {
> >> > struct sk_buff *skb = NULL;
> >> > char *scratch;
> >> > size_t len;
> >> >
> >> > /* allocate message with maximum possible size */
> >> > len = strlen(action_string) + strlen(devpath) + 2;
> >> > skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> >> > if (!skb)
> >> > return NULL;
> >> >
> >> > /* add header */
> >> > scratch = skb_put(skb, len);
> >> > sprintf(scratch, "%s@%s", action_string, devpath);
> >> >
> >> > skb_put_data(skb, env->buf, env->buflen);
> >> >
> >> > NETLINK_CB(skb).dst_group = 1;
> >> >
> >> > return skb;
> >> > }
> >> >
> >> >>
> >> >> My only concern with the above is that we almost certainly need to fix
> >> >> the credentials on the skb so that userspace does not drop the packet
> >
> > I guess we simply want:
> > if (user_ns != &init_user_ns) {
> > NETLINK_CB(skb).creds.uid = (kuid_t)0;
> > NETLINK_CB(skb).creds.gid = kgid_t)0;
> > }
>
> I believe the above is what we already have.
>
> > instead of the more complicated and - imho wrong:
> >
> > if (user_ns != &init_user_ns) {
> > /* fix credentials for udev running in user namespace */
> > kuid_t uid = NETLINK_CB(skb).creds.uid;
> > kgid_t gid = NETLINK_CB(skb).creds.gid;
> > NETLINK_CB(skb).creds.uid = from_kuid_munged(user_ns, uid);
> > NETLINK_CB(skb).creds.gid = from_kgid_munged(user_ns, gid);
> > }
>
> The above is most definitely wrong as we store kuids and kgids in
> "NETLINK_CB(skb).creds".
>
> I am pretty certain what we want is:
> kuid_t root_uid = make_kuid(net->user_ns, 0);
> kgid_g root_gid = make_kgid(net->user_ns, 0);

Thanks! I looked at user_namespace.c which contained map_id_down() which
is the function that I wanted and remembered from a prior patchset of
mine but they weren't exported. I didn't spot make_k{g,u}id() which are
wrapping those. These are the droids^H^H^H^H^H^Hfunctions I was looking
for!

> if (!uid_valid(root_uid))
> root_uid = GLOBAL_ROOT_UID;
> if (!gid_valid(root_gid))
> root_gid = GLOBAL_ROOT_GID;
> NETLINK_CB(skb).creds.uid = root_uid;
> NETLINK_CB(skb).creds.gid = root_gid;
>
> Let's be careful and only fix this for the networking uevents please.
> We want the other onces to just go away.

This is already handled by the

if (!net)
handle_untagged_uevents()
else
handle_taggged_uevents()

The else branch will only every contain network devices as to my
knowledge no other kernel devices are currently tagged.

Thanks!
Christian

>
> The networking uevents we have to fix or they will be gone completely.
>
> Eric