This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
to allow sending uevent messages into the network namespace the socket
belongs to.
Currently non-initial network namespaces are already isolated and don't
receive uevents. There are a number of cases where it is beneficial for a
sufficiently privileged userspace process to send a uevent into a network
namespace.
One such use case would be debugging and fuzzing of a piece of software
which listens and reacts to uevents. By running a copy of that software
inside a network namespace, specific uevents could then be presented to it.
More concretely, this would allow for easy testing of udevd/ueventd.
This will also allow some piece of software to run components inside a
separate network namespace and then effectively filter what that software
can receive. Some examples of software that do directly listen to uevents
and that we have in the past attempted to run inside a network namespace
are rbd (CEPH client) or the X server.
Implementation:
The implementation has been kept as simple as possible from the kernel's
perspective. Specifically, a simple input method uevent_net_rcv() is added
to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
af_netlink infrastructure and does neither add an additional netlink family
nor requires any user-visible changes.
For example, by using netlink_rcv_skb() we can make use of existing netlink
infrastructure to report back informative error messages to userspace.
Furthermore, this implementation does not introduce any overhead for
existing uevent generating codepaths. The struct netns gets a new uevent
socket member that records the uevent socket associated with that network
namespace. Since we record the uevent socket for each network namespace in
struct net we don't have to walk the whole uevent socket list.
Instead we can directly retrieve the relevant uevent socket and send the
message. This keeps the codepath very performant without introducing
needless overhead.
Uevent sequence numbers are kept global. When a uevent message is sent to
another network namespace the implementation will simply increment the
global uevent sequence number and append it to the received uevent. This
has the advantage that the kernel will never need to parse the received
uevent message to replace any existing uevent sequence numbers. Instead it
is up to the userspace process to remove any existing uevent sequence
numbers in case the uevent message to be sent contains any.
Security:
In order for a caller to send uevent messages to a target network namespace
the caller must have CAP_SYS_ADMIN in the owning user namespace of the
target network namespace. Additionally, any received uevent message is
verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
needed to append the uevent sequence number.
Testing:
This patch has been tested and verified to work with the following udev
implementations:
1. CentOS 6 with udevd version 147
2. Debian Sid with systemd-udevd version 237
3. Android 7.1.1 with ueventd
Signed-off-by: Christian Brauner <[email protected]>
---
include/net/net_namespace.h | 1 +
lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f306b2aa15a4..467bde763a9b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -78,6 +78,7 @@ struct net {
struct sock *rtnl; /* rtnetlink socket */
struct sock *genl_sock;
+ struct sock *uevent_sock; /* uevent socket */
struct list_head dev_base_head;
struct hlist_head *dev_name_head;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 9fe6ec8fda28..10b2144b9fc3 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -25,6 +25,7 @@
#include <linux/uuid.h>
#include <linux/ctype.h>
#include <net/sock.h>
+#include <net/netlink.h>
#include <net/net_namespace.h>
@@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
EXPORT_SYMBOL_GPL(add_uevent_var);
#if defined(CONFIG_NET)
+static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
+ struct netlink_ext_ack *extack)
+{
+ int ret;
+ u64 seqnum;
+ /* u64 to chars: 2^64 - 1 = 21 chars */
+ char buf[sizeof("SEQNUM=") + 21];
+ struct sk_buff *skbc;
+
+ /* bump sequence number */
+ mutex_lock(&uevent_sock_mutex);
+ seqnum = ++uevent_seqnum;
+ mutex_unlock(&uevent_sock_mutex);
+
+ /* prepare sequence number */
+ ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", seqnum);
+ if (ret < 0 || (size_t)ret >= sizeof(buf))
+ return -ENOMEM;
+ ret++;
+
+ /* verify message does not overflow */
+ if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
+ NL_SET_ERR_MSG(extack, "uevent message too big");
+ return -EINVAL;
+ }
+
+ /* copy skb and extend to accommodate sequence number */
+ skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
+ if (!skbc)
+ return -ENOMEM;
+
+ /* append sequence number */
+ skb_put_data(skbc, buf, ret);
+
+ /* remove msg header */
+ skb_pull(skbc, NLMSG_HDRLEN);
+
+ /* set portid 0 to inform userspace message comes from kernel */
+ NETLINK_CB(skbc).portid = 0;
+ NETLINK_CB(skbc).dst_group = 1;
+
+ ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
+ /* ENOBUFS should be handled in userspace */
+ if (ret == -ENOBUFS || ret == -ESRCH)
+ ret = 0;
+
+ return ret;
+}
+
+static int uevent_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack)
+{
+ void *msg;
+ struct net *target_net;
+ struct sock *target_sk;
+
+ msg = nlmsg_data(nlh);
+ if (!msg)
+ return -EINVAL;
+
+ target_net = sock_net(NETLINK_CB(skb).sk);
+ target_sk = target_net->uevent_sock;
+
+ /*
+ * Verify that we are allowed to send messages to the target network
+ * namespace. The caller must have CAP_SYS_ADMIN in the owning user
+ * namespace of the target network namespace.
+ */
+ if (!netlink_ns_capable(skb, target_net->user_ns, CAP_SYS_ADMIN)) {
+ NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
+ return -EPERM;
+ }
+
+ return uevent_net_send(target_sk, skb, extack);
+}
+
+static void uevent_net_rcv(struct sk_buff *skb)
+{
+ netlink_rcv_skb(skb, &uevent_rcv_msg);
+}
+
static int uevent_net_init(struct net *net)
{
struct uevent_sock *ue_sk;
struct netlink_kernel_cfg cfg = {
.groups = 1,
- .flags = NL_CFG_F_NONROOT_RECV,
+ .input = uevent_net_rcv,
+ .flags = NL_CFG_F_NONROOT_RECV
};
ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
@@ -621,6 +704,9 @@ static int uevent_net_init(struct net *net)
kfree(ue_sk);
return -ENODEV;
}
+
+ net->uevent_sock = ue_sk->sk;
+
mutex_lock(&uevent_sock_mutex);
list_add_tail(&ue_sk->list, &uevent_sock_list);
mutex_unlock(&uevent_sock_mutex);
--
2.15.1
CC Andrey Vagin
On 15.03.2018 03:12, Christian Brauner wrote:
> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> to allow sending uevent messages into the network namespace the socket
> belongs to.
>
> Currently non-initial network namespaces are already isolated and don't
> receive uevents. There are a number of cases where it is beneficial for a
> sufficiently privileged userspace process to send a uevent into a network
> namespace.
>
> One such use case would be debugging and fuzzing of a piece of software
> which listens and reacts to uevents. By running a copy of that software
> inside a network namespace, specific uevents could then be presented to it.
> More concretely, this would allow for easy testing of udevd/ueventd.
>
> This will also allow some piece of software to run components inside a
> separate network namespace and then effectively filter what that software
> can receive. Some examples of software that do directly listen to uevents
> and that we have in the past attempted to run inside a network namespace
> are rbd (CEPH client) or the X server.
>
> Implementation:
> The implementation has been kept as simple as possible from the kernel's
> perspective. Specifically, a simple input method uevent_net_rcv() is added
> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
> af_netlink infrastructure and does neither add an additional netlink family
> nor requires any user-visible changes.
>
> For example, by using netlink_rcv_skb() we can make use of existing netlink
> infrastructure to report back informative error messages to userspace.
>
> Furthermore, this implementation does not introduce any overhead for
> existing uevent generating codepaths. The struct netns gets a new uevent
> socket member that records the uevent socket associated with that network
> namespace. Since we record the uevent socket for each network namespace in
> struct net we don't have to walk the whole uevent socket list.
> Instead we can directly retrieve the relevant uevent socket and send the
> message. This keeps the codepath very performant without introducing
> needless overhead.
>
> Uevent sequence numbers are kept global. When a uevent message is sent to
> another network namespace the implementation will simply increment the
> global uevent sequence number and append it to the received uevent. This
> has the advantage that the kernel will never need to parse the received
> uevent message to replace any existing uevent sequence numbers. Instead it
> is up to the userspace process to remove any existing uevent sequence
> numbers in case the uevent message to be sent contains any.
>
> Security:
> In order for a caller to send uevent messages to a target network namespace
> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
> target network namespace. Additionally, any received uevent message is
> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
> needed to append the uevent sequence number.
>
> Testing:
> This patch has been tested and verified to work with the following udev
> implementations:
> 1. CentOS 6 with udevd version 147
> 2. Debian Sid with systemd-udevd version 237
> 3. Android 7.1.1 with ueventd
>
> Signed-off-by: Christian Brauner <[email protected]>
> ---
> include/net/net_namespace.h | 1 +
> lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 88 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index f306b2aa15a4..467bde763a9b 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -78,6 +78,7 @@ struct net {
>
> struct sock *rtnl; /* rtnetlink socket */
> struct sock *genl_sock;
> + struct sock *uevent_sock; /* uevent socket */
Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
become to look confusing. There are:
mutex_lock(&uevent_sock_mutex);
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
if (sock_net(ue_sk->sk) == net)
goto found;
}
Can't we make a small cleanup in lib/kobject_uevent.c after this change
and before the main part of the patch goes?
>
> struct list_head dev_base_head;
> struct hlist_head *dev_name_head;
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 9fe6ec8fda28..10b2144b9fc3 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -25,6 +25,7 @@
> #include <linux/uuid.h>
> #include <linux/ctype.h>
> #include <net/sock.h>
> +#include <net/netlink.h>
> #include <net/net_namespace.h>
>
>
> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> EXPORT_SYMBOL_GPL(add_uevent_var);
>
> #if defined(CONFIG_NET)
> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
> + struct netlink_ext_ack *extack)
> +{
> + int ret;
> + u64 seqnum;
> + /* u64 to chars: 2^64 - 1 = 21 chars */
18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
we forgot +1 in buf declaration.
> + char buf[sizeof("SEQNUM=") + 21];
> + struct sk_buff *skbc;
> +
> + /* bump sequence number */
> + mutex_lock(&uevent_sock_mutex);
> + seqnum = ++uevent_seqnum;
> + mutex_unlock(&uevent_sock_mutex);
Commit 7b60a18da393 from Andrew Vagin says:
uevent: send events in correct order according to seqnum (v3)
The queue handling in the udev daemon assumes that the events are
ordered.
Before this patch uevent_seqnum is incremented under sequence_lock,
than an event is send uner uevent_sock_mutex. I want to say that code
contained a window between incrementing seqnum and sending an event.
This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
Signed-off-by: Andrew Vagin <[email protected]>
After this change the order will be lost. Also the rest of namespaces
will have holes in uevent numbers, as they won't receive a number sent
to specific namespace.
It seems we should made uevent_seqnum per-net to solve this problem.
> + /* prepare sequence number */
> + ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", seqnum);
> + if (ret < 0 || (size_t)ret >= sizeof(buf))
> + return -ENOMEM;> + ret++;
> +
> + /* verify message does not overflow */
> + if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
> + NL_SET_ERR_MSG(extack, "uevent message too big");
> + return -EINVAL;
> + }
> +
> + /* copy skb and extend to accommodate sequence number */
> + skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
> + if (!skbc)
> + return -ENOMEM;
> +
> + /* append sequence number */
> + skb_put_data(skbc, buf, ret);
> +
> + /* remove msg header */
> + skb_pull(skbc, NLMSG_HDRLEN);
> +
> + /* set portid 0 to inform userspace message comes from kernel */
> + NETLINK_CB(skbc).portid = 0;
> + NETLINK_CB(skbc).dst_group = 1;
> +
> + ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
> + /* ENOBUFS should be handled in userspace */
> + if (ret == -ENOBUFS || ret == -ESRCH)
> + ret = 0;
> +
> + return ret;
> +}
> +
> +static int uevent_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
> + struct netlink_ext_ack *extack)
> +{
> + void *msg;
> + struct net *target_net;
> + struct sock *target_sk;
> +
> + msg = nlmsg_data(nlh);
> + if (!msg)
> + return -EINVAL;
> +
> + target_net = sock_net(NETLINK_CB(skb).sk);
> + target_sk = target_net->uevent_sock;
> +
> + /*
> + * Verify that we are allowed to send messages to the target network
> + * namespace. The caller must have CAP_SYS_ADMIN in the owning user
> + * namespace of the target network namespace.
> + */
> + if (!netlink_ns_capable(skb, target_net->user_ns, CAP_SYS_ADMIN)) {
> + NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
> + return -EPERM;
> + }
> +
> + return uevent_net_send(target_sk, skb, extack);
> +}
> +
> +static void uevent_net_rcv(struct sk_buff *skb)
> +{
> + netlink_rcv_skb(skb, &uevent_rcv_msg);
> +}
> +
> static int uevent_net_init(struct net *net)
> {
> struct uevent_sock *ue_sk;
> struct netlink_kernel_cfg cfg = {
> .groups = 1,
> - .flags = NL_CFG_F_NONROOT_RECV,
> + .input = uevent_net_rcv,
> + .flags = NL_CFG_F_NONROOT_RECV
> };
>
> ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
> @@ -621,6 +704,9 @@ static int uevent_net_init(struct net *net)
> kfree(ue_sk);
> return -ENODEV;
> }
> +
> + net->uevent_sock = ue_sk->sk;
> +
> mutex_lock(&uevent_sock_mutex);
> list_add_tail(&ue_sk->list, &uevent_sock_list);
> mutex_unlock(&uevent_sock_mutex);
Kirill
On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
> CC Andrey Vagin
Hey Kirill,
Thanks for CCing Andrey.
>
> On 15.03.2018 03:12, Christian Brauner wrote:
> > This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> > to allow sending uevent messages into the network namespace the socket
> > belongs to.
> >
> > Currently non-initial network namespaces are already isolated and don't
> > receive uevents. There are a number of cases where it is beneficial for a
> > sufficiently privileged userspace process to send a uevent into a network
> > namespace.
> >
> > One such use case would be debugging and fuzzing of a piece of software
> > which listens and reacts to uevents. By running a copy of that software
> > inside a network namespace, specific uevents could then be presented to it.
> > More concretely, this would allow for easy testing of udevd/ueventd.
> >
> > This will also allow some piece of software to run components inside a
> > separate network namespace and then effectively filter what that software
> > can receive. Some examples of software that do directly listen to uevents
> > and that we have in the past attempted to run inside a network namespace
> > are rbd (CEPH client) or the X server.
> >
> > Implementation:
> > The implementation has been kept as simple as possible from the kernel's
> > perspective. Specifically, a simple input method uevent_net_rcv() is added
> > to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
> > af_netlink infrastructure and does neither add an additional netlink family
> > nor requires any user-visible changes.
> >
> > For example, by using netlink_rcv_skb() we can make use of existing netlink
> > infrastructure to report back informative error messages to userspace.
> >
> > Furthermore, this implementation does not introduce any overhead for
> > existing uevent generating codepaths. The struct netns gets a new uevent
> > socket member that records the uevent socket associated with that network
> > namespace. Since we record the uevent socket for each network namespace in
> > struct net we don't have to walk the whole uevent socket list.
> > Instead we can directly retrieve the relevant uevent socket and send the
> > message. This keeps the codepath very performant without introducing
> > needless overhead.
> >
> > Uevent sequence numbers are kept global. When a uevent message is sent to
> > another network namespace the implementation will simply increment the
> > global uevent sequence number and append it to the received uevent. This
> > has the advantage that the kernel will never need to parse the received
> > uevent message to replace any existing uevent sequence numbers. Instead it
> > is up to the userspace process to remove any existing uevent sequence
> > numbers in case the uevent message to be sent contains any.
> >
> > Security:
> > In order for a caller to send uevent messages to a target network namespace
> > the caller must have CAP_SYS_ADMIN in the owning user namespace of the
> > target network namespace. Additionally, any received uevent message is
> > verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
> > needed to append the uevent sequence number.
> >
> > Testing:
> > This patch has been tested and verified to work with the following udev
> > implementations:
> > 1. CentOS 6 with udevd version 147
> > 2. Debian Sid with systemd-udevd version 237
> > 3. Android 7.1.1 with ueventd
> >
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
> > include/net/net_namespace.h | 1 +
> > lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 88 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index f306b2aa15a4..467bde763a9b 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -78,6 +78,7 @@ struct net {
> >
> > struct sock *rtnl; /* rtnetlink socket */
> > struct sock *genl_sock;
> > + struct sock *uevent_sock; /* uevent socket */
>
> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
> become to look confusing. There are:
>
> mutex_lock(&uevent_sock_mutex);
> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> if (sock_net(ue_sk->sk) == net)
> goto found;
> }
>
> Can't we make a small cleanup in lib/kobject_uevent.c after this change
> and before the main part of the patch goes?
Hm, not sure. It seems it makes sense to maintain them in a separate
list. Looks like this lets us keep the locking simpler. Otherwise we
would have to do something like for_each_net() and it seems that this
would force us to use rntl_{un}lock().
>
> >
> > struct list_head dev_base_head;
> > struct hlist_head *dev_name_head;
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 9fe6ec8fda28..10b2144b9fc3 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -25,6 +25,7 @@
> > #include <linux/uuid.h>
> > #include <linux/ctype.h>
> > #include <net/sock.h>
> > +#include <net/netlink.h>
> > #include <net/net_namespace.h>
> >
> >
> > @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> > EXPORT_SYMBOL_GPL(add_uevent_var);
> >
> > #if defined(CONFIG_NET)
> > +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
> > + struct netlink_ext_ack *extack)
> > +{
> > + int ret;
> > + u64 seqnum;
> > + /* u64 to chars: 2^64 - 1 = 21 chars */
>
> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
> we forgot +1 in buf declaration.
sizeof("SEQNUM=") will include the '\0' pointer in contrast to
strlen("SEQNUM=") so this is correct if I'm not completely mistaken.
>
> > + char buf[sizeof("SEQNUM=") + 21];
> > + struct sk_buff *skbc;
> > +
> > + /* bump sequence number */
> > + mutex_lock(&uevent_sock_mutex);
> > + seqnum = ++uevent_seqnum;
> > + mutex_unlock(&uevent_sock_mutex);
>
> Commit 7b60a18da393 from Andrew Vagin says:
>
> uevent: send events in correct order according to seqnum (v3)
>
> The queue handling in the udev daemon assumes that the events are
> ordered.
>
> Before this patch uevent_seqnum is incremented under sequence_lock,
> than an event is send uner uevent_sock_mutex. I want to say that code
> contained a window between incrementing seqnum and sending an event.
>
> This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
> Signed-off-by: Andrew Vagin <[email protected]>
>
> After this change the order will be lost. Also the rest of namespaces
> will have holes in uevent numbers, as they won't receive a number sent
> to specific namespace.
Afaict from looking into udevd when I wrote the patch it only cares
about numbers being ordered (which is also what Andrey's patch states)
not that they are sequential so holes should be fine. udevd will use
the DEVPATH to determine whether the sequence number of the current
uevent should be used as "even->delaying_seqnum" number. All that
matters is that it is greater than the previous one. About the ordering,
if that's an issue then we should simply do what Andrey has been doing
for kobject_uevent_env() and extend the lock being held until after we
sent out the uevent. Since we're not serving all listeners but only
ever one this is also way more lightweight then kobject_uevent_env().
>
> It seems we should made uevent_seqnum per-net to solve this problem.
Yes, Eric and I have been discussing this already. The idea was to do
this in a follow-up patch to keep this patch as simple as possible. If
we agree that it would make sense right away then I will dig out the
corresponding patch.
It would basically just involve giving each struct net a uevent_seqnum
member.
Christian
>
> > + /* prepare sequence number */
> > + ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", seqnum);
> > + if (ret < 0 || (size_t)ret >= sizeof(buf))
> > + return -ENOMEM;> + ret++;
> > +
> > + /* verify message does not overflow */
> > + if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
> > + NL_SET_ERR_MSG(extack, "uevent message too big");
> > + return -EINVAL;
> > + }
> > +
> > + /* copy skb and extend to accommodate sequence number */
> > + skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
> > + if (!skbc)
> > + return -ENOMEM;
> > +
> > + /* append sequence number */
> > + skb_put_data(skbc, buf, ret);
> > +
> > + /* remove msg header */
> > + skb_pull(skbc, NLMSG_HDRLEN);
> > +
> > + /* set portid 0 to inform userspace message comes from kernel */
> > + NETLINK_CB(skbc).portid = 0;
> > + NETLINK_CB(skbc).dst_group = 1;
> > +
> > + ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
> > + /* ENOBUFS should be handled in userspace */
> > + if (ret == -ENOBUFS || ret == -ESRCH)
> > + ret = 0;
> > +
> > + return ret;
> > +}
> > +
> > +static int uevent_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
> > + struct netlink_ext_ack *extack)
> > +{
> > + void *msg;
> > + struct net *target_net;
> > + struct sock *target_sk;
> > +
> > + msg = nlmsg_data(nlh);
> > + if (!msg)
> > + return -EINVAL;
> > +
> > + target_net = sock_net(NETLINK_CB(skb).sk);
> > + target_sk = target_net->uevent_sock;
> > +
> > + /*
> > + * Verify that we are allowed to send messages to the target network
> > + * namespace. The caller must have CAP_SYS_ADMIN in the owning user
> > + * namespace of the target network namespace.
> > + */
> > + if (!netlink_ns_capable(skb, target_net->user_ns, CAP_SYS_ADMIN)) {
> > + NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
> > + return -EPERM;
> > + }
> > +
> > + return uevent_net_send(target_sk, skb, extack);
> > +}
> > +
> > +static void uevent_net_rcv(struct sk_buff *skb)
> > +{
> > + netlink_rcv_skb(skb, &uevent_rcv_msg);
> > +}
> > +
> > static int uevent_net_init(struct net *net)
> > {
> > struct uevent_sock *ue_sk;
> > struct netlink_kernel_cfg cfg = {
> > .groups = 1,
> > - .flags = NL_CFG_F_NONROOT_RECV,
> > + .input = uevent_net_rcv,
> > + .flags = NL_CFG_F_NONROOT_RECV
> > };
> >
> > ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
> > @@ -621,6 +704,9 @@ static int uevent_net_init(struct net *net)
> > kfree(ue_sk);
> > return -ENODEV;
> > }
> > +
> > + net->uevent_sock = ue_sk->sk;
> > +
> > mutex_lock(&uevent_sock_mutex);
> > list_add_tail(&ue_sk->list, &uevent_sock_list);
> > mutex_unlock(&uevent_sock_mutex);
>
> Kirill
On 15.03.2018 16:39, Christian Brauner wrote:
> On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
>> CC Andrey Vagin
>
> Hey Kirill,
>
> Thanks for CCing Andrey.
>
>>
>> On 15.03.2018 03:12, Christian Brauner wrote:
>>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
>>> to allow sending uevent messages into the network namespace the socket
>>> belongs to.
>>>
>>> Currently non-initial network namespaces are already isolated and don't
>>> receive uevents. There are a number of cases where it is beneficial for a
>>> sufficiently privileged userspace process to send a uevent into a network
>>> namespace.
>>>
>>> One such use case would be debugging and fuzzing of a piece of software
>>> which listens and reacts to uevents. By running a copy of that software
>>> inside a network namespace, specific uevents could then be presented to it.
>>> More concretely, this would allow for easy testing of udevd/ueventd.
>>>
>>> This will also allow some piece of software to run components inside a
>>> separate network namespace and then effectively filter what that software
>>> can receive. Some examples of software that do directly listen to uevents
>>> and that we have in the past attempted to run inside a network namespace
>>> are rbd (CEPH client) or the X server.
>>>
>>> Implementation:
>>> The implementation has been kept as simple as possible from the kernel's
>>> perspective. Specifically, a simple input method uevent_net_rcv() is added
>>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
>>> af_netlink infrastructure and does neither add an additional netlink family
>>> nor requires any user-visible changes.
>>>
>>> For example, by using netlink_rcv_skb() we can make use of existing netlink
>>> infrastructure to report back informative error messages to userspace.
>>>
>>> Furthermore, this implementation does not introduce any overhead for
>>> existing uevent generating codepaths. The struct netns gets a new uevent
>>> socket member that records the uevent socket associated with that network
>>> namespace. Since we record the uevent socket for each network namespace in
>>> struct net we don't have to walk the whole uevent socket list.
>>> Instead we can directly retrieve the relevant uevent socket and send the
>>> message. This keeps the codepath very performant without introducing
>>> needless overhead.
>>>
>>> Uevent sequence numbers are kept global. When a uevent message is sent to
>>> another network namespace the implementation will simply increment the
>>> global uevent sequence number and append it to the received uevent. This
>>> has the advantage that the kernel will never need to parse the received
>>> uevent message to replace any existing uevent sequence numbers. Instead it
>>> is up to the userspace process to remove any existing uevent sequence
>>> numbers in case the uevent message to be sent contains any.
>>>
>>> Security:
>>> In order for a caller to send uevent messages to a target network namespace
>>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
>>> target network namespace. Additionally, any received uevent message is
>>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
>>> needed to append the uevent sequence number.
>>>
>>> Testing:
>>> This patch has been tested and verified to work with the following udev
>>> implementations:
>>> 1. CentOS 6 with udevd version 147
>>> 2. Debian Sid with systemd-udevd version 237
>>> 3. Android 7.1.1 with ueventd
>>>
>>> Signed-off-by: Christian Brauner <[email protected]>
>>> ---
>>> include/net/net_namespace.h | 1 +
>>> lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 88 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>>> index f306b2aa15a4..467bde763a9b 100644
>>> --- a/include/net/net_namespace.h
>>> +++ b/include/net/net_namespace.h
>>> @@ -78,6 +78,7 @@ struct net {
>>>
>>> struct sock *rtnl; /* rtnetlink socket */
>>> struct sock *genl_sock;
>>> + struct sock *uevent_sock; /* uevent socket */
>>
>> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
>> become to look confusing. There are:
>>
>> mutex_lock(&uevent_sock_mutex);
>> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>> if (sock_net(ue_sk->sk) == net)
>> goto found;
>> }
>>
>> Can't we make a small cleanup in lib/kobject_uevent.c after this change
>> and before the main part of the patch goes?
>
> Hm, not sure. It seems it makes sense to maintain them in a separate
> list. Looks like this lets us keep the locking simpler. Otherwise we
> would have to do something like for_each_net() and it seems that this
> would force us to use rntl_{un}lock().
I'm about:
mutex_lock();
list_del(net->ue_sk->list);
mutex_unlock();
kfree();
Thus we avoid iterations in uevent_net_exit().
>>
>>>
>>> struct list_head dev_base_head;
>>> struct hlist_head *dev_name_head;
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 9fe6ec8fda28..10b2144b9fc3 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -25,6 +25,7 @@
>>> #include <linux/uuid.h>
>>> #include <linux/ctype.h>
>>> #include <net/sock.h>
>>> +#include <net/netlink.h>
>>> #include <net/net_namespace.h>
>>>
>>>
>>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>>> EXPORT_SYMBOL_GPL(add_uevent_var);
>>>
>>> #if defined(CONFIG_NET)
>>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
>>> + struct netlink_ext_ack *extack)
>>> +{
>>> + int ret;
>>> + u64 seqnum;
>>> + /* u64 to chars: 2^64 - 1 = 21 chars */
>>
>> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
>> we forgot +1 in buf declaration.
>
> sizeof("SEQNUM=") will include the '\0' pointer in contrast to
> strlen("SEQNUM=") so this is correct if I'm not completely mistaken.
The code is OK, I'm worrying about comment. But I've missed this sizeof().
So, there is only 1 bytes excess allocated as 0xFFFFFFFFFFFFFFFF=18446744073709551615
Not so important...
>>
>>> + char buf[sizeof("SEQNUM=") + 21];
>>> + struct sk_buff *skbc;
>>> +
>>> + /* bump sequence number */
>>> + mutex_lock(&uevent_sock_mutex);
>>> + seqnum = ++uevent_seqnum;
>>> + mutex_unlock(&uevent_sock_mutex);
>>
>> Commit 7b60a18da393 from Andrew Vagin says:
>>
>> uevent: send events in correct order according to seqnum (v3)
>>
>> The queue handling in the udev daemon assumes that the events are
>> ordered.
>>
>> Before this patch uevent_seqnum is incremented under sequence_lock,
>> than an event is send uner uevent_sock_mutex. I want to say that code
>> contained a window between incrementing seqnum and sending an event.
>>
>> This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
>> Signed-off-by: Andrew Vagin <[email protected]>
>>
>> After this change the order will be lost. Also the rest of namespaces
>> will have holes in uevent numbers, as they won't receive a number sent
>> to specific namespace.
>
> Afaict from looking into udevd when I wrote the patch it only cares
> about numbers being ordered (which is also what Andrey's patch states)
> not that they are sequential so holes should be fine. udevd will use
> the DEVPATH to determine whether the sequence number of the current
> uevent should be used as "even->delaying_seqnum" number. All that
> matters is that it is greater than the previous one. About the ordering,
> if that's an issue then we should simply do what Andrey has been doing
> for kobject_uevent_env() and extend the lock being held until after we
> sent out the uevent. Since we're not serving all listeners but only
> ever one this is also way more lightweight then kobject_uevent_env().
Yes, extending the lock to netlink_broadcast() should fix the problem.
>>
>> It seems we should made uevent_seqnum per-net to solve this problem.
>
> Yes, Eric and I have been discussing this already. The idea was to do
> this in a follow-up patch to keep this patch as simple as possible. If
> we agree that it would make sense right away then I will dig out the
> corresponding patch.
> It would basically just involve giving each struct net a uevent_seqnum
> member.
pernet seqnum may have a sense if we introduce per-net locks to protect it.
If there is single mutex, it does not matter either they are pernet or not.
Per-net mutex may be useful only if we have many single-net messages like
you introduce in this patch, or if we are going to filter net in existing
kobject_uevent_net_broadcast() (by user_ns?!) in the future.
Kirill
On Thu, Mar 15, 2018 at 05:14:13PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 16:39, Christian Brauner wrote:
> > On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
> >> CC Andrey Vagin
> >
> > Hey Kirill,
> >
> > Thanks for CCing Andrey.
> >
> >>
> >> On 15.03.2018 03:12, Christian Brauner wrote:
> >>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> >>> to allow sending uevent messages into the network namespace the socket
> >>> belongs to.
> >>>
> >>> Currently non-initial network namespaces are already isolated and don't
> >>> receive uevents. There are a number of cases where it is beneficial for a
> >>> sufficiently privileged userspace process to send a uevent into a network
> >>> namespace.
> >>>
> >>> One such use case would be debugging and fuzzing of a piece of software
> >>> which listens and reacts to uevents. By running a copy of that software
> >>> inside a network namespace, specific uevents could then be presented to it.
> >>> More concretely, this would allow for easy testing of udevd/ueventd.
> >>>
> >>> This will also allow some piece of software to run components inside a
> >>> separate network namespace and then effectively filter what that software
> >>> can receive. Some examples of software that do directly listen to uevents
> >>> and that we have in the past attempted to run inside a network namespace
> >>> are rbd (CEPH client) or the X server.
> >>>
> >>> Implementation:
> >>> The implementation has been kept as simple as possible from the kernel's
> >>> perspective. Specifically, a simple input method uevent_net_rcv() is added
> >>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
> >>> af_netlink infrastructure and does neither add an additional netlink family
> >>> nor requires any user-visible changes.
> >>>
> >>> For example, by using netlink_rcv_skb() we can make use of existing netlink
> >>> infrastructure to report back informative error messages to userspace.
> >>>
> >>> Furthermore, this implementation does not introduce any overhead for
> >>> existing uevent generating codepaths. The struct netns gets a new uevent
> >>> socket member that records the uevent socket associated with that network
> >>> namespace. Since we record the uevent socket for each network namespace in
> >>> struct net we don't have to walk the whole uevent socket list.
> >>> Instead we can directly retrieve the relevant uevent socket and send the
> >>> message. This keeps the codepath very performant without introducing
> >>> needless overhead.
> >>>
> >>> Uevent sequence numbers are kept global. When a uevent message is sent to
> >>> another network namespace the implementation will simply increment the
> >>> global uevent sequence number and append it to the received uevent. This
> >>> has the advantage that the kernel will never need to parse the received
> >>> uevent message to replace any existing uevent sequence numbers. Instead it
> >>> is up to the userspace process to remove any existing uevent sequence
> >>> numbers in case the uevent message to be sent contains any.
> >>>
> >>> Security:
> >>> In order for a caller to send uevent messages to a target network namespace
> >>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
> >>> target network namespace. Additionally, any received uevent message is
> >>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
> >>> needed to append the uevent sequence number.
> >>>
> >>> Testing:
> >>> This patch has been tested and verified to work with the following udev
> >>> implementations:
> >>> 1. CentOS 6 with udevd version 147
> >>> 2. Debian Sid with systemd-udevd version 237
> >>> 3. Android 7.1.1 with ueventd
> >>>
> >>> Signed-off-by: Christian Brauner <[email protected]>
> >>> ---
> >>> include/net/net_namespace.h | 1 +
> >>> lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
> >>> 2 files changed, 88 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> >>> index f306b2aa15a4..467bde763a9b 100644
> >>> --- a/include/net/net_namespace.h
> >>> +++ b/include/net/net_namespace.h
> >>> @@ -78,6 +78,7 @@ struct net {
> >>>
> >>> struct sock *rtnl; /* rtnetlink socket */
> >>> struct sock *genl_sock;
> >>> + struct sock *uevent_sock; /* uevent socket */
> >>
> >> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
> >> become to look confusing. There are:
> >>
> >> mutex_lock(&uevent_sock_mutex);
> >> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> >> if (sock_net(ue_sk->sk) == net)
> >> goto found;
> >> }
> >>
> >> Can't we make a small cleanup in lib/kobject_uevent.c after this change
> >> and before the main part of the patch goes?
> >
> > Hm, not sure. It seems it makes sense to maintain them in a separate
> > list. Looks like this lets us keep the locking simpler. Otherwise we
> > would have to do something like for_each_net() and it seems that this
> > would force us to use rntl_{un}lock().
>
> I'm about:
>
> mutex_lock();
> list_del(net->ue_sk->list);
> mutex_unlock();
> kfree();
>
> Thus we avoid iterations in uevent_net_exit().
Ah right, but I only added struct sock *uevent_sock to struct net, not
struct uevent_sock. Thus there's no list member in there. I mainly only
added struct sock to keep it clean an aligned with the other sock
member. We can revisit this later though.
>
> >>
> >>>
> >>> struct list_head dev_base_head;
> >>> struct hlist_head *dev_name_head;
> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >>> index 9fe6ec8fda28..10b2144b9fc3 100644
> >>> --- a/lib/kobject_uevent.c
> >>> +++ b/lib/kobject_uevent.c
> >>> @@ -25,6 +25,7 @@
> >>> #include <linux/uuid.h>
> >>> #include <linux/ctype.h>
> >>> #include <net/sock.h>
> >>> +#include <net/netlink.h>
> >>> #include <net/net_namespace.h>
> >>>
> >>>
> >>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> >>> EXPORT_SYMBOL_GPL(add_uevent_var);
> >>>
> >>> #if defined(CONFIG_NET)
> >>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
> >>> + struct netlink_ext_ack *extack)
> >>> +{
> >>> + int ret;
> >>> + u64 seqnum;
> >>> + /* u64 to chars: 2^64 - 1 = 21 chars */
> >>
> >> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
> >> we forgot +1 in buf declaration.
> >
> > sizeof("SEQNUM=") will include the '\0' pointer in contrast to
> > strlen("SEQNUM=") so this is correct if I'm not completely mistaken.
>
> The code is OK, I'm worrying about comment. But I've missed this sizeof().
> So, there is only 1 bytes excess allocated as 0xFFFFFFFFFFFFFFFF=18446744073709551615
> Not so important...
>
> >>
> >>> + char buf[sizeof("SEQNUM=") + 21];
> >>> + struct sk_buff *skbc;
> >>> +
> >>> + /* bump sequence number */
> >>> + mutex_lock(&uevent_sock_mutex);
> >>> + seqnum = ++uevent_seqnum;
> >>> + mutex_unlock(&uevent_sock_mutex);
> >>
> >> Commit 7b60a18da393 from Andrew Vagin says:
> >>
> >> uevent: send events in correct order according to seqnum (v3)
> >>
> >> The queue handling in the udev daemon assumes that the events are
> >> ordered.
> >>
> >> Before this patch uevent_seqnum is incremented under sequence_lock,
> >> than an event is send uner uevent_sock_mutex. I want to say that code
> >> contained a window between incrementing seqnum and sending an event.
> >>
> >> This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
> >> Signed-off-by: Andrew Vagin <[email protected]>
> >>
> >> After this change the order will be lost. Also the rest of namespaces
> >> will have holes in uevent numbers, as they won't receive a number sent
> >> to specific namespace.
> >
> > Afaict from looking into udevd when I wrote the patch it only cares
> > about numbers being ordered (which is also what Andrey's patch states)
> > not that they are sequential so holes should be fine. udevd will use
> > the DEVPATH to determine whether the sequence number of the current
> > uevent should be used as "even->delaying_seqnum" number. All that
> > matters is that it is greater than the previous one. About the ordering,
> > if that's an issue then we should simply do what Andrey has been doing
> > for kobject_uevent_env() and extend the lock being held until after we
> > sent out the uevent. Since we're not serving all listeners but only
> > ever one this is also way more lightweight then kobject_uevent_env().
>
> Yes, extending the lock to netlink_broadcast() should fix the problem.
Yep, send another version of the patch but forgot to CC you, sorry:
https://lkml.org/lkml/2018/3/15/1098
Thanks!
Christian
>
> >>
> >> It seems we should made uevent_seqnum per-net to solve this problem.
> >
> > Yes, Eric and I have been discussing this already. The idea was to do
> > this in a follow-up patch to keep this patch as simple as possible. If
> > we agree that it would make sense right away then I will dig out the
> > corresponding patch.
> > It would basically just involve giving each struct net a uevent_seqnum
> > member.
>
> pernet seqnum may have a sense if we introduce per-net locks to protect it.
> If there is single mutex, it does not matter either they are pernet or not.
> Per-net mutex may be useful only if we have many single-net messages like
> you introduce in this patch, or if we are going to filter net in existing
> kobject_uevent_net_broadcast() (by user_ns?!) in the future.
>
> Kirill
On 16.03.2018 02:46, Christian Brauner wrote:
> On Thu, Mar 15, 2018 at 05:14:13PM +0300, Kirill Tkhai wrote:
>> On 15.03.2018 16:39, Christian Brauner wrote:
>>> On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
>>>> CC Andrey Vagin
>>>
>>> Hey Kirill,
>>>
>>> Thanks for CCing Andrey.
>>>
>>>>
>>>> On 15.03.2018 03:12, Christian Brauner wrote:
>>>>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
>>>>> to allow sending uevent messages into the network namespace the socket
>>>>> belongs to.
>>>>>
>>>>> Currently non-initial network namespaces are already isolated and don't
>>>>> receive uevents. There are a number of cases where it is beneficial for a
>>>>> sufficiently privileged userspace process to send a uevent into a network
>>>>> namespace.
>>>>>
>>>>> One such use case would be debugging and fuzzing of a piece of software
>>>>> which listens and reacts to uevents. By running a copy of that software
>>>>> inside a network namespace, specific uevents could then be presented to it.
>>>>> More concretely, this would allow for easy testing of udevd/ueventd.
>>>>>
>>>>> This will also allow some piece of software to run components inside a
>>>>> separate network namespace and then effectively filter what that software
>>>>> can receive. Some examples of software that do directly listen to uevents
>>>>> and that we have in the past attempted to run inside a network namespace
>>>>> are rbd (CEPH client) or the X server.
>>>>>
>>>>> Implementation:
>>>>> The implementation has been kept as simple as possible from the kernel's
>>>>> perspective. Specifically, a simple input method uevent_net_rcv() is added
>>>>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
>>>>> af_netlink infrastructure and does neither add an additional netlink family
>>>>> nor requires any user-visible changes.
>>>>>
>>>>> For example, by using netlink_rcv_skb() we can make use of existing netlink
>>>>> infrastructure to report back informative error messages to userspace.
>>>>>
>>>>> Furthermore, this implementation does not introduce any overhead for
>>>>> existing uevent generating codepaths. The struct netns gets a new uevent
>>>>> socket member that records the uevent socket associated with that network
>>>>> namespace. Since we record the uevent socket for each network namespace in
>>>>> struct net we don't have to walk the whole uevent socket list.
>>>>> Instead we can directly retrieve the relevant uevent socket and send the
>>>>> message. This keeps the codepath very performant without introducing
>>>>> needless overhead.
>>>>>
>>>>> Uevent sequence numbers are kept global. When a uevent message is sent to
>>>>> another network namespace the implementation will simply increment the
>>>>> global uevent sequence number and append it to the received uevent. This
>>>>> has the advantage that the kernel will never need to parse the received
>>>>> uevent message to replace any existing uevent sequence numbers. Instead it
>>>>> is up to the userspace process to remove any existing uevent sequence
>>>>> numbers in case the uevent message to be sent contains any.
>>>>>
>>>>> Security:
>>>>> In order for a caller to send uevent messages to a target network namespace
>>>>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
>>>>> target network namespace. Additionally, any received uevent message is
>>>>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
>>>>> needed to append the uevent sequence number.
>>>>>
>>>>> Testing:
>>>>> This patch has been tested and verified to work with the following udev
>>>>> implementations:
>>>>> 1. CentOS 6 with udevd version 147
>>>>> 2. Debian Sid with systemd-udevd version 237
>>>>> 3. Android 7.1.1 with ueventd
>>>>>
>>>>> Signed-off-by: Christian Brauner <[email protected]>
>>>>> ---
>>>>> include/net/net_namespace.h | 1 +
>>>>> lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>> 2 files changed, 88 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>>>>> index f306b2aa15a4..467bde763a9b 100644
>>>>> --- a/include/net/net_namespace.h
>>>>> +++ b/include/net/net_namespace.h
>>>>> @@ -78,6 +78,7 @@ struct net {
>>>>>
>>>>> struct sock *rtnl; /* rtnetlink socket */
>>>>> struct sock *genl_sock;
>>>>> + struct sock *uevent_sock; /* uevent socket */
>>>>
>>>> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
>>>> become to look confusing. There are:
>>>>
>>>> mutex_lock(&uevent_sock_mutex);
>>>> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>>>> if (sock_net(ue_sk->sk) == net)
>>>> goto found;
>>>> }
>>>>
>>>> Can't we make a small cleanup in lib/kobject_uevent.c after this change
>>>> and before the main part of the patch goes?
>>>
>>> Hm, not sure. It seems it makes sense to maintain them in a separate
>>> list. Looks like this lets us keep the locking simpler. Otherwise we
>>> would have to do something like for_each_net() and it seems that this
>>> would force us to use rntl_{un}lock().
>>
>> I'm about:
>>
>> mutex_lock();
>> list_del(net->ue_sk->list);
>> mutex_unlock();
>> kfree();
>>
>> Thus we avoid iterations in uevent_net_exit().
>
> Ah right, but I only added struct sock *uevent_sock to struct net, not
> struct uevent_sock. Thus there's no list member in there. I mainly only
> added struct sock to keep it clean an aligned with the other sock
> member. We can revisit this later though.
In my opinion, we shouldn't leave the code in inconsistent state just because
of it's not interesting for this patch. There are two logical changes:
1)add fast-access pointer to struct net
2)implement single-net uevent
It's easy to do 1), and this may help further reader not to be confused
by the inconsistency, so I don't see a reason why we shouldn't do that.
>>
>>>>
>>>>>
>>>>> struct list_head dev_base_head;
>>>>> struct hlist_head *dev_name_head;
>>>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>>>> index 9fe6ec8fda28..10b2144b9fc3 100644
>>>>> --- a/lib/kobject_uevent.c
>>>>> +++ b/lib/kobject_uevent.c
>>>>> @@ -25,6 +25,7 @@
>>>>> #include <linux/uuid.h>
>>>>> #include <linux/ctype.h>
>>>>> #include <net/sock.h>
>>>>> +#include <net/netlink.h>
>>>>> #include <net/net_namespace.h>
>>>>>
>>>>>
>>>>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>>>>> EXPORT_SYMBOL_GPL(add_uevent_var);
>>>>>
>>>>> #if defined(CONFIG_NET)
>>>>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
>>>>> + struct netlink_ext_ack *extack)
>>>>> +{
>>>>> + int ret;
>>>>> + u64 seqnum;
>>>>> + /* u64 to chars: 2^64 - 1 = 21 chars */
>>>>
>>>> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
>>>> we forgot +1 in buf declaration.
>>>
>>> sizeof("SEQNUM=") will include the '\0' pointer in contrast to
>>> strlen("SEQNUM=") so this is correct if I'm not completely mistaken.
>>
>> The code is OK, I'm worrying about comment. But I've missed this sizeof().
>> So, there is only 1 bytes excess allocated as 0xFFFFFFFFFFFFFFFF=18446744073709551615
>> Not so important...
>>
>>>>
>>>>> + char buf[sizeof("SEQNUM=") + 21];
>>>>> + struct sk_buff *skbc;
>>>>> +
>>>>> + /* bump sequence number */
>>>>> + mutex_lock(&uevent_sock_mutex);
>>>>> + seqnum = ++uevent_seqnum;
>>>>> + mutex_unlock(&uevent_sock_mutex);
>>>>
>>>> Commit 7b60a18da393 from Andrew Vagin says:
>>>>
>>>> uevent: send events in correct order according to seqnum (v3)
>>>>
>>>> The queue handling in the udev daemon assumes that the events are
>>>> ordered.
>>>>
>>>> Before this patch uevent_seqnum is incremented under sequence_lock,
>>>> than an event is send uner uevent_sock_mutex. I want to say that code
>>>> contained a window between incrementing seqnum and sending an event.
>>>>
>>>> This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
>>>> Signed-off-by: Andrew Vagin <[email protected]>
>>>>
>>>> After this change the order will be lost. Also the rest of namespaces
>>>> will have holes in uevent numbers, as they won't receive a number sent
>>>> to specific namespace.
>>>
>>> Afaict from looking into udevd when I wrote the patch it only cares
>>> about numbers being ordered (which is also what Andrey's patch states)
>>> not that they are sequential so holes should be fine. udevd will use
>>> the DEVPATH to determine whether the sequence number of the current
>>> uevent should be used as "even->delaying_seqnum" number. All that
>>> matters is that it is greater than the previous one. About the ordering,
>>> if that's an issue then we should simply do what Andrey has been doing
>>> for kobject_uevent_env() and extend the lock being held until after we
>>> sent out the uevent. Since we're not serving all listeners but only
>>> ever one this is also way more lightweight then kobject_uevent_env().
>>
>> Yes, extending the lock to netlink_broadcast() should fix the problem.
>
> Yep, send another version of the patch but forgot to CC you, sorry:
> https://lkml.org/lkml/2018/3/15/1098
>
> Thanks!
> Christian
>
>>
>>>>
>>>> It seems we should made uevent_seqnum per-net to solve this problem.
>>>
>>> Yes, Eric and I have been discussing this already. The idea was to do
>>> this in a follow-up patch to keep this patch as simple as possible. If
>>> we agree that it would make sense right away then I will dig out the
>>> corresponding patch.
>>> It would basically just involve giving each struct net a uevent_seqnum
>>> member.
>>
>> pernet seqnum may have a sense if we introduce per-net locks to protect it.
>> If there is single mutex, it does not matter either they are pernet or not.
>> Per-net mutex may be useful only if we have many single-net messages like
>> you introduce in this patch, or if we are going to filter net in existing
>> kobject_uevent_net_broadcast() (by user_ns?!) in the future.
>>
>> Kirill