Received: by 10.213.65.68 with SMTP id h4csp4030310imn; Tue, 10 Apr 2018 08:13:18 -0700 (PDT) X-Google-Smtp-Source: AIpwx49VYB8RzbA+0uRe1ktbKIQMVmNwe+wBKFzSVqS3Hgq+cJflWJl+x8oGmtOJ8Vd1cHM8Flpw X-Received: by 2002:a17:902:ab81:: with SMTP id f1-v6mr856691plr.5.1523373198212; Tue, 10 Apr 2018 08:13:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523373198; cv=none; d=google.com; s=arc-20160816; b=HVlhnDt5P4hOF/rX2GFFqQcZJL3akYETQU8h3PPBfo7SFPaImcB7Q5eXGQdAPexOHd g0Gzvo7HKt85rzyeMia9QCe3O/i+qMbL+L/CniclJdxEUcKf5x/2vyTpz2pK+gdwGBY6 DJr3Nx7FyXO+wzZE+27kXqycWDl2Wnl6pt1zFIv0ElRCTImtgbYZRjiIW6mqobi4eANT lMt20xFpEf9VDEaA3a/tntKD+TwvZaOvjEQohz94kVmt9uQVkgIGcuGh22hukoRU1M3c iqMrl3A7/nkVA0AMy889GUqfx85i/qoB6s6ovwJIj14WGCDU5IAiJlOsTX3Lqz51GZDn bRUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:mime-version:user-agent :message-id:in-reply-to:date:references:cc:to:from :arc-authentication-results; bh=eTlt0ihXSP6AqzOj2p8G5BzLXFXlhU4U7F8T15PZY30=; b=UswBQYMvRf4w1LAh+NiFpDwImgJtD55P5mdUMPseChJtbxTa5z7PFQdAsr+lw9qN3B axcgLP/vJW7TXt3cNDPInWVl9czlFcZK0V6p3tS4PEzKJ5OLmCqFcLK8irx86SJ8Z6J0 aE/kXDaO/1JnMECvX7U6C9IzQmfXUh5pltjZ7CdbjRmtp/mGN7sHQgoeykxfx4Je/vjM B0HSQ+d2VTcTvD/ONnPgRNbXzdYlu7uUWdrXaLiGVdXf+nSq2fc3+PvDa15rhX0isZ45 qB+LIoIKwRI7YnwwhPh2zmU8e1TpTLBOLsDfcpvUUTp/0uYKePlIHbgFL6oJgSlXWgFv 6fpw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 65-v6si2774628plb.573.2018.04.10.08.12.39; Tue, 10 Apr 2018 08:13:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754269AbeDJPGJ (ORCPT + 99 others); Tue, 10 Apr 2018 11:06:09 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:56267 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753118AbeDJPGH (ORCPT ); Tue, 10 Apr 2018 11:06:07 -0400 Received: from in02.mta.xmission.com ([166.70.13.52]) by out03.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1f5uqQ-0004KC-9N; Tue, 10 Apr 2018 09:06:02 -0600 Received: from [97.119.140.30] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1f5uqO-0003tE-Uc; Tue, 10 Apr 2018 09:06:02 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Christian Brauner Cc: Kirill Tkhai , davem@davemloft.net, gregkh@linuxfoundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, avagin@virtuozzo.com, serge@hallyn.com References: <20180404194857.29375-1-christian.brauner@ubuntu.com> <442e89b8-e947-6eeb-1bcb-fa28f22a25f0@virtuozzo.com> <20180405140709.GA1697@gmail.com> <941de2b9-332f-75fc-f8ac-4059a9b5426f@virtuozzo.com> <20180405144130.GB26043@gmail.com> <87in953ryi.fsf@xmission.com> <20180409154627.GA15157@gmail.com> <878t9wx8xw.fsf@xmission.com> <20180410143515.GA14186@gmail.com> Date: Tue, 10 Apr 2018 10:04:46 -0500 In-Reply-To: <20180410143515.GA14186@gmail.com> (Christian Brauner's message of "Tue, 10 Apr 2018 16:35:16 +0200") Message-ID: <87in8zumpd.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1f5uqO-0003tE-Uc;;;mid=<87in8zumpd.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.119.140.30;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18HuIUW57eNcl5aA9j4Ab63mjtOgLtPWQA= X-SA-Exim-Connect-IP: 97.119.140.30 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa04.xmission.com X-Spam-Level: X-Spam-Status: No, score=0.8 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,T_XMDrugObfuBody_04 autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa04 1397; Body=1 Fuz1=1 Fuz2=1] * 1.0 T_XMDrugObfuBody_04 obfuscated drug references X-Spam-DCC: XMission; sa04 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Christian Brauner X-Spam-Relay-Country: X-Spam-Timing: total 777 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 4.4 (0.6%), b_tie_ro: 3.4 (0.4%), parse: 1.97 (0.3%), extract_message_metadata: 41 (5.3%), get_uri_detail_list: 15 (2.0%), tests_pri_-1000: 12 (1.5%), tests_pri_-950: 1.61 (0.2%), tests_pri_-900: 1.27 (0.2%), tests_pri_-400: 71 (9.2%), check_bayes: 69 (8.9%), b_tokenize: 29 (3.7%), b_tok_get_all: 23 (2.9%), b_comp_prob: 6 (0.8%), b_tok_touch_all: 8 (1.0%), b_finish: 0.71 (0.1%), tests_pri_0: 628 (80.9%), check_dkim_signature: 0.62 (0.1%), check_dkim_adsp: 3.8 (0.5%), tests_pri_500: 9 (1.2%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH net-next] netns: filter uevents correctly X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Christian Brauner writes: > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote: >> Christian Brauner writes: >> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote: >> >> Christian Brauner writes: >> >> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote: >> >> >> On 05.04.2018 17:07, Christian Brauner wrote: >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote: >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote: >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") >> >> >> >>> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010. >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices >> >> >> >>> that carry a namespace tag are filtered according to that namespace. >> >> >> >>> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject >> >> >> >>> does not match the namespace tag of the netlink socket. One example are >> >> >> >>> network devices. Uevents for network devices only show up in the network >> >> >> >>> namespaces these devices are moved to or created in. >> >> >> >>> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it >> >> >> >>> into all network namespaces. >> >> >> >>> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became >> >> >> >>> partially isolated as they were filtered by user namespaces: >> >> >> >>> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast() >> >> >> >>> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) { >> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) >> >> >> >>> return; >> >> >> >>> >> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net)) >> >> >> >>> return; >> >> >> >>> >> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, >> >> >> >>> CAP_NET_BROADCAST)) >> >> >> >>> j return; >> >> >> >>> } >> >> >> >>> >> >> >> >>> The file_ns_capable() check will check whether the caller had >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a >> >> >> >>> namespace tag should never be sent into another user namespace. This has >> >> >> >>> been the intention all along. But there's one case where this breaks, >> >> >> >>> namely if a new user namespace is created by root on the host and an >> >> >> >>> identity mapping is established between root on the host and root in the >> >> >> >>> new user namespace. Here's a reproducer: >> >> >> >>> >> >> >> >>> sudo unshare -U --map-root >> >> >> >>> udevadm monitor -k >> >> >> >>> # Now change to initial user namespace and e.g. do >> >> >> >>> modprobe kvm >> >> >> >>> # or >> >> >> >>> rmmod kvm >> >> >> >>> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the >> >> >> >>> host. This seems very anecdotal given that in the general case user >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful >> >> >> >>> with them. >> >> >> >>> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process >> >> >> >>> make a decision what uevents should be sent. >> >> >> >>> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace >> >> >> >>> in kobj_bcast_filter(). Specifically: >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the >> >> >> >>> event will always be filtered. >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the >> >> >> >>> initial user namespace but was opened from a non-initial user namespace >> >> >> >>> the event will be filtered as well. >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now >> >> >> >>> always only sent to the initial user namespace. The regression potential >> >> >> >>> for this is near to non-existent since user namespaces can't really do >> >> >> >>> anything with interesting devices. >> >> >> >>> >> >> >> >>> Signed-off-by: Christian Brauner >> >> >> >>> --- >> >> >> >>> lib/kobject_uevent.c | 10 +++++++++- >> >> >> >>> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> >> >>> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644 >> >> >> >>> --- a/lib/kobject_uevent.c >> >> >> >>> +++ b/lib/kobject_uevent.c >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data) >> >> >> >>> return sock_ns != ns; >> >> >> >>> } >> >> >> >>> >> >> >> >>> - return 0; >> >> >> >>> + /* >> >> >> >>> + * The kobject does not carry a namespace tag so filter by user >> >> >> >>> + * namespace below. >> >> >> >>> + */ >> >> >> >>> + if (sock_net(dsk)->user_ns != &init_user_ns) >> >> >> >>> + return 1; >> >> >> >>> + >> >> >> >>> + /* Check if socket was opened from non-initial user namespace. */ >> >> >> >>> + return sk_user_ns(dsk) != &init_user_ns; >> >> >> >>> } >> >> >> >>> #endif >> >> >> >> >> >> >> >> So, this prohibits to listen events of all devices except network-related >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not >> >> >> > >> >> >> > No, this is not correct: As it is right now *without my patch* no >> >> >> > non-initial user namespace is receiving *any uevents* but those >> >> >> > specifically namespaced such as those for network devices. This patch >> >> >> > doesn't change that at all. The commit message outlines this in detail >> >> >> > how this comes about. >> >> >> > There is only one case where this currently breaks and this is as I >> >> >> > outlined explicitly in my commit message when you create a new user >> >> >> > namespace and map container(0) -> host(0). This patch fixes this. >> >> >> >> >> >> Could you please point the place, where non-initial user namespaces are filtered? >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted". >> >> >> Now it will return 1 sometimes. >> >> > >> >> > Oh sure, it's in the commit message though. The callchain is >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() -> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() -> >> >> > net/netlink/af_netlink.c:do_one_broadcast(): >> >> > >> >> > This codepiece will check whether the openened socket holds >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace >> >> > which it won't because we don't have device namespaces and all devices >> >> > belong to the initial set of namespaces. >> >> > >> >> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, >> >> > CAP_NET_BROADCAST)) >> >> > j return; >> >> > >> >> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID >> >> on their socket and has had someone with the appropriate privileges >> >> assign a peerrnetid. >> >> >> >> All of which is to say that file_ns_capable is not nearly as applicable >> >> as it might be, and if you can pass the other two checks I think it is >> >> pointless (because the peernet attributes are not generated for >> >> kobj_uevents) but valid to receive events from outside your network >> >> namespace. >> >> >> >> >> >> I might be missing something but I don't see anything excluding network >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent >> >> logic. >> >> >> >> The uevent_sock_list has one entry per network namespace. And >> >> kobject_uevent_net_broacast appears to walk each one. >> >> >> >> I had a memory of filtering uevent messages and I had a memory >> >> that the netlink_has_listeners had a per network namespace effect. >> >> Neither seems true from my inspection of the code tonight. >> >> >> >> If we are not filtering ordinary uevents at least at the user namespace >> >> level that does seem to be at least a nuisance. >> >> >> >> >> >> Christian can you dig a little deeper into this. I have a feeling that >> >> there are some real efficiency improvements that we could make to >> >> kobject_uevent_net_broadcast if nothing else. >> >> >> >> Perhaps you could see where uevents are broadcast by poking >> >> the sysfs uevent of an existing device, and triggering a retransmit. >> > >> > @Eric, so I did some intensive testing over the weekend and forget everything I >> > said before. Uevents are not filtered by the kernel at the moment. This is >> > currently - apart from network devices - a pure userspace thing. Specifically, >> > anyone on the host can listen to all uevents from everywhere. It's neither >> > filtered by user nor by network namespace. The fact that I used >> > >> > udevadm --debug monitor >> > >> > to test my prior hypothesis was what led me to believe that uevents are already >> > correctly filtered. >> > Instead, what is actually happening is that every udev implementation out there >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA. >> > Specifically, >> >> Yes. I remember something of the sort. I believe udev also filters to >> ensure that the netlink port id == 0 to ensure the message came from >> the kernel and was not spoofed in any way. >> >> > - legacy standalone udevd: >> > https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz >> > - eudevd >> > https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645 >> > - systemd-udevd >> > https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656 >> > - ueventd (Android) >> > https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81 >> > >> > For all of those I could trace this behavior back to the first released >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd >> > I could trace it back to the first version that is still available on >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is >> > trivially true that it has the same behavior from the beginning.) >> > Android filters uevents in the same way but removed that check on January 8 >> > 2018 for what I think is invalid reasoning. The good news is that this is only >> > in their master branch. It hasn't even made it into an rc release for Android 8 >> > yet. I filed a bug against Android and offered them a fix if they agree. >> > >> > In any case, userspace udev is not making use of uevents at all right now since >> > any uid != 0 events are **explicitly** discarded. >> > The fact that you receive uevents for >> > >> > sudo unshare -U --map-root -n >> > udevadm --debug monitor >> > >> > is simply explained by the fact that container(0) <=> host(0) at which point >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not >> > discard it. >> > The use case for receiving uevents in containers/user namespaces is definitely >> > there but that's what the uevent injection patch series was for that we merged. >> > This is a much safer and saner solution. >> > The fact that all udev implementations filter uevents by uid != 0 very much >> > seems like a security mechanism in userspace that we probably should provide by >> > isolating uevents based on user and/or network namespaces. >> >> So in summary. Uevents are filtered in a user namespace (by userspace) >> because the received uid != 0. It instead == 65534 == "nobody" because >> the global root uid is not mapped. > > Exactly. > >> >> Which means that we can modify the kernel to not send to all network >> namespaces whose user_ns != &init_user_ns because we know that userspace >> will ignore the message because of the uid anyway. Which means when > > Yes. > >> net-next reopens you can send that patch. But please base it on just >> not including network namespaces in the list, as that is much more >> efficient than adding more conditions to the filter. > > I'll send a patch out once net-next reopens. I'll also make sure to > inform all udev userspace implementations of the change. It won't affect > them but it is nice for them to know that they're safer now. The real danger is in a user namespace or in a container really is too many daemons responding to events will generate a thundering hurd of activity when there is really nothing to do. > Something like this (Proper commit message and so on will be added once > I sent this out.): Exactly. I would make the comment say something like: "ignore all but the initial user namespace". Eric > From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001 > From: Christian Brauner > Date: Tue, 10 Apr 2018 11:56:49 +0200 > Subject: [PATCH] netns: restrict uevents to initial user namespace > > /* Here'll be a useful commit message describing in detail what's > * happening once I sent it to net-next. */ > > Signed-off-by: Christian Brauner > --- > lib/kobject_uevent.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c > index 15ea216a67ce..22a2c1a98b8f 100644 > --- a/lib/kobject_uevent.c > +++ b/lib/kobject_uevent.c > @@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net) > > net->uevent_sock = ue_sk; > > - mutex_lock(&uevent_sock_mutex); > - list_add_tail(&ue_sk->list, &uevent_sock_list); > - mutex_unlock(&uevent_sock_mutex); > + /* > + * Only sent uevents to uevent sockets that are located in network > + * namespaces owned by the initial user namespace. > + */ > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { > + mutex_lock(&uevent_sock_mutex); > + list_add_tail(&ue_sk->list, &uevent_sock_list); > + mutex_unlock(&uevent_sock_mutex); > + } > + > return 0; > } > > @@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net) > { > struct uevent_sock *ue_sk = net->uevent_sock; > > - mutex_lock(&uevent_sock_mutex); > - list_del(&ue_sk->list); > - mutex_unlock(&uevent_sock_mutex); > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { > + mutex_lock(&uevent_sock_mutex); > + list_del(&ue_sk->list); > + mutex_unlock(&uevent_sock_mutex); > + } > > netlink_kernel_release(ue_sk->sk); > kfree(ue_sk);