Received: by 10.213.65.68 with SMTP id h4csp1984405imn; Thu, 5 Apr 2018 07:08:56 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/4OB0gQiSFzFmYKAHzIadaYeHnh5rcmZzJeHs4oJu5j5l6mDSMpx0IvDLFediPZF2ouvAF X-Received: by 2002:a17:902:b095:: with SMTP id p21-v6mr22806649plr.31.1522937336580; Thu, 05 Apr 2018 07:08:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522937336; cv=none; d=google.com; s=arc-20160816; b=eBHXnTzzjbao5BTG3Ef1GVjDjzZAp8lCl1VuBhyqpHNaNkGpMNJ/5rqzcrwj1WxmMh tf3qpfPnB752EQPze03Qt9xlUzJoeKGTcH7+iW05UxBM9QVnNZvWHyfLw+j5yrSL6j2k Ls//gxtk3cv00vMqEcC9DYtnkx+oPWxkChEPVWGbAZFSAAiwT04zOlpBfHhk0OoEl3EJ t0IpvFT/gFPdYih0bofDk5rbejTYjTseHUNUNbH8VKMeFgVIM8vacNk4KnWGV52hfLbu wIMO+VrG7yU/KcZDQ4loA5jYfOq+nt7dsj4WRibQCCbCmqW4x4JDQ8j+yhd9ZpPFC2fA OMqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:date:from:arc-authentication-results; bh=Ha5+F9Bs6hYYPIeEmvtthy+yteduGWjJbFzKyTFTAd8=; b=DZYO6XYpPcEcaLbw/Ljl1kJpq9UB8c/Rc9mxpQXyliixHCNO+7Ij/YNlb8r5TK5V9M ynf9ueEBE4MeN3N5zwtqHrC4QtwB5Er8j5Qcbg0NLfmY/M0YjqsgAW1xae8s7+0K+nMU 4AIh3tnASLUD9Pd2p6/sE6Pb4RE1fwdtSrvpPVms4rJCm5gR6kEVKbjXJWPj/uay5GqK twifdIu5iIhEE6XlHeRo37GXwS1yo9eU1lHlrirc7W7pnzdfSOEYKNQBfWpax/dpYm4O 4NKspOwb3tufZgVGf4n09s8sXRZ4Zy7KnzF2YH8FvCWlo/8eQHx0iVcY2HQCSpA2vzqD 7uQw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w2-v6si5986525plk.702.2018.04.05.07.08.42; Thu, 05 Apr 2018 07:08:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751454AbeDEOHP (ORCPT + 99 others); Thu, 5 Apr 2018 10:07:15 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:50474 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751259AbeDEOHN (ORCPT ); Thu, 5 Apr 2018 10:07:13 -0400 Received: from mail-wr0-f197.google.com ([209.85.128.197]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1f45Xk-00059P-BC for linux-kernel@vger.kernel.org; Thu, 05 Apr 2018 14:07:12 +0000 Received: by mail-wr0-f197.google.com with SMTP id j47so13351162wre.11 for ; Thu, 05 Apr 2018 07:07:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:date:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Ha5+F9Bs6hYYPIeEmvtthy+yteduGWjJbFzKyTFTAd8=; b=cut3hrCbKjE5OcMOphA++wSjm0redaLXU/mTluqFvOVxLXSbVqSdJMvbbFrc5ZzdcW xhJQtAB6zoB+kUXaMtsmkam+BaVNxae6ACKUG0o57GJVADZmwCN5wono5yjSI/YNXzDG +y7h6VRzmangghdo2bpaEjkWRx/hpcQu5RNp/0ilsQM6J+Kb5C0Ap1c4as3TvA86168l Xj7LiB8U+YoC3lM4ytmz1jnQ1Hb2GM+Qnxr9tA+ThXNWF3iMveCFn6tp9xg7SQI/4C62 Jg52qgAI5birKqL511F4kYJ0tpxuM+/X4ATDC36NvQ2df9/F/rEcxQJXzKdRcQTsJWEP CCHg== X-Gm-Message-State: AElRT7FOcpAuBHBqJRHh0CWT79eZ48PD6XaDay1JvbFLj/O2J3yHrRPT fSpYoI4ZlA2RCGA9tTWU15b63+MeUEuSYtgEftbc/4AVBEwfU8bYkXwlWQmzdm6nO0HQYMVQyk1 MsIpnigcTk0jBxLpXsJg0LBImaEV1NQGe7Drp3zVx9Q== X-Received: by 10.223.143.7 with SMTP id p7mr15638361wrb.207.1522937231977; Thu, 05 Apr 2018 07:07:11 -0700 (PDT) X-Received: by 10.223.143.7 with SMTP id p7mr15638347wrb.207.1522937231692; Thu, 05 Apr 2018 07:07:11 -0700 (PDT) Received: from gmail.com (u-086-c187.eap.uni-tuebingen.de. [134.2.86.187]) by smtp.gmail.com with ESMTPSA id f54sm8308277wra.80.2018.04.05.07.07.10 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 05 Apr 2018 07:07:10 -0700 (PDT) From: Christian Brauner X-Google-Original-From: Christian Brauner Date: Thu, 5 Apr 2018 16:07:10 +0200 To: Kirill Tkhai Cc: ebiederm@xmission.com, davem@davemloft.net, gregkh@linuxfoundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, avagin@virtuozzo.com, serge@hallyn.com Subject: Re: [PATCH net-next] netns: filter uevents correctly Message-ID: <20180405140709.GA1697@gmail.com> References: <20180404194857.29375-1-christian.brauner@ubuntu.com> <442e89b8-e947-6eeb-1bcb-fa28f22a25f0@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <442e89b8-e947-6eeb-1bcb-fa28f22a25f0@virtuozzo.com> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote: > On 04.04.2018 22:48, Christian Brauner wrote: > > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") > > > > enabled sending hotplug events into all network namespaces back in 2010. > > Over time the set of uevents that get sent into all network namespaces has > > shrunk. We have now reached the point where hotplug events for all devices > > that carry a namespace tag are filtered according to that namespace. > > > > Specifically, they are filtered whenever the namespace tag of the kobject > > does not match the namespace tag of the netlink socket. One example are > > network devices. Uevents for network devices only show up in the network > > namespaces these devices are moved to or created in. > > > > However, any uevent for a kobject that does not have a namespace tag > > associated with it will not be filtered and we will *try* to broadcast it > > into all network namespaces. > > > > The original patchset was written in 2010 before user namespaces were a > > thing. With the introduction of user namespaces sending out uevents became > > partially isolated as they were filtered by user namespaces: > > > > net/netlink/af_netlink.c:do_one_broadcast() > > > > if (!net_eq(sock_net(sk), p->net)) { > > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) > > return; > > > > if (!peernet_has_id(sock_net(sk), p->net)) > > return; > > > > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, > > CAP_NET_BROADCAST)) > > j return; > > } > > > > The file_ns_capable() check will check whether the caller had > > CAP_NET_BROADCAST at the time of opening the netlink socket in the user > > namespace of interest. This check is fine in general but seems insufficient > > to me when paired with uevents. The reason is that devices always belong to > > the initial user namespace so uevents for kobjects that do not carry a > > namespace tag should never be sent into another user namespace. This has > > been the intention all along. But there's one case where this breaks, > > namely if a new user namespace is created by root on the host and an > > identity mapping is established between root on the host and root in the > > new user namespace. Here's a reproducer: > > > > sudo unshare -U --map-root > > udevadm monitor -k > > # Now change to initial user namespace and e.g. do > > modprobe kvm > > # or > > rmmod kvm > > > > will allow the non-initial user namespace to retrieve all uevents from the > > host. This seems very anecdotal given that in the general case user > > namespaces do not see any uevents and also can't really do anything useful > > with them. > > > > Additionally, it is now possible to send uevents from userspace. As such we > > can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user > > namespace of the network namespace of the netlink socket) userspace process > > make a decision what uevents should be sent. > > > > This makes me think that we should simply ensure that uevents for kobjects > > that do not carry a namespace tag are *always* filtered by user namespace > > in kobj_bcast_filter(). Specifically: > > - If the owning user namespace of the uevent socket is not init_user_ns the > > event will always be filtered. > > - If the network namespace the uevent socket belongs to was created in the > > initial user namespace but was opened from a non-initial user namespace > > the event will be filtered as well. > > Put another way, uevents for kobjects not carrying a namespace tag are now > > always only sent to the initial user namespace. The regression potential > > for this is near to non-existent since user namespaces can't really do > > anything with interesting devices. > > > > Signed-off-by: Christian Brauner > > --- > > lib/kobject_uevent.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c > > index 15ea216a67ce..cb98cddb6e3b 100644 > > --- a/lib/kobject_uevent.c > > +++ b/lib/kobject_uevent.c > > @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data) > > return sock_ns != ns; > > } > > > > - return 0; > > + /* > > + * The kobject does not carry a namespace tag so filter by user > > + * namespace below. > > + */ > > + if (sock_net(dsk)->user_ns != &init_user_ns) > > + return 1; > > + > > + /* Check if socket was opened from non-initial user namespace. */ > > + return sk_user_ns(dsk) != &init_user_ns; > > } > > #endif > > So, this prohibits to listen events of all devices except network-related > in containers? If it's so, I don't think it's a good solution. Uevents is not No, this is not correct: As it is right now *without my patch* no non-initial user namespace is receiving *any uevents* but those specifically namespaced such as those for network devices. This patch doesn't change that at all. The commit message outlines this in detail how this comes about. There is only one case where this currently breaks and this is as I outlined explicitly in my commit message when you create a new user namespace and map container(0) -> host(0). This patch fixes this. > net-devices-only related interface and it's used for all devices in system. > People may want to delegate block devices to nested user_ns, for example. That's fine but that's why I added uevent injection in a previous patch series: I repeat no non-initial user namespace will by default receive uevents. Thanks! Christian