Received: by 10.192.165.148 with SMTP id m20csp3254627imm; Mon, 23 Apr 2018 03:27:39 -0700 (PDT) X-Google-Smtp-Source: AIpwx4929G+miqj6E5wig9Jx6MIYhTM4aVTfbfX0bB2B1grYrwluqR3/fMnUd8MJb1dA+l2q1kkX X-Received: by 2002:a17:902:da4:: with SMTP id 33-v6mr20266223plv.52.1524479259723; Mon, 23 Apr 2018 03:27:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524479259; cv=none; d=google.com; s=arc-20160816; b=Vob0AQ489yWqVPtmocwkvSW8kUdeuSf9ZkSCCfI3SEhxXIuPsLjPikqNNgZUoK6U8v tXt3VimXOGuIt19qdNrsy+NU8/de9E89PJJ53cuLpqfSdASyj2GAxSDAORJSQa5GfDsS p8M/WIYBVj8hX+DLunvrASrCHVUvAZ9dSvGhTo0SRn46m4DQqCo3uVmtYQjh0lDPE57u vVoyC3FyoHsybyRLU1VFsom0rw+ITxe7TkvfM1Hpbob2R2f78yweFtl1VDkQE9JoDeBO ZFUk92Mt/zwMXNcFvBggvK1Xp6SvyaJySNIs6/aurvVdI7vMKWl8R8BdNt0XisFXtEr7 /9dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=XliCxe/l7kZpg+8NLtUzMpSk29JtdS8iV2/M09ugIIc=; b=pW0f/HyQBQezc4LHaPOkqV5AshcZPMkttUs7uBqZXsgmi7jY8QxRWAI5kXeZmALyTS LxeFrPzoI0zovO6NYAjx2qI8UW6Ajd5XVFQVwyN9iko2Z0P3zxVkY+QxF1smuED4MTlX 5nBEFH4S0a/dfNQrI/Vl0ts4WscCji9aXvnmx8A+cykjGVuP1XsS8qOD5DRRG9GUN2wS Y0GvotWOfGRKMenJZIFSQQMmGf9Dn7R13b9oANlq6ToraNb13WwahLsd6wY96e5IRpBP QOvRIX018u4m6Cbn9slGXPHSawSgzrxZXjCl4WCqQ6wgaXWSgW74OKQpgrdhsaYUDsOK ADNQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k5-v6si12298859pln.598.2018.04.23.03.27.25; Mon, 23 Apr 2018 03:27:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754711AbeDWKZh (ORCPT + 99 others); Mon, 23 Apr 2018 06:25:37 -0400 Received: from mail-wr0-f194.google.com ([209.85.128.194]:42906 "EHLO mail-wr0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754675AbeDWKY4 (ORCPT ); Mon, 23 Apr 2018 06:24:56 -0400 Received: by mail-wr0-f194.google.com with SMTP id s18-v6so39691903wrg.9; Mon, 23 Apr 2018 03:24:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=XliCxe/l7kZpg+8NLtUzMpSk29JtdS8iV2/M09ugIIc=; b=filIa6uj5IhkacL8ioxFiZcOjkqwtFPE6oOhdIr94M1HtSD12XzjC3vMsfJQyalmfi hdMqMrH2q/0R+VFki/a3T6mOkhHL3vfPVHUSq12u8BZdnuNcHPhAAApeVu5L4Sh2YR91 K1Y8/7TXpSsPbAQkgGK9pnr+sUNkgTrpl66tefQNVDG749qC4dxSKiQ+lPU20j34E55Q PR4g0E/K9qQsFROMA78tVAcD3QSPFv1blPMzb7d7Ez/V5cD6XHvKIeFLeG2rGikG2gwA aHZccutcI4swY/okGgRPYByJOz2UuByNIvshpoESNqMP07kUCzkopeM3Zg6ebqRZos6f sH9w== X-Gm-Message-State: ALQs6tBt6cx1svKBl4evs+Jp043foVNLyEI/rM+MG+dJRy0q6FYjxS5k QkhbBAnx+djIxtysAPAsYM8= X-Received: by 2002:adf:9986:: with SMTP id y6-v6mr16676242wrb.40.1524479094839; Mon, 23 Apr 2018 03:24:54 -0700 (PDT) Received: from localhost.localdomain (u-085-c021.eap.uni-tuebingen.de. [134.2.85.21]) by smtp.gmail.com with ESMTPSA id l15-v6sm11144182wrb.85.2018.04.23.03.24.53 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Apr 2018 03:24:54 -0700 (PDT) From: Christian Brauner To: ebiederm@xmission.com, davem@davemloft.net, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Cc: avagin@virtuozzo.com, ktkhai@virtuozzo.com, serge@hallyn.com, gregkh@linuxfoundation.org, Christian Brauner Subject: [PATCH net-next 1/2 v1] netns: restrict uevents Date: Mon, 23 Apr 2018 12:24:42 +0200 Message-Id: <20180423102443.16627-2-christian.brauner@ubuntu.com> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180423102443.16627-1-christian.brauner@ubuntu.com> References: <20180423102443.16627-1-christian.brauner@ubuntu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") enabled sending hotplug events into all network namespaces back in 2010. Over time the set of uevents that get sent into all network namespaces has shrunk a little. We have now reached the point where hotplug events for all devices that carry a namespace tag are filtered according to that namespace. Specifically, they are filtered whenever the namespace tag of the kobject does not match the namespace tag of the netlink socket. One example are network devices. Uevents for network devices only show up in the network namespaces these devices are moved to or created in. However, any uevent for a kobject that does not have a namespace tag associated with it will not be filtered and we will broadcast it into all network namespaces. This behavior stopped making sense when user namespaces were introduced. This patch restricts uevents to the initial user namespace for a couple of reasons that have been extensively discusses on the mailing list [1]. - Thundering herd: Broadcasting uevents into all network namespaces introduces significant overhead. All processes that listen to uevents running in non-initial user namespaces will end up responding to uevents that will be meaningless to them. Mainly, because non-initial user namespaces cannot easily manage devices unless they have a privileged host-process helping them out. This means that there will be a thundering herd of activity when there shouldn't be any. - Uevents from non-root users are already filtered in userspace: Uevents are filtered by userspace in a user namespace because the received uid != 0. Instead the uid associated with the event will be 65534 == "nobody" because the global root uid is not mapped. This means we can safely and without introducing regressions modify the kernel to not send uevents into all network namespaces whose owning user namespace is not the initial user namespace because we know that userspace will ignore the message because of the uid anyway. I have a) verified that is is true for every udev implementation out there b) that this behavior has been present in all udev implementations from the very beginning. - Removing needless overhead/Increasing performance: Currently, the uevent socket for each network namespace is added to the global variable uevent_sock_list. The list itself needs to be protected by a mutex. So everytime a uevent is generated the mutex is taken on the list. The mutex is held *from the creation of the uevent (memory allocation, string creation etc. until all uevent sockets have been handled*. This is aggravated by the fact that for each uevent socket that has listeners the mc_list must be walked as well which means we're talking O(n^2) here. Given that a standard Linux workload usually has quite a lot of network namespaces and - in the face of containers - a lot of user namespaces this quickly becomes a performance problem (see "Thundering herd" above). By just recording uevent sockets of network namespaces that are owned by the initial user namespace we significantly increase performance in this codepath. - Injecting uevents: There's a valid argument that containers might be interested in receiving device events especially if they are delegated to them by a privileged userspace process. One prime example are SR-IOV enabled devices that are explicitly designed to be handed of to other users such as VMs or containers. This use-case can now be correctly handled since commit 692ec06d7c92 ("netns: send uevent messages"). This commit introduced the ability to send uevents from userspace. As such we can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of the network namespace of the netlink socket) userspace process make a decision what uevents should be sent. This removes the need to blindly broadcast uevents into all user namespaces and provides a performant and safe solution to this problem. - Filtering logic: This patch filters by *owning user namespace of the network namespace a given task resides in* and not by user namespace of the task per se. This means if the user namespace of a given task is unshared but the network namespace is kept and is owned by the initial user namespace a listener that is opening the uevent socket in that network namespace can still listen to uevents. [1]: https://lkml.org/lkml/2018/4/4/739 Signed-off-by: Christian Brauner --- Changelog v0->v1: * patch unchanged --- lib/kobject_uevent.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 15ea216a67ce..f5f5038787ac 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net) net->uevent_sock = ue_sk; - mutex_lock(&uevent_sock_mutex); - list_add_tail(&ue_sk->list, &uevent_sock_list); - mutex_unlock(&uevent_sock_mutex); + /* Restrict uevents to initial user namespace. */ + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { + mutex_lock(&uevent_sock_mutex); + list_add_tail(&ue_sk->list, &uevent_sock_list); + mutex_unlock(&uevent_sock_mutex); + } + return 0; } @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net) { struct uevent_sock *ue_sk = net->uevent_sock; - mutex_lock(&uevent_sock_mutex); - list_del(&ue_sk->list); - mutex_unlock(&uevent_sock_mutex); + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { + mutex_lock(&uevent_sock_mutex); + list_del(&ue_sk->list); + mutex_unlock(&uevent_sock_mutex); + } netlink_kernel_release(ue_sk->sk); kfree(ue_sk); -- 2.17.0