Received: by 10.223.164.202 with SMTP id h10csp416417wrb; Thu, 9 Nov 2017 08:17:09 -0800 (PST) X-Google-Smtp-Source: ABhQp+S4a6uAyMgBr3YiwYDgz+SwowPsEgFDKqc1rbrQNI0Bx0oNWSibvcXgbOkAz5qXLOLUakmY X-Received: by 10.99.122.8 with SMTP id v8mr961975pgc.413.1510244229386; Thu, 09 Nov 2017 08:17:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1510244229; cv=none; d=google.com; s=arc-20160816; b=cGUyNOd9jGMWoKNWYsQzSU3XUdAl1f2Nb4mKPMIwV6vgSNqoYYnIZrJuc71H1ZQi2p e+l8Kmt47P8AoNxpXS1EjqIGqlSa7/8hYpXGd9SQbDhOVcy84bUQ9AqLvwwOIDKKW+Nz D5Z+wrsDiHGI2vRT+rMRAby/WYg7i9sB2FISpLCv891BgxEit/njkxMtVBwnAN2A/Ct9 SOzWurYdfBKPm1dGGXYef9yMRWetvX2kIg9nwt3eauGtm+K0LYKhZFrppmCl7iAgCHOB 1BpwfCpwhc8Xmc0CdyrPJLNGl6eH0YKLbiuC1WMWH8QtlSksZX0I32I9Jzg1doyqK+pS UJhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature :arc-authentication-results; bh=F0J+p7kOo9Ase4a+qonogQsetB/v0W1Kq1K4zq7NelM=; b=Xw629jRm7Uzx3phf6xLB9KQ2BNM9PCADS3LQZG98AWMbvqV2hVxil/L9fJ3C0+gOQQ OwDtLRlGe8UVIIluSYnky4t0so/pjy2OUzYzMYYsLG7yYGrYAyrjRnpJ6PdSxEDvSs7a Qz+N6HKCR4yn8jHwaNDffQ2XPLReeU5qwqCXltJxQFaPEv+tBo4Q54P4Ad6cO2XcUlIC LD8Z58YN+ZE8AAtOT/ZCuNTOff+IDqdgxGvNQ5S7+oquL0WLoUDW0UdiCLOegWHysntL lk+8S9iXgDguGAwsjOtSzOVqLvTJauKyDmTWtDlQaut6osLlvumsNLU3uPZ2nvm4pnNn TfpA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RHAylZB0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l12si6960973pfd.342.2017.11.09.08.16.56; Thu, 09 Nov 2017 08:17:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RHAylZB0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753421AbdKIQOb (ORCPT + 81 others); Thu, 9 Nov 2017 11:14:31 -0500 Received: from mail-wm0-f48.google.com ([74.125.82.48]:47140 "EHLO mail-wm0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753244AbdKIQOW (ORCPT ); Thu, 9 Nov 2017 11:14:22 -0500 Received: by mail-wm0-f48.google.com with SMTP id b14so3009506wme.2; Thu, 09 Nov 2017 08:14:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=F0J+p7kOo9Ase4a+qonogQsetB/v0W1Kq1K4zq7NelM=; b=RHAylZB01552NgOkuU3vF5UNkGcHsPjZC/JXaHIxJnALPcHpWxoVZ/0w2FtLe4HZ9X iD8F0K431Za1rZjmSrlfBkyv9/auxuY4anZMZVX3vMkNTFLeXbK3Mjs3yI5x3RSUcg66 f0T6PlFbEmKQ66Rvd54tZR4nqIF/68TliTzD7/1xOkjBmkpr1NVXDkrsVlsBSMGw8vTN PdcKOSzsuK2BYtVAeulGOFFmwAvAPZ2rgFJxuPFOD7yixhdsrF0UryzWXmAeNt3hrIl+ 3k1RkoQtY822Dq3v8i1kejNmqaxESJ8uT7R11jnnOKUxLj6ReiPcS/GsXJU/CSQy0V3p SIhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=F0J+p7kOo9Ase4a+qonogQsetB/v0W1Kq1K4zq7NelM=; b=CIPdUChJVynT363yH7PYVETvYTFfTLWIsTGyEoAcFYflpWG6iQ92YqTqbHZ/c/Isqd 7lpBcsRrHqOtCO1elScab5Q3jCwBmiuxQEFn0IJW0XbbZlGM4FC8Ug/tCDCNSA8H/hz7 COf/AgFbwA+DD0IYXQkpCifUVrLJAvX5yJVyJXR4B7bJUDEypVYAuO6DRKzscshibe/9 9GK2AzyY8CAC7mJ2Dy9PAfyqkyWQKs0ykkKj+PE/KLNaxPjLdVVSdV97NxWQxJGWqxMm 42NDXHc6Xr1pynzQaWV3jScdqLV4agjVEbIBpmymKuZYFMLCJ4967IUvT7z1MlciSPRb eJqw== X-Gm-Message-State: AJaThX5FC/lhSwwCxD3L5jBMQl1eiYdcWNPzSk4YQbJDSGdmL7/CLt9a 1rSSp0p2JB68PyoxUalzUF0= X-Received: by 10.80.245.204 with SMTP id x12mr1375324edm.172.1510244060370; Thu, 09 Nov 2017 08:14:20 -0800 (PST) Received: from localhost.localdomain (ip-109-45-0-227.web.vodafone.de. [109.45.0.227]) by smtp.gmail.com with ESMTPSA id d3sm5826085edd.41.2017.11.09.08.14.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 09 Nov 2017 08:14:19 -0800 (PST) From: Djalal Harouni To: Kees Cook , Alexey Gladkov , Andy Lutomirski , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-hardening@lists.openwall.com, linux-security-module@vger.kernel.org, linux-api@vger.kernel.org Cc: Greg Kroah-Hartman , Alexander Viro , Akinobu Mita , me@tobin.cc, Oleg Nesterov , Jeff Layton , Ingo Molnar , Alexey Dobriyan , ebiederm@xmission.com, Linus Torvalds , Daniel Micay , Jonathan Corbet , bfields@fieldses.org, Stephen Rothwell , solar@openwall.com, Djalal Harouni Subject: [PATCH RFC v3 0/7] proc: modernize proc to support multiple private instances Date: Thu, 9 Nov 2017 17:13:59 +0100 Message-Id: <1510244046-3256-1-git-send-email-tixxdz@gmail.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi list, Preface: -------- This is RFC v3 to modernize procfs and make it able to support multiple private instances per the same pid namespace. I have been working on this with Alexey Gladkov and Andy Lutomirski. RFC v1 is here: https://lkml.org/lkml/2017/3/30/670 RFC v2 is here: https://lkml.org/lkml/2017/4/25/282 This RFC v3 can be applied on top of next-20171109 This RFC was tested on Ubuntu/Debian and Alexey tested it on altlinux. It does not work on Fedora due to a bug during boot with dracut, I did not have time to investigate it more. I will make sure to fix it next iteration. We decided to send it to get more feedback on the direction, we will continue to improve it. RFC v3 handles all previous comments from Andy Lutomirski, thank you for all the feedback. Procfs modernization: --------------------- Historically procfs was always tied to pid namespaces, during pid namespace creation we internally create a procfs mount for it. However, this has the effect that all new procfs mounts are just a mirror of the internal one, any change, any mount option update, any new future introduction will propagate to all other procfs mounts that are in the same pid namespace. This may have solved several use cases in that time. However today we face new requirements, and making procfs able to support new private instances inside same pid namespace seems a major point. If we want to to introduce new features and security mechanisms we have to make sure first that we do not break existing usecases. Supporting private procfs instances wil allow to support new features and behaviour without propagating it to all other procfs mounts. Today procfs is more of a burden especially to some Embedded, IoT, sandbox, container use cases. In user space we are over-mounting null or inaccessible files on top to hide files and information. If we want to hide pids we have to create PID namespaces otherwise mount options propagate to all other proc mounts, changing a mount option value in one mount will propagate to all other proc mounts. If we want to introduce new features, then they will propagate to all other mounts too, resulting either maybe new useful functionality or maybe breaking stuff. We have also to note that userspace should not workaround procfs, the kernel should just provide a sane simple interface. In this regard several developers and maintainers pointed out that there are problems with procfs and it has to be modernized: "Here's another one: split up and modernize /proc." by Andy Lutomirski [1] Discussion about kernel pointer leaks: "And yes, as Kees and Daniel mentioned, it's definitely not just dmesg. In fact, the primary things tend to be /proc and /sys, not dmesg itself." By Linus Torvalds [2] Lot of other areas in the kernel and filesystems have been updated to be able to support private instances, devpts is one major example [3]. The aim here is to modernize procfs without breaking userspace, or without affecting the shared procfs mount. Later new features will apply on the private instances, and after more testing, months, maybe it can be made the default especially for IoT. We want the possibility to do: mount -t proc -onewinstance,newfeature none /proc newfeature: we are planning new features later for procfs, for now in this RFC we only introduce "pids=all|ptraceable" mount option. This allows to absorbe changes, make improvments without breaking use cases. Which will be used for: 1) Embedded systems and IoT: usually we have one supervisor for apps, we have some lightweight sandbox support, however if we create pid namespaces we have to manage all the processes inside too, where our goal is to be able to run a bunch of apps each one inside its own mount namespace, maybe use network namespaces for vlans setups, but right now we only want mount namespaces, without all the other complexity. we want procfs to behave more like a real file system, and block access to inodes that belong to other users. 'hidepid=' will not work since it is a shared mount option. 2) Containers, sandboxes and Private instances of file systems - devpts case Historically, lot of file systems inside Linux kernel view when instantiated were just a mirror of an already created and mounted filesystem. This was the case of devpts filesystem, it seems at that time the requirements were to optimize things and reuse the same memory, etc. This design used to work but not anymore with today’s containers, IoT, hostile environments and all the privacy challenges that Linux faces. In that regards, devpts was updated so that each new mounts is a total independent file system by the following patches: “devpts: Make each mount of devpts an independent filesystem” by Eric W. Biederman [3] [4] 3) Linux Security Modules have multiple ptrace paths inside some subsystems, however inside procfs, the implementation does not guarantee that the ptrace() check which triggers the security_ptrace_check() hook will always run. We have the 'hidepid' mount option that can be used to force the ptrace_may_access() check inside has_pid_permissions() to run. The problem is that 'hidepid' is per pid namespace and not attached to the mount point, any remount or modification of 'hidepid' will propagate to all other procfs mounts. This also does not allow to support Yama LSM easily in desktop and user sessions. Yama ptrace scope which restricts ptrace and some other syscalls to be allowed only on inferiors, can be updated to have a per-task context, where the context will be inherited during fork(), clone() and preserved across execve(). If we support multiple private procfs instances, then we may force the ptrace_may_access() on /proc// to always run inside that new procfs instances. This will allow to specifiy on user sessions if we should populate procfs with pids that the user can ptrace or not. By using Yama ptrace scope, some restricted users will only be able to see inferiors inside /proc, they won't even be able to see their other processes. Some software like Chromium, Firefox's crash handler, Wine and others are already using Yama to restrict which processes can be ptracable. With this change this will give the possibility to restrict /proc// but more importantly this will give desktop users a generic and usuable way to specifiy which users should see all processes and which user can not. Side notes: * This covers the lack of seccomp where it is not able to parse arguments, it is easy to install a seccomp filter on direct syscalls that operate on pids, however /proc// is a Linux ABI using filesystem syscalls. With this change all LSMs should be able to analyze open/read/write/close... on /proc// 4) This will allow to implement new features either in kernel or userspace without having to worry about procfs. In containers, sandboxes, etc we have workarounds to hide some /proc inodes, this should be supported natively without doing extra complex work, the kernel should be able to support sane options that work with today and future Linux use cases. Alexey Gladkov has on top a patch [7] that allows to hide non-pid inodes from procfs, we are improving that patch and with 'newinstance' option it can be used in containers and sandboxes, as these are already trying to hide and block access to procfs inodes anyway. https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3 Introduced changes: ------------------- This series adds two new mount options: * 'newinstance' mount option, it was also suggesed by Andy Lutomirski [5]. When this option is passed we automatically create a private procfs instance. This is not the default behaviour since we do not want to break userspace and we do not want to provide different devices IDs by default when stat()ing inodes, I am not sure about all the use cases there [6]. * 'pids' mount option, as discussed with Andy Lutomirski. If 'pids=' is passed without 'newinstance' then it has no effect. If 'newinstance,pids=all' then processes will be show inside the proc mount. If 'newinstance,pids=ptraceable' then only ptraceable processes will be shown. This allows to support lightweight sandboxes in Embedded Linux, also solves the case for LSM where now with this mount option, we make sure that they have a ptrace path in procfs. Use cases of 'newinstance' mount option: * We create a private procfs instance that it is disconnected from the shared or other procfs instances. * "hidepid" instead of chaning all other mirrored procfs mounts, now it will work only on the new private instance. * "gid" instead of chaning all other mirrored procfs mounts, now it will work only on the new private instance. * "pids=ptraceable" mount option which will take precendence over "hidepid" will only work when 'newinstance' is set. Otherwise it is ignored. This should allow later after real testing to have a smooth transition to a procfs with default private instances. How to test: $ sudo mount -t proc -onewinstance,pids=ptraceable none /test Note for userspace that should be documented: If you are over mounting /proc, then make sure you are in a new mount namespace where propagation to master is disconnected. This will avoid to pin that new /proc mount. References: ----------- [1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html [2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5 [3] https://lwn.net/Articles/689539/ [4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14 [5] https://lkml.org/lkml/2017/5/2/407 [6] https://lkml.org/lkml/2017/5/3/357 # Changes since RFC v2: *) Renamed mount options to 'newinstance' and 'pids=' Suggested-by: Andy Lutomirski *) Fixed order of commit, Suggested-by: Andy Lutomirski *) Many bug fixes. # Changes since RFC v1: *) Removed 'unshared' mount option and replaced it with 'limit_pids' which is attached to the current procfs mount. Suggested-by Andy Lutomirski *) Do not fill dcache with pid entries that we can not ptrace. *) Many bug fixes. Djalal Harouni (7): [PATCH 1/7] proc: add proc_fs_info struct to store proc information [PATCH 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info [PATCH 3/7] proc: add helpers to set and get proc hidepid and gid mount options [PATCH 4/7] proc: support mounting private procfs instances inside same pid namespace [PATCH 5/7] proc: move hidepid definitions to proc files [PATCH 6/7] proc: support new 'pids=all|ptraceable' mount option [patch 7/7] proc: flush dcache entries from all procfs instances fs/locks.c | 6 +- fs/proc/base.c | 103 ++++++++++++++++------- fs/proc/inode.c | 34 ++++++-- fs/proc/internal.h | 2 +- fs/proc/root.c | 188 +++++++++++ From 1583644766822074328@xxx Fri Nov 10 02:35:40 +0000 2017 X-GM-THRID: 1583325808595528330 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread