Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751295AbdCZHNo (ORCPT ); Sun, 26 Mar 2017 03:13:44 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:33039 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751142AbdCZHNm (ORCPT ); Sun, 26 Mar 2017 03:13:42 -0400 MIME-Version: 1.0 In-Reply-To: <20170323220712.GH4554@comp-core-i7-2640m-0182e6> References: <20170312021257.GP29622@ZenIV.linux.org.uk> <20170320125855.GG4554@comp-core-i7-2640m-0182e6> <20170323220712.GH4554@comp-core-i7-2640m-0182e6> From: Djalal Harouni Date: Sun, 26 Mar 2017 09:03:33 +0200 Message-ID: Subject: Re: [RFC] Add option to mount only a pids subset To: Alexey Gladkov Cc: Linux Kernel Mailing List , Linux API , "Kirill A. Shutemov" , Vasiliy Kulikov , Al Viro , "Eric W. Biederman" , Oleg Nesterov , Pavel Emelyanov , James Bottomley , "Dmitry V. Levin" , Jann Horn , Kees Cook , Andy Lutomirski , Miklos Szeredi Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3650 Lines: 98 (Cc'ed Kees and Jann for the procfs stacking issue) On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov wrote: > On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote: >> Hi Alexey, >> >> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov >> wrote: >> > >> > >> > Al Viro, this patch looks better ? >> > >> > == Overview == >> > >> > Some of the container virtualization systems are mounted /proc inside >> > the container. This is done in most cases to operate with information >> > about the processes. Knowing that /proc filesystem is not fully >> > virtualized they are mounted on top of dangerous places empty files or >> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). >> > >> > The structure of this filesystem is dynamic and any module can create a >> > new object which will not necessarily be virtualized. There are >> > proprietary modules that aren't in the mainline whose work we can not >> > verify. >> > >> > This opens up a potential threat to the system. The developers of the >> > virtualization system can't predict all dangerous places in /proc by >> > definition. >> > >> > A more effective solution would be to mount into the container only what >> > is necessary and ignore the rest. >> > >> > Right now there is the opportunity to pass in the container any port of >> > the /proc filesystem using mount --bind expect the pids. >> > >> > This patch allows to mount only the part of /proc related to pids without >> > rest objects. Since this is an option for /proc, flags applied to /proc >> > have an effect on this subset of filesystem. >> >> I just sent a patch that also has to deal with proc hidepid here: >> https://lkml.org/lkml/2017/3/23/505 > > I completely agree with you that it looks wrong when options of one > mountpoint affect the others mountpoints. > >> I'm not sure if that's the right approach, it is still buggy, however >> seems that your patch also stores the mount option inside the >> pid_namespace which may get propagated to all mounts inside same pidns >> ? > > I don't store 'pidonly' option in my current patch. I mean in: > https://lkml.org/lkml/2017/3/20/363 > > I parse options twice at first mount of procfs. It happens before > the mounting /proc in userspace. > > I know it's bad, but I couldn't find place to store this option. Ok, then maybe that approach of having a procfs struct holder instead of using pid namespace may help! >> I didn't have enough time but maybe if they are related we can work it >> out together ? > > I don't have enough experience in kenrel hacking, but I would happily do > my best :) OK, let's sync on this then. > I also tring to mention it in every patch, as my changes almost completely > useless without the ability to use the overlayfs. > > Now if you remove the restriction: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497 This protection was introduced in e54ad7f1ee263 That's a security fix, ecryptfs seems to schedule some code through kthreads which always run in the initial namespaces with full capabilities, bypassing all security checks. I don't know if overlayfs has something similar, if not then maybe if it's possible in ecryptfs to check the lower mount and the fs type and move this procfs there... Cc'ed Jann in case he may comment. > and mount procfs as lowerdir in overlayfs then you get NULL pointer > dereference at: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891 > > I got it when I tried to do `ls -la /overlay/proc/self/`. > > -- > Rgrds, legion