Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752157AbaJSUUv (ORCPT ); Sun, 19 Oct 2014 16:20:51 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:53115 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751583AbaJSUUq (ORCPT ); Sun, 19 Oct 2014 16:20:46 -0400 Date: Sun, 19 Oct 2014 21:20:34 +0100 From: Al Viro To: Andy Lutomirski Cc: David Drysdale , "Eric W. Biederman" , Meredydd Luff , "linux-kernel@vger.kernel.org" , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Morton , Kees Cook , Arnd Bergmann , X86 ML , linux-arch , Linux API Subject: Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call Message-ID: <20141019202034.GH7996@ZenIV.linux.org.uk> References: <1401975635-6162-1-git-send-email-drysdale@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote: > For example, I want to be able to reliably do something like nsenter > --namespace-flags-here toybox sh. Toybox's shell is unusual in that > it is more or less fully functional, so this should Just Work (tm), > except that the toybox binary might not exist in the namespace being > entered. If execveat were available, I could rig nsenter or a similar > tool to open it with O_CLOEXEC, enter the namespace, and then call > execveat. The question I hadn't seen really answered through all of that was how to deal with #!... "Just use d_path()" isn't particulary appealing - if that file has a pathname reachable for you, you could bloody well use _that_ from the very beginning. Frankly, I wonder if it would make sense to provide something like dupfs. We can't mount it by default on /dev/fd (more's the pity), but it might be a good thing to have. What it is, for those who are not familiar with Plan 9: a filesystem with one directory and a bunch of files in it. Directory contents depends on who's looking; for each opened descriptor in your descriptor table, you'll see two files there. One series is 0, 1, ... - opening one of those gives dup(). IOW, it's *not* giving you a new struct file; it gives you a new reference to existing one, complete with sharing IO position, etc. Another is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n. It's actually a better match for what one would expect at /dev/fd than what we do. Example: ; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh ; (echo 'line 1';echo 'line 2') >a ; cat a|sh a.sh line 2 The first line was line 1 ; sh a.sh atomic_open() calling conventions. Instead of having it return 0 or error, let's switch to returning NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file. Same as we did for ->lookup(), and for similar reason. Right now we have 8 instances of ->atomic_open() and one place calling that method. Changing the API like that would be trivial (and it's a trivial conversion - replace return ret; with return ERR_PTR(ret); through all instances, so any out-of-tree filesystems could follow easily). We certainly can't do anything of that sort with ->open() - there would be thousands instances to convert. ->atomic_open(), OTOH, is still new enough for that to be feasible. What we get from that conversion is an ability to do dup-style semantics easily. * give root directory an ->atomic_open() instance that would be handling opens. * make lookups in there fail with ENOENT if you don't have such a descriptor at the moment. Otherwise bind all of them to the same inode. The only method it needs is ->getattr(), and that would look into your descriptor table for descriptor with number derived from dentry (stashed in ->d_fsdata at lookup time) and do what fstat() would. * have those dentries always fail ->d_revalidate(), to force everything towards ->atomic_open(). * for ...ctl names, ->atomic_open() would act in normal fashion; again, only one inode is needed. ->read() would pick descriptor number from ->d_fsdata and report on whatever you have with that number at the time. I'll try to put a prototype of that together; I think it's at least interesting to try. And that ought to be safe to mount even in very restricted environments, making arguments along the lines of "but we can't get the path by opened file without the big bad wol^Wprocfs and we can't have that in our environment" much weaker... Comments? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/