Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933631AbXFGUMa (ORCPT ); Thu, 7 Jun 2007 16:12:30 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761368AbXFGUMA (ORCPT ); Thu, 7 Jun 2007 16:12:00 -0400 Received: from smtp2.linux-foundation.org ([207.189.120.14]:59273 "EHLO smtp2.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752341AbXFGUL7 (ORCPT ); Thu, 7 Jun 2007 16:11:59 -0400 Date: Thu, 7 Jun 2007 13:10:23 -0700 (PDT) From: Linus Torvalds To: Alan Cox cc: Davide Libenzi , Linux Kernel Mailing List , Andrew Morton , Ulrich Drepper , Ingo Molnar , Eric Dumazet Subject: Re: [patch 7/8] fdmap v2 - implement sys_socket2 In-Reply-To: <20070606235906.72439d16@the-village.bc.nu> Message-ID: References: <20070606235906.72439d16@the-village.bc.nu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4436 Lines: 108 On Wed, 6 Jun 2007, Alan Cox wrote: > > This still all seems really really ugly. I do agree that it's ugly. That many new system calls with new prototypes and new glibc support is just nasty. So I don't think this is viable. > Is there anything wrong with throwing all these extra cases out and > replacing the entire lot with > > prctl(PR_SPARSEFD, 1); > > to turn on sparse fd allocation for a process ? Yes. We really don't want to set global state that affects any random library thing that runs after it. HOWEVER. I think we could introduce a *single* new system call, which does basically a "run the specified system call with the following flags". The flags would literally be local to that *one* system call, and one of the flags could be the semantics for FD allocation. [ There are a few other cases where such an indirect system call might be interesting: temporarily unmasking a signal for just the duration of a single system call is the reason for things like 'pselect()' and 'sigtimedwait()', and similarly the 'access()' system call is basically a "temporarily run with my real UID, rather than the effective UID thing, and quite frankly, it might be perfectly valid to want to do an 'open()' with that rule too, because "access()+open()" is racy! ] So maybe the proper solution to this mess is *not* to add fifteen new system calls, but to add *one*, which takes a "flags" value to set certain things: - FD_NONSEQ: "allocate any new fd's nonsequentially" - FD_CLOEXEC: "allocate any new fd's as close-on-exec" Rationale: allow people to open any fd with the flags set a certain way, regardless of the system call. - LOOKUP_REALUID/GID: "make the fsuid/fsgid temporarily be my _real_ uid/gid for this single system call" Rationale: avoid the inevitable races that the fundamentally broken "access()" system call has! - LOOKUP_NOFOLLOW: "do not follow any symlink at the end of the path" LOOKUP_NOATIME: "don't update atime" Rationale: "open()" already has O_NOFOLLOW/O_NOATIME, and "stat()" has "lstat()", but a lot of other path-handlign system calls cannot do the same thing. - LOOKUP_NOSYMLINKS: "do not allow any symlink traversal at *all*" LOOKUP_NODOTDOT: "don't traverse a .. upwards" LOOKUP_NOMOUNT: "don't traverse a mount point" Rationale: for security-conscious things, quite often it's not the _last_ symlink you want to avoid, it's any symlinks at all, and sometimes it's things like guaranteeing that you stay in a certain directory structure - which means not going outside with ".." or some magic mount-point. People currently literally end up traversing things one path component at a time, doing a "lstat()" on it, and checking. Even if 99% of the time you probably don't actually ever hit the problem case. (Eg Apache at some point used to do something like this if you asked for security, I'm not sure if it still does). - signal mask for temporarily blocking/unblocking during a single system call. - something else? The above are things that I know I _personally_ have occasionally cursed not having had. What do people think about that kind of approach? It has the advantage that it does *not* involve multiple kernel entries (just a single entry to a small wrapper that sets some process state temporarily), and that it doesn't have any sticky state that might confuse a library (or a signal handler: even if you end up doing "prctrl(ON) ; syscall(); prctrl(OFF)", a signal handler that happens in between the prctrl's would see unexpected behaviour). It has the disadvantage that it would need some per-architecture setup to load the actual real arguments from memory: the system call would probably look something like syscall_indirect(unsigned long flags, sigset_t *, int syscall, unsigned long args[6]); and the rule would be that it would just load the six system call registers from that "args[]" array. Always load the full six registers, to make it simpler and faster, and not having any confusion or ever needing any wrappers that depend on the number of system calls. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/