Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753934Ab2HBG4G (ORCPT ); Thu, 2 Aug 2012 02:56:06 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:56829 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752037Ab2HBG4D (ORCPT ); Thu, 2 Aug 2012 02:56:03 -0400 Date: Thu, 2 Aug 2012 07:55:57 +0100 From: Al Viro To: "H. Peter Anvin" Cc: Meredydd Luff , linux-kernel@vger.kernel.org, Kees Cook , Ingo Molnar , Jeff Dike , Richard Weinberger , Andrew Morton , linux-arch@vger.kernel.org Subject: Re: [PATCH] [RFC] syscalls,x86: Add execveat() system call (v2) Message-ID: <20120802065557.GI6481@ZenIV.linux.org.uk> References: <1343859049-3632-1-git-send-email-meredydd@senatehouse.org> <5019B36A.4030604@zytor.com> <5019BC0E.4010109@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5019BC0E.4010109@zytor.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3570 Lines: 73 On Wed, Aug 01, 2012 at 04:30:22PM -0700, H. Peter Anvin wrote: > On 08/01/2012 04:09 PM, Meredydd Luff wrote: > >>> # > >>> # x32-specific system call numbers start at 512 to avoid cache impact > >> > >> I think that should be common, not 64 (as should kcmp be). > > > > I copied the original execve, which is 64. > > > > Sorry, you're right. The argument vector needs compatibility support. > > This means you need an x32 version of the function -- execve > unfortunately is one of the few system calls which require a special x32 > version (although it's a simple wrapper around sys32_execve). See > sys_x32_execve. I *really* strongly object to doing that thing before we sanitize the situation with sys_execve(). As it is, the damn thing is defined separately on each architecture, with spectaculary ugly kludges used in these implementations. Adding a parallel pile of kludges (and due to their nature, they'll need to be changed in non-trivial way in a lot of cases) is simply wrong. The thing is, there's essentially no reason to have more than one implementation. What they are (badly) doing is "we need to find pt_regs to pass to do_execve(), the thing we are after has to be near our stack frame, so let's try to get to it that way". With really ugly set of kludges trying to do just that. What we should use instead is task_pt_regs(); maybe introduce current_pt_regs(), defaulting to task_pt_regs(current) and letting architectures that can do it better (on some it's simply available in dedicated register, on some it's better to work from current_thread_info(), etc.) override the default. With that we have a fairly good chance to merge most of those guys; probably not all of them, due to e.g. mips weirdness, but enough to make it worth doing. The obstacle is in lazy kernel_execve() implementations; ones that simply issue a trap/whatever is used to enter the system call. Directly from kernel space. It doesn't have to be done that way; see what e.g. arm does there. Note that doing it without syscall instruction avoids another headache; namely, we don't have to worry about returning from *failed* execve (i.e. return to kernel mode) through the codepath that is normally taken only when returning to userland. FWIW, I would try to pull the asm tail of arm kernel_execve() into something that would look to C side as ret_from_kernel_exec(®s); /* never returns */ and start converting architectures to that primitive. It should copy the provided pt_regs to normal location (keeping in mind that there really might be an overlap), set registers (including stack pointer) for normal return to user path and jump there. Essentially, that's the real arch-dependent part of kernel_execve() - transition from kernel thread to userland process. It can be done architecture-by-architecture; there's no need to make it a flagday conversion. Once an arch is handled, we define something like __ARCH_HAS_RET_FROM_KERNEL_EXEC and get the common implementations of kernel_execve() and sys_execve() for that - those could simply live in fs/exec.c under the matching ifdef. Along with your sys_execveat(). I can probably throw alpha, arm and x86 conversions into the pile, but it really needs to be handled on linux-arch, with arch maintainers at least agreeing in principle with that scheme. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/