Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751494AbdILKsw (ORCPT ); Tue, 12 Sep 2017 06:48:52 -0400 Received: from mail-io0-f195.google.com ([209.85.223.195]:35266 "EHLO mail-io0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751375AbdILKsu (ORCPT ); Tue, 12 Sep 2017 06:48:50 -0400 X-Google-Smtp-Source: AOwi7QATSx71AY/+R4i8n2h0SOMmDCgqA3teBnfXXxh4doyXTKxg57htkXVrnuFfG90iH4pVJRnqpA== Subject: Re: execve(NULL, argv, envp) for nommu? To: Oleg Nesterov Cc: Alan Cox , Geert Uytterhoeven , Linux Embedded , dalias@libc.org, "linux-kernel@vger.kernel.org" References: <324c00d9-06a6-1fc5-83fe-5bd36d874501@landley.net> <20170905142436.262ed118@alans-desktop> <20170911151526.GA4126@redhat.com> From: Rob Landley Message-ID: Date: Tue, 12 Sep 2017 05:48:47 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170911151526.GA4126@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5337 Lines: 109 On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > On 09/08, Rob Landley wrote: >> >> So is exec(NULL, argv, envp) a reasonable thing to want? > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > dentry_open(current->mm->exe_file->path, O_PATH) > > and returns fd make more sense. > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). I'm all for it? That sounds like a cosmetic difference, a more verbose way of achieving the same outcome. (Of course now you've got a filehandle you can read xattrs and such through from otherwise jailed contexts letting you do things you couldn't necessarily do before, but I assume you know the security implications of that more than I do. I tried to suggest something that _didn't_ create new capabilities, just let nommu do a thing that mmu could already do.) > But to be honest, I can't understand the problem, because I know nothing > about nommu. > > You need to unblock parent sleeping in vfork(), and you can't do another > fork (I don't undestand why). A nommu system doesn't have a memory management unit, so all addresses are physical addresses. This means two processes can't see different things at the same address: either they see the same thing or one of them can't see that address (due to a range register making it). Conventional fork() creates copy on write mappings of all the existing writable memory of the parent process. So when the new PID dirties a page, the old page gets copied by the fault handler. The problem isn't the copies (that's just slow), the problem is two processes seeing different things at the same address. That requires an MMU with a TLB loaded from page tables. If you create _new_ mappings and copy the data over, they'll have different addresses. But any pointers you copied will point to the _old_ addresses. Finding and adjusting all those pointers to point to the new addresses instead is basically the same problem as doing garbage collection in C. Your stack has pointers. Your heap has pointers. Your data and bss (once initialized) can have pointers. These pointers can be in the middle of malloc()'ed structures so no ELF table anywhere knows anything about them. A long variable containing a value that _could_ point into one of these ranges isn't guaranteed to _be_ a pointer, in which case adjusting it is breakage. Tracking them all down and fixing up just the right ones without missing any or changing data you shouldn't is REALLY HARD. The vfork() system call is what you use on nommu instead: it creates a child process that uses its parent's memory mappings. The parent process is stopped until the child calls _exit() or exec(), either of which means it stops using those mappings and the parent can go back to using them without the two stomping on each other. (Usually they even share the same stack, so the child shouldn't return from the function that called vfork() or it'll corrupt the stack for the parent process. And be careful about changing local variables, the parent might see the changes when it resumes. Some vfork() implementations provide a small new stack, ala signal handlers or kernel interrupts, so you can't guarantee your parent will see your local variable changes, but you still can't return from the function that called vfork() in either case.) So after calling vfork(), the child _must_ call exec() in order for there to be two independent processes running at the same time. Until then, the parent is stopped. The real problem with implementing full fork() isn't the expense of copying the data (although if you fork and exec from a mozilla style pig process, you could copy hundreds of megabytes of data and then immediately discard it again; that's why fork() doesn't usually do that; oh and on nommu systems you need _contiguous_ memory blocks for the data because it can't collect disparate pages together into a longer mapping, so this is actually a largeish real-world issue on those systems, not merely slow and expensive.) The hard problem is translating the pointers so the new mapping doesn't read/write objects in the old mapping. > Perhaps the child can create another thread? The main thread can exit > after that and unblock the parent. Or perhaps even something like > clone(CLONE_VM | CLONE_PARENT), I dunno... Launching a new thread doesn't unblock the parent. A second vfork() from the child wouldn't unblock the parent. Your mappings are still overcommited, only _exit() or execve() releases the child process's use of those mappings. You can create threads on nommu because they're designed to share the same mappings. In that case you're guaranteed a new stack, and not stomping the parent's data is your problem. But if you exec() from a thread, posix says it kills all the other threads: http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html And even without that, we're still in the "vfork but add concurrency" territory. Your threads don't have their own independent mappings, they're sharing and stomping each other's data unless you add locking and write your program to know about the other threads. To get two independent process contexts running the same executable but with different mappings (I.E. the goal we started with), you still need the child to exec. And the start of this thread was "exec what"? > Oleg. Rob