Right now, Linux isn't all that friendly to JIT emulators.
Here are the problems and suggestions to improve the situation.
There is an SE Linux execmem restriction that enforces W^X.
Assuming you don't wish to just disable SE Linux, there are
two ugly ways around the problem. You can mmap a file twice,
or you can abuse SysV shared memory. The mmap method requires
that you know of a filesystem mounted rw,exec where you can
write a very large temporary file. This arbitrary filesystem,
rather than swap space, will be the backing store. The SysV
shared memory method requires an undocumented flag and is
subject to some annoying size limits. Both methods create
objects that will fail to be deleted if the program dies
before marking the objects for deletion.
Processors often have annoying limits on the immediate values
in instructions. An x86 or x86_64 JIT can go a bit faster if
all allocations are kept to the low 2 GB of address space.
There are also reasons for a 32bit-to-x86_64 JIT to chose
a nearly arbitrary 2 GB region that lies above 4 GB.
Other archs have other limits, such as 32 MB or 256 MB.
Sometimes it is very helpful to have the read/write mapping
be a fixed offset from the read/exec mapping. A power of 2
can be especially desirable.
Emulators often need a cheap way to change page permissions.
One VMA per page is no good. Besides taking up space and making
many things generally slower, having one VMA per page causes
a huge performance loss for snapshot roll-back operations.
Just tearing down all those VMAs takes a good while.
Additions to better support JIT emulators:
a. sysctl to set IPC_RMID by default
b. shmget() flag to set IPC_RMID by default
c. open() flag to unlink a file before returning the fd
d. mremap() flag to always keep the old mapping
e. mremap() flag to get a read/write mapping of a read/exec one
f. mremap() flag to get a read/exec mapping of a read/write one
g. mremap() flag to make the 5th arg (new addr) be the upper limit
h. 6-bit wide mremap() "flag" to set the upper limit above given base
i. support the prot argument to remap_file_pages
j. a documented way (madvise?) to punch same-VMA zero-page holes
Albert Cahalan a ?crit :
> Right now, Linux isn't all that friendly to JIT emulators.
> Here are the problems and suggestions to improve the situation.
>
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem. You can mmap a file twice,
> or you can abuse SysV shared memory. The mmap method requires
> that you know of a filesystem mounted rw,exec where you can
> write a very large temporary file. This arbitrary filesystem,
> rather than swap space, will be the backing store. The SysV
> shared memory method requires an undocumented flag and is
> subject to some annoying size limits. Both methods create
> objects that will fail to be deleted if the program dies
> before marking the objects for deletion.
>
> Processors often have annoying limits on the immediate values
> in instructions. An x86 or x86_64 JIT can go a bit faster if
> all allocations are kept to the low 2 GB of address space.
> There are also reasons for a 32bit-to-x86_64 JIT to chose
> a nearly arbitrary 2 GB region that lies above 4 GB.
> Other archs have other limits, such as 32 MB or 256 MB.
>
> Sometimes it is very helpful to have the read/write mapping
> be a fixed offset from the read/exec mapping. A power of 2
> can be especially desirable.
>
> Emulators often need a cheap way to change page permissions.
> One VMA per page is no good. Besides taking up space and making
> many things generally slower, having one VMA per page causes
> a huge performance loss for snapshot roll-back operations.
> Just tearing down all those VMAs takes a good while.
>
> Additions to better support JIT emulators:
>
> a. sysctl to set IPC_RMID by default
Not very good, this will break some apps.
> b. shmget() flag to set IPC_RMID by default
This is better :)
> c. open() flag to unlink a file before returning the fd
Well, I assume you would like fd = open("/path/somefile", O_RDWR | O_CREAT |
O_UNLINK, 0644)
(ie allocate a file handle but no name ?)
Quite difficult to implement this atomically with current vfs, maybe a new
syscall would be better. (Linus will kill me for that :) )
(We dont need to insert "somefile" in one directory, then unlink it, we only
need to allocate an unnamed inode to get some backing store)
This is a generalization of anonymous inodes ( fs/anon_inodes.c )
> There is an SE Linux execmem restriction that enforces W^X.
This depends on whatever SELinux rulesets you are running. Its just a
good rule to have present that most programs shouldn't be self patching,
and then label those that do differently.
> Sometimes it is very helpful to have the read/write mapping
> be a fixed offset from the read/exec mapping. A power of 2
> can be especially desirable.
mmap MAP_FIXED can do this but you need to know a lot about the memory
layout of the system so it gets a bit platform specific.
> Emulators often need a cheap way to change page permissions.
mprotect(, range) rather than a page at a time. The kernel will do
merging.
> a. sysctl to set IPC_RMID by default
> b. shmget() flag to set IPC_RMID by default
Use POSIX shared memory
> c. open() flag to unlink a file before returning the fd
Is it really that costly to create a blank file, why do you need to do it
a lot in a JIT ?
> e. mremap() flag to get a read/write mapping of a read/exec one
> f. mremap() flag to get a read/exec mapping of a read/write one
> g. mremap() flag to make the 5th arg (new addr) be the upper limit
This is all mprotect and munmap.
> h. 6-bit wide mremap() "flag" to set the upper limit above given base
> i. support the prot argument to remap_file_pages
> j. a documented way (madvise?) to punch same-VMA zero-page holes
mmap (although you get more VMAs from that) so memset() is probably
genuinely cheaper if the permissions are not changing.
On Fri, 2007-06-08 at 12:10 +0100, Alan Cox wrote:
> > e. mremap() flag to get a read/write mapping of a read/exec one
> > f. mremap() flag to get a read/exec mapping of a read/write one
> > g. mremap() flag to make the 5th arg (new addr) be the upper limit
>
> This is all mprotect and munmap.
I think he's asking for a way to copy an existing mapping, which does
sound genuinely useful. (i.e. mremap(ptr, size, size, MREMAP_COPY), with
no need to mess with files to get multiple mappings of the same region)
--
Nicholas Miell <[email protected]>
On 6/8/07, Eric Dumazet <[email protected]> wrote:
> Albert Cahalan a ?crit :
> > Additions to better support JIT emulators:
> >
> > a. sysctl to set IPC_RMID by default
>
> Not very good, this will break some apps.
As a sysctl, the admin gets to choose between
compatibility and sanity.
I can see such a sysctl also being really helpful for a
shared computer used for an Operating Systems or
System Programming course.
> > b. shmget() flag to set IPC_RMID by default
>
> This is better :)
Both are good. This one requires that all apps using
SysV shared memory be modified to use the flag.
The other requires that a very few apps be modified
to tolerate a behavior change.
> > c. open() flag to unlink a file before returning the fd
>
>
> Well, I assume you would like fd = open("/path/somefile", O_RDWR | O_CREAT |
> O_UNLINK, 0644)
>
> (ie allocate a file handle but no name ?)
Yes.
> Quite difficult to implement this atomically with current vfs, maybe a new
> syscall would be better. (Linus will kill me for that :) )
>
> (We dont need to insert "somefile" in one directory, then unlink it, we only
> need to allocate an unnamed inode to get some backing store)
I suspect that SMB/CIFS has a native call for this. There is
some sort of tmpfile flag defined over in that world.
On 6/8/07, Alan Cox <[email protected]> wrote:
> > There is an SE Linux execmem restriction that enforces W^X.
>
> This depends on whatever SELinux rulesets you are running. Its just a
> good rule to have present that most programs shouldn't be self patching,
> and then label those that do differently.
A marking in the executable would have made more sense.
It is really broken having an unprivileged user being able to
create whole new executables but unable to lift this restriction
on those executables.
In any case, the restriction is common and troublesome.
> > Sometimes it is very helpful to have the read/write mapping
> > be a fixed offset from the read/exec mapping. A power of 2
> > can be especially desirable.
>
> mmap MAP_FIXED can do this but you need to know a lot about the memory
> layout of the system so it gets a bit platform specific.
Yes. There are unportable programs, and UNPORTABLE ones.
Memory layout can vary between vendor kernels, between normal
and 32-on-64 situations, between two different C libraries...
> > Emulators often need a cheap way to change page permissions.
>
> mprotect(, range) rather than a page at a time. The kernel will do
> merging.
Nope. This can happen rapidly and repeatedly to pages
that are essentially random. The median length of a range
will be a page or two. Merging won't do very much at all.
> > a. sysctl to set IPC_RMID by default
> > b. shmget() flag to set IPC_RMID by default
>
> Use POSIX shared memory
That appears to have the exact same problem.
> > c. open() flag to unlink a file before returning the fd
>
> Is it really that costly to create a blank file, why do you need to do it
> a lot in a JIT ?
This part isn't about cost. It's about not leaving around
debris when the JIT crashes.
> > e. mremap() flag to get a read/write mapping of a read/exec one
> > f. mremap() flag to get a read/exec mapping of a read/write one
> > g. mremap() flag to make the 5th arg (new addr) be the upper limit
>
> This is all mprotect and munmap.
That won't get me a second mapping. Supposing that I had
a second mapping, SE Linux would deny the mprotect.
I'm looking for a mapping that is born executable or a mapping
that is born writable, as needed, so that no transition is needed.
> > h. 6-bit wide mremap() "flag" to set the upper limit above given base
> > i. support the prot argument to remap_file_pages
> > j. a documented way (madvise?) to punch same-VMA zero-page holes
>
> mmap (although you get more VMAs from that) so memset() is probably
> genuinely cheaper if the permissions are not changing.
Well cost is the problem here. I sure can find some way to
get the operation done, but it isn't cheap. For some usages,
the current setup is costly enough that one must consider
abandoning the hardware MMU in favor of a software one
emitted as part of the JIT. :-(
Albert Cahalan wrote:
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem.
This should be fixed in SELinux, or more accurately the SELinux profile.
There is absolutely no other sane option.
Of course, you generally don't need a page to be writable and executable
at the same time, but the overhead of switching can be enormous.
-hpa
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Right now, Linux isn't all that friendly to JIT emulators.
> Here are the problems and suggestions to improve the situation.
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem. You can mmap a file twice,
> or you can abuse SysV shared memory. The mmap method requires
> that you know of a filesystem mounted rw,exec where you can
> write a very large temporary file. This arbitrary filesystem,
> rather than swap space, will be the backing store. The SysV
> shared memory method requires an undocumented flag and is
> subject to some annoying size limits. Both methods create
> objects that will fail to be deleted if the program dies
> before marking the objects for deletion.
If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Processors often have annoying limits on the immediate values
> in instructions. An x86 or x86_64 JIT can go a bit faster if
> all allocations are kept to the low 2 GB of address space.
> There are also reasons for a 32bit-to-x86_64 JIT to chose
> a nearly arbitrary 2 GB region that lies above 4 GB.
> Other archs have other limits, such as 32 MB or 256 MB.
This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Sometimes it is very helpful to have the read/write mapping
> be a fixed offset from the read/exec mapping. A power of 2
> can be especially desirable.
As far as the kernel is concerned they're unrelated, so this will
likely need MAP_FIXED barring a staggering array of fresh system
calls to act on tuples of memory ranges in lockstep.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Emulators often need a cheap way to change page permissions.
> One VMA per page is no good. Besides taking up space and making
> many things generally slower, having one VMA per page causes
> a huge performance loss for snapshot roll-back operations.
> Just tearing down all those VMAs takes a good while.
remap_file_pages_prot() is reputedly waiting in the wings somewhere
for this.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Additions to better support JIT emulators:
> a. sysctl to set IPC_RMID by default
This is a bad idea. The standard semantics are needed for programs
relying upon them.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> b. shmget() flag to set IPC_RMID by default
This is relatively innocuous.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> c. open() flag to unlink a file before returning the fd
You probably want a tmpfile(3) -like affair which never has a pathname
to begin with. It could be useful for security purposes more generally.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> d. mremap() flag to always keep the old mapping
This sounds vaguely like another syscall, like mdup(). This is
particularly meaningful in the context of anonymous memory, for
which there is no method of replicating mappings within a single
process address space.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> e. mremap() flag to get a read/write mapping of a read/exec one
> f. mremap() flag to get a read/exec mapping of a read/write one
Presumably to be used in conjunction with keeping the old mapping.
A composite mdup()/mremap() and mprotect(), presumably saving a TLB
flush or other sorts of overhead, may make some sort of sense here.
Odds are it'll get rejected as the sequence of syscalls is a rather
precise equivalent, though it would optimize things (as would other
composite syscalls, e.g. ones combining fork() and execve() etc.).
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> g. mremap() flag to make the 5th arg (new addr) be the upper limit
> h. 6-bit wide mremap() "flag" to set the upper limit above given base
Essentially more placement support for mremap()/mdup(). It's not clear
to me those particular semantics are the ideal ones. A target range
for placement should do, if not manual address space management.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> i. support the prot argument to remap_file_pages
This is probably going to happen anyway.
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> j. a documented way (madvise?) to punch same-VMA zero-page holes
This is MADV_REMOVE, though most filesystems don't support it. Do you
need it for more than tmpfs?
-- wli
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> Right now, Linux isn't all that friendly to JIT emulators.
>> Here are the problems and suggestions to improve the situation.
>> There is an SE Linux execmem restriction that enforces W^X.
>> Assuming you don't wish to just disable SE Linux, there are
>> two ugly ways around the problem. You can mmap a file twice,
>> or you can abuse SysV shared memory. The mmap method requires
>> that you know of a filesystem mounted rw,exec where you can
>> write a very large temporary file. This arbitrary filesystem,
>> rather than swap space, will be the backing store. The SysV
>> shared memory method requires an undocumented flag and is
>> subject to some annoying size limits. Both methods create
>> objects that will fail to be deleted if the program dies
>> before marking the objects for deletion.
>
> If the policy forbidding self-modifying code lacks a method of
> exempting programs such as JIT interpreters (which I doubt) then
> it's a problem. I'm with Alan on this one.
It does and it doesn't. There is not a reasonable way for a
user to mark an app as needing full self-modifying ability.
It's not like the executable stack, which can be set via the
ELF note markings on the executable. (ELF note markings are
ideal because they can not be used via a ret-to-libc attack)
With admin privs, one can change SE Linux settings. Mark the
executable, disable the protection system-wide, generate a
completely new SE Linux policy, or just turn SE Linux off.
Normally we don't expect/require admin privs to install an
executable in one's own ~/bin directory. This is broken.
It ought to be easier to get a JIT working well without
enabling arbitrary mprotect. This would allow a JIT to
partially benefit from the recent security enhancements.
(think of all the buggy browser-based JIT things!)
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> Processors often have annoying limits on the immediate values
>> in instructions. An x86 or x86_64 JIT can go a bit faster if
>> all allocations are kept to the low 2 GB of address space.
>> There are also reasons for a 32bit-to-x86_64 JIT to chose
>> a nearly arbitrary 2 GB region that lies above 4 GB.
>> Other archs have other limits, such as 32 MB or 256 MB.
>
> This sort of logic might be appropriate for a sort of parametrized
> and specialized vma allocator setting the policy in /proc/ along
> with various sorts of limits. There are limits to such and at some
> point things will have to manually manage their own process address
> spaces in a platform-specific fashion. If kernel assistance here is
> rejected they may have to do so in all cases.
I prefer ELF notes (for start-up allocations) and prctl,
plus a mmap flag for per-allocation behavior.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> Additions to better support JIT emulators:
>> a. sysctl to set IPC_RMID by default
>
> This is a bad idea. The standard semantics are needed for programs
> relying upon them.
I didn't mean that the default default :-) setting would change.
I meant that people could change the behavior from a boot script.
Things that break are really foul and nasty anyway, probably with
serious problems that ought to get fixed.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> c. open() flag to unlink a file before returning the fd
>
> You probably want a tmpfile(3) -like affair which never has a pathname
> to begin with. It could be useful for security purposes more generally.
Yes, exactly. I think there are some possible optimizations
available too, particularly with the cifs filesystem.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> d. mremap() flag to always keep the old mapping
>
> This sounds vaguely like another syscall, like mdup(). This is
> particularly meaningful in the context of anonymous memory, for
> which there is no method of replicating mappings within a single
> process address space.
Yes, mdup() and probably mdup2(). It could be mremap flags or not.
JIT emulators generally need a second mapping so that they can
have both read/write and execute for the same physical memory.
It is somewhat tolerable to have SE Linux enforce that the second
mapping be randomized. (it helps security greatly, but slows the
emulator by a tiny bit)
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> e. mremap() flag to get a read/write mapping of a read/exec one
>> f. mremap() flag to get a read/exec mapping of a read/write one
>
> Presumably to be used in conjunction with keeping the old mapping.
> A composite mdup()/mremap() and mprotect(), presumably saving a TLB
> flush or other sorts of overhead, may make some sort of sense here.
> Odds are it'll get rejected as the sequence of syscalls is a rather
> precise equivalent, though it would optimize things (as would other
> composite syscalls, e.g. ones combining fork() and execve() etc.).
A few mremap flags ought to do the job I think.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> g. mremap() flag to make the 5th arg (new addr) be the upper limit
>> h. 6-bit wide mremap() "flag" to set the upper limit above given base
>
> Essentially more placement support for mremap()/mdup(). It's not clear
> to me those particular semantics are the ideal ones. A target range
> for placement should do, if not manual address space management.
Yes. I'm looking for the change that will help JIT emulators
the most while hurting security the least.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> i. support the prot argument to remap_file_pages
>
> This is probably going to happen anyway.
Great.
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> j. a documented way (madvise?) to punch same-VMA zero-page holes
>
> This is MADV_REMOVE, though most filesystems don't support it. Do you
> need it for more than tmpfs?
Yes and no. It's painful to be restricted to one backing store.
Covering MAP_ANONYMOUS and SysV shared mem is most critical.
I suppose that other filesystems may require multiple flags to
deal with the desire to (not) punch a hole on disk and what to
do if that isn't possible.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> If the policy forbidding self-modifying code lacks a method of
>> exempting programs such as JIT interpreters (which I doubt) then
>> it's a problem. I'm with Alan on this one.
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> It does and it doesn't. There is not a reasonable way for a
> user to mark an app as needing full self-modifying ability.
> It's not like the executable stack, which can be set via the
> ELF note markings on the executable. (ELF note markings are
> ideal because they can not be used via a ret-to-libc attack)
> With admin privs, one can change SE Linux settings. Mark the
> executable, disable the protection system-wide, generate a
> completely new SE Linux policy, or just turn SE Linux off.
> Normally we don't expect/require admin privs to install an
> executable in one's own ~/bin directory. This is broken.
> It ought to be easier to get a JIT working well without
> enabling arbitrary mprotect. This would allow a JIT to
> partially benefit from the recent security enhancements.
> (think of all the buggy browser-based JIT things!)
I presumed an ELF note or extended filesystem attributes were already
in place for this sort of affair. It may be that the model implemented
is so restrictive that users are forbidden to create new executables,
in which case using a different model is certainly in order. Otherwise
the ELF note or attributes need to be implemented.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> This sort of logic might be appropriate for a sort of parametrized
>> and specialized vma allocator setting the policy in /proc/ along
>> with various sorts of limits. There are limits to such and at some
>> point things will have to manually manage their own process address
>> spaces in a platform-specific fashion. If kernel assistance here is
>> rejected they may have to do so in all cases.
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> I prefer ELF notes (for start-up allocations) and prctl,
> plus a mmap flag for per-allocation behavior.
Beware that the kernel (upstream of me) will likely refuse to support
to exotic mmap() placement policies. At that point userspace will have
to implement them itself with a front-end to mmap().
Userspace can actually live without kernel placement support for
everything but the executable itself, which is already implemented via
ELF loading standards. This is not to downplay the tremendous amounts
of pain involved for moving the stack, getting ld.so to land in the
right place, and so on. Actually I'm less sure about .interp placement.
In any event, exotic virtualspace allocation policies are largely yet
another "simple matter of programming" implementable entirely in
userspace.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> This is a bad idea. The standard semantics are needed for programs
>> relying upon them.
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> I didn't mean that the default default :-) setting would change.
> I meant that people could change the behavior from a boot script.
> Things that break are really foul and nasty anyway, probably with
> serious problems that ought to get fixed.
It's actually not a good idea to make it the default even via sysctl.
People won't realize something will break until it does, and what will
break is likely to be a database responsible for data integrity. The
IPC_RMID creation flag should suffice.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> You probably want a tmpfile(3) -like affair which never has a pathname
>> to begin with. It could be useful for security purposes more generally.
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> Yes, exactly. I think there are some possible optimizations
> available too, particularly with the cifs filesystem.
I doubt this will be controversial, but it's not clear to me that there
is any convenient way to obtain an anonymous inode on anything but tmpfs,
in which case it's not really anonymous, but not visible to userspace on
account of the default kern_mount(). Essentially it's possible to hoist
the tmpfile name generation in-kernel to where it's in a disconnected
namespace not visible to any userspace whatsoever, and kernel threads
can cooperatively ensure safety via access discipline. Alternatively,
one could kern_mount() a fresh tmpfs filesystem for some concurrency
domain, e.g. per-uid, per-process, or per-thread.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> This sounds vaguely like another syscall, like mdup(). This is
>> particularly meaningful in the context of anonymous memory, for
>> which there is no method of replicating mappings within a single
>> process address space.
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> Yes, mdup() and probably mdup2(). It could be mremap flags or not.
> JIT emulators generally need a second mapping so that they can
> have both read/write and execute for the same physical memory.
> It is somewhat tolerable to have SE Linux enforce that the second
> mapping be randomized. (it helps security greatly, but slows the
> emulator by a tiny bit)
I think this may be doable via an mremap() flag barring needing to
break it up into multiple syscalls so it's implementable on all
architectures. That itself will be so difficult to get merged the
duplication may have to stand on its own as an mremap() flag.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> Presumably to be used in conjunction with keeping the old mapping.
>> A composite mdup()/mremap() and mprotect(), presumably saving a TLB
>> flush or other sorts of overhead, may make some sort of sense here.
>> Odds are it'll get rejected as the sequence of syscalls is a rather
>> precise equivalent, though it would optimize things (as would other
>> composite syscalls, e.g. ones combining fork() and execve() etc.).
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> A few mremap flags ought to do the job I think.
mremap() already has so many arguments this is going to be difficult
to get merged. Breaking it up into multiple syscalls will not be easy
to get past people, and there are architectures that can't implement
syscalls with too many arguments.
On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> This is MADV_REMOVE, though most filesystems don't support it. Do you
>> need it for more than tmpfs?
On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> Yes and no. It's painful to be restricted to one backing store.
> Covering MAP_ANONYMOUS and SysV shared mem is most critical.
> I suppose that other filesystems may require multiple flags to
> deal with the desire to (not) punch a hole on disk and what to
> do if that isn't possible.
If those two are the bare necessities, they're already in place.
-- wli
William Lee Irwin III wrote:
>
> I presumed an ELF note or extended filesystem attributes were already
> in place for this sort of affair. It may be that the model implemented
> is so restrictive that users are forbidden to create new executables,
> in which case using a different model is certainly in order. Otherwise
> the ELF note or attributes need to be implemented.
>
Another thing to keep in mind, since we're talking about security
policies in the first place, is that anything like this *MUST* be
"opt-in" on the part of the security policy, because what we're talking
about is circumventing an explicit security policy just based on a
user-provided binary saying, in effect, "don't worry, I know what I'm
doing."
Changing the meaning of an established explicit security policy is not
acceptable.
-hpa
William Lee Irwin III wrote:
>> I presumed an ELF note or extended filesystem attributes were already
>> in place for this sort of affair. It may be that the model implemented
>> is so restrictive that users are forbidden to create new executables,
>> in which case using a different model is certainly in order. Otherwise
>> the ELF note or attributes need to be implemented.
On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote:
> Another thing to keep in mind, since we're talking about security
> policies in the first place, is that anything like this *MUST* be
> "opt-in" on the part of the security policy, because what we're talking
> about is circumventing an explicit security policy just based on a
> user-provided binary saying, in effect, "don't worry, I know what I'm
> doing."
> Changing the meaning of an established explicit security policy is not
> acceptable.
This is what I had in mind with the commentary on the intentions of the
policy. Thank you for correcting my hamhanded attempt to describe it.
-- wli
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> I presumed an ELF note or extended filesystem attributes were already
>>> in place for this sort of affair. It may be that the model implemented
>>> is so restrictive that users are forbidden to create new executables,
>>> in which case using a different model is certainly in order. Otherwise
>>> the ELF note or attributes need to be implemented.
>
> On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote:
>> Another thing to keep in mind, since we're talking about security
>> policies in the first place, is that anything like this *MUST* be
>> "opt-in" on the part of the security policy, because what we're talking
>> about is circumventing an explicit security policy just based on a
>> user-provided binary saying, in effect, "don't worry, I know what I'm
>> doing."
>> Changing the meaning of an established explicit security policy is not
>> acceptable.
>
> This is what I had in mind with the commentary on the intentions of the
> policy. Thank you for correcting my hamhanded attempt to describe it.
>
Right. It's important to notice that it's actually more of an issue if
the user can create executables, but the policy doesn't want to allow
them to run bypassing the policy.
-hpa
On 6/20/07, H. Peter Anvin <[email protected]> wrote:
> William Lee Irwin III wrote:
> > I presumed an ELF note or extended filesystem attributes were already
> > in place for this sort of affair. It may be that the model implemented
> > is so restrictive that users are forbidden to create new executables,
> > in which case using a different model is certainly in order. Otherwise
> > the ELF note or attributes need to be implemented.
>
> Another thing to keep in mind, since we're talking about security
> policies in the first place, is that anything like this *MUST* be
> "opt-in" on the part of the security policy, because what we're talking
> about is circumventing an explicit security policy just based on a
> user-provided binary saying, in effect, "don't worry, I know what I'm
> doing."
>
> Changing the meaning of an established explicit security policy is not
> acceptable.
Not in this case. If an attacker can CHANGE THE BINARY then
it's already game over.
Putting this into the security policy was an error born of
lazyness to begin with. Abuse of the security mechanism
was easier than hacking the toolchain, ELF loader, etc.
Either a binary needs self-modification, or it doesn't. This is
determined by the author of the code. If you don't trust an
executable that needs this ability, then you simply can not
run it in a useful way.
On 6/20/07, William Lee Irwin III <[email protected]> wrote:
> On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>>> If the policy forbidding self-modifying code lacks a method of
>>> exempting programs such as JIT interpreters (which I doubt) then
>>> it's a problem. I'm with Alan on this one.
>
> On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
>> It does and it doesn't. There is not a reasonable way for a
>> user to mark an app as needing full self-modifying ability.
>> It's not like the executable stack, which can be set via the
>> ELF note markings on the executable. (ELF note markings are
>> ideal because they can not be used via a ret-to-libc attack)
>> With admin privs, one can change SE Linux settings. Mark the
>> executable, disable the protection system-wide, generate a
>> completely new SE Linux policy, or just turn SE Linux off.
>> Normally we don't expect/require admin privs to install an
>> executable in one's own ~/bin directory. This is broken.
>> It ought to be easier to get a JIT working well without
>> enabling arbitrary mprotect. This would allow a JIT to
>> partially benefit from the recent security enhancements.
>> (think of all the buggy browser-based JIT things!)
>
> I presumed an ELF note or extended filesystem attributes were already
> in place for this sort of affair. It may be that the model implemented
> is so restrictive that users are forbidden to create new executables,
> in which case using a different model is certainly in order. Otherwise
> the ELF note or attributes need to be implemented.
Users can create executables. Some will be non-functional
unless specially marked by an admin.
What is the goal here? I see no reasonable goal that would
result in such a policy.
> On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>>> This sort of logic might be appropriate for a sort of parametrized
>>> and specialized vma allocator setting the policy in /proc/ along
>>> with various sorts of limits. There are limits to such and at some
>>> point things will have to manually manage their own process address
>>> spaces in a platform-specific fashion. If kernel assistance here is
>>> rejected they may have to do so in all cases.
>
> On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
>> I prefer ELF notes (for start-up allocations) and prctl,
>> plus a mmap flag for per-allocation behavior.
>
> Beware that the kernel (upstream of me) will likely refuse to support
> to exotic mmap() placement policies. At that point userspace will have
> to implement them itself with a front-end to mmap().
>
> Userspace can actually live without kernel placement support for
> everything but the executable itself, which is already implemented via
> ELF loading standards. This is not to downplay the tremendous amounts
> of pain involved for moving the stack, getting ld.so to land in the
> right place, and so on. Actually I'm less sure about .interp placement.
> In any event, exotic virtualspace allocation policies are largely yet
> another "simple matter of programming" implementable entirely in
> userspace.
When you go that route, you may need to abandon libc. I've done exactly
that for one emulator. It was not easy. Nearly nobody will want to go
down that path.
Things improve a bit if MAP_ANONYMOUS and SysV shared mem allocations
can be made to ignore the available memory checking. If I could allocate
a 2 GB chunk on a system with 1 GB total swap+RAM, then I could use
that as an area in which to perform MAP_FIXED allocations. As of now
this would require either adding the swap space or disabling the
available memory checking system-wide via sysctl.
> On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>>> This is a bad idea. The standard semantics are needed for programs
>>> relying upon them.
>
> On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
>> I didn't mean that the default default :-) setting would change.
>> I meant that people could change the behavior from a boot script.
>> Things that break are really foul and nasty anyway, probably with
>> serious problems that ought to get fixed.
>
> It's actually not a good idea to make it the default even via sysctl.
> People won't realize something will break until it does, and what will
> break is likely to be a database responsible for data integrity. The
> IPC_RMID creation flag should suffice.
It's highly unlikely that such breakage would cause corruption.
Most likely it would cause the database to exit with an error
about failing to attach to a SysV shared memory segment.
I believe that a major cause of reboots is that admins are
unaware of SysV shared memory cruft left behind by apps that
crashed at the wrong moment or had other bugs. If something
is eating memory and you don't know what it is, you reboot.
> On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>>> This is MADV_REMOVE, though most filesystems don't support it. Do you
>>> need it for more than tmpfs?
>
> On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
>> Yes and no. It's painful to be restricted to one backing store.
>> Covering MAP_ANONYMOUS and SysV shared mem is most critical.
>> I suppose that other filesystems may require multiple flags to
>> deal with the desire to (not) punch a hole on disk and what to
>> do if that isn't possible.
>
> If those two are the bare necessities, they're already in place.
Well NONE of this stuff is absolutely required to run a JIT,
and one doesn't even need a JIT if one likes pure emulation.
All of this is about optimization and failure clean-up.
MAP_ANONYMOUS and SysV shared mem are good for transient things.
Sometimes a JIT author wants to keep a persistent image on disk.
In this case, it is much better to use the disk as backing store.
Also, sometimes one prefers to use a specific filesystem because
swap may be slower, smaller, or of unknown quality.
BTW, a mdup2 is great for DSP algorithms as well. It can allow
for wrap-around arrays, greatly simplifying and speeding up
things like filters.
Albert Cahalan wrote:
> Putting this into the security policy was an error born of
> lazyness to begin with. Abuse of the security mechanism
> was easier than hacking the toolchain, ELF loader, etc.
>
> Either a binary needs self-modification, or it doesn't. This is
> determined by the author of the code. If you don't trust an
> executable that needs this ability, then you simply can not
> run it in a useful way.
That's fine. That's a policy decision. That's what a security policy
*is*. The owner of the system has decided, by security policy, that
that is not allowed. Bypassing that is not acceptable.
-hpa
On 6/20/07, H. Peter Anvin <[email protected]> wrote:
> Albert Cahalan wrote:
> > Putting this into the security policy was an error born of
> > lazyness to begin with. Abuse of the security mechanism
> > was easier than hacking the toolchain, ELF loader, etc.
> >
> > Either a binary needs self-modification, or it doesn't. This is
> > determined by the author of the code. If you don't trust an
> > executable that needs this ability, then you simply can not
> > run it in a useful way.
>
> That's fine. That's a policy decision. That's what a security policy
> *is*. The owner of the system has decided, by security policy, that
> that is not allowed. Bypassing that is not acceptable.
Fixing a bug should be acceptable.
Look, let's back up a bit here. At a high level, what exactly do
you imagine that this behavior was intended for? I suggest you
list some examples of the attacks that are blocked.
Can you come up with a reasonable argument that the current behavior
is the least painful restriction required to block those attacks?
Does the current behavior block any attack that the proposed behavior
would not? (list the attacks please)
Albert Cahalan wrote:
>>
>> That's fine. That's a policy decision. That's what a security policy
>> *is*. The owner of the system has decided, by security policy, that
>> that is not allowed. Bypassing that is not acceptable.
>
> Fixing a bug should be acceptable.
>
That's not what you're trying to do, though. You're trying to change
the behaviour underneath the security policy. If there is a bug, it's
in the security policy and that's where it needs to be changed.
> Look, let's back up a bit here. At a high level, what exactly do
> you imagine that this behavior was intended for? I suggest you
> list some examples of the attacks that are blocked.
>
> Can you come up with a reasonable argument that the current behavior
> is the least painful restriction required to block those attacks?
> Does the current behavior block any attack that the proposed behavior
> would not? (list the attacks please)
See above.
-hpa
On 6/20/07, H. Peter Anvin <[email protected]> wrote:
> Albert Cahalan wrote:
> > Look, let's back up a bit here. At a high level, what exactly do
> > you imagine that this behavior was intended for? I suggest you
> > list some examples of the attacks that are blocked.
> >
> > Can you come up with a reasonable argument that the current behavior
> > is the least painful restriction required to block those attacks?
> > Does the current behavior block any attack that the proposed behavior
> > would not? (list the attacks please)
>
> See above.
Nope. I asked you to justify the existing behavior. Apparently you
are unable to do so. This should be a hint.
Albert Cahalan <[email protected]> wrote:
> On 6/19/07, William Lee Irwin III <[email protected]> wrote:
>> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>>> Right now, Linux isn't all that friendly to JIT emulators.
>>> Here are the problems and suggestions to improve the situation.
>>> There is an SE Linux execmem restriction that enforces W^X.
>>> Assuming you don't wish to just disable SE Linux, there are
>>> two ugly ways around the problem. You can mmap a file twice,
>>> or you can abuse SysV shared memory. The mmap method requires
>>> that you know of a filesystem mounted rw,exec where you can
>>> write a very large temporary file. This arbitrary filesystem,
>>> rather than swap space, will be the backing store. The SysV
>>> shared memory method requires an undocumented flag and is
>>> subject to some annoying size limits. Both methods create
>>> objects that will fail to be deleted if the program dies
>>> before marking the objects for deletion.
>>
>> If the policy forbidding self-modifying code lacks a method of
>> exempting programs such as JIT interpreters (which I doubt) then
>> it's a problem. I'm with Alan on this one.
>
> It does and it doesn't. There is not a reasonable way for a
> user to mark an app as needing full self-modifying ability.
> It's not like the executable stack, which can be set via the
> ELF note markings on the executable. (ELF note markings are
> ideal because they can not be used via a ret-to-libc attack)
>
> With admin privs, one can change SE Linux settings. Mark the
> executable, disable the protection system-wide, generate a
> completely new SE Linux policy, or just turn SE Linux off.
According to the documents I found about SELinux, you can also
- create a this-app-needs-selfmodification type
- allow users to change the context type of their files to this type
- configure a domain to allow self-modification
- configure the domain transition
Brave words from someone who did not yet successfully find the magic in
order to install the refpolicy on debilian (after finding their refpolicy-foo
to be incomplete and their refpolicy-src to not compile).
--
Why do women have smaller feet than men?
It's one of those "evolutionary things" that allows them to stand
closer to the kitchen sink.
Fri?, Spammer: [email protected] [email protected]
On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> Right now, Linux isn't all that friendly to JIT emulators.
> Here are the problems and suggestions to improve the situation.
>
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem. You can mmap a file twice,
> or you can abuse SysV shared memory. The mmap method requires
> that you know of a filesystem mounted rw,exec where you can
> write a very large temporary file. This arbitrary filesystem,
> rather than swap space, will be the backing store. The SysV
> shared memory method requires an undocumented flag and is
> subject to some annoying size limits. Both methods create
> objects that will fail to be deleted if the program dies
> before marking the objects for deletion.
and these methods also destroy yourself on any machine with a looser
cache coherency between I and D-cache....
for all but x86 you pretty much have to do the mprotect() between the
two states to deal with the cache flushing properly...
On 6/21/07, Arjan van de Ven <[email protected]> wrote:
> On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> > Right now, Linux isn't all that friendly to JIT emulators.
> > Here are the problems and suggestions to improve the situation.
> >
> > There is an SE Linux execmem restriction that enforces W^X.
> > Assuming you don't wish to just disable SE Linux, there are
> > two ugly ways around the problem. You can mmap a file twice,
> > or you can abuse SysV shared memory. The mmap method requires
> > that you know of a filesystem mounted rw,exec where you can
> > write a very large temporary file. This arbitrary filesystem,
> > rather than swap space, will be the backing store. The SysV
> > shared memory method requires an undocumented flag and is
> > subject to some annoying size limits. Both methods create
> > objects that will fail to be deleted if the program dies
> > before marking the objects for deletion.
>
> and these methods also destroy yourself on any machine with a looser
> cache coherency between I and D-cache....
>
> for all but x86 you pretty much have to do the mprotect() between the
> two states to deal with the cache flushing properly...
If the instructions to force data write-back and/or to
invalidate the instruction cache are priveleged, yes.
AFAIK, only ARM is that lame.
For example, PowerPC lets unprivileged code run
the required instructions.
On Fri, 2007-06-22 at 01:56 -0400, Albert Cahalan wrote:
> On 6/21/07, Arjan van de Ven <[email protected]> wrote:
> > On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> > > Right now, Linux isn't all that friendly to JIT emulators.
> > > Here are the problems and suggestions to improve the situation.
> > >
> > > There is an SE Linux execmem restriction that enforces W^X.
> > > Assuming you don't wish to just disable SE Linux, there are
> > > two ugly ways around the problem. You can mmap a file twice,
> > > or you can abuse SysV shared memory. The mmap method requires
> > > that you know of a filesystem mounted rw,exec where you can
> > > write a very large temporary file. This arbitrary filesystem,
> > > rather than swap space, will be the backing store. The SysV
> > > shared memory method requires an undocumented flag and is
> > > subject to some annoying size limits. Both methods create
> > > objects that will fail to be deleted if the program dies
> > > before marking the objects for deletion.
> >
> > and these methods also destroy yourself on any machine with a looser
> > cache coherency between I and D-cache....
> >
> > for all but x86 you pretty much have to do the mprotect() between the
> > two states to deal with the cache flushing properly...
>
> If the instructions to force data write-back and/or to
> invalidate the instruction cache are priveleged, yes.
> AFAIK, only ARM is that lame.
and your program executes this on all the cpus in the system?
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
On 6/22/07, Arjan van de Ven <[email protected]> wrote:
> On Fri, 2007-06-22 at 01:56 -0400, Albert Cahalan wrote:
> > On 6/21/07, Arjan van de Ven <[email protected]> wrote:
> > > On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> > > > Right now, Linux isn't all that friendly to JIT emulators.
> > > > Here are the problems and suggestions to improve the situation.
> > > >
> > > > There is an SE Linux execmem restriction that enforces W^X.
> > > > Assuming you don't wish to just disable SE Linux, there are
> > > > two ugly ways around the problem. You can mmap a file twice,
> > > > or you can abuse SysV shared memory. The mmap method requires
> > > > that you know of a filesystem mounted rw,exec where you can
> > > > write a very large temporary file. This arbitrary filesystem,
> > > > rather than swap space, will be the backing store. The SysV
> > > > shared memory method requires an undocumented flag and is
> > > > subject to some annoying size limits. Both methods create
> > > > objects that will fail to be deleted if the program dies
> > > > before marking the objects for deletion.
> > >
> > > and these methods also destroy yourself on any machine with a looser
> > > cache coherency between I and D-cache....
> > >
> > > for all but x86 you pretty much have to do the mprotect() between the
> > > two states to deal with the cache flushing properly...
> >
> > If the instructions to force data write-back and/or to
> > invalidate the instruction cache are priveleged, yes.
> > AFAIK, only ARM is that lame.
>
> and your program executes this on all the cpus in the system?
I'll remember that if I ever run a JIT on the SMP ARM box.
(there's like one, at the manufacturer, right?)
I don't recall seeing such code in the libgcc tranpoline
setup for PowerPC. Either it's not required, or this is
a rather popular bug.
Perhaps ARM needs syscalls for this, or emulation for
the privileged instructions. This may already exist; it
sure is required. So this would be another need for
properly supporting JIT emulators.
> > > > and these methods also destroy yourself on any machine with a looser
> > > > cache coherency between I and D-cache....
> > > >
> > > > for all but x86 you pretty much have to do the mprotect() between the
> > > > two states to deal with the cache flushing properly...
> > >
> > > If the instructions to force data write-back and/or to
> > > invalidate the instruction cache are priveleged, yes.
> > > AFAIK, only ARM is that lame.
> >
> > and your program executes this on all the cpus in the system?
no I meant that you had to call your userspace instruction on all cpus,
so on all-but-arm (from the Intel side I know IA64 needs such a flush,
but I'm pretty sure PPC does too)
> I don't recall seeing such code in the libgcc tranpoline
> setup for PowerPC. Either it's not required, or this is
> a rather popular bug.
I suspect it'll be playing under the assumption that going from "no
code" to "code" is fine since the icache is cold.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
On 6/22/07, Arjan van de Ven <[email protected]> wrote:
> > > > > and these methods also destroy yourself on any machine with a looser
> > > > > cache coherency between I and D-cache....
> > > > >
> > > > > for all but x86 you pretty much have to do the mprotect() between the
> > > > > two states to deal with the cache flushing properly...
> > > >
> > > > If the instructions to force data write-back and/or to
> > > > invalidate the instruction cache are priveleged, yes.
> > > > AFAIK, only ARM is that lame.
> > >
> > > and your program executes this on all the cpus in the system?
>
> no I meant that you had to call your userspace instruction on all cpus,
> so on all-but-arm (from the Intel side I know IA64 needs such a flush,
> but I'm pretty sure PPC does too)
I understood.
AFAIK, it is common to propagate this via a special
bus cycle. Section 5.1.5.2.1 of the PowerPC manual
states that this is so. Secion 5.1.5.2 lists the requirements
for both uniprocessor and multiprocessor. Note that
Linux uses the coherent memory model for PowerPC SMP.
See also the "icbi" instruction description, where the use
of an address-only broadcast is mentioned.
> > I don't recall seeing such code in the libgcc tranpoline
> > setup for PowerPC. Either it's not required, or this is
> > a rather popular bug.
>
> I suspect it'll be playing under the assumption that going from "no
> code" to "code" is fine since the icache is cold.
A previous trampoline would ruin that.
Fortunately, PowerPC is not as brain-dead as ARM and IA64.
(not that I'm writing code for any of these)
On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote:
> On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>> c. open() flag to unlink a file before returning the fd
>
> You probably want a tmpfile(3) -like affair which never has a
> pathname to begin with. It could be useful for security purposes
> more generally.
maybe this: open("/some/dir", O_TMPFILE);
and this? open("/some/dir", O_TMPFILE|O_DIRECTORY);
The former would return a filehandle to a new anonymous file
somewhere on whatever filesystem backs the specified path. The
latter would do the same, except create an anonymous directory where
you could use "openat()" or something. Presumably "lsof" and "/proc"
should show either type of handle as referring to either "/some/
filesystem/" or "/some/filesystem/ (anonymous temp file)" or something.
Cheers,
Kyle Moffett
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>>> c. open() flag to unlink a file before returning the fd
On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote:
>> You probably want a tmpfile(3) -like affair which never has a
>> pathname to begin with. It could be useful for security purposes
>> more generally.
On Fri, Jun 22, 2007 at 11:52:12PM -0400, Kyle Moffett wrote:
> maybe this: open("/some/dir", O_TMPFILE);
> and this? open("/some/dir", O_TMPFILE|O_DIRECTORY);
> The former would return a filehandle to a new anonymous file
> somewhere on whatever filesystem backs the specified path. The
> latter would do the same, except create an anonymous directory where
> you could use "openat()" or something. Presumably "lsof" and "/proc"
> should show either type of handle as referring to either "/some/
> filesystem/" or "/some/filesystem/ (anonymous temp file)" or something.
This is plausible (and I did indeed consider the file variant),
though it may require more infrastructure than for tmpfs only.
It may be worth clarifying that I have no concrete plans to work on
the JIT emulator issues myself. I'm only disseminating ideas I think
will pass review. I expect others to take up the issue(s) perhaps with
some inspiration from what I described. I may review some, but I have
a large review backlog as things now stand.
-- wli