2005-12-15 22:49:39

by Ulrich Drepper

[permalink] [raw]
Subject: [PATCH 0/3] *at syscalls: Intro

Here is a series of patches which introduce in total 11 new system calls
which take a file descriptor/filename pair instead of a single file name.
These functions, openat etc, have been discussed on numerous occasions.
They are needed to implement race-free filesystem traversal, they are
necessary to implement a virtual per-thread current working directory
(think multi-threaded backup software), etc.

We have in glibc today implementations of the interfaces which use the
/proc/self/fd magic. But this code is rather expensive. Here are some
results (similar to what Jim Meyering posted before):

The test creates a deep directory hierarchy on a tmpfs filesystem.
Then rm -fr is used to remove all directories. Without syscall support
I get this:

real 0m31.921s
user 0m0.688s
sys 0m31.234s

With syscall support the results are much better:

real 0m20.699s
user 0m0.536s
sys 0m20.149s


The implemenation is really small:

arch/i386/kernel/syscall_table.S | 11 ++
arch/x86_64/ia32/ia32entry.S | 11 ++
fs/compat.c | 48 +++++++++--
fs/exec.c | 2
fs/namei.c | 167 ++++++++++++++++++++++++++++++---------
fs/open.c | 60 +++++++++++---
fs/stat.c | 54 ++++++++++--
include/asm-i386/unistd.h | 13 ++-
include/asm-x86_64/ia32_unistd.h | 13 ++-
include/asm-x86_64/unistd.h | 24 +++++
include/linux/fcntl.h | 7 +
include/linux/fs.h | 7 +
include/linux/namei.h | 7 -
include/linux/time.h | 2
14 files changed, 355 insertions(+), 71 deletions(-)

I've split the patch in three parts. The first part is the actual
code change. It mostly consists of passing down an additional parameter
with a file descriptor and add wrapper functions which pass down the
default parameter AT_FDCWD. Three new constants are defined in
<linux/fcntl.h> which must correspond to the values already in use
in glibc. In a few cases I've modified some code which would not
necesarily need changing but the change makes it a bit more efficient
in presence of the wrapper functions.

The real change needed is the additional else-clause in what is now
do_path_lookup. That's it.

The other two patches contain the syscall definitions for x86 and x86-64.
I've tested the code on x86-64, including the ia32 compat code. Because
there is no architecture specific change all should work well on other
archs once the syscalls are added.


2005-12-16 00:32:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

Ulrich Drepper wrote:
> Here is a series of patches which introduce in total 11 new system calls
> which take a file descriptor/filename pair instead of a single file name.
> These functions, openat etc, have been discussed on numerous occasions.
> They are needed to implement race-free filesystem traversal, they are
> necessary to implement a virtual per-thread current working directory
> (think multi-threaded backup software), etc.
>
> We have in glibc today implementations of the interfaces which use the
> /proc/self/fd magic. But this code is rather expensive. Here are some
> results (similar to what Jim Meyering posted before):
>
> The test creates a deep directory hierarchy on a tmpfs filesystem.
> Then rm -fr is used to remove all directories. Without syscall support
> I get this:
>
> real 0m31.921s
> user 0m0.688s
> sys 0m31.234s
>
> With syscall support the results are much better:
>
> real 0m20.699s
> user 0m0.536s
> sys 0m20.149s
>
>
> The implemenation is really small:
>
> arch/i386/kernel/syscall_table.S | 11 ++
> arch/x86_64/ia32/ia32entry.S | 11 ++
> fs/compat.c | 48 +++++++++--
> fs/exec.c | 2
> fs/namei.c | 167 ++++++++++++++++++++++++++++++---------
> fs/open.c | 60 +++++++++++---
> fs/stat.c | 54 ++++++++++--
> include/asm-i386/unistd.h | 13 ++-
> include/asm-x86_64/ia32_unistd.h | 13 ++-
> include/asm-x86_64/unistd.h | 24 +++++
> include/linux/fcntl.h | 7 +
> include/linux/fs.h | 7 +
> include/linux/namei.h | 7 -
> include/linux/time.h | 2
> 14 files changed, 355 insertions(+), 71 deletions(-)
>
> I've split the patch in three parts. The first part is the actual
> code change. It mostly consists of passing down an additional parameter
> with a file descriptor and add wrapper functions which pass down the
> default parameter AT_FDCWD. Three new constants are defined in
> <linux/fcntl.h> which must correspond to the values already in use
> in glibc. In a few cases I've modified some code which would not
> necesarily need changing but the change makes it a bit more efficient
> in presence of the wrapper functions.
>
> The real change needed is the additional else-clause in what is now
> do_path_lookup. That's it.

Definitely gets my ACK, for my own motivations: I want to create
race-free cp(1)/mv(1)/rm(1) utilities for my posixutils project.

It would be nice to add the additional argument to path_lookup(), rather
than calling do_path_lookup() everywhere.

Jeff



2005-12-16 01:14:03

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

On Thu, 2005-12-15 at 17:49 -0500, Ulrich Drepper wrote:
> Here is a series of patches which introduce in total 11 new system calls
> which take a file descriptor/filename pair instead of a single file name.
> These functions, openat etc, have been discussed on numerous occasions.
> They are needed to implement race-free filesystem traversal, they are
> necessary to implement a virtual per-thread current working directory
> (think multi-threaded backup software), etc.
>

Actually, that last part is false (or maybe just misleading). You can
create threads without CLONE_FS to get a per-thread cwd/chroot/umask, no
"virtual" required.

Don't take this as an objection to implementation of the *at() syscalls
in Linux, though; rather, look at is as a request for the addition of
int pthread_attr_setfssharing_np(pthread_attr_t *attr, int share) and
int pthread_attr_getfssharing_np(pthread_attr_t *attr) to glibc.

--
Nicholas Miell <[email protected]>

2005-12-16 01:30:31

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

Jeff Garzik wrote:
> It would be nice to add the additional argument to path_lookup(), rather
> than calling do_path_lookup() everywhere.

path_lookup is unfortunately exported. If breaking the ABI is no issue
I'll change it.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2005-12-16 01:40:05

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

Ulrich Drepper wrote:
> Jeff Garzik wrote:
>
>> It would be nice to add the additional argument to path_lookup(),
>> rather than calling do_path_lookup() everywhere.
>
>
> path_lookup is unfortunately exported. If breaking the ABI is no issue
> I'll change it.

Yep, I saw it was exported. I would prefer to change the API, but my
preference doesn't carry much weight: I have no idea if there are any
important out-of-tree users or not.

Jeff


2005-12-16 11:32:27

by Jim Meyering

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

FYI, the rm in coreutils-cvs is finally thread-safe and race-free,
when using openat et al.

On 12/16/05, Jeff Garzik <[email protected]> wrote:
> Definitely gets my ACK, for my own motivations: I want to create
> race-free cp(1)/mv(1)/rm(1) utilities for my posixutils project.

2005-12-16 16:27:09

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

Jim Meyering wrote:
> FYI, the rm in coreutils-cvs is finally thread-safe and race-free,
> when using openat et al.

Actually, Jim, I doubt it. There is one more race which cannot be
solved with the existing interfaces. I want to tackle this next, after
these changes are decided on.

The problem is directory creation and then populating it. As in cp -r
and any backup tool. You currently have to use (at best)

mkdirat(fd, "some-dir", 0666);
dfd = openat(fd, "some-dir", O_RDONLY);

What is needed is a way to create a new directory and return a file
descriptor for it.

I was thinking about using

dfd = openat(fd, "some-dir", O_RDONLY|O_DIRECTORY|O_CREAT, S_IFDIR|0666)

where the combination of using O_DIRECTORY, O_RDONLY, O_CREAT, and the
S_IFDIR flag can be recognized. This is a configuration which cannot be
used successfully in current code. Should probably also work with open().

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2005-12-16 16:36:27

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

On Fri, Dec 16, 2005 at 08:24:53AM -0800, Ulrich Drepper wrote:
> Jim Meyering wrote:
> >FYI, the rm in coreutils-cvs is finally thread-safe and race-free,
> >when using openat et al.
>
> Actually, Jim, I doubt it. There is one more race which cannot be
> solved with the existing interfaces. I want to tackle this next, after
> these changes are decided on.
>
> The problem is directory creation and then populating it. As in cp -r
> and any backup tool. You currently have to use (at best)
>
> mkdirat(fd, "some-dir", 0666);
> dfd = openat(fd, "some-dir", O_RDONLY);
>
> What is needed is a way to create a new directory and return a file
> descriptor for it.
>
> I was thinking about using
>
> dfd = openat(fd, "some-dir", O_RDONLY|O_DIRECTORY|O_CREAT, S_IFDIR|0666)
>
> where the combination of using O_DIRECTORY, O_RDONLY, O_CREAT, and the
> S_IFDIR flag can be recognized. This is a configuration which cannot be
> used successfully in current code. Should probably also work with open().

At least for prelink I'd also like to be able to atomically
read/modify/write a file, but the Solaris renameat doesn't allow that.
What I do is open the original file, read it etc., then create a temporary
file in the same dir and as the final step I need to rename it over
the original file, but I'd like to do that only if nobody replaced
the original file in the mean time.
So a renameat variant that would have
olddirfd, oldname, oldfd, newdirfd, newname, newfd
arguments would be handy. It would do the same thing as
renameat (olddirfd, oldname, newdirfd, newname),
but would fail if oldfd resp. newfd no longer corresponds to
the olddirfd/oldname resp. newdirfd/newname files.

Jakub

2005-12-16 17:51:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro



On Thu, 15 Dec 2005, Nicholas Miell wrote:
>
> Don't take this as an objection to implementation of the *at() syscalls
> in Linux, though; rather, look at is as a request for the addition of
> int pthread_attr_setfssharing_np(pthread_attr_t *attr, int share) and
> int pthread_attr_getfssharing_np(pthread_attr_t *attr) to glibc.

I don't think separate fs/files makes sense with pthreads.

Why? Signals.

Signals are shared in a pthread environment, so signal handlers can happen
in any thread. If you don't share the filesystem thing, your signal
handlers had better not do any file operations. You'd better not try to
use the old "send a byte down a pipe" trick to wake somebody else up,
because the thread you may take the signal in may not _have_ that pipe fd
open.

Now, opening files in a signal handler may be unusual enough that
having separate cwd/root (!CLONE_FS as opposed to !CLONE_FILES) might be
more commonly usable, but it's still potentially asking for some really
funky behaviour.

Linus

2005-12-16 19:59:52

by Jim Meyering

[permalink] [raw]
Subject: Re: [PATCH 0/3] *at syscalls: Intro

On 12/16/05, Ulrich Drepper <[email protected]> wrote:
> Jim Meyering wrote:
> > FYI, the rm in coreutils-cvs is finally thread-safe and race-free,
> > when using openat et al.
>
> Actually, Jim, I doubt it.

You seem to have misread. I mentioned only `rm'.
I know full well that the other dir-traversing tools are not as robust.