2002-08-31 16:28:19

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH] Initial support for struct vfs_cred [0/1]


Hi Linus,

I hope this addresses all the issues that you raised (minus
user_struct, as I explained). Changes w.r.t. previous patches are as
follows:

1) Dropped current_fsuid(), current_fsgid(), current_ngroups().
=> had to merge patches [1/3] & [3/3]. Makes for a rather large
initial patch (see diffstat output below) 8-(

2) 'struct ucred' renamed to 'struct vfs_cred'. Renamed the
helper routines to reflect the new struct name.

3) Moved #ifdef __KERNEL__

4) Renamed the remaining current_* routines so as to drop the
'current_' prefix where appropriate (e.g. setfsuid() instead of
current_setfsuid(), ...).
Hopefully the naming scheme will be seen as an improvement.

5) Dropped patch [2/3]. There is no longer a namespace conflict
to justify renaming the networking code's 'struct ucred'.

Cheers,
Trond

arch/s390x/kernel/linux32.c | 9
arch/sparc64/kernel/sys_sparc32.c | 9
drivers/hotplug/pci_hotplug_core.c | 4
drivers/isdn/capi/capifs.c | 4
drivers/usb/core/inode.c | 4
fs/affs/inode.c | 4
fs/attr.c | 6
fs/bfs/dir.c | 4
fs/coda/coda_linux.c | 6
fs/devpts/inode.c | 4
fs/dquot.c | 2
fs/driverfs/inode.c | 4
fs/exec.c | 6
fs/ext2/balloc.c | 2
fs/ext2/ialloc.c | 4
fs/ext2/ioctl.c | 4
fs/ext3/balloc.c | 2
fs/ext3/ialloc.c | 4
fs/ext3/ioctl.c | 4
fs/file_table.c | 8
fs/hpfs/namei.c | 24 -
fs/intermezzo/file.c | 13 -
fs/intermezzo/journal.c | 54 ++--
fs/intermezzo/kml_reint.c | 12
fs/intermezzo/upcall.c | 2
fs/intermezzo/vfs.c | 4
fs/jffs/inode-v23.c | 28 +-
fs/jffs2/fs.c | 4
fs/jfs/jfs_inode.c | 4
fs/lockd/host.c | 7
fs/locks.c | 2
fs/minix/bitmap.c | 4
fs/namei.c | 8
fs/nfsd/auth.c | 11
fs/nfsd/vfs.c | 4
fs/open.c | 17 -
fs/pipe.c | 4
fs/proc/array.c | 11
fs/proc/base.c | 4
fs/ramfs/inode.c | 4
fs/reiserfs/namei.c | 4
fs/sysv/ialloc.c | 4
fs/udf/ialloc.c | 4
fs/udf/namei.c | 2
fs/ufs/ialloc.c | 4
fs/umsdos/namei.c | 8
include/linux/cred.h | 88 ++++++
include/linux/init_task.h | 1
include/linux/sched.h | 8
init/main.c | 1
kernel/Makefile | 5
kernel/cred.c | 468 +++++++++++++++++++++++++++++++++++++
kernel/fork.c | 17 +
kernel/kmod.c | 6
kernel/sys.c | 65 ++---
kernel/uid16.c | 9
mm/shmem.c | 8
net/socket.c | 4
net/sunrpc/auth_unix.c | 55 ++--
net/sunrpc/sched.c | 2
security/capability.c | 4
security/dummy.c | 5
62 files changed, 835 insertions(+), 252 deletions(-)


2002-08-31 18:52:59

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

But aren't credential changes supposed to be infrequent?

If so, isn't it faster to put the fields directly in task_struct, keep a
separate shared structure and copy them on kernel entry?

Here is some pseudocode to better illustrate the idea.

vfs_shared_cred::lock can be removed if it's not necessary to do atomic
modifications of multiple cred fields (otherwise you could copy and then
exchange a pointer).

vfs_shared_cred::tasklist_lock can be replaced with RCU.

struct vfs_cred_groups
{
int ngroups;
gid_t groups[];
};

struct vfs_cred
{
uid_t uid, gid;
struct vfs_cred_groups* groups;
};

struct vfs_shared_cred
{
atomic_t count;
spinlock_t lock;
spinlock_t tasklist_lock;
vfs_cred cred;
};

struct task_struct
{
struct vfs_cred cred;
struct vfs_cred* shared_cred;
list_t shared_cred_list;
};

struct thread_info
{
unsigned propagate_cred;
};

propagate_cred()
{
current_thread_info()->propagate_cred = 0;

spin_lock(&current->shared_cred->lock);
current->cred = current->shared_cred->cred;
spin_unlock(&current->shared_cred->lock);
}

kernel_entry() /* asm for every architecture :( */
{
if(unlikely(current_thread_info()->propagate_cred))
propagate_cred();
}

change_credentials()
{
list_t list;

spin_lock(&current->shared_cred->lock);
frob(current->shared_cred);
spin_unlock(&current->shared_cred->lock);

wmb();

spin_lock(&current->shared_cred->tasklist_lock);
list_for_each(list, &current->shared_cred_list)
{
struct task_struct* task = list_entry(list, struct task_struct, shared_cred_list);
task->thread_info->propagate_cred = 1;
}
spin_unlock(&current->shared_cred->tasklist_lock);
}

clone()
{
spin_lock(&current->shared_cred->lock);
new->cred = current->shared_cred->cred); /* not current->cred! */
spin_unlock(&current->shared_cred->lock);

if(flags & CLONE_CRED)
{
new->shared_cred = get_shared_cred(current->shared_cred);
spin_lock(&current->shared_cred->tasklist_lock);
list_add(new, &current->shared_cred_list);
spin_unlock(&current->shared_cred->tasklist_lock);
}
}

static void release_task(struct task_struct * p)
{
spin_lock(&current->shared_cred->tasklist_lock);
__list_del(current->shared_cred_list);
spin_unlock(&current->shared_cred->tasklist_lock);

put_shared_cred(current->shared_cred);
}


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-08-31 19:24:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]


On 31 Aug 2002, Luca Barbieri wrote:
>
> But aren't credential changes supposed to be infrequent?
>
> If so, isn't it faster to put the fields directly in task_struct, keep a
> separate shared structure and copy them on kernel entry?

But that makes us copy them every time, even though they practically never
change.. Much better to only copy them in the extremely rare cases when
they do change.

Linus

2002-08-31 19:34:32

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sat, 2002-08-31 at 21:36, Linus Torvalds wrote:
>
> On 31 Aug 2002, Luca Barbieri wrote:
> >
> > But aren't credential changes supposed to be infrequent?
> >
> > If so, isn't it faster to put the fields directly in task_struct, keep a
> > separate shared structure and copy them on kernel entry?
>
> But that makes us copy them every time, even though they practically never
> change.. Much better to only copy them in the extremely rare cases when
> they do change.
Sorry, I have explained myself incorrectly.
When credentials are changed, the changing task walks the list of tasks
sharing credentials with him and sets the propagate_cred flag in their
thread_info's.
The assembly code at entry checks this.
It's just two instructions, one memory read:
cmpl $0, propagate_cred(%ebx)
jnz do_propagate_cred

We could use the flags field, but we need atomic and/or and we still
don't have them for all architectures.
We could even merge this with the syscall trace check (but that brings
more complexity to avoid races).

Then the rest of the code doesn't need to know at all that credentials
are shared and is simpler and faster.
We have however a larger penalty on credential change but, as you say,
that's extremely rare (well, perhaps not necessarily extremely, but
still rare).


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-08-31 19:47:19

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

Forgot to mention that, of course, the vfs_cred_groups struct needs to
be recreated if modified and we must keep an atomic reference count in
it.
This allows to avoid having to copy the whole groups array for each task
on modification.
uid and gid instead are placed directly in task_struct since checking
them is faster and thus needs more optimization.
Same reasoning applies for eventual ACLs or similar structures.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-08-31 22:26:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

> Then the rest of the code doesn't need to know at all that
> credentials are shared and is simpler and faster. We have
> however a larger penalty on credential change but, as you say,
> that's extremely rare (well, perhaps not necessarily extremely,
> but still rare).

What if I, in a fit of madness/perversion, decide to use CLONE_CRED
between 2 kernel threads (i.e. no 'kernel entry')?


Leaving CLONE_CRED aside, please do not forget that most of the
motivation for vfs_cred is the need to *cache* credentials.
This is something which we already do today in several filesystems:
Coda, Intermezzo, NFS, to name but the most obvious.
The result of the lack of a VFS-sanctioned credential is that we have
to use 'struct file' as a vehicle for passing credentials in, for
instance, the address_space_operations, and that each filesystem ends
up having to keep its own private copies of those credentials in
file->private_data.

Cheers,
Trond

2002-08-31 23:09:39

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 00:30, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> > Then the rest of the code doesn't need to know at all that
> > credentials are shared and is simpler and faster. We have
> > however a larger penalty on credential change but, as you say,
> > that's extremely rare (well, perhaps not necessarily extremely,
> > but still rare).
>
> What if I, in a fit of madness/perversion, decide to use CLONE_CRED
> between 2 kernel threads (i.e. no 'kernel entry')?
You don't or you manually patch the task_struct of the other threads.
This isn't a serious concern.

> Leaving CLONE_CRED aside, please do not forget that most of the
> motivation for vfs_cred is the need to *cache* credentials.
> This is something which we already do today in several filesystems:
> Coda, Intermezzo, NFS, to name but the most obvious.
> The result of the lack of a VFS-sanctioned credential is that we have
> to use 'struct file' as a vehicle for passing credentials in, for
> instance, the address_space_operations, and that each filesystem ends
> up having to keep its own private copies of those credentials in
> file->private_data.
I'm not proposing to not use vfs_cred (the pseudocode I written includes
it).
I'm just suggesting a way to speed up access to credentials.
Your code does get_current_vfscred()/put_vfscred() in normal paths and
accesses credentials with an indirect pointer.
Furthermore, setting the uid and gid are very expensive since they need
to break copy-on-write.
If we instead only allow changes to credentials on kernel mode entry, we
don't have to worry about changing credentials and the code becomes
simpler and faster.
It also seems better to only copy-on-write groups and just copy uid and
gid.


For example, rather than this;
- uid_t saved_fsuid = current->fsuid;
+ struct vfs_cred *saved_cred = get_current_vfscred();
kernel_cap_t saved_cap = current->cap_effective;

/* Create RPC socket as root user so we get a priv port
*/
- current->fsuid = 0;
+ setfsuid(0);
cap_raise (current->cap_effective,
CAP_NET_BIND_SERVICE);
xprt = xprt_create_proto(host->h_proto, &host->h_addr,
NULL);
- current->fsuid = saved_fsuid;
+ set_current_vfscred(saved_cred);
+ put_vfscred(saved_cred);

you can just do this:
- uid_t saved_fsuid = current->fsuid;
+ uid_t saved_fsuid = current->fscred.uid;
kernel_cap_t saved_cap = current->cap_effective;

/* Create RPC socket as root user so we get a priv port
*/
- current->fsuid = 0;
+ current->fscred.uid = 0;
cap_raise (current->cap_effective,
CAP_NET_BIND_SERVICE);
xprt = xprt_create_proto(host->h_proto, &host->h_addr,
NULL);
- current->fsuid = saved_fsuid;
+ current->fscred.uid = saved_fsuid;

(here the fscred is the one called cred in the pseudocode: fscred is a
better name).

This code uses vfs_cred but is as fast as the old one while the one in
your patch is significantly slower (e.g. compare your setfsuid with
current->fscred.uid = xxx).
If you then need to keep the vfs_cred data for longer than a single
syscall, just copy it like you would do in the propagation code.
That means copying the structure and then incrementing the reference
count on groups.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 12:58:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

> For example, rather than this;
<snip>

> you can just do this:
> - uid_t saved_fsuid = current->fsuid;
> + uid_t saved_fsuid = current->fscred.uid;
> kernel_cap_t saved_cap =
> current->cap_effective;

But I don't want to have to do that at all. Why should I change the
actual task's privileges in order to call down into a single VFS
function?
The point of VFS support for credentials is to eliminate these hacks,
and cut down on all this gratuitous changing of privilege. That's what
we want the API changes for.

Who cares if changing fsuid/fsgid is more expensive? The only place we
should actually be doing that is in sys_fsuid(), sys_fsgid(), and
possibly daemonize(), where adequate security checks can be made.

Cheers,
Trond

2002-09-01 14:05:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Trond Myklebust <[email protected]> writes:

>>>>> " " == Luca Barbieri <[email protected]> writes:
>> For example, rather than this;
> <snip>

>> you can just do this:
>> - uid_t saved_fsuid = current->fsuid;
>> + uid_t saved_fsuid = current->fscred.uid;
>> kernel_cap_t saved_cap =
current-> cap_effective;

> But I don't want to have to do that at all. Why should I change

Just to follow up on why the above 'optimization' is just plain wrong.

You are forgetting that the fscred might simultaneously be referenced
by an open struct file. Are you saying that this file should suddenly
see its credential change?
The alternative without copy on write is to make a full copy of the
fscred every time we open a file or schedule some form of asynchronous
I/O, and hence need to cache the current VFS credentials.

The copy-on-write rule is there in order to *minimize* the need to
copy the cred. It works because changing the cred's entries is
supposed to be a *rare* occurrence, whereas taking references and
reading are common.

Cheers,
Trond

2002-09-01 14:16:04

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 16:10, Trond Myklebust wrote:
> >>>>> " " == Trond Myklebust <[email protected]> writes:
>
> >>>>> " " == Luca Barbieri <[email protected]> writes:
> >> For example, rather than this;
> > <snip>
>
> >> you can just do this:
> >> - uid_t saved_fsuid = current->fsuid;
> >> + uid_t saved_fsuid = current->fscred.uid;
> >> kernel_cap_t saved_cap =
> current-> cap_effective;
>
> > But I don't want to have to do that at all. Why should I change
>
> Just to follow up on why the above 'optimization' is just plain wrong.
>
> You are forgetting that the fscred might simultaneously be referenced
> by an open struct file. Are you saying that this file should suddenly
> see its credential change?
No, it cannot be referenced by an open struct file because you copy the
structure, not pointers to it.

> The alternative without copy on write is to make a full copy of the
> fscred every time we open a file or schedule some form of asynchronous
> I/O, and hence need to cache the current VFS credentials.
Yes.
Note however that the structure is like this:
struct vfs_cred
{
uid_t uid, gid;
vfs_cred_groups* groups;
}

vfs_cred_copy(struct vfs_cred* dest, struct vfs_cred* src)
{
dest->uid = src->uid;
dest->gid = src->gid;
dest->groups = src->groups;
atomic_inc(&src->groups->count);
}

Of course if you don't need groups you can avoid copying them.
This is efficient if you usually either check the uid and gid or copy
only them (omitting groups).
If instead copying the whole structure is more frequent than the
operations described above, then use reference counting and
copy-on-write for the whole structure, but I don't think that this is
the case.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 14:29:21

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 15:03, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> > For example, rather than this;
> <snip>
>
> > you can just do this:
> > - uid_t saved_fsuid = current->fsuid;
> > + uid_t saved_fsuid = current->fscred.uid;
> > kernel_cap_t saved_cap =
> > current->cap_effective;
>
> But I don't want to have to do that at all. Why should I change the
> actual task's privileges in order to call down into a single VFS
> function?
> The point of VFS support for credentials is to eliminate these hacks,
> and cut down on all this gratuitous changing of privilege. That's what
> we want the API changes for.
That's what the code did, unless I misunderstood it.
Anyway if you want to give a different fsuid to a filesystem function,
you either pass credentials as a parameter (that means that you change
all the functions in the call chain to do that) or you modify a
structure present or referenced from the task_struct (or you disable
preemption and use a per-cpu variable, but this would be a very bad idea
here).

Furthermore if no one except the current task can access/modify the task
credentials the "gratuitous changing of privileges" is only seen by the
VFS function you call, so it is fine (it may not be conceptually
elegant, but that should not be preferred over being fast and minimally
intrusive).


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 15:18:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sunday 01 September 2002 01:13, Luca Barbieri wrote:
> On Sun, 2002-09-01 at 00:30, Trond Myklebust wrote:
> > >>>>> " " == Luca Barbieri <[email protected]> writes:
> >
> > > Then the rest of the code doesn't need to know at all that
> > > credentials are shared and is simpler and faster. We have
> > > however a larger penalty on credential change but, as you say,
> > > that's extremely rare (well, perhaps not necessarily extremely,
> > > but still rare).
> >
> > What if I, in a fit of madness/perversion, decide to use CLONE_CRED
> > between 2 kernel threads (i.e. no 'kernel entry')?
> You don't or you manually patch the task_struct of the other threads.
> This isn't a serious concern.

It is a serious concern. Inventing new, subtle behavior differences
between user and kernel threads is, in a word, gross. It's certain
to bite people in the future.

--
Daniel

2002-09-01 15:30:44

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

> It is a serious concern. Inventing new, subtle behavior differences
> between user and kernel threads is, in a word, gross. It's certain
> to bite people in the future.
So you are suggesting that it's better to slow down *all* threads so
that it's possible to have kernel threads with automatically shared
credentials?
BTW, signals and rescheduling (unless PREEMPT=y) don't work
automatically for the same reason.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 16:33:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

> That's what the code did, unless I misunderstood it. Anyway if
> you want to give a different fsuid to a filesystem function,
> you either pass credentials as a parameter (that means that you
> change all the functions in the call chain to do that) or you

If you read through the beginning of the thread, you will see that
this is *exactly* what I'm proposing to do. Just not in this one
already very large patch.

Cheers,
Trond

2002-09-01 16:36:34

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

>> You are forgetting that the fscred might simultaneously be
>> referenced by an open struct file. Are you saying that this
>> file should suddenly see its credential change?
> No, it cannot be referenced by an open struct file because you
> copy the structure, not pointers to it.

So you are proposing to optimize for the rare case of setuid(),
instead of the more common case of file open()?

Cheers,
Trond

2002-09-01 18:38:37

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 18:38, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> > That's what the code did, unless I misunderstood it. Anyway if
> > you want to give a different fsuid to a filesystem function,
> > you either pass credentials as a parameter (that means that you
> > change all the functions in the call chain to do that) or you
>
> If you read through the beginning of the thread, you will see that
> this is *exactly* what I'm proposing to do. Just not in this one
> already very large patch.
But you'll need to modify the declaration of the various function
pointers whose implementations might need credentials and modify all
functions that call them and deal with permissions.
Instead with my proposal the credentials are automatically immutable
across the syscall without needing to worry at all about locks, counts
and sharing.

And remember that passing parameters has a cost since you need to push
them and the stack will be larger, occupy more cachelines, and thus
cause more cache misses.
All this without any particular reason to do since normally the
credentials are always the same.

Of course, if you have reason to think that we'll need to call VFS
functions with a lot of differential credential then you are right, but
otherwise you would add just a lot of useless overhead and code
modifications.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 18:50:30

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 18:40, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> >> You are forgetting that the fscred might simultaneously be
> >> referenced by an open struct file. Are you saying that this
> >> file should suddenly see its credential change?
> > No, it cannot be referenced by an open struct file because you
> > copy the structure, not pointers to it.
>
> So you are proposing to optimize for the rare case of setuid(),
> instead of the more common case of file open()?
No, this does not optimize for a setuid. It allows to easily make
temporary modifications to the credentials but that's not the main
focus.
Permanent modifications, if credentials are shared, would need to walk
the shared-cred tasklist and set the propagation flag on all tasks on it
so I'm surely not proposing to optimize for them.

And, in the common case of open, why do you need to copy the structure
to check permissions?
I think that open should just check the current values.
open might want to copy credentials in case you want to do the inode
lookup asynchronously but then it doesn't make sense to optimize for
this since you already have the huge disk read penalty.
BTW, the 2.5.32 open does the check in vfs_permission without copying
anything.
Anyway it's just a 3 long copy plus an atomic inc vs. 1 long copy and
atomic inc.
And if you don't need the groups array, it's just a 2 longs copy that on
some architectures with very slow atomic operations (e.g. sparc) is much
better.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 19:21:20

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

> But you'll need to modify the declaration of the various
> function pointers whose implementations might need credentials
> and modify all functions that call them and deal with
> permissions. Instead with my proposal the credentials are

Yes... And this is a useful activity in itself, as the existence of
all these hacks that temporarily change uid/gid/whatever... show.

> automatically immutable across the syscall without needing to
> worry at all about locks, counts and sharing.

I still have no opinion about your proposal for implementing CLONE_CRED.

What I fail to see is why you appear to insist it would be
incompatible with the idea of copy-on-write VFS credentials (which I
explained are interesting for other purposes).

I also fail to understand why you insist that we need to drop the idea
of copy-on-write credentials in order to optimize for this fringe case
in which somebody calls sys_access() or exec with euid != fsuid.

Now repeat after me

"changing fsuid/fsgid is *not* the common case that needs optimization."

Trond

2002-09-01 19:36:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

> And, in the common case of open, why do you need to copy the
> structure to check permissions? I think that open should just
> check the current values. open might want to copy credentials

Because, as has been explained to you, we have things like Coda,
Intermezzo, NFS, for which this is insufficient.

> in case you want to do the inode lookup asynchronously but then
> it doesn't make sense to optimize for this since you already
> have the huge disk read penalty. BTW, the 2.5.32 open does the
> check in vfs_permission without copying anything. Anyway it's
> just a 3 long copy plus an atomic inc vs. 1 long copy and
> atomic inc. And if you don't need the groups array, it's just
> a 2 longs copy that on some architectures with very slow atomic
> operations (e.g. sparc) is much better.

But we we do need to check the groups array in the VFS. And as Linus
pointed out, there is a good case for passing info from the
user_struct too (crypto), etc...

Cheers,
Trond

2002-09-01 21:29:59

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 21:40, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> > And, in the common case of open, why do you need to copy the
> > structure to check permissions? I think that open should just
> > check the current values. open might want to copy credentials
>
> Because, as has been explained to you, we have things like Coda,
> Intermezzo, NFS, for which this is insufficient.
If they only need them at open, and the open is synchronous, you don't
need to copy them.

Otherwise, if you need them in other syscall, or in another context, you
probably need to copy them, that only costs an additional 2 long copies.

> > in case you want to do the inode lookup asynchronously but then
> > it doesn't make sense to optimize for this since you already
> > have the huge disk read penalty. BTW, the 2.5.32 open does the
> > check in vfs_permission without copying anything. Anyway it's
> > just a 3 long copy plus an atomic inc vs. 1 long copy and
> > atomic inc. And if you don't need the groups array, it's just
> > a 2 longs copy that on some architectures with very slow atomic
> > operations (e.g. sparc) is much better.
>
> But we we do need to check the groups array in the VFS. And as Linus
> pointed out, there is a good case for passing info from the
> user_struct too (crypto), etc...
There is copy-on-write for the groups array.
The idea is that anything that takes little time to use should be in the
copy-always structure, while things that need a long time to access
(e.g. groups, lsm data) and/or are big, stay in the copy-on-write
structure referenced from the copy-always one.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 21:31:57

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 21:25, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> > But you'll need to modify the declaration of the various
> > function pointers whose implementations might need credentials
> > and modify all functions that call them and deal with
> > permissions. Instead with my proposal the credentials are
>
> Yes... And this is a useful activity in itself, as the existence of
> all these hacks that temporarily change uid/gid/whatever... show.
I disagree.
It reduces performance, and results in huge patches.
Furthermore you can't predict whether the target of a function pointer
will need credentials or not.

> > automatically immutable across the syscall without needing to
> > worry at all about locks, counts and sharing.
>
> I still have no opinion about your proposal for implementing CLONE_CRED.
>
> What I fail to see is why you appear to insist it would be
> incompatible with the idea of copy-on-write VFS credentials (which I
> explained are interesting for other purposes).
I don't insist on that (see below).
> I also fail to understand why you insist that we need to drop the idea
> of copy-on-write credentials in order to optimize for this fringe case
> in which somebody calls sys_access() or exec with euid != fsuid.
I do not propose to remove the idea of copy-on-write credentials, just
to use copy-on-write only for groups and "copy-always" for uid and gid
so that access to uid and gid doesn't need to go through the credentials
pointer.

You have code like this:
+ cred = get_current_vfscred();
fdata->fd_do_lml = 0;
- fdata->fd_fsuid = current->fsuid;
- fdata->fd_fsgid = current->fsgid;
+ fdata->fd_fsuid = cred->uid;
+ fdata->fd_fsgid = cred->gid;
fdata->fd_mode = file->f_dentry->d_inode->i_mode;
- fdata->fd_ngroups = current->ngroups;
- for (i=0 ; i<current->ngroups ; i++)
- fdata->fd_groups[i] = current->groups[i];
+ fdata->fd_ngroups = cred->ngroups;
+ for (i=0 ; i<cred->ngroups ; i++)
+ fdata->fd_groups[i] = cred->groups[i];
fdata->fd_bytes_written = 0; /*when open,written data is zero*/
file->private_data = fdata;
+ put_vfscred(cred);

The get/put here only make sense if something else can modify the
credentials without it.
Otherwise, you can skip them since the reference count will always be >
0 because nothing else can change the current->vfscred pointer, and that
will contribute to the reference count.

What I'm trying to do is to design credentials in such a way that, even
with shared credentials, the get/put won't be necessary.
I'm also trying to put such credentials directly in task_struct so that
the code doesn't need to get a pointer from task_struct and then fetch
the credentials from the structure it points to.

The order of likelyhood of credential operations is assumed to be the
following:
1. Use/copy/local temporary modification of uid and gid
2. Use/copy/local temporary modification of groups
3. Permanent modification of something
* Local temporary modification means that the modification only has
effect on the current task and is revert before the end of the syscall

To optimize for this I propose to:
1. Put uid and gid directly in task_struct (optimize for 1)
2. Use copy-on-write on groups so that we avoid copying an big array
(optimize for 2.copy)
3. Change credentials only when not performing a syscall (optimize for 1
and 2)

> "changing fsuid/fsgid is *not* the common case that needs optimization."
I completely agree with this, that's the whole point!


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-01 21:51:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

>>>>> " " == Luca Barbieri <[email protected]> writes:

>> Because, as has been explained to you, we have things like
>> Coda, Intermezzo, NFS, for which this is insufficient.
> If they only need them at open, and the open is synchronous,
> you don't need to copy them.

Bullshit. You clearly haven't got a clue what you are talking about.

For all these 3 systems credentials need to be cached from file open
to file close.

YES EVEN NOW, WITH NO CLONE_CRED AND WITH SYNCHRONOUS OPEN !!!!

On something like NFS or Coda, the server needs to receive
authentication information for each and every RPC call we send to
it. There's no state. The server does not know that we have opened a
file.

Currently this is done by the NFS client hiding information in the
file's private_data field. This means that other people that want to
do write-through-caching etc are in trouble 'cos they have to cope
with the fact that NFS has its private field, Coda has its private
field,... And they are all doing the same thing in different ways.

Now stop trolling

Trond

2002-09-01 22:46:08

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Sun, 2002-09-01 at 23:56, Trond Myklebust wrote:
> >>>>> " " == Luca Barbieri <[email protected]> writes:
>
> >> Because, as has been explained to you, we have things like
> >> Coda, Intermezzo, NFS, for which this is insufficient.
> > If they only need them at open, and the open is synchronous,
> > you don't need to copy them.
>
> Bullshit. You clearly haven't got a clue what you are talking about.
> For all these 3 systems credentials need to be cached from file open
> to file close.
>
> YES EVEN NOW, WITH NO CLONE_CRED AND WITH SYNCHRONOUS OPEN !!!!
>
> On something like NFS or Coda, the server needs to receive
> authentication information for each and every RPC call we send to
> it. There's no state. The server does not know that we have opened a
> file.
But then in the _open_ syscall you don't need to send them from a copy.
However, since you need them in the later syscalls, you need to copy
them to the file structure for the later syscalls in open, but you don't
need to use copied credentials for the operation of opening a file
(assuming it's done synchronously within sys_open).
Anyway this is only relevant to decide whether to always copy uid and
gid or to use copy-write on them and access them with an extra memory
read (to read the pointer to the copy-on-write structure), which is not
the main issue.

BTW, imho a correctly designed network filesystem should have a single
stateful encrypted connection (or a pool of equally authenticated ones)
and credentials (i.e. passwords) should only be passed when the user
makes the first filesystem access. After that the server should do
authentication with the OR of all credentials received and the client
kernel should further decide whether it can give access to a particular
user.
This is off-topic here, though.

> Currently this is done by the NFS client hiding information in the
> file's private_data field. This means that other people that want to
> do write-through-caching etc are in trouble 'cos they have to cope
> with the fact that NFS has its private field, Coda has its private
> field,... And they are all doing the same thing in different ways.
Yes, I agree with the need to provide copy-on-write.
I just disagree with the idea that code should be written to handle
changing credentials and that credentials should be passed as parameters
and I suggest that always copying uid and gid might be better since
accessing uid and gid is more frequent and happens in faster paths than
copying credentials.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-08 22:01:03

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

> Now if this is a multithreaded application that does this in one thread
> and another thread happens to open a completely unrelated file around
> the time that the first thread drops this application's setuid
> permissions. If we don't use a copy during the open upcall, but copy it
> after the fact, we don't have the correct permissions around the time
> the file is closed.
You would copy them during the filesystem open.
My point was that in the generic vfs open there is no need to use copied
credentials so credentials copying can be restricted to network
filesystems with not-very-good designs.

> > BTW, imho a correctly designed network filesystem should have a single
> > stateful encrypted connection (or a pool of equally authenticated ones)
> > and credentials (i.e. passwords) should only be passed when the user
> > makes the first filesystem access. After that the server should do
> > authentication with the OR of all credentials received and the client
> > kernel should further decide whether it can give access to a particular
> > user.
>
> Right, which is pretty close to what Coda does.
This is in contradiction with your statement about credentials sent
during close.
The advantage of the model I described is that, with the exception of
connection management, it works exactly like a normal filesystem with
the exception of some totally inaccessible files due to server access
denies.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-09-09 06:17:42

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Mon, Sep 09, 2002 at 12:04:03AM +0200, Luca Barbieri wrote:
> > Now if this is a multithreaded application that does this in one thread
> > and another thread happens to open a completely unrelated file around
> > the time that the first thread drops this application's setuid
> > permissions. If we don't use a copy during the open upcall, but copy it
> > after the fact, we don't have the correct permissions around the time
> > the file is closed.
> You would copy them during the filesystem open.
> My point was that in the generic vfs open there is no need to use copied
> credentials so credentials copying can be restricted to network
> filesystems with not-very-good designs.
>
> > > BTW, imho a correctly designed network filesystem should have a single
> > > stateful encrypted connection (or a pool of equally authenticated ones)
> > > and credentials (i.e. passwords) should only be passed when the user
> > > makes the first filesystem access. After that the server should do
> > > authentication with the OR of all credentials received and the client
> > > kernel should further decide whether it can give access to a particular
> > > user.
> >
> > Right, which is pretty close to what Coda does.
> This is in contradiction with your statement about credentials sent
> during close.

How so, Coda has a pool of authenticated connections per user. But the
userspace cachemanager still needs to know which user performed the
operation in order to pick the right connection to send the message.

If user A opens a file and then user B writes/closes it, should it send
the write/close message over the authenticated connection of user B?
Ofcourse not, so we need to store the credentials of the opener with the
file.

This behaviour is the same for any filesystem that performs/requires
additional permission checking after a file is opened, i.e. probably all
network filesystems.

> The advantage of the model I described is that, with the exception of
> connection management, it works exactly like a normal filesystem with
> the exception of some totally inaccessible files due to server access
> denies.

Filedescriptors being used by processes with different user credentials
compared to the opener are more common than you might think. One common
example is stdout/stderr redirection with setuid apps in a shellscript,
another is due to fd passing through Unix domain sockets, or apps that
open files and then drop their priveledges. And it happens even more
with network sockets and pipes, although those possibly don't affect the
VFS (haven't checked if they use generic VFS paths, pipes probably do,
network sockets might).

Jan

2002-09-09 11:13:08

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Initial support for struct vfs_cred [0/1]

On Mon, 2002-09-09 at 08:22, Jan Harkes wrote:
> On Mon, Sep 09, 2002 at 12:04:03AM +0200, Luca Barbieri wrote:
> > > Now if this is a multithreaded application that does this in one thread
> > > and another thread happens to open a completely unrelated file around
> > > the time that the first thread drops this application's setuid
> > > permissions. If we don't use a copy during the open upcall, but copy it
> > > after the fact, we don't have the correct permissions around the time
> > > the file is closed.
> > You would copy them during the filesystem open.
> > My point was that in the generic vfs open there is no need to use copied
> > credentials so credentials copying can be restricted to network
> > filesystems with not-very-good designs.
> >
> > > > BTW, imho a correctly designed network filesystem should have a single
> > > > stateful encrypted connection (or a pool of equally authenticated ones)
> > > > and credentials (i.e. passwords) should only be passed when the user
> > > > makes the first filesystem access. After that the server should do
> > > > authentication with the OR of all credentials received and the client
> > > > kernel should further decide whether it can give access to a particular
> > > > user.
> > >
> > > Right, which is pretty close to what Coda does.
> > This is in contradiction with your statement about credentials sent
> > during close.
>
> How so, Coda has a pool of authenticated connections per user. But the
> userspace cachemanager still needs to know which user performed the
> operation in order to pick the right connection to send the message.
>
> If user A opens a file and then user B writes/closes it, should it send
> the write/close message over the authenticated connection of user B?
> Ofcourse not, so we need to store the credentials of the opener with the
> file.
My point is that all users should be authenticated on all connections.
You may have to remember which connection was used to open the file, but
it's probably much better to design the protocol so that this isn't
necessary: this would allow efficient load balancing with per-cpu or
per-nic connections.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part