2007-06-04 16:40:38

by Jeff Dike

[permalink] [raw]
Subject: Syslets, signals, and security

Syslets seem like a fundamentally good idea to me, but the current
implementation, using CLONE_THREAD threads, seems like a basic
problem.

First, there are signals. If the app has an interval timer enabled,
every thread will inherit it and you will have 32 threads getting
alarms, which seems surprising and wasteful. I can imagine other
signals (like SIGWINCH and SIGHUP) which might be surprising to
receive multiple times in an obstensibly single-threaded process.

Second, security. What happens if a well-written server starts life
as root, does some (async) I/O, and setuids to a non-root uid? There
will be a bunch of async threads still running as root, with the
result that async operations (and the main thread) will more than
occassionally regain root privs.

You can fix this by going around to all the threads in the thread
group and changing their ->uid, but that seems somewhat kludgy. You
could also fix this by killing off the async threads on a setuid, and
make the app recreate the async threads under its new uid. However,
the current interface has no way to give the userspace stacks back to
the process, so this would force a memory leak on the process.

There's also ptrace, which (as I think Zach already mentioned) gets
along badly with syslets.

Offhand, it seems to me that the underlying problem is that the
threads all have their own task_structs and userspaces.

If only the main thread can receive signals and exit to userspace, the
signal surprises go away.

Similarly, if they all share a task structure, the setuid problem goes
away.

There are also warts in the current interface due to all of the
threads possibly being able to return to userspace:
they all need to be provided with userspace stacks
they need to be given an user IP to execute
there is a new system call for them to call when they land in
userspace.

Since the basic schedulable unit is currently a task_struct, and
syslets would prefer to have multiple schedulable units per
task_struct, this would imply some surgery on the task_struct and the
scheduler. What I think we'd want is for the basic schedulable unit
to be not much more than a kernel stack and a register set. A
(non-kernel-thread) task_struct would still be associated 1-1 with a
userspace, but could have multiple schedulable units, one of which is
allowed to exit to userspace.

Jeff

--
Work email - jdike at linux dot intel dot com


2007-06-04 17:17:10

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Syslets, signals, and security

On 6/4/07, Jeff Dike <[email protected]> wrote:
> First, there are signals. If the app has an interval timer enabled,
> every thread will inherit it and you will have 32 threads getting
> alarms, which seems surprising and wasteful.

Not only that. IIRC the current code does nothing about blocking
signals while the thread is not yet a real thread. If these worked
threads would receive a signal sent to the process (process group in
kernel-speak) this could severely irritate the application. The
signal mask of a new thread is inherited from the parent thread. If
the parent has it unset, the child will by default. Until and unless
the aio-created thread is used to deliver a notification because
SIGEV_THREAD is selected it must not appear doing any work for the
process.

This should be easy to fix: when creating a thread for aio, the kernel
should block all signals (probably including SIGSTOP etc). If and
when the thread is seen at userlevel the glibc code can then set a
reasonable signal mask if the thread is to be reused.


> I can imagine other
> signals (like SIGWINCH and SIGHUP) which might be surprising to
> receive multiple times in an obstensibly single-threaded process.

Normal signals are sent to only one thread in the process. If a
thread cannot handle the signal it has to block it. The kernel will
then choose a different thread or keep it pending. Special signals
are delivered to all threads. SIGSTOP, for instance. There is a lot
of special code in the kernel to handle these cases and they should be
fine.


> Second, security. What happens if a well-written server starts life
> as root, does some (async) I/O, and setuids to a non-root uid? There
> will be a bunch of async threads still running as root, with the
> result that async operations (and the main thread) will more than
> occassionally regain root privs.

This is indeed a problem and the reason is that we are forced to
implement setuid etc at userlevel. The code currently works by
sending a special signal to all threads in the process and the signal
handler issues the appropriate setuid etc call.

We would need to extend this implementation to all threads created via
AIO. This requires that we can send a signal to the syslet before it
has left the kernel and maybe before it's done with the request. The
reason is that the thread calling setuid etc has to wait until all
threads have been successfully changed. We cannot indefinitely delay
the thread just because of outstanding AIO requests.

My preferred solution would still be to implement setuid etc in the
kernel. I.e., have a new syscall setuid_pgrp or so.


> You can fix this by going around to all the threads in the thread
> group and changing their ->uid, but that seems somewhat kludgy.

That's what we do already and I couldn't agree more with you.


> You
> could also fix this by killing off the async threads on a setuid, and
> make the app recreate the async threads under its new uid.

This cannot work reliably, some of the AIO might already have happened.


> There's also ptrace, which (as I think Zach already mentioned) gets
> along badly with syslets.

syslets should be invisible to ptrace until they enter userlevel.

2007-06-04 17:46:15

by Zach Brown

[permalink] [raw]
Subject: Re: Syslets, signals, and security

On Mon, Jun 04, 2007 at 12:31:45PM -0400, Jeff Dike wrote:
> Syslets seem like a fundamentally good idea to me, but the current
> implementation, using CLONE_THREAD threads, seems like a basic
> problem.

It has remaining problems that need to be addressed, yes.

> First, there are signals. If the app has an interval timer enabled,
> every thread will inherit it and you will have 32 threads getting
> alarms, which seems surprising and wasteful. I can imagine other
> signals (like SIGWINCH and SIGHUP) which might be surprising to
> receive multiple times in an obstensibly single-threaded process.

Yeah. I'm hoping someone who knows more about signals than I do would
be excited about looking into the best way to fix this. I'm sure I
could plod through it in time, but, bleh. SIGXFSZ is the one in the
back of my mind.

> Second, security. What happens if a well-written server starts life
> as root, does some (async) I/O, and setuids to a non-root uid? There
> will be a bunch of async threads still running as root, with the
> result that async operations (and the main thread) will more than
> occassionally regain root privs.

Yeah. I wanted to experiment with this in the fs/aio.c syslet
management code. The most obvious fix is to just tear down all the
waiting threads before each operation. We have to have a worst case to
measure :). More incentive to get thread creation and destruction even
cheaper might not be such a bad thing, too.

One can imagine all sorts of crazy schemes which let us only shoot down
waiting async threads which were cloned before state in the submitting
task_struct was changed. Maybe we could swallow increasing a counter in
task_struct each time we change one of these sensitive fields (fsuid,
etc), but I bet the maintenance burden of anything more fine grained
than that would get pretty excessive.

> You can fix this by going around to all the threads in the thread
> group and changing their ->uid, but that seems somewhat kludgy.

No doubt racey and unsafe, too.

> There's also ptrace, which (as I think Zach already mentioned) gets
> along badly with syslets.

Yeah, and I'm blissfully ignorant of ptrace. Imagine me skipping
through a field of daisies with some ptrace wolves hiding in the bushes
at the edge of the meadow. La la la.

> Offhand, it seems to me that the underlying problem is that the
> threads all have their own task_structs and userspaces.

Well, don't conflate the two issues.

Each execution context having its own task struct is intentional, very
much so. Remember, this started with the fibrils patch series which
indeed shared a single task struct amongst a set of running stacks and
thread infos, etc. This approach is invasive because it changes the
sleeping entity in the scheduler from a task struct to some new
construct which has a many to one mapping to the task struct. It
touches vast amounts of code. This approach is also risky because it
introduces concurrent access to the task struct. That's a pretty big
auditing burden. So the syslet approach is a more conservative
alternative. I think we'll need a pretty strong demonstration of why we
*can't* fix the problems you're seeing with the syslets approach before
taking on the amount of effort needed to divorce the notion of tasks and
scheduling entities.

Each thread being able to return to userspace is an interface design
decision that I'm not prepared to defend. My current preference for a
user-facing API would be a lot stupider than what Ingo has built around
the atoms, and such, but I haven't taken up that battle as I'm focusing
on the fs/aio.c use of syslets.

My intention with the fs/aio.c syslet use is to only ever have one of
the worker threads returning to userspace. That's not quite implemented
now. I think cachemiss threads servicing fs/aio.c could get quite
confused if they exit their scheduling loop if a signal is raised, but
that's a bug -- not a feature :).

> Since the basic schedulable unit is currently a task_struct, and
> syslets would prefer to have multiple schedulable units per
> task_struct, this would imply some surgery on the task_struct and the
> scheduler. What I think we'd want is for the basic schedulable unit
> to be not much more than a kernel stack and a register set. A
> (non-kernel-thread) task_struct would still be associated 1-1 with a
> userspace, but could have multiple schedulable units, one of which is
> allowed to exit to userspace.

Have you looked at how the fibrils stuff did it? It's a lot more work
than it seems like it should be, on first glance. You start to modify
things and every few days you trip across another fundamental kernel
construct which needs to be modified :/.

http://lwn.net/Articles/219954/

So can we dig a little deeper into what it will take to fix these
problems with the syslet approach?

- z

2007-06-04 19:22:21

by Jeff Dike

[permalink] [raw]
Subject: Re: Syslets, signals, and security

On Mon, Jun 04, 2007 at 10:45:42AM -0700, Zach Brown wrote:
> > Second, security. What happens if a well-written server starts life
> > as root, does some (async) I/O, and setuids to a non-root uid? There
> > will be a bunch of async threads still running as root, with the
> > result that async operations (and the main thread) will more than
> > occassionally regain root privs.

> One can imagine all sorts of crazy schemes which let us only shoot down
> waiting async threads which were cloned before state in the submitting
> task_struct was changed. Maybe we could swallow increasing a counter in
> task_struct each time we change one of these sensitive fields (fsuid,
> etc), but I bet the maintenance burden of anything more fine grained
> than that would get pretty excessive.

How about splitting the credentials out of the task_struct and making
them sharable ala ->mm et al? You change uid there and it changes for
everyone. It will make fork slightly more expensive though.

> Yeah, and I'm blissfully ignorant of ptrace. Imagine me skipping
> through a field of daisies with some ptrace wolves hiding in the bushes
> at the edge of the meadow. La la la.

Heh, I'm somewhat less ignorant of ptrace, so I'll see if I can help
out there.

> Each execution context having its own task struct is intentional, very
> much so. Remember, this started with the fibrils patch series which
> indeed shared a single task struct amongst a set of running stacks and
> thread infos, etc. This approach is invasive because it changes the
> sleeping entity in the scheduler from a task struct to some new
> construct which has a many to one mapping to the task struct. It
> touches vast amounts of code. This approach is also risky because it
> introduces concurrent access to the task struct. That's a pretty big
> auditing burden.

Yeah, I realize that, but have no idea how much code that requires
looking at.

> Have you looked at how the fibrils stuff did it? It's a lot more work
> than it seems like it should be, on first glance. You start to modify
> things and every few days you trip across another fundamental kernel
> construct which needs to be modified :/.
>
> http://lwn.net/Articles/219954/

Ah, I somehow missed this. Since you seem to have explored that area
of the solution space already and found it wanting, I agree that it
makes sense to see if syslets can be made to work.

Jeff
--
Work email - jdike at linux dot intel dot com

2007-06-04 20:01:17

by Andi Kleen

[permalink] [raw]
Subject: Re: Syslets, signals, and security

Jeff Dike <[email protected]> writes:

> How about splitting the credentials out of the task_struct and making
> them sharable ala ->mm et al? You change uid there and it changes for
> everyone. It will make fork slightly more expensive though.

Strictly that's required by POSIX anyways. But it's a real mess.

The problem is that you would need to reference count/lock them in every
syscall or ioctl or similar. Otherwise another thread
could change them in the middle of a syscall which wouldn't be
good. Doing this full reference counting would be probably somewhat
expensive with more locked cycles and also a lot of work to implement.
You would need to audit large parts of the source tree.

I don't think it's a good idea.

-Andi

2007-06-04 22:40:17

by Alan

[permalink] [raw]
Subject: Re: Syslets, signals, and security

> The problem is that you would need to reference count/lock them in every
> syscall or ioctl or similar. Otherwise another thread
> could change them in the middle of a syscall which wouldn't be
> good. Doing this full reference counting would be probably somewhat
> expensive with more locked cycles and also a lot of work to implement.
> You would need to audit large parts of the source tree.

It also mucks up the rather more powerful existing behaviour unless we
make the changing optional in which case it gets *really ugly*

> I don't think it's a good idea.

Ditto (and glibc handles it for userspace posix APIs)

2007-06-05 11:11:25

by Andi Kleen

[permalink] [raw]
Subject: Re: Syslets, signals, and security

> > I don't think it's a good idea.
>
> Ditto (and glibc handles it for userspace posix APIs)

How?

-Andi