2007-05-01 17:30:45

by Bill Irwin

[permalink] [raw]
Subject: Re: per-thread rusage

On Mon, Apr 09, 2007 at 04:53:15PM -0700, Andrew Morton wrote:
>> Seems sane. Could we please get it tested and get a full description in
>> place? Something which provides enough detail for the manpage maintainers.
>> Also, a quick comparison between Linux's RUSAGE_THREAD and $other-os's
>> implementations would reduce the possibility of silly, cast-in-stone
>> incompatabilities.

On Mon, Apr 09, 2007 at 05:42:01PM -0700, William Lee Irwin III wrote:
> The latter is the more serious of the two. I'll go about investigating
> that as the primary task here. Testing and a more verbose patch
> description are clearly very little work.
> General maintenance-relevant commentary: This patch arose from an
> observation of a lacuna in the API. There are no bugs or apps broken
> awaiting this as a fix, so it's not needed by 2.6.22 or otherwise
> urgently. My use for it is report generation in VM (and possibly other)
> testcases. The ack-in-concept is good enough for me to go about
> sweeping up the OS/standards compatibility, testing, and documentation
> issues in the near future prior to resubmission.

A sort of note for me to refer back to when I get the rest of the way
here. AIX does this with getrusage(RUSAGE_THREAD,...), Solaris with
getrusage(RUSAGE_LWP,...), Tru64 and HP-UX seem to lack any obvious
way to do this at all, likewise for MacOS X and the opensource BSD's.


-- wli


2007-05-01 18:39:48

by Ulrich Drepper

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Bill Irwin <[email protected]> wrote:
> A sort of note for me to refer back to when I get the rest of the way
> here. AIX does this with getrusage(RUSAGE_THREAD,...), Solaris with
> getrusage(RUSAGE_LWP,...),

RUSAGE_LWP is a remnant of Solaris' M-on-N thread library days. No
reason to got there. Use RUSAGE_THREAD. Even though the kernel calls
the process and process group, at userland these are threads and
processes.

2007-05-01 20:25:36

by Bill Irwin

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Bill Irwin <[email protected]> wrote:
>> A sort of note for me to refer back to when I get the rest of the way
>> here. AIX does this with getrusage(RUSAGE_THREAD,...), Solaris with
>> getrusage(RUSAGE_LWP,...),

On Tue, May 01, 2007 at 11:39:46AM -0700, Ulrich Drepper wrote:
> RUSAGE_LWP is a remnant of Solaris' M-on-N thread library days. No
> reason to got there. Use RUSAGE_THREAD. Even though the kernel calls
> the process and process group, at userland these are threads and
> processes.

Well, this is just the part where I'm surveying how other OS's report
getrusage-like info on a per-thread basis. As far as this is concerned,
RUSAGE_LWP is just Solaris nomenclature for it. The implementation
details (e.g. M:N thread disambiguation) are not so important for this.
What I think it means is "Solaris and AIX pass a nonstandard flag to
getrusage() for the same purpose."

And actually, you're a good person to ask about all this. How do other
kernels/OS's report per-thread analogues of rusage information? Googling
for getrusage() manpages is probably not going to find ones that, say,
force userspace to fish it out of /proc/ analogues and so on. The basic
idea is to try to do it similarly to how everyone else does so userspace
(I suppose this would include glibc) don't have to bend over backward to
accommodate it. Or basically to do what everyone expects.


- wli

2007-05-01 22:10:42

by Ulrich Drepper

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Bill Irwin <[email protected]> wrote:
> The basic
> idea is to try to do it similarly to how everyone else does so userspace
> (I suppose this would include glibc) don't have to bend over backward to
> accommodate it. Or basically to do what everyone expects.

I think beside RUSAGE_THREAD you'll find no precedence. It's all new,
you have to tread the path. The RUSAGE_THREAD interface is not
sufficient, actually. First, if a thread terminates we don't have to
keep it stick around until a wait call can be issued. We terminate
threads right away and the synchronization with waiters is done
independently. Seond, the thread ID (aka kernel process ID) is not
exported nor should it. This is easy to solve, though: introduce a
pthread_getrusage interface.

To solve the first problem the terminating thread should write out the
data before it is gone. Automatically. After registration. So, you
could have a syscall to register a structure in the user address space
which is filled with the data. If the data structure is the same as
rusage you're done. If you use a different data structure yo need to
introduce a getrusage-equivalent syscall.

With this infrastructure in place we could have

int pthread_getrusage(pthread_t, struct ruage *);

and

int pthread_join4(pthread_t, void ** valueptr, struct rusage *);

pthread_join4 is a joke, we need a better name, but you get the drift.

2007-05-01 22:27:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: per-thread rusage

On Tue, May 01, 2007 at 03:10:40PM -0700, Ulrich Drepper wrote:
> I think beside RUSAGE_THREAD you'll find no precedence. It's all new,
> you have to tread the path. The RUSAGE_THREAD interface is not
> sufficient, actually. First, if a thread terminates we don't have to
> keep it stick around until a wait call can be issued. We terminate
> threads right away and the synchronization with waiters is done
> independently. Seond, the thread ID (aka kernel process ID) is not
> exported nor should it. This is easy to solve, though: introduce a
> pthread_getrusage interface.

Hey Ulrich,

It turns out this could be useful implementing something
called "Cost Enforcement" in the Real Time Specification for Java,
which is an optional part of the specification, but which some
customers have wanted.

The basic idea is that the thread tells JVM how much time
(either CPU or wall clock) it will consume, and if it takes more than
the specified amount of time, the assumption is that the thread has
malfunctioned or there has been some programming error, and the thread
should get the Java equivalent of a SIGXPU.

There are two ways of implementing this. One is to have the
JVM periodically poll using a pthread_getrusage() interface. A better
choice might be some kind of per-thread CPU limit, that would result
in a thread-specific SIGXCPU signal. But there are no interfaces
today that do anything like this.

Do you have any thoughts or preferences about how this might
be done, if we tried to about doing something like a per-thread
SIGXCPU functionality? If not, pthread_getrusage() might be
sufficient, if not the most efficient way of doing things.

Regards,

- Ted

2007-05-01 22:30:08

by Bill Irwin

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Bill Irwin <[email protected]> wrote:
>> The basic
>> idea is to try to do it similarly to how everyone else does so userspace
>> (I suppose this would include glibc) don't have to bend over backward to
>> accommodate it. Or basically to do what everyone expects.

On Tue, May 01, 2007 at 03:10:40PM -0700, Ulrich Drepper wrote:
> I think beside RUSAGE_THREAD you'll find no precedence. It's all new,
> you have to tread the path. The RUSAGE_THREAD interface is not
> sufficient, actually. First, if a thread terminates we don't have to
> keep it stick around until a wait call can be issued. We terminate
> threads right away and the synchronization with waiters is done
> independently. Seond, the thread ID (aka kernel process ID) is not
> exported nor should it. This is easy to solve, though: introduce a
> pthread_getrusage interface.

Sounds reasonable enough. I can follow directions. I'd not be concerned
if you happen to write it yourself, though I'll get around to it if you
don't.


On Tue, May 01, 2007 at 03:10:40PM -0700, Ulrich Drepper wrote:
> To solve the first problem the terminating thread should write out the
> data before it is gone. Automatically. After registration. So, you
> could have a syscall to register a structure in the user address space
> which is filled with the data. If the data structure is the same as
> rusage you're done. If you use a different data structure yo need to
> introduce a getrusage-equivalent syscall.
> With this infrastructure in place we could have
> int pthread_getrusage(pthread_t, struct ruage *);
> and
> int pthread_join4(pthread_t, void ** valueptr, struct rusage *);
> pthread_join4 is a joke, we need a better name, but you get the drift.

This addresses further missing pieces in the API beyond what the
getrusage() flag did. I'll follow this precisely for any patch posted
for merging, and Cc: you on such.


-- wli

2007-05-01 22:35:24

by Bill Irwin

[permalink] [raw]
Subject: Re: per-thread rusage

On Tue, May 01, 2007 at 06:27:24PM -0400, Theodore Tso wrote:
> There are two ways of implementing this. One is to have the
> JVM periodically poll using a pthread_getrusage() interface. A better
> choice might be some kind of per-thread CPU limit, that would result
> in a thread-specific SIGXCPU signal. But there are no interfaces
> today that do anything like this.
> Do you have any thoughts or preferences about how this might
> be done, if we tried to about doing something like a per-thread
> SIGXCPU functionality? If not, pthread_getrusage() might be
> sufficient, if not the most efficient way of doing things.

I just so happen to think we should implement a variety of CPU resource
limits beyond what we now do, so this, too, interests me.


-- wli

2007-05-01 23:02:35

by Alan

[permalink] [raw]
Subject: Re: per-thread rusage

> I just so happen to think we should implement a variety of CPU resource
> limits beyond what we now do, so this, too, interests me.

Agreed - and make them all 64bit while doing the cleanup. One thing
several Unixen have we don't for 32bi boxes is a proper set of 64bit
resource handling for memory/file etc.

We could also start using the CPU facilities to enforce some of
the really interesting real time process ones (like main memory
bandwidth) that at the moment we have no control over and can lead to
very unfair behaviour.

Alan

2007-05-01 23:18:07

by Bill Irwin

[permalink] [raw]
Subject: Re: per-thread rusage

At some point in the past, I wrote:
>> I just so happen to think we should implement a variety of CPU resource
>> limits beyond what we now do, so this, too, interests me.

On Wed, May 02, 2007 at 12:04:58AM +0100, Alan Cox wrote:
> Agreed - and make them all 64bit while doing the cleanup. One thing
> several Unixen have we don't for 32bi boxes is a proper set of 64bit
> resource handling for memory/file etc.
> We could also start using the CPU facilities to enforce some of
> the really interesting real time process ones (like main memory
> bandwidth) that at the moment we have no control over and can lead to
> very unfair behaviour.

That would be very useful, though I'm unsure of how broad a variety of
architectures implement performance counters useful for such. Simple
caps on %cpu would be a good start in my view.


-- wli

2007-05-02 00:17:32

by Ulrich Drepper

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Theodore Tso <[email protected]> wrote:
> There are two ways of implementing this. One is to have the
> JVM periodically poll using a pthread_getrusage() interface.

Not a good idea.

> A better
> choice might be some kind of per-thread CPU limit, that would result
> in a thread-specific SIGXCPU signal. But there are no interfaces
> today that do anything like this.

We have, in principal: setrlimit. We jump through hoops in the moment
to make RLIMIT_CPU a per-process facility. This is all nice. All you
need to do is to add resources RLIMIT_*_THREAD (e.g.,
RLIMIT_CPU_THREAD) and additionally do accounting in a per-thread
basis.

The only issue which has to be decided is what is the action when the
limit is exceeded. An unanswered signal kills the process, not just
the thread. And you cannot just terminate a thread in the kernel
since there might be userlevel cleanup to do. The thread library can
also not simply hijack the SIGXCPU signal, the application want to use
it. The thread cancellation must appear like any other cancellation,
perhaps with a special status value (PTHREAD_CANCELED_XCPU instead of
PTHREAD_CANCEL). But that's a userlevel detail.

So what would be additionally needed is a method to specify what
signal to sent. The default might just as well be SIGXCPU but this
must be changable.

2007-05-02 04:31:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: per-thread rusage

On Tue, May 01, 2007 at 05:17:28PM -0700, Ulrich Drepper wrote:
> We have, in principal: setrlimit. We jump through hoops in the moment
> to make RLIMIT_CPU a per-process facility. This is all nice. All you
> need to do is to add resources RLIMIT_*_THREAD (e.g.,
> RLIMIT_CPU_THREAD) and additionally do accounting in a per-thread
> basis.

Indeed; in fact it would be easier to do per-thread accounting than
our current per-process accounting, as you note.

> The thread library can also not simply hijack the SIGXCPU signal,
> the application want to use it.... So what would be additionally
> needed is a method to specify what signal to sent. The default
> might just as well be SIGXCPU but this must be changable.


The question is should we use setrlimit() to set the per-thread CPU
limit, given that we would need some separate interface to set signal
that should be sent.

Is there any reason why we should have the interface specify whether
the signal should be directed to a specified process or kernel
thread-id, perhaps using si_pid field in the siginfo_t to specify
which thread had exceeded its CPU limit. Or would this be overkill?

> The thread cancellation must appear like any other cancellation,
> perhaps with a special status value (PTHREAD_CANCELED_XCPU instead of
> PTHREAD_CANCEL). But that's a userlevel detail.

Yep, I agree that thread cancellation is the right thing to happen at
the Posix Threads level.

Do you think this is something that we could get standardized into an
upcoming Posix/Posix Threads standard?

- Ted

2007-05-02 04:57:27

by Ulrich Drepper

[permalink] [raw]
Subject: Re: per-thread rusage

On 5/1/07, Theodore Tso <[email protected]> wrote:
> The question is should we use setrlimit() to set the per-thread CPU
> limit, given that we would need some separate interface to set signal
> that should be sent.
>
> Is there any reason why we should have the interface specify whether
> the signal should be directed to a specified process or kernel
> thread-id, perhaps using si_pid field in the siginfo_t to specify
> which thread had exceeded its CPU limit. Or would this be overkill?

The more I think about it the more complex it gets. There is a
problem with delivering the signal to the receiving process itself: it
is out of time and cannot perform the cleanup operation anymore. You
could grant it a grace period but how long should that be? Some of
the cleanup handlers might take a long time. If you don't enforce the
CPU limit then it doesn't have to be in the kernel and you might as
well use CLOCK_THREAD_CPUTIME_ID and create a timer. This should
already work today. If not it must be fixed.

Delivering the timeout signal to another thread isn't really possible
either since the cleanup code might access thread-local data which
wouldn't work since it's not the canceled thread's data which is
accessed.

I don't have a good answer right now whether enforced CPU limits can
be implemented at all. But it seems for your purposes a timer with
the CPU clock might be sufficient.


> Do you think this is something that we could get standardized into an
> upcoming Posix/Posix Threads standard?

Regardless of whether a solution can be found, it's too late for the
next revision. The deadline for new features is long gone by.

2007-05-02 05:05:20

by Balbir Singh

[permalink] [raw]
Subject: Re: per-thread rusage

Alan Cox wrote:
>> I just so happen to think we should implement a variety of CPU resource
>> limits beyond what we now do, so this, too, interests me.
>
> Agreed - and make them all 64bit while doing the cleanup. One thing
> several Unixen have we don't for 32bi boxes is a proper set of 64bit
> resource handling for memory/file etc.
>
> We could also start using the CPU facilities to enforce some of
> the really interesting real time process ones (like main memory
> bandwidth) that at the moment we have no control over and can lead to
> very unfair behaviour.
>
> Alan

Hi, Alan,

Thanks for bringing this up. There are a couple of patches posted to
lkml for RSS control (unmapped page cache controller under development).

http://lwn.net/Articles/223829/

and the new enhanced verison by Pavel at

http://www.opensubscriber.com/message/[email protected]/6456480.html

We would appreciate any feedback to help us move the work forward and
make the code ready for acceptance

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL