Here is a set of patches that adds per-task delay accounting to
Linux. In this context, delays is the time spent by a task
waiting for some resource to become available. Currently the patches
record (or make available) the following delays:
CPU delay: time spent on runqueue waiting for a CPU to run on
Block I/O delay: waiting for block I/O to complete (including any
wait for queueing the request)
Page fault delay: waiting for page faults (major & minor) to get
completed
Having this information allows one to adjust the priorities (cpu, io)
and rss limits of a task. e.g. if task A is spending too much time
waiting for block I/O to complete compared to task B, bumping up
A's I/O priority relative to that of B might help.This isn't particularly
useful if one always want A to get more I/O bandwidth than B. But if one
is interested in dynamically adjusting priorities, delay statistics
complete the feedback loop.
The statistics are collected by simple timestamping and recording of
intervals in the task_struct. The cpu stats are already being collected
by the schedstats so no additional code is needed in the hot path of a
context switch.
They are made available through a connector interface which allows
- stats for a given <pid> to be obtained in response to a command
which specifies the <pid>. The need for dynamically obtaining delay
stats is the reason why piggybacking delay stats onto BSD process
accounting wasn't considered.
- stats for exiting tasks to be sent to userspace listeners. This can
be useful for collecting statistics by any kind of grouping done by
the userspace agent. Such groupings (banks/process aggregates/classes)
have been proposed by different projects.
Comments on the patches are requested.
--Shailabh
Series
delayacct-init.patch
delayacct-blkio.patch
delayacct-pgflt.patch
delayacct-connector.patch
Shailabh Nagar <[email protected]> wrote:
>
> They are made available through a connector interface which allows
> - stats for a given <pid> to be obtained in response to a command
> which specifies the <pid>. The need for dynamically obtaining delay
> stats is the reason why piggybacking delay stats onto BSD process
> accounting wasn't considered.
I think this is the first time that anyone has come out with real code
which does per-task accounting via connector.
Which makes one wonder where this will end up. If numerous different
people add numerous different accounting messages, presumably via different
connector channels then it may all end up a bit of a mess. Given the way
kernel development happens, that's pretty likely.
For example, should the next developer create a new message type, or should
he tack his desired fields onto the back of yours? If the former, we'll
end up with quite a lot of semi-duplicated code and a lot more messages and
resources than we strictly need. If the latter, then perhaps the
versioning you have in there will suffice - I'm not sure.
I wonder if at this stage we should take a shot at some overarching "how do
do per-task accounting messages over connector" design which can at least
incorporate the various things which people have been talking about
recently?
On Mon, Nov 14, 2005 at 08:17:41PM -0800, Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
> >
> > They are made available through a connector interface which allows
> > - stats for a given <pid> to be obtained in response to a command
> > which specifies the <pid>. The need for dynamically obtaining delay
> > stats is the reason why piggybacking delay stats onto BSD process
> > accounting wasn't considered.
>
> I think this is the first time that anyone has come out with real code
> which does per-task accounting via connector.
>
> Which makes one wonder where this will end up. If numerous different
> people add numerous different accounting messages, presumably via different
> connector channels then it may all end up a bit of a mess. Given the way
> kernel development happens, that's pretty likely.
>
> For example, should the next developer create a new message type, or should
> he tack his desired fields onto the back of yours? If the former, we'll
> end up with quite a lot of semi-duplicated code and a lot more messages and
> resources than we strictly need. If the latter, then perhaps the
> versioning you have in there will suffice - I'm not sure.
>
> I wonder if at this stage we should take a shot at some overarching "how do
> do per-task accounting messages over connector" design which can at least
> incorporate the various things which people have been talking about
> recently?
Another point to be taken in consideration is that SystemTap should make
it possible to add such instrumentation on-the-fly.
Means you don't have to maintain such statistics code (which are likely
to change often due to users needs) in the mainline kernel.
The burden goes to userspace where the kernel hooks are compiled and
inserted.
OTOH, when you think of kernel's fast rate of change, that might not be
a very good option.
Just my two cents.
Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
>
>>They are made available through a connector interface which allows
>> - stats for a given <pid> to be obtained in response to a command
>> which specifies the <pid>. The need for dynamically obtaining delay
>> stats is the reason why piggybacking delay stats onto BSD process
>> accounting wasn't considered.
>
>
> I think this is the first time that anyone has come out with real code
> which does per-task accounting via connector.
>
> Which makes one wonder where this will end up. If numerous different
> people add numerous different accounting messages, presumably via different
> connector channels then it may all end up a bit of a mess. Given the way
> kernel development happens, that's pretty likely.
Yes, thats quite likely. While doing this, it was pretty clear that the
connector interface part of the patch was relatively independent of the
collection of these specific delays. This is already apparent from the exporting
of per-task cpu data that wasn't collected by this patch and there's no reason
why other data (currently available from /proc/<tgid>/stats) shouldn't be
exported too if its needed with less overhead than /proc offers.
> For example, should the next developer create a new message type, or should
> he tack his desired fields onto the back of yours? If the former, we'll
> end up with quite a lot of semi-duplicated code and a lot more messages and
> resources than we strictly need. If the latter, then perhaps the
> versioning you have in there will suffice - I'm not sure.
The design I have assumes the latter. Hence the versioning.
Another bit of code thats being duplicated already is the registration/deregistration
of userspace listeners. Currently this is duplicated with Matt Helsley's process event
connector and it'll only get worse later.
> I wonder if at this stage we should take a shot at some overarching "how do
> do per-task accounting messages over connector" design which can at least
> incorporate the various things which people have been talking about
> recently?
Thats a good thought and I'd been sorely tempted to do that as part of this connector
patch but held back since it can quickly become quite elaborate.
The assumption here is that there will be a growing and varied set of per-task data
that userspace needs to get through a more efficient interface than /proc. We can limit
the discussion to accounting data since its logically related.
Here's a strawman:
Have userspace specify an "interest set" for the per-task stats that are available -
a bitmap of two u64's should suffice to cover all the stats available.
The interest set can be specified at registration time (when the first client starts
listening) so that any future task exits send the desired stats down to userspace.
Clients could also be allowed to override the default interest set whenever they send a
command to get data on a specific task.
The returned data would now need to be variable-sized to avoid unnecessary overhead.
So something like
header including a bitmap of returned data
data 1
data 2
....
That way we avoid versioning until we run of of bits in the bitmap.
If more per-task stats get added, they can be assigned bits and the bitmap also defines
the implicit order in which the data gets sent.