2006-12-18 23:51:31

by David Wragg

[permalink] [raw]
Subject: [PATCH] procfs: export context switch counts in /proc/*/stat

The kernel already maintains context switch counts for each task, and
exposes them through getrusage(2). These counters can also be used
more generally to track which processes on the system are active
(i.e. getting scheduled to run), but getrusage is too constrained to
use it in that way.

This patch (against 2.6.19/2.6.19.1) adds the four context switch
values (voluntary context switches, involuntary context switches, and
the same values accumulated from terminated child processes) to the
end of /proc/*/stat, similarly to min_flt, maj_flt and the time used
values.

Signed-off-by: David Wragg <[email protected]>

diff -uprN --exclude='*.o' --exclude='*~' --exclude='.*' linux-2.6.19.1/fs/proc/array.c linux-2.6.19.1.build/fs/proc/array.c
--- linux-2.6.19.1/fs/proc/array.c 2006-12-18 14:35:36.000000000 +0000
+++ linux-2.6.19.1.build/fs/proc/array.c 2006-12-18 14:43:21.000000000 +0000
@@ -327,6 +327,8 @@ static int do_task_stat(struct task_stru
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long min_flt = 0, maj_flt = 0;
cputime_t cutime, cstime, utime, stime;
+ unsigned long cnvcsw = 0, cnivcsw = 0;
+ unsigned long nvcsw = 0, nivcsw = 0;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
@@ -369,6 +371,8 @@ static int do_task_stat(struct task_stru
cmaj_flt = sig->cmaj_flt;
cutime = sig->cutime;
cstime = sig->cstime;
+ cnvcsw = sig->cnvcsw;
+ cnivcsw = sig->cnivcsw;
rsslim = sig->rlim[RLIMIT_RSS].rlim_cur;

/* add up live thread stats at the group level */
@@ -379,6 +383,8 @@ static int do_task_stat(struct task_stru
maj_flt += t->maj_flt;
utime = cputime_add(utime, t->utime);
stime = cputime_add(stime, t->stime);
+ nvcsw += t->nvcsw;
+ nivcsw += t->nivcsw;
t = next_thread(t);
} while (t != task);

@@ -386,6 +392,8 @@ static int do_task_stat(struct task_stru
maj_flt += sig->maj_flt;
utime = cputime_add(utime, sig->utime);
stime = cputime_add(stime, sig->stime);
+ nvcsw += sig->nvcsw;
+ nivcsw += sig->nivcsw;
}

sid = sig->session;
@@ -404,6 +412,8 @@ static int do_task_stat(struct task_stru
maj_flt = task->maj_flt;
utime = task->utime;
stime = task->stime;
+ nvcsw = task->nvcsw;
+ nivcsw = task->nivcsw;
}

/* scale priority and nice values from timeslices to -20..20 */
@@ -420,7 +430,7 @@ static int do_task_stat(struct task_stru

res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu %lu %lu %lu %lu\n",
task->pid,
tcomm,
state,
@@ -465,7 +475,12 @@ static int do_task_stat(struct task_stru
task_cpu(task),
task->rt_priority,
task->policy,
- (unsigned long long)delayacct_blkio_ticks(task));
+ (unsigned long long)delayacct_blkio_ticks(task),
+ nvcsw,
+ cnvcsw,
+ nivcsw,
+ cnivcsw);
+
if(mm)
mmput(mm);
return res;



2006-12-19 06:46:17

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
> This patch (against 2.6.19/2.6.19.1) adds the four context switch
> values (voluntary context switches, involuntary context switches, and
> the same values accumulated from terminated child processes) to the
> end of /proc/*/stat, similarly to min_flt, maj_flt and the time used
> values.

Please put these into new files, as the stat files in /proc are
horribly overloaded and have always been somewhat problematic
when it comes to changing how things are reported due to internal
changes to the kernel. Cheers,

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-12-19 11:48:29

by David Wragg

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

Benjamin LaHaise <[email protected]> writes:
> On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
>> This patch (against 2.6.19/2.6.19.1) adds the four context switch
>> values (voluntary context switches, involuntary context switches, and
>> the same values accumulated from terminated child processes) to the
>> end of /proc/*/stat, similarly to min_flt, maj_flt and the time used
>> values.
>
> Please put these into new files, as the stat files in /proc are
> horribly overloaded and have always been somewhat problematic
> when it comes to changing how things are reported due to internal
> changes to the kernel. Cheers,

The delay accounting value was added to the end of /proc/pid/stat back
in July without discussion, so I assumed this approach was still
considered satisfactory.

Putting just these four values into a new file would seem a little
odd, since they have a lot in common with the other getrusage values
that are already in /proc/pid/stat. One possibility is to add
/proc/pid/rusage, mirroring the full struct rusage in text form, since
struct rusage is already part of the kernel ABI (though Linux doesn't
fill in half of the values).

Or perhaps it makes sense to reorganize all the values from
/proc/pid/stat and its siblings into a sysfs-like one-value-per-file
structure, though that might introduce atomicity and efficiency issues
(calculating some of the values involves iterating over the threads in
the process; with everything in one file, these loops are folded
together).

Any thoughts?


David

2006-12-20 05:41:00

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

David Wragg writes:
> Benjamin LaHaise <[email protected]> writes:
>> On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:

>>> This patch (against 2.6.19/2.6.19.1) adds the four context
>>> switch values (voluntary context switches, involuntary
>>> context switches, and the same values accumulated from
>>> terminated child processes) to the end of /proc/*/stat,
>>> similarly to min_flt, maj_flt and the time used values.

Hmmm, OK, do people have a use for these values?

>> Please put these into new files, as the stat files in /proc are
>> horribly overloaded and have always been somewhat problematic
>> when it comes to changing how things are reported due to internal
>> changes to the kernel. Cheers,

No thanks. Yours truly, the maintainer of "ps", "top", "vmstat", etc.

> The delay accounting value was added to the end of /proc/pid/stat back
> in July without discussion, so I assumed this approach was still
> considered satisfactory.

/proc/*/stat is the very best place in /proc for any per-process
data that will be commonly needed. Unlike /proc/*/status, few
people are tempted to screw with the formatting and/or spelling.
Unlike the /sys crap, it doesn't take 3 syscalls PER VALUE to
get at the data.

The things to ask are of course: will this really be used, and
does it really belong in /proc at all?

> Putting just these four values into a new file would seem a little
> odd, since they have a lot in common with the other getrusage values
> that are already in /proc/pid/stat. One possibility is to add
> /proc/pid/rusage, mirroring the full struct rusage in text form, since
> struct rusage is already part of the kernel ABI (though Linux doesn't
> fill in half of the values).

Since we already have a struct defined and all...

sys_get_rusage(int pid)

> Or perhaps it makes sense to reorganize all the values from
> /proc/pid/stat and its siblings into a sysfs-like one-value-per-file
> structure, though that might introduce atomicity and efficiency issues
> (calculating some of the values involves iterating over the threads in
> the process; with everything in one file, these loops are folded
> together).

Yeah, big time. Things are quite bad in /proc, but /sys is a joke.

2006-12-20 13:21:32

by David Wragg

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

"Albert Cahalan" <[email protected]> writes:
> On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
>> This patch (against 2.6.19/2.6.19.1) adds the four context
>> switch values (voluntary context switches, involuntary
>> context switches, and the same values accumulated from
>> terminated child processes) to the end of /proc/*/stat,
>> similarly to min_flt, maj_flt and the time used values.
>
> Hmmm, OK, do people have a use for these values?

My reason for writing the patch was to track which processes are
active (i.e. got scheduled to run) by polling these context switch
values. The time used values are not a reliable way to detect process
activity on fast machines. So for example, when sorting by %CPU, top
often shows many processes using 0% CPU, despite the fact that these
processes are running occasionally. If top sorted by (%CPU, context
switch count delta), it might give a more useful display of which
processes are active on the system.

More generally, it seems perverse to track these context switch values
but only expose them through the constrained getrusage interface. If
they are worth having, why aren't they worth exposing in the same way
as all other process info?

> [...]
>> Putting just these four values into a new file would seem a little
>> odd, since they have a lot in common with the other getrusage values
>> that are already in /proc/pid/stat. One possibility is to add
>> /proc/pid/rusage, mirroring the full struct rusage in text form, since
>> struct rusage is already part of the kernel ABI (though Linux doesn't
>> fill in half of the values).
>
> Since we already have a struct defined and all...
>
> sys_get_rusage(int pid)

That would be a much more useful system call than getrusage. But why
have two ways of retrieving process info, /proc and a sys_get_rusage,
exposing differing subsets of process information?


David

2006-12-20 13:48:13

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

On Wed, 2006-12-20 at 13:20 +0000, David Wragg wrote:
> "Albert Cahalan" <[email protected]> writes:
> > On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
> >> This patch (against 2.6.19/2.6.19.1) adds the four context
> >> switch values (voluntary context switches, involuntary
> >> context switches, and the same values accumulated from
> >> terminated child processes) to the end of /proc/*/stat,
> >> similarly to min_flt, maj_flt and the time used values.
> >
> > Hmmm, OK, do people have a use for these values?
>
> My reason for writing the patch was to track which processes are
> active (i.e. got scheduled to run) by polling these context switch
> values. The time used values are not a reliable way to detect process
> activity on fast machines. So for example, when sorting by %CPU, top
> often shows many processes using 0% CPU, despite the fact that these
> processes are running occasionally. If top sorted by (%CPU, context
> switch count delta), it might give a more useful display of which
> processes are active on the system.


if all you care is the number of context switches, you can use the
following system tap script as well:

http://www.fenrus.org/cstop.stp


--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-12-20 14:39:27

by David Wragg

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

Arjan van de Ven <[email protected]> writes:
> if all you care is the number of context switches, you can use the
> following system tap script as well:
>
> http://www.fenrus.org/cstop.stp

Thanks, something similar to that might well have solved my original
problem.

(When I try the script, stap complains about the lack of the kernel
debuginfo package, which of course doesn't exist for my self-built
kernel. After hunting around on the web for 10 minutes, I'm still no
closer to resolving this. But I look forward to playing with
systemtap once I get past that problem.)

Nonetheless, while systemtap might provide an objection to adding
per-task context switch counters to the kernel, it doesn't answer the
question, since we do have these counters, why not expose them in the
normal way?


David

2006-12-20 14:51:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

On Wed, 2006-12-20 at 14:38 +0000, David Wragg wrote:
> Arjan van de Ven <[email protected]> writes:
> > if all you care is the number of context switches, you can use the
> > following system tap script as well:
> >
> > http://www.fenrus.org/cstop.stp
>
> Thanks, something similar to that might well have solved my original
> problem.
>
> (When I try the script, stap complains about the lack of the kernel
> debuginfo package, which of course doesn't exist for my self-built
> kernel. After hunting around on the web for 10 minutes, I'm still no
> closer to resolving this. But I look forward to playing with
> systemtap once I get past that problem.)

what worked for me is copying the "vmlinux" file to /boot as
/boot/vmlinux-`uname -r`

(strace the stap program to see what it tries to load)



--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-12-20 15:14:34

by David Wragg

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

Arjan van de Ven <[email protected]> writes:
> On Wed, 2006-12-20 at 14:38 +0000, David Wragg wrote:
>> (When I try the script, stap complains about the lack of the kernel
>> debuginfo package, which of course doesn't exist for my self-built
>> kernel. After hunting around on the web for 10 minutes, I'm still no
>> closer to resolving this. But I look forward to playing with
>> systemtap once I get past that problem.)
>
> what worked for me is copying the "vmlinux" file to /boot as
> /boot/vmlinux-`uname -r`

Thanks, that's got it working.

2006-12-20 17:36:26

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

On 12/20/06, David Wragg <[email protected]> wrote:
> "Albert Cahalan" <[email protected]> writes:
> > On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
> >> This patch (against 2.6.19/2.6.19.1) adds the four context
> >> switch values (voluntary context switches, involuntary
> >> context switches, and the same values accumulated from
> >> terminated child processes) to the end of /proc/*/stat,
> >> similarly to min_flt, maj_flt and the time used values.
> >
> > Hmmm, OK, do people have a use for these values?
>
> My reason for writing the patch was to track which processes are
> active (i.e. got scheduled to run) by polling these context switch
> values. The time used values are not a reliable way to detect process
> activity on fast machines. So for example, when sorting by %CPU, top
> often shows many processes using 0% CPU, despite the fact that these
> processes are running occasionally. If top sorted by (%CPU, context
> switch count delta), it might give a more useful display of which
> processes are active on the system.

Oh, that'd be great.

The cumulative ones are still not justified though, and I fear they
may be 64-bit even on i386. It turns out that an i386 procps spends
much of its time doing 64-bit division to parse the damn ASCII crap.
I suppose I could just skip those fields, but generating them isn't
too cheap and probably I'd get stuck parsing them for some other
reason -- having them separate is probably a good idea.

2006-12-21 06:01:08

by Al Boldi

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

Albert Cahalan wrote:
> On 12/20/06, David Wragg <[email protected]> wrote:
> > "Albert Cahalan" <[email protected]> writes:
> > > On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
> > >> This patch (against 2.6.19/2.6.19.1) adds the four context
> > >> switch values (voluntary context switches, involuntary
> > >> context switches, and the same values accumulated from
> > >> terminated child processes) to the end of /proc/*/stat,
> > >> similarly to min_flt, maj_flt and the time used values.
> > >
> > > Hmmm, OK, do people have a use for these values?
> >
> > My reason for writing the patch was to track which processes are
> > active (i.e. got scheduled to run) by polling these context switch
> > values. The time used values are not a reliable way to detect process
> > activity on fast machines. So for example, when sorting by %CPU, top
> > often shows many processes using 0% CPU, despite the fact that these
> > processes are running occasionally. If top sorted by (%CPU, context
> > switch count delta), it might give a more useful display of which
> > processes are active on the system.
>
> Oh, that'd be great.

It may be great, but it's really only a workaround. The real fix is in
changing the current probed proc-timing to an inlined one.

> The cumulative ones are still not justified though, and I fear they
> may be 64-bit even on i386. It turns out that an i386 procps spends
> much of its time doing 64-bit division to parse the damn ASCII crap.
> I suppose I could just skip those fields, but generating them isn't
> too cheap and probably I'd get stuck parsing them for some other
> reason -- having them separate is probably a good idea.

Agreed. It may also be advisable to add a top3 line in /proc/stat, to
circumvent parsing /proc/*/stat, when only checking who is eating CPU most.


Thanks!

--
Al

2006-12-24 01:42:13

by David Wragg

[permalink] [raw]
Subject: Re: [PATCH] procfs: export context switch counts in /proc/*/stat

"Albert Cahalan" <[email protected]> writes:
> The cumulative ones are still not justified though, and I fear they
> may be 64-bit even on i386.

All the context switch counts are unsigned long.

> It turns out that an i386 procps spends
> much of its time doing 64-bit division to parse the damn ASCII crap.
> I suppose I could just skip those fields, but generating them isn't
> too cheap and probably I'd get stuck parsing them for some other
> reason -- having them separate is probably a good idea.

I can't think of a compelling justification for the cumulative context
switch counts. But I suggest that if the cost of exposing these
values is low enough, they should be exposed anyway, just for the sake
of uniformity (these would be the only two getrusage values not
present in /proc/pid/stat).

If the decimal representation of values in /proc/pid/stat has such
unpleasant overheads, then I wonder if that is something worth fixing,
whether the context switch counts are added or not? It occurs to me
that it would be easy to add support for a hex version of
/proc/pid/stat with very little additional code, by using an alternate
sprintf format string in fs/proc/array.c:do_task_stat(). I assume
that procps could be adapted quite easily to take advantage of this?


David