I have noticed that when the scheduling policy of a process is SCHED_FIFO
or SCHED_RR, proc_pid_stat() in fs/proc/array.c still uses the counter field
in the task structure to calculate the priority in /proc/<pid>/stat.
For such processes, the counter field is ignored by the scheduler in favour
of the rt_priority field. Thus, even though the actual priority is available
via sched_getparam(2), the priority in /proc/<pid>/stat -- and, consequently,
in the output of ps(1) and top(1) -- seems to be, for SCHED_FIFO/SCHED_RR
processes, a value that is not representative at all of how the process is
scheduled.
I have thrown together a patch to address this, but I can't say I feel
entirely comfortable about scaling from 1..99 to -20..20.
Any comments would be appreciated.
Lev Makhlis
-------------------------------CUT HERE---------------------------------
--- linux-2.4.12/fs/proc/array.c Sat Oct 13 13:09:29 2001
+++ linux/fs/proc/array.c Sat Oct 13 13:21:05 2001
@@ -334,8 +334,13 @@
/* scale priority and nice values from timeslices to -20..20 */
/* to make it look like a "normal" Unix priority/nice value */
- priority = task->counter;
- priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
+ if ((task->policy & ~SCHED_YIELD) == SCHED_OTHER) {
+ priority = task->counter;
+ priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
+ } else {
+ priority = task->rt_priority;
+ priority = 20 - (priority * 40 + 50) / 100;
+ }
nice = task->nice;
read_lock(&tasklist_lock);
Lev Makhlis writes:
> I have noticed that when the scheduling policy of a process is SCHED_FIFO
> or SCHED_RR, proc_pid_stat() in fs/proc/array.c still uses the counter field
> in the task structure to calculate the priority in /proc/<pid>/stat.
> For such processes, the counter field is ignored by the scheduler in favour
> of the rt_priority field. Thus, even though the actual priority is available
> via sched_getparam(2), the priority in /proc/<pid>/stat -- and, consequently,
> in the output of ps(1) and top(1) -- seems to be, for SCHED_FIFO/SCHED_RR
> processes, a value that is not representative at all of how the process is
> scheduled.
>
> I have thrown together a patch to address this, but I can't say I feel
> entirely comfortable about scaling from 1..99 to -20..20.
Do not do this. Just supply the raw value for ps(1) and top(1) to use.
Also supply the scheduling policy type. You can tack this on the end
of /proc/<pid>/stat and tell me when Linus accepts it so that I can
make ps(1) and top(1) support the new info.
On Monday 19 November 2001 01:01 am, Albert D. Cahalan wrote:
> Lev Makhlis writes:
> > I have noticed that when the scheduling policy of a process is SCHED_FIFO
> > or SCHED_RR, proc_pid_stat() in fs/proc/array.c still uses the counter
> > field in the task structure to calculate the priority in
> > /proc/<pid>/stat. For such processes, the counter field is ignored by the
> > scheduler in favour of the rt_priority field. Thus, even though the
> > actual priority is available via sched_getparam(2), the priority in
> > /proc/<pid>/stat -- and, consequently, in the output of ps(1) and top(1)
> > -- seems to be, for SCHED_FIFO/SCHED_RR processes, a value that is not
> > representative at all of how the process is scheduled.
> >
> > I have thrown together a patch to address this, but I can't say I feel
> > entirely comfortable about scaling from 1..99 to -20..20.
>
> Do not do this. Just supply the raw value for ps(1) and top(1) to use.
> Also supply the scheduling policy type. You can tack this on the end
> of /proc/<pid>/stat and tell me when Linus accepts it so that I can
> make ps(1) and top(1) support the new info.
I agree scaling from 1.99 to 20..-20 wasn't a good idea, but I don't think
supplying the raw (1..99) value without any transformation at all would be
right either -- I think we need to reverse its sign, for the following
reason:
If you look at what happens on other Unix platforms, the "direction"
of priority values can vary: usually, higher values mean lower priority,
but, for example, on Solaris, higher values mean higher priority.
But on any specific platform, the "direction" is consistent across
the different scheduling policies. On Linux, it's "higher value = lower
priority" for the default timesharing policy, and therefore I think it should
be the same for the RT priorities.
I think the Right Thing would be to use a f(x) = c - x transormation,
where c could be 100, or 0, or -20, or -100, or something else.
-20 or -100 have the advantage of preserving the order relationship
between priorities across the scheduling policies.
The patch below uses c=-100 -- as an example.
-------------------------CUT------------------------------
--- linux-2.4.12/fs/proc/array.c Sat Oct 13 13:09:29 2001
+++ linux/fs/proc/array.c Mon Nov 19 01:33:20 2001
@@ -334,8 +334,12 @@
/* scale priority and nice values from timeslices to -20..20 */
/* to make it look like a "normal" Unix priority/nice value */
- priority = task->counter;
- priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
+ if ((task->policy & ~SCHED_YIELD) == SCHED_OTHER) {
+ priority = task->counter;
+ priority = 20 - (priority * 10 + DEF_COUNTER / 2) /
DEF_COUNTER;
+ } else {
+ priority = -100 - task->rt_priority;
+ }
nice = task->nice;
read_lock(&tasklist_lock);
@@ -343,7 +347,7 @@
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d\n",
task->pid,
task->comm,
state,
@@ -386,7 +390,8 @@
task->nswap,
task->cnswap,
task->exit_signal,
- task->processor);
+ task->processor,
+ task->policy & ~SCHED_YIELD);
if(mm)
mmput(mm);
return res;
Anton Altaparmakov writes:
> On Monday 19 November 2001 01:01 am, Albert D. Cahalan wrote:
>> Do not do this. Just supply the raw value for ps(1) and top(1) to use.
>> Also supply the scheduling policy type. You can tack this on the end
>> of /proc/<pid>/stat and tell me when Linus accepts it so that I can
>> make ps(1) and top(1) support the new info.
>
> I agree scaling from 1.99 to 20..-20 wasn't a good idea, but I don't think
> supplying the raw (1..99) value without any transformation at all would be
> right either -- I think we need to reverse its sign, for the following
> reason:
>
> If you look at what happens on other Unix platforms, the "direction"
> of priority values can vary: usually, higher values mean lower priority,
> but, for example, on Solaris, higher values mean higher priority.
> But on any specific platform, the "direction" is consistent across
> the different scheduling policies. On Linux, it's "higher value = lower
> priority" for the default timesharing policy, and therefore I think it should
> be the same for the RT priorities.
>
> I think the Right Thing would be to use a f(x) = c - x transormation,
> where c could be 100, or 0, or -20, or -100, or something else.
> -20 or -100 have the advantage of preserving the order relationship
> between priorities across the scheduling policies.
>
> The patch below uses c=-100 -- as an example.
I can tell you what procps will do. The very first thing is
to undo your transformation. Don't bother having the kernel
muck with the numbers. The procps code will transform the
numbers as needed to match UNIX convention and/or the tools
which users run to set these values.
On Monday 19 November 2001 06:31 am, Albert D. Cahalan wrote:
> > On Monday 19 November 2001 01:01 am, Albert D. Cahalan wrote:
> >> Do not do this. Just supply the raw value for ps(1) and top(1) to use.
> >> Also supply the scheduling policy type. You can tack this on the end
> >> of /proc/<pid>/stat and tell me when Linus accepts it so that I can
> >> make ps(1) and top(1) support the new info.
> >
> > [snip]
> > I think the Right Thing would be to use a f(x) = c - x transormation,
> > where c could be 100, or 0, or -20, or -100, or something else.
> > -20 or -100 have the advantage of preserving the order relationship
> > between priorities across the scheduling policies.
> >
> > The patch below uses c=-100 -- as an example.
>
> I can tell you what procps will do. The very first thing is
> to undo your transformation. Don't bother having the kernel
> muck with the numbers. The procps code will transform the
> numbers as needed to match UNIX convention and/or the tools
> which users run to set these values.
Hmm, how would you explain that the kernel mucks with the numbers
for SCHED_OTHER, but not for SCHED_FIFO/SCHED_RR?
IIRC, procps does not attempt to undo the f(x) = 20 - (10x + 5) / 10
(assuming HZ=100) transformation currently used for SCHED_OTHER.
Granted, procps can do the transformation itself, but procps does not
have a monopoly on using procfs data -- any other performance-monitoring
application would have to duplicate the transformation, if it is to be
consistent with the standard (procps) tools. I thought it would be
nice if the kernel provided a consistent interface through procfs to
begin with.
Lev Makhlis wwrites:
> On Monday 19 November 2001 06:31 am, Albert D. Cahalan wrote:
>> I can tell you what procps will do. The very first thing is
>> to undo your transformation. Don't bother having the kernel
>> muck with the numbers. The procps code will transform the
>> numbers as needed to match UNIX convention and/or the tools
>> which users run to set these values.
>
> Hmm, how would you explain that the kernel mucks with the numbers
> for SCHED_OTHER, but not for SCHED_FIFO/SCHED_RR?
Long ago, the kernel didn't muck with the numbers. It has to
do that now because the numbers used inside the kernel are
different than they used to be. If the interface were being
designed today, it could supply the raw numbers.
> IIRC, procps does not attempt to undo the f(x) = 20 - (10x + 5) / 10
> (assuming HZ=100) transformation currently used for SCHED_OTHER.
Yes it does, more-or-less. This depends on what is being
supplied to the user. You can have the data UNIX-style,
SunOS-style, and traditional Linux-style. Like this:
ps -eo pri,opri,priority
> Granted, procps can do the transformation itself, but procps does not
> have a monopoly on using procfs data -- any other performance-monitoring
> application would have to duplicate the transformation, if it is to be
> consistent with the standard (procps) tools. I thought it would be
> nice if the kernel provided a consistent interface through procfs to
> begin with.
Maybe you should consider why, if true, the kernel internals
are not consistent with the API. Perhaps this isn't a performance
advantage. I also wonder why RT tasks have a separate priority
in the task struct when they leave the regular one unused and
regular tasks leave the RT one unused. If these could be the same
data type, then there isn't even any need for a union.
For compatibility with the rest of the world, procps needs to
display the scheduling policy ("RR", "TS", etc.) and remap RT
priority values in several different ways. Having the kernel
remap values just obfuscates what the data really means, making
more work for every app developer and wasting kernel CPU time.
On Tuesday 20 November 2001 12:51 am, Albert D. Cahalan wrote:
> Lev Makhlis wwrites:
> > On Monday 19 November 2001 06:31 am, Albert D. Cahalan wrote:
> >> I can tell you what procps will do. The very first thing is
> >> to undo your transformation. Don't bother having the kernel
> >> muck with the numbers. The procps code will transform the
> >> numbers as needed to match UNIX convention and/or the tools
> >> which users run to set these values.
> >
> > Hmm, how would you explain that the kernel mucks with the numbers
> > for SCHED_OTHER, but not for SCHED_FIFO/SCHED_RR?
>
> Long ago, the kernel didn't muck with the numbers. It has to
> do that now because the numbers used inside the kernel are
> different than they used to be. If the interface were being
> designed today, it could supply the raw numbers.
Yes, and then there would be no question about supplying the raw
numbers for FIFO/RR as well. I'm not sure, though, that it would be
such a good idea, because the raw numbers for SCHED_OTHER
do not have a rigid scale -- it changes with the definition of HZ.
It wouldn't be trivial for an app programmer to find out just how high
on the priority scale a particular process is.
>
> > IIRC, procps does not attempt to undo the f(x) = 20 - (10x + 5) / 10
> > (assuming HZ=100) transformation currently used for SCHED_OTHER.
>
> Yes it does, more-or-less. This depends on what is being
> supplied to the user. You can have the data UNIX-style,
> SunOS-style, and traditional Linux-style. Like this:
>
> ps -eo pri,opri,priority
But you can't have the raw task->counter value, can you?
I don't think it's possible, since the mapping is not 1-1.
>
> > Granted, procps can do the transformation itself, but procps does not
> > have a monopoly on using procfs data -- any other performance-monitoring
> > application would have to duplicate the transformation, if it is to be
> > consistent with the standard (procps) tools. I thought it would be
> > nice if the kernel provided a consistent interface through procfs to
> > begin with.
>
> Maybe you should consider why, if true, the kernel internals
> are not consistent with the API.
I can only speculate that it's for the same reason that /proc/partitions
shows sizes in units of 1024 regardless of the actual block size, and
that /proc/stat shows CPU times in units of 10ms even if HZ is redefined
to something other than 100 -- so that the API remains backward-compatible
as the kernel internals continue to evolve.
Perhaps this isn't a performance
> advantage. I also wonder why RT tasks have a separate priority
> in the task struct when they leave the regular one unused and
> regular tasks leave the RT one unused. If these could be the same
> data type, then there isn't even any need for a union.
I do not claim to understand all of the scheduler code, but it appears
to me that the regular priority field still has some limited use for
RT processes, especially for RR.
>
> For compatibility with the rest of the world, procps needs to
> display the scheduling policy ("RR", "TS", etc.) and remap RT
> priority values in several different ways. Having the kernel
> remap values just obfuscates what the data really means, making
> more work for every app developer and wasting kernel CPU time.
Frankly, as an app developer, I can't see the benefit of having
the raw values, as long as I get a 1-1 mapping. For example, when
I am programming on Solaris, I know that FIFO/RR priorities can range
from 0 (lowest) to 59 (highest) when I look at them via /proc, or
from 100 (lowest) to 159 (highest) when using the POSIX interface
(<sched.h>). On HP-UX, I know that they range from -32 (highest)
to -1 (lowest) when using pstat(), or from 0 (lowest) to 31 (highest)
when using sched_*(). And on Tru64 Unix, they can be between 0
(highest) and 63 (lowest) when using the mach interface, or between
0 (lowest) and 63 (highest) when using sched_*(). In each case,
there is a 1-1 mapping between the POSIX values and the "native"
values that are used by ps/top by default. In each case, I don't know
which values the kernel uses internally -- could be the POSIX ones,
the "native" ones, or neither. I simply fail to see what additional
benefit I would have if I knew, for example, that Solaris really uses
values from 300 to 359 (just an off the wall example) under the hood.
Unless, perhaps, I wanted to bypass the API and read the priority
straight from process tables in /dev/kmem or something.
On Linux, I know the RT priorities range from 1 (lowest) to 99 (highest)
in the POSIX interface, and I happen to know, thanks to source
availability, that 1..99 is used internally. Would I see anything wrong
with Linux, unlike the other platforms, providing exactly the same
numbers through its "native" (procfs) interface? Not at all. My only
objection is that it would be inconsistent with the mapping for
SCHED_OTHER that's already in place, which is from -20 (highest)
to 20 (lowest). I am saying the scales should either both go upwards,
or both go downwards. I suggested a reversal of the RT scale
because I doubt a reversal of the TS scale would be readily accepted
at this stage, but maybe I'm wrong...