2005-05-03 14:27:37

by Carlos Carvalho

[permalink] [raw]
Subject: problem with nice values and cpu consumption in 2.6.11-5

Look at this cpu usage in a two-processor machine:

893 user1 39 19 7212 5892 492 R 99.7 1.1 3694:29 mi41
1118 user2 25 0 155m 61m 624 R 50.0 12.3 857:54.18 b170-se.x
1186 user3 25 0 155m 62m 640 R 50.2 12.3 103:25.22 b170-se.x

The job with nice 19 seems to be using 100% of cpu time while the
other two nice 0 jobs share a single processor with 50% only. This is
persistent, not a transient. I did a kill -STOP to the nice 19 job and
a kill -CONT, and for a while it decreased the cpu usage but later
returned to the above.

This is with kernel 2.6.11-5 and top 3.2.5. What's the reason for this
(apparent??) mis-behavior and how can I correct it? This is important
because the machine is used for number-crunching and users get really
upset when they don't get the expected share of cpu time...


2005-05-04 11:53:21

by Kirill Korotaev

[permalink] [raw]
Subject: Re: problem with nice values and cpu consumption in 2.6.11-5

This is a real problem with O(1)-scheduler in 2.6... :(
The only workaround for you right now is to run 2.4 or move to some type
of virtualization solutions with fair cpu scheduler...

Kirill

> Look at this cpu usage in a two-processor machine:
>
> 893 user1 39 19 7212 5892 492 R 99.7 1.1 3694:29 mi41
> 1118 user2 25 0 155m 61m 624 R 50.0 12.3 857:54.18 b170-se.x
> 1186 user3 25 0 155m 62m 640 R 50.2 12.3 103:25.22 b170-se.x
>
> The job with nice 19 seems to be using 100% of cpu time while the
> other two nice 0 jobs share a single processor with 50% only. This is
> persistent, not a transient. I did a kill -STOP to the nice 19 job and
> a kill -CONT, and for a while it decreased the cpu usage but later
> returned to the above.
>
> This is with kernel 2.6.11-5 and top 3.2.5. What's the reason for this
> (apparent??) mis-behavior and how can I correct it? This is important
> because the machine is used for number-crunching and users get really
> upset when they don't get the expected share of cpu time...
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2005-05-04 12:12:37

by Con Kolivas

[permalink] [raw]
Subject: Re: problem with nice values and cpu consumption in 2.6.11-5

On Wed, 4 May 2005 00:24, Carlos Carvalho wrote:
> Look at this cpu usage in a two-processor machine:
>
> 893 user1 39 19 7212 5892 492 R 99.7 1.1 3694:29 mi41
> 1118 user2 25 0 155m 61m 624 R 50.0 12.3 857:54.18 b170-se.x
> 1186 user3 25 0 155m 62m 640 R 50.2 12.3 103:25.22 b170-se.x
>
> The job with nice 19 seems to be using 100% of cpu time while the
> other two nice 0 jobs share a single processor with 50% only. This is
> persistent, not a transient. I did a kill -STOP to the nice 19 job and
> a kill -CONT, and for a while it decreased the cpu usage but later
> returned to the above.
>
> This is with kernel 2.6.11-5 and top 3.2.5. What's the reason for this
> (apparent??) mis-behavior and how can I correct it? This is important
> because the machine is used for number-crunching and users get really
> upset when they don't get the expected share of cpu time...

We currently do not have "nice" aware SMP balancing. The balancing is purely
designed with throughput in mind, and something about the behaviour of the
tasks you are running makes the scheduler design to balance them in this way.
The only way around this is to use affinities to bind tasks to cpus. The only
cross-cpu "nice" awareness we currently have is between hyperthread (SMT)
logical siblings, and not true physical cores.

I've been experimenting with code to make the SMP balancing "nice" aware but
the balancing design in the 2.6 scheduler changes every 3 minutes for some
apparent gain somewhere (it is getting impossible to track these) and there
is no baseline for me to work off, so I have, for the moment, given up on
that idea.

Cheers,
Con


Attachments:
(No filename) (1.63 kB)
(No filename) (189.00 B)
Download all attachments