2010-02-05 17:41:40

by malc

[permalink] [raw]
Subject: Scheduler oddity


Following test exhibits somewhat odd behaviour on, at least, 2.6.32.3
(ppc) and 2.6.29.1 (x86_64), perhaps someone could explain why.

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(int argc, const char *argv[])
{
int pipe_fd[2];
pid_t pid;

if (argc == 2 && !strcmp(argv[1], "batch")) {
int ret;
struct sched_param p = {0};

ret = sched_setscheduler(0, SCHED_BATCH, &p);
if (ret) {
perror("sched_setscheduler");
return 1;
}
printf("using batch\n");
}
else {
printf("using default\n");
}

/* create the pipe */
if (pipe(pipe_fd) == -1) {
perror("pipe");
return 1;
}

/* fork a process that will run cat */
pid = fork();

if (pid == -1) {
perror("fork");
return 1;
}

if (pid == 0) {
/* child */
close(pipe_fd[1]); /* close the write end of the pipe */
if (dup2(pipe_fd[0], STDIN_FILENO) == -1) {
perror("dup2");
return 1;
}
close(pipe_fd[0]);
execl("/usr/bin/wc", "wc", "-c", NULL);
perror("execl");
return 1;
} else {
/* parent */
int i;

close(pipe_fd[0]); /* close the read end of the pipe */
for (i = 0; i < 10000000; i++)
write(pipe_fd[1], "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 30);
close(pipe_fd[1]);
wait(NULL);
return 0;
}

return 0;
}

~$ uname -a
Linux linmac 2.6.32.3 #4 Sun Jan 31 09:52:58 MSK 2010 ppc 7447A, altivec supported PowerMac10,2 GNU/Linux

$ gcc -o test-pipe test-pipe.c

$ \time -v ./test-pipe batch
using batch
300000000
Command being timed: "./test-pipe batch"
User time (seconds): 0.50
System time (seconds): 7.47
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2496
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 343
Voluntary context switches: 9933
Involuntary context switches: 1997
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

$ \time -v ./test-pipe
using default
300000000
Command being timed: "./test-pipe"
User time (seconds): 1.53
System time (seconds): 34.74
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.27
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2496
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 343
Voluntary context switches: 9992328
Involuntary context switches: 9994617
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0


P.S. The test wasn't written by me

--
mailto:[email protected]


2010-02-06 12:12:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler oddity

On Fri, 2010-02-05 at 19:33 +0300, malc wrote:
> Following test exhibits somewhat odd behaviour on, at least, 2.6.32.3
> (ppc) and 2.6.29.1 (x86_64), perhaps someone could explain why.

Expected behavior.

SCHED_BATCH tasks do not wakeup preempt, preemption is tick driven. The
writer therefore has time to fill the pipe/block, so reader can then
drain the pipe, leading to efficient data transfer.

SCHED_NORMAL tasks do preempt. Every write wakes a reader who's
vruntime (CPU utilization fairness yardstick) lags the writer enough to
warrant preemption, so one write translates to one preemption followed
by a read. The reader can't possibly catch up to the writer (being
synchronous) in either case, but the scheduler doesn't know that or
care, it simply tries to equalize the two. Since the writer's CPU
utilization stems entirely from tiny writes, that time is what goes
toward equalizing the reader. Result is the tiny I/Os the programmer
asked for, extreme low latency, and utterly horrid throughput.

-Mike

2010-02-06 15:20:21

by malc

[permalink] [raw]
Subject: Re: Scheduler oddity

On Sat, 6 Feb 2010, Mike Galbraith wrote:

> On Fri, 2010-02-05 at 19:33 +0300, malc wrote:
> > Following test exhibits somewhat odd behaviour on, at least, 2.6.32.3
> > (ppc) and 2.6.29.1 (x86_64), perhaps someone could explain why.
>
> Expected behavior.
>
> SCHED_BATCH tasks do not wakeup preempt, preemption is tick driven. The
> writer therefore has time to fill the pipe/block, so reader can then
> drain the pipe, leading to efficient data transfer.
>
> SCHED_NORMAL tasks do preempt. Every write wakes a reader who's
> vruntime (CPU utilization fairness yardstick) lags the writer enough to
> warrant preemption, so one write translates to one preemption followed
> by a read. The reader can't possibly catch up to the writer (being
> synchronous) in either case, but the scheduler doesn't know that or
> care, it simply tries to equalize the two. Since the writer's CPU
> utilization stems entirely from tiny writes, that time is what goes
> toward equalizing the reader. Result is the tiny I/Os the programmer
> asked for, extreme low latency, and utterly horrid throughput.
>

Thank you.

--
mailto:[email protected]