LinuxLists.cc - 2.4.17 absurd number of context switches

2001-12-28 16:36:46

Subject: 2.4.17 absurd number of context switches

Here's a slice of vmstat 1 on my 2-way x86, 2GB main memory machine
running Postgres 7.2beta4 on Linux 2.4.17:
cpu
r b w bi bo in cs us sy id
7 0 0 371612 58272 18576 1568896 0 0 0 168 414 33113 49 38 13
9 0 0 371612 59168 18576 1568900 0 0 0 64 215 32143 56 36 8
5 0 0 371612 58532 18576 1568924 0 0 0 696 363 33553 52 41 7
8 0 0 371612 59344 18576 1568956 0 0 16 240 374 34237 52 38 9
3 0 0 371612 58860 18576 1568996 0 0 0 128 254 31848 51 38 11
6 0 0 371612 59172 18576 1568996 0 0 0 64 234 36340 56 30 14
3 0 0 371612 59092 18576 1569004 0 0 0 232 204 32065 48 42 11
^^^^^
Check out those figures for context switches! 30,000 switches per second
with only three runnable processes and practically no block I/O seems
quite high to me. You can also see that the system is spending half its
time in the kernel, presumably in the scheduler. Postgres is barely
getting any CPU time at all, and the performance suffers noticeably.

Is this a scheduler worst-case, something to be expected, or something I
can work around?

Please CC me since vger's majordomo is an impossible chunk of shit.

-jwb

2001-12-28 16:52:28

by Alan

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

> Check out those figures for context switches! 30,000 switches per second
> with only three runnable processes and practically no block I/O seems
> quite high to me. You can also see that the system is spending half its
..
> Is this a scheduler worst-case, something to be expected, or something I
> can work around?

The scheduler is _good_ at the three process case. Run some straces it looks
more like postgres is doing wacky yield based locks.

Alan

2001-12-28 17:07:28

by Anton Tinchev

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

Is your system heavy loaded?
I'm running Postgres 7.1.2 on 2-way x86 too and noticed that context switches
are high!
Here the slice on about %50 free cpu time, but when goes up to %80-%90 -
context switches going to 10000-15000:
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
1 0 0 87364 15448 2540 1218324 1 1 20 13 7 17 16 4
18
3 0 0 87364 12596 2552 1219336 36 0 964 0 504 1385 30 16
53
4 1 0 87364 7684 2556 1220568 0 0 1196 768 687 1645 28 16
56
2 1 0 87364 6172 2568 1218940 0 0 1156 0 604 1485 31 11
58
10 0 0 87364 5648 2572 1218764 0 0 972 0 366 1012 23 18
60
5 0 0 87364 6216 2580 1218208 0 0 604 0 411 1154 28 20
52
12 0 0 87364 5832 2580 1216952 0 0 272 0 317 1567 31 23
46
12 0 0 87364 6048 2592 1216100 0 0 168 384 448 2011 47 13
40
11 0 0 87364 5920 2596 1216116 0 0 20 0 284 2307 58 12
30
9 0 0 87364 6556 2600 1216136 0 0 12 0 284 2164 53 9
38
11 0 0 87364 6536 2608 1216220 0 0 68 0 324 2140 50 12
38
23 0 0 87364 7176 2612 1216232 0 0 4 0 317 2238 46 9
46
12 0 0 87364 6880 2616 1216360 16 0 140 256 440 2367 33 18
49
9 0 0 87364 6380 2644 1216692 0 0 336 0 454 2874 28 17
55
7 0 0 87364 5956 2648 1216772 0 0 52 0 550 2659 28 17
55
1 0 0 87364 5420 2628 1215544 0 0 32 0 538 648 13 2
85
0 0 0 87364 6180 2632 1215636 0 0 72 0 421 430 11 3
86
0 0 0 87364 6152 2632 1215664 0 0 20 1152 328 225 6 1
92
0 0 0 87364 6116 2632 1215696 0 0 36 0 278 242 7 1
91
0 0 0 87364 5512 2580 1214524 0 0 152 0 433 426 10 4
87
0 0 0 87364 9072 2580 1214536 0 0 8 0 303 290 8 1
91

"Jeffrey W. Baker" wrote:

> Here's a slice of vmstat 1 on my 2-way x86, 2GB main memory machine
> running Postgres 7.2beta4 on Linux 2.4.17:
> cpu
> r b w bi bo in cs us sy id
> 7 0 0 371612 58272 18576 1568896 0 0 0 168 414 33113 49 38 13
> 9 0 0 371612 59168 18576 1568900 0 0 0 64 215 32143 56 36 8
> 5 0 0 371612 58532 18576 1568924 0 0 0 696 363 33553 52 41 7
> 8 0 0 371612 59344 18576 1568956 0 0 16 240 374 34237 52 38 9
> 3 0 0 371612 58860 18576 1568996 0 0 0 128 254 31848 51 38 11
> 6 0 0 371612 59172 18576 1568996 0 0 0 64 234 36340 56 30 14
> 3 0 0 371612 59092 18576 1569004 0 0 0 232 204 32065 48 42 11
> ^^^^^
> Check out those figures for context switches! 30,000 switches per second
> with only three runnable processes and practically no block I/O seems
> quite high to me. You can also see that the system is spending half its
> time in the kernel, presumably in the scheduler. Postgres is barely
> getting any CPU time at all, and the performance suffers noticeably.
>
> Is this a scheduler worst-case, something to be expected, or something I
> can work around?
>
> Please CC me since vger's majordomo is an impossible chunk of shit.
>
> -jwb
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-28 17:05:58

by Davide Libenzi

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

2001-12-28 17:18:58

by Davide Libenzi

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001, Stephan von Krawczynski wrote:

> On Fri, 28 Dec 2001 09:07:48 -0800 (PST)
> Davide Libenzi <[email protected]> wrote:
>
> > The scheduler that Linus merged in 2.5.2-pre3 will solve the problem.
>
> Could you kindly provide a patch for 2.4.17 for Jeffrey to test and give
> comparison results to the list. This could be interesting for 2.4 too (which
> the world uses nowadays (and in the near future))?
>
> Thanks for your help ;-)

You can't, the scheduler is changed.
Try this, in sys_sched_yield() remove :

spin_lock_irq(&runqueue_lock);
move_last_runqueue(current);
spin_unlock_irq(&runqueue_lock);

and replace it with :

local_irq_disable();
if (current->counter > 0)
--current->counter;
local_irq_enable();

- Davide

2001-12-28 17:14:58

by Stephan von Krawczynski

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001 09:07:48 -0800 (PST)
Davide Libenzi <[email protected]> wrote:

> The scheduler that Linus merged in 2.5.2-pre3 will solve the problem.

Could you kindly provide a patch for 2.4.17 for Jeffrey to test and give
comparison results to the list. This could be interesting for 2.4 too (which
the world uses nowadays (and in the near future))?

Thanks for your help ;-)

Stephan

2001-12-28 17:30:21

by Davide Libenzi

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001, Alan Cox wrote:

> > > The scheduler is _good_ at the three process case. Run some straces it looks
> > > more like postgres is doing wacky yield based locks.
> >
> > The scheduler that Linus merged in 2.5.2-pre3 will solve the problem.
>
> Looking at the postgres traces here it wont make any difference at all. Not
> one iota. If I am reading it right I have processes each going
> yield, yield, yield... so the kernel does just that (and indeed posix
> semantics require that behaviour).

task A old the lock , counter = 2
task B counter = 5 and task C counter = 4 are woke up
try to look at a switch dump.
the scheduler will spend 3 entire time slices switching between B and C
before A will get back the CPU and will free the lock.

- Davide

2001-12-28 17:27:29

by Alan

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

> > The scheduler is _good_ at the three process case. Run some straces it looks
> > more like postgres is doing wacky yield based locks.
>
> The scheduler that Linus merged in 2.5.2-pre3 will solve the problem.

Looking at the postgres traces here it wont make any difference at all. Not
one iota. If I am reading it right I have processes each going
yield, yield, yield... so the kernel does just that (and indeed posix
semantics require that behaviour).

Alan

2001-12-28 17:34:40

by Alan

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

> local_irq_disable();
> if (current->counter > 0)
> --current->counter;
> local_irq_enable();

Umm: SuS sayeth..

DESCRIPTION

The sched_yield() function forces the running thread to relinquish the
processor until it again becomes the head of its thread list. It takes
no arguments.

Which doesnt seem to be what you are doing.

2001-12-28 17:35:50

by Alan

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

> the scheduler will spend 3 entire time slices switching between B and C
> before A will get back the CPU and will free the lock.

Ugggh. That sounds horribly plausible.

2001-12-28 17:39:42

by Davide Libenzi

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001, Alan Cox wrote:

> > local_irq_disable();
> > if (current->counter > 0)
> > --current->counter;
> > local_irq_enable();
>
> Umm: SuS sayeth..
>
> DESCRIPTION
>
> The sched_yield() function forces the running thread to relinquish the
> processor until it again becomes the head of its thread list. It takes
> no arguments.
>
> Which doesnt seem to be what you are doing.

1) the scheduler in 2.5.2-pre3 does it in a different way because the
dynamic priority is split from the time slice.

2) the current scheduler does not permit you doing such a thing in a smart
way so, if i've to choose i prefer an implementation that solves
real world cases. i challenge you to measure the counter tick loss
during the yield() call anyway. it's very easy indeed to measure
the current behavior, like we're currently seeing

- Davide

2001-12-28 17:45:20

by Jeffrey W. Baker

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001, Alan Cox wrote:

> > Check out those figures for context switches! 30,000 switches per second
> > with only three runnable processes and practically no block I/O seems
> > quite high to me. You can also see that the system is spending half its
> ..
> > Is this a scheduler worst-case, something to be expected, or something I
> > can work around?
>
> The scheduler is _good_ at the three process case. Run some straces it looks
> more like postgres is doing wacky yield based locks.

All I see in strace is semop forever

[pid 10076] 0.000054 semop(1179648, 0xbfffe6e8, 1 <unfinished ...>
[pid 10077] 0.000224 <... semop resumed> ) = 0
[pid 10077] 0.000077 semop(1179648, 0xbfffe1e8, 1) = 0
[pid 10077] 0.000057 semop(1179648, 0xbfffe0f8, 1 <unfinished ...>
[pid 10076] 0.000128 <... semop resumed> ) = 0
[pid 10076] 0.000035 semop(1179648, 0xbfffe6a8, 1) = 0
[pid 10076] 0.000127 semop(1179648, 0xbfffe758, 1 <unfinished ...>
[pid 10077] 0.000085 <... semop resumed> ) = 0
[pid 10077] 0.000075 semop(1179648, 0xbfffe0f8, 1) = 0
[pid 10077] 0.000155 semop(1179648, 0xbfffdfb8, 1 <unfinished ...>
[pid 10076] 0.000401 <... semop resumed> ) = 0
[pid 10076] 0.000034 semop(1179648, 0xbfffe758, 1) = 0
[pid 10076] 0.000046 semop(1179648, 0xbfffe758, 1 <unfinished ...>
[pid 10077] 0.000113 <... semop resumed> ) = 0
[pid 10077] 0.000040 semop(1179648, 0xbfffdf78, 1) = 0
[pid 10077] 0.000051 semop(1179648, 0xbfffdfc8, 1 <unfinished ...>
[pid 10076] 0.000317 <... semop resumed> ) = 0
[pid 10076] 0.000055 semop(1179648, 0xbfffe718, 1) = 0
[pid 10076] 0.000083 semop(1179648, 0xbfffe8d8, 1 <unfinished ...>
[pid 10077] 0.000217 <... semop resumed> ) = 0
[pid 10077] 0.000091 semop(1179648, 0xbfffdfc8, 1) = 0
[pid 10077] 0.000057 semop(1179648, 0xbfffdfa8, 1 <unfinished ...>
[pid 10076] 0.000191 <... semop resumed> ) = 0
[pid 10076] 0.000037 semop(1179648, 0xbfffe898, 1) = 0
[pid 10076] 0.000054 semop(1179648, 0xbfffe928, 1 <unfinished ...>
[pid 10077] 0.000056 <... semop resumed> ) = 0
[pid 10077] 0.000034 semop(1179648, 0xbfffdf68, 1) = 0

-jwb

2001-12-28 18:03:11

by Davide Libenzi

[permalink] [raw]

Subject: Re: 2.4.17 absurd number of context switches

On Fri, 28 Dec 2001, Jeffrey W. Baker wrote:

>
>
> On Fri, 28 Dec 2001, Alan Cox wrote:
>
> > > Check out those figures for context switches! 30,000 switches per second
> > > with only three runnable processes and practically no block I/O seems
> > > quite high to me. You can also see that the system is spending half its
> > ..
> > > Is this a scheduler worst-case, something to be expected, or something I
> > > can work around?
> >
> > The scheduler is _good_ at the three process case. Run some straces it looks
> > more like postgres is doing wacky yield based locks.
>
> All I see in strace is semop forever
>
> [pid 10076] 0.000054 semop(1179648, 0xbfffe6e8, 1 <unfinished ...>
> [pid 10077] 0.000224 <... semop resumed> ) = 0
> [pid 10077] 0.000077 semop(1179648, 0xbfffe1e8, 1) = 0
> [pid 10077] 0.000057 semop(1179648, 0xbfffe0f8, 1 <unfinished ...>
> [pid 10076] 0.000128 <... semop resumed> ) = 0
> [pid 10076] 0.000035 semop(1179648, 0xbfffe6a8, 1) = 0
> [pid 10076] 0.000127 semop(1179648, 0xbfffe758, 1 <unfinished ...>
> [pid 10077] 0.000085 <... semop resumed> ) = 0
> [pid 10077] 0.000075 semop(1179648, 0xbfffe0f8, 1) = 0
> [pid 10077] 0.000155 semop(1179648, 0xbfffdfb8, 1 <unfinished ...>
> [pid 10076] 0.000401 <... semop resumed> ) = 0
> [pid 10076] 0.000034 semop(1179648, 0xbfffe758, 1) = 0
> [pid 10076] 0.000046 semop(1179648, 0xbfffe758, 1 <unfinished ...>
> [pid 10077] 0.000113 <... semop resumed> ) = 0
> [pid 10077] 0.000040 semop(1179648, 0xbfffdf78, 1) = 0
> [pid 10077] 0.000051 semop(1179648, 0xbfffdfc8, 1 <unfinished ...>
> [pid 10076] 0.000317 <... semop resumed> ) = 0
> [pid 10076] 0.000055 semop(1179648, 0xbfffe718, 1) = 0
> [pid 10076] 0.000083 semop(1179648, 0xbfffe8d8, 1 <unfinished ...>
> [pid 10077] 0.000217 <... semop resumed> ) = 0
> [pid 10077] 0.000091 semop(1179648, 0xbfffdfc8, 1) = 0
> [pid 10077] 0.000057 semop(1179648, 0xbfffdfa8, 1 <unfinished ...>
> [pid 10076] 0.000191 <... semop resumed> ) = 0
> [pid 10076] 0.000037 semop(1179648, 0xbfffe898, 1) = 0
> [pid 10076] 0.000054 semop(1179648, 0xbfffe928, 1 <unfinished ...>
> [pid 10077] 0.000056 <... semop resumed> ) = 0
> [pid 10077] 0.000034 semop(1179648, 0xbfffdf68, 1) = 0

It's not the a sys_sched_yield() problem. probably and IPC_NOWAIT issue

- Davide

2001-12-29 20:50:44

by Jeffrey W. Baker

[permalink] [raw]