2015-07-08 15:45:56

by Andres Freund

[permalink] [raw]
Subject: Significant performance difference for postgres w/o sched_autogroup

Hi,

while debugging a performance issue I noticed the following very odd
bit.

Kernel: 4.1.0-rc7-andres-00049-ge64f638
CPU: 2xE5520

Workload:

postgresql (multi-process via shared memory SQL server) and benchmark
client (pgbench, multi threaded) running on the same server. Connected
using unix sockets. The statements are relatively simple (~1.5ms on
average), forcing frequent back/forth between server processes and
pgbench threads.

I found that disabling sched_autogroup *significantly* reduces
throughput. Even when both server and client are started from the same
terminal and thus should be in the same group!

There's a significant difference in how %sys with autogroups
enabled/disabled. ~8% v ~27%. That sounds too much.

The difference in postgres performance is significant:
enabled:
progress: 1.0 s, 25857.2 tps, lat 1.171 ms stddev 3.314
progress: 2.0 s, 19582.5 tps, lat 1.586 ms stddev 5.349
progress: 3.0 s, 17796.3 tps, lat 1.804 ms stddev 7.364
progress: 4.0 s, 18736.7 tps, lat 1.696 ms stddev 7.612
progress: 5.0 s, 19269.9 tps, lat 1.640 ms stddev 6.835
progress: 6.0 s, 21389.6 tps, lat 1.467 ms stddev 5.663
progress: 7.0 s, 22153.8 tps, lat 1.417 ms stddev 8.417
progress: 8.0 s, 21550.7 tps, lat 1.487 ms stddev 7.460
progress: 9.0 s, 22602.3 tps, lat 1.401 ms stddev 6.738
progress: 10.0 s, 23475.5 tps, lat 1.327 ms stddev 7.111
tps = 21329.219141 (excluding connections establishing)

progress: 1.0 s, 25092.5 tps, lat 1.255 ms stddev 0.511
progress: 2.0 s, 13322.0 tps, lat 2.377 ms stddev 5.435
progress: 3.0 s, 13903.0 tps, lat 2.298 ms stddev 5.075
progress: 4.0 s, 13734.0 tps, lat 2.328 ms stddev 5.146
progress: 5.0 s, 14015.0 tps, lat 2.281 ms stddev 5.099
progress: 6.0 s, 14034.0 tps, lat 2.277 ms stddev 5.002
progress: 7.0 s, 13930.0 tps, lat 2.297 ms stddev 5.136
progress: 8.0 s, 14015.0 tps, lat 2.279 ms stddev 5.161
progress: 9.0 s, 14000.0 tps, lat 2.287 ms stddev 5.117
progress: 10.0 s, 13900.6 tps, lat 2.299 ms stddev 5.142
tps = 15006.317841 (excluding connections establishing)

(note the odd dip after the first second. That's what I was actually debugging)

A perf diff of -g -a -e cycles:k shows:
# Baseline Delta Shared Object Symbol
# ........ ....... ................... ..............................................
#
11.70% +3.87% [kernel.vmlinux] [k] unix_stream_recvmsg
6.74% +2.49% [kernel.vmlinux] [k] __fget
6.69% +4.31% [kernel.vmlinux] [k] sys_recvfrom
6.49% +1.77% [kernel.vmlinux] [k] system_call_after_swapgs
5.37% +1.39% [kernel.vmlinux] [k] _raw_spin_lock
4.63% +1.76% [kernel.vmlinux] [k] mutex_lock_interruptible
4.27% +1.27% [kernel.vmlinux] [k] mutex_unlock
3.43% +1.22% [kernel.vmlinux] [k] fput
3.41% -0.21% [kernel.vmlinux] [k] __fget_light
2.96% +1.33% libpthread-2.19.so [.] __libc_recv
2.56% +2.35% [kernel.vmlinux] [k] system_call_fastpath
2.36% +0.80% [kernel.vmlinux] [k] import_single_range
2.30% +0.82% [kernel.vmlinux] [k] sockfd_lookup_light

Looking a bit into the profile shows an apparent difference in how often
recv goes through the fastpath:

enabled:
- 6.55% pgbench [kernel.vmlinux] [k] sys_recvfrom
- sys_recvfrom
+ 85.88% system_call_fastpath
+ 14.12% __libc_recv

disabled:
- 10.97% pgbench [kernel.vmlinux] [k] sys_recvfrom
- sys_recvfrom
+ 74.16% system_call_fastpath
+ 25.84% __libc_recv

Is that actually expected or is there a bug somewhere? The difference
seems to be a bit large to me.

If interesting, I can try to script the workload in a reproducible
manner.

Regards,

Andres


2015-07-09 02:46:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: Significant performance difference for postgres w/o sched_autogroup

On Wed, 2015-07-08 at 17:45 +0200, Andres Freund wrote:

> Workload:
>
> postgresql (multi-process via shared memory SQL server) and benchmark
> client (pgbench, multi threaded) running on the same server. Connected
> using unix sockets. The statements are relatively simple (~1.5ms on
> average), forcing frequent back/forth between server processes and
> pgbench threads.
>
> I found that disabling sched_autogroup *significantly* reduces
> throughput. Even when both server and client are started from the same
> terminal and thus should be in the same group!
>
> There's a significant difference in how %sys with autogroups
> enabled/disabled. ~8% v ~27%. That sounds too much.

Seems reasonable to me. 1(tg)/2(tgs) > 1(task)/N(tasks), throughput is
what the server can sustain on its given budget, larger budget means
less client blockage, thus less %sys.

-Mike

2015-07-09 09:28:40

by Andres Freund

[permalink] [raw]
Subject: Re: Significant performance difference for postgres w/o sched_autogroup

On 2015-07-09 04:45:38 +0200, Mike Galbraith wrote:
> On Wed, 2015-07-08 at 17:45 +0200, Andres Freund wrote:
> > Workload:
> >
> > postgresql (multi-process via shared memory SQL server) and benchmark
> > client (pgbench, multi threaded) running on the same server. Connected
> > using unix sockets. The statements are relatively simple (~1.5ms on
> > average), forcing frequent back/forth between server processes and
> > pgbench threads.
> >
> > I found that disabling sched_autogroup *significantly* reduces
> > throughput. Even when both server and client are started from the same
> > terminal and thus should be in the same group!
> >
> > There's a significant difference in how %sys with autogroups
> > enabled/disabled. ~8% v ~27%. That sounds too much.
>
> Seems reasonable to me. 1(tg)/2(tgs) > 1(task)/N(tasks), throughput is
> what the server can sustain on its given budget, larger budget means
> less client blockage, thus less %sys.

That'd make some sense if there were other stuff going on - but here the
same total budget in both cases leads to a 40% difference in throughput:

autogroup on: tps = 21329.219141; autogroup off: tps = 15006.317841

there's really nothing to do for the kernel or other tasks in the
background. It's a readonly workload, not requiring lots of memory, ...

2015-07-09 10:36:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: Significant performance difference for postgres w/o sched_autogroup

On Thu, 2015-07-09 at 11:28 +0200, Andres Freund wrote:

> That'd make some sense if there were other stuff going on - but here the
> same total budget in both cases leads to a 40% difference in throughput:

When I suspect that what I'm hearing is horse pookey, I instrument the
source and ask my silicon horse pookey detector :)

-Mike