Date: Wed, 8 Jul 2015 17:45:50 +0200
From: Andres Freund <andres@anarazel.de>
To: linux-kernel@vger.kernel.org, Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>
Subject: Significant performance difference for postgres w/o sched_autogroup
Message-ID: <20150708154550.GH340@alap3.anarazel.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3967
Lines: 98

Hi,

while debugging a performance issue I noticed the following very odd
bit.

Kernel: 4.1.0-rc7-andres-00049-ge64f638
CPU: 2xE5520

Workload:

postgresql (multi-process via shared memory SQL server) and benchmark
client (pgbench, multi threaded) running on the same server. Connected
using unix sockets.  The statements are relatively simple (~1.5ms on
average), forcing frequent back/forth between server processes and
pgbench threads.

I found that disabling sched_autogroup *significantly* reduces
throughput. Even when both server and client are started from the same
terminal and thus should be in the same group!

There's a significant difference in how %sys with autogroups
enabled/disabled. ~8% v ~27%. That sounds too much.

The difference in postgres performance is significant:
enabled:
progress: 1.0 s, 25857.2 tps, lat 1.171 ms stddev 3.314
progress: 2.0 s, 19582.5 tps, lat 1.586 ms stddev 5.349
progress: 3.0 s, 17796.3 tps, lat 1.804 ms stddev 7.364
progress: 4.0 s, 18736.7 tps, lat 1.696 ms stddev 7.612
progress: 5.0 s, 19269.9 tps, lat 1.640 ms stddev 6.835
progress: 6.0 s, 21389.6 tps, lat 1.467 ms stddev 5.663
progress: 7.0 s, 22153.8 tps, lat 1.417 ms stddev 8.417
progress: 8.0 s, 21550.7 tps, lat 1.487 ms stddev 7.460
progress: 9.0 s, 22602.3 tps, lat 1.401 ms stddev 6.738
progress: 10.0 s, 23475.5 tps, lat 1.327 ms stddev 7.111
tps = 21329.219141 (excluding connections establishing)

progress: 1.0 s, 25092.5 tps, lat 1.255 ms stddev 0.511
progress: 2.0 s, 13322.0 tps, lat 2.377 ms stddev 5.435
progress: 3.0 s, 13903.0 tps, lat 2.298 ms stddev 5.075
progress: 4.0 s, 13734.0 tps, lat 2.328 ms stddev 5.146
progress: 5.0 s, 14015.0 tps, lat 2.281 ms stddev 5.099
progress: 6.0 s, 14034.0 tps, lat 2.277 ms stddev 5.002
progress: 7.0 s, 13930.0 tps, lat 2.297 ms stddev 5.136
progress: 8.0 s, 14015.0 tps, lat 2.279 ms stddev 5.161
progress: 9.0 s, 14000.0 tps, lat 2.287 ms stddev 5.117
progress: 10.0 s, 13900.6 tps, lat 2.299 ms stddev 5.142
tps = 15006.317841 (excluding connections establishing)

(note the odd dip after the first second. That's what I was actually debugging)

A perf diff of -g -a -e cycles:k shows:
# Baseline    Delta  Shared Object        Symbol
# ........  .......  ...................  ..............................................
#
    11.70%   +3.87%  [kernel.vmlinux]     [k] unix_stream_recvmsg
     6.74%   +2.49%  [kernel.vmlinux]     [k] __fget
     6.69%   +4.31%  [kernel.vmlinux]     [k] sys_recvfrom
     6.49%   +1.77%  [kernel.vmlinux]     [k] system_call_after_swapgs
     5.37%   +1.39%  [kernel.vmlinux]     [k] _raw_spin_lock
     4.63%   +1.76%  [kernel.vmlinux]     [k] mutex_lock_interruptible
     4.27%   +1.27%  [kernel.vmlinux]     [k] mutex_unlock
     3.43%   +1.22%  [kernel.vmlinux]     [k] fput
     3.41%   -0.21%  [kernel.vmlinux]     [k] __fget_light
     2.96%   +1.33%  libpthread-2.19.so   [.] __libc_recv
     2.56%   +2.35%  [kernel.vmlinux]     [k] system_call_fastpath
     2.36%   +0.80%  [kernel.vmlinux]     [k] import_single_range
     2.30%   +0.82%  [kernel.vmlinux]     [k] sockfd_lookup_light

Looking a bit into the profile shows an apparent difference in how often
recv goes through the fastpath:

enabled:
-    6.55%  pgbench          [kernel.vmlinux]     [k] sys_recvfrom
   - sys_recvfrom
      + 85.88% system_call_fastpath
      + 14.12% __libc_recv

disabled:
-   10.97%  pgbench          [kernel.vmlinux]    [k] sys_recvfrom
   - sys_recvfrom
      + 74.16% system_call_fastpath
      + 25.84% __libc_recv

Is that actually expected or is there a bug somewhere? The difference
seems to be a bit large to me.

If interesting, I can try to script the workload in a reproducible
manner.

Regards,

Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/