Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965086AbbD0UJF (ORCPT ); Mon, 27 Apr 2015 16:09:05 -0400 Received: from mail-la0-f50.google.com ([209.85.215.50]:32940 "EHLO mail-la0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965019AbbD0UJB (ORCPT ); Mon, 27 Apr 2015 16:09:01 -0400 MIME-Version: 1.0 From: Andy Lutomirski Date: Mon, 27 Apr 2015 13:08:39 -0700 Message-ID: Subject: D-bus is three orders of magnitude too slow (Re: [GIT PULL] kdbus for 4.1-rc1) To: Lukasz Skalski Cc: Greg Kroah-Hartman , Linus Torvalds , Andrew Morton , Arnd Bergmann , "Eric W. Biederman" , One Thousand Gnomes , Tom Gundersen , Jiri Kosina , "linux-kernel@vger.kernel.org" , Daniel Mack , David Herrmann , Djalal Harouni Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7709 Lines: 168 On Fri, Apr 24, 2015 at 6:50 AM, Lukasz Skalski wrote: > Hi All, > > On 04/23/2015 07:16 PM, Greg Kroah-Hartman wrote: >> On Thu, Apr 23, 2015 at 09:46:22AM -0700, Andy Lutomirski wrote: >>> - There's still an open performance question. Namely: is kdbus performant? >> >> Yes, I thought that was already answered. Tizen posted some numbers >> with a much older version of the code, before David fixed a bunch of >> issues that he and you found, and that averaged between 25-50% faster. >> Details are in this presentation: >> http://download.tizen.org/misc/media/conference2014/slides/tdc2014-kdbus-in-tizen3.pdf >> >> The Tizen and GENIVI developers are off running numbers with the latest >> code, or so they told me through emails, but I don't know when/if that >> will ever happen, so I can't promise more than what is already here. >> > > I'm working on kdbus support for GLib ([1],[2]). I saw some questions > about kdbus performance, so I've prepared simple benchmark. Because > David already has posted some comparison results between kdbus and UDS, > I've decided to use my GLib port with native kdbus support (it should > be noted, that this port is not finished yet and there are still some > places for improvements, thus please do not treat these test results as > final). > > To perform tests I've created two simple apps: > > - server: http://fpaste.org/215157/ > - client: http://fpaste.org/215156/ After some fiddling (what's this G_BUS_TYPE_USER thing? G_BUS_TYPE_SESSION seems to work), I got this to run. Can we please take a big step back from the kernel-vs-userspace debate here? Can we try to understand why it's so bloody slow before thinking about merging something that might ossify it? With a slightly improved test (it shows how many calls happened) and userspace dbus, I got: $ taskset -c 0 dbus-launch bash $ perf stat ./test.sh HANDLER: bus_acquired_handler HANDLER: name_acquired_handler 20000 calls in 4.978810 s = 0.248940 ms per call Performance counter stats for './test.sh': 3866.770672 task-clock (msec) # 0.738 CPUs utilized 300,633 context-switches # 0.078 M/sec 3 cpu-migrations # 0.001 K/sec 817 page-faults # 0.211 K/sec 13,572,244,134 cycles # 3.510 GHz [83.26%] 8,799,771,026 stalled-cycles-frontend # 64.84% frontend cycles idle [82.75%] 6,914,976,524 stalled-cycles-backend # 50.95% backend cycles idle [68.92%] 9,915,362,980 instructions # 0.73 insns per cycle # 0.89 stalled cycles per insn [84.82%] 2,343,233,242 branches # 605.992 M/sec [82.25%] 44,699,493 branch-misses # 1.91% of all branches [82.83%] 5.238823116 seconds time elapsed That's more than 15 context switches per call. Something is severely broken. There should be almost exactly four context switches per call. NB: subtract 250ms from the elapsed time. test.sh contains a sleep 0.25 to work around a stupid race. This performance is absolutely terrible. Kdbus should be 100-1000x faster, not 2x faster, so kdbus' performance is still absolutely terrible. Sure, we can crank the message size so high that the only thing that matters is the per-byte overhead, in which case anything that uses memfd will win, but dbus-daemon could pretty easily do that to. Anyway, here's part of the problem. The client does this for each message (I added the nanosleep for this test so I could see what was going on): [pid 29592] eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 6 [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29592] write(5, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6, revents=POLLIN}]) [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6, revents=POLLIN}]) [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8 [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000 [pid 29593] <... poll resumed> ) = 1 ([{fd=5, revents=POLLIN}]) [pid 29593] read(5, "\1\0\0\0\0\0\0\0", 16) = 8 [pid 29593] sendmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\5\4\0\0=\4\0\0e\0\0\0\10\1g\0\1s\0\0\1\1o\0\r\0\0\0"..., 1149}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1149 [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}]) [pid 29593] read(5, 0x7f9115edfcf0, 16) = -1 EAGAIN (Resource temporarily unavailable) [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29593] recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\7\0\0\0>\4\0\0-\0\0\0", 16}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}]) [pid 29593] recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"\10\1g\0\1s\0\0\5\1u\0=\4\0\0\6\1s\0\4\0\0\0:1.1\0\0\0\0"..., 55}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 55 [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29593] poll([{fd=5, events=POLLIN}], 1, 0) = 1 ([{fd=5, revents=POLLIN}]) [pid 29593] write(6, "\1\0\0\0\0\0\0\0", 8 [pid 29592] <... poll resumed> ) = 1 ([{fd=6, revents=POLLIN}]) [pid 29592] futex(0x1f5c130, FUTEX_WAIT, 2, NULL [pid 29593] <... write resumed> ) = 8 [pid 29593] futex(0x1f5c130, FUTEX_WAKE, 1 [pid 29592] <... futex resumed> ) = 0 [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8 [pid 29592] futex(0x1f5c130, FUTEX_WAKE, 1) = 0 [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29592] futex(0x1f5c140, FUTEX_WAKE, 2147483647) = 0 [pid 29592] nanosleep({0, 100000}, [pid 29593] <... futex resumed> ) = 1 [pid 29593] close(6) = 0 [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout) [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2, 4294967295) = 1 ([{fd=5, revents=POLLIN}]) [pid 29593] read(5, "\4\0\0\0\0\0\0\0", 16) = 8 [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2, 4294967295 [pid 29592] <... nanosleep resumed> NULL) = 0 Hmm. So you're closing an fd for each message, thereby forcing an RCU cleanup, and you're doing some kind of inter-thread communication using several poll calls and a few futex wakes. Can someone who likes kdbus please benchmark kdbus against something that doesn't go out of its way to be inefficient? I don't even want to review kernel code that has as a major claim to fame the ability to beat this mess by a mere factor of two. For comparison, I have some code that uses Thrift, which is dog-slow [1], to serialize a message, send it, and deserialize it on the other end. The entire process takes 28 microseconds on average. That would be almost exactly three orders of magnitude faster. Of course, I'm not comparing apples to apples here, but the species of fruit being compared does not justify a three order of magnitude difference. [1] For all I know, the glib serialization stuff is also dog-slow, or perhaps snail-slow. Nonetheless, Thrift, or at least the version I'm using, is not a shining example of performance. If you want serialization performance, use Cap'n Proto. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/