Message-id: <553F601C.6030309@samsung.com>
Date: Tue, 28 Apr 2015 12:25:32 +0200
From: Lukasz Skalski <l.skalski@samsung.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101
 Thunderbird/31.5.0
MIME-version: 1.0
To: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Arnd Bergmann <arnd@arndb.de>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
        Tom Gundersen <teg@jklm.no>, Jiri Kosina <jkosina@suse.cz>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Daniel Mack <daniel@zonque.org>,
        David Herrmann <dh.herrmann@gmail.com>,
        Djalal Harouni <tixxdz@opendz.org>
Subject: Re: D-bus is three orders of magnitude too slow (Re: [GIT PULL] kdbus
 for 4.1-rc1)
References: <CALCETrWLTLqZ0pioOEHakd_S+h=F1X2qXpODmEZ73JcFZNxYig@mail.gmail.com>
In-reply-to: <CALCETrWLTLqZ0pioOEHakd_S+h=F1X2qXpODmEZ73JcFZNxYig@mail.gmail.com>
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8563
Lines: 185

On 04/27/2015 10:08 PM, Andy Lutomirski wrote:
> On Fri, Apr 24, 2015 at 6:50 AM, Lukasz Skalski <l.skalski@samsung.com> wrote:
>> Hi All,
>>
>> On 04/23/2015 07:16 PM, Greg Kroah-Hartman wrote:
>>> On Thu, Apr 23, 2015 at 09:46:22AM -0700, Andy Lutomirski wrote:
>>>>  - There's still an open performance question.  Namely: is kdbus performant?
>>>
>>> Yes, I thought that was already answered.  Tizen posted some numbers
>>> with a much older version of the code, before David fixed a bunch of
>>> issues that he and you found, and that averaged between 25-50% faster.
>>> Details are in this presentation:
>>>       http://download.tizen.org/misc/media/conference2014/slides/tdc2014-kdbus-in-tizen3.pdf
>>>
>>> The Tizen and GENIVI developers are off running numbers with the latest
>>> code, or so they told me through emails, but I don't know when/if that
>>> will ever happen, so I can't promise more than what is already here.
>>>
>>
>> I'm working on kdbus support for GLib ([1],[2]). I saw some questions
>> about kdbus performance, so I've prepared simple benchmark. Because
>> David already has posted some comparison results between kdbus and UDS,
>> I've decided to use my GLib port with native kdbus support (it should
>> be noted, that this port is not finished yet and there are still some
>> places for improvements, thus please do not treat these test results as
>> final).
>>
>> To perform tests I've created two simple apps:
>>
>> - server: http://fpaste.org/215157/
>> - client: http://fpaste.org/215156/
> 
> After some fiddling (what's this G_BUS_TYPE_USER thing?
> G_BUS_TYPE_SESSION seems to work), I got this to run.
> 

As it was discussed some time ago with GLib developers, we have a plan
to create two new bus types called "user" (G_BUS_TYPE_USER) and
"machine" (G_BUS_TYPE_MACHINE). These ones try to connect to kdbus,
falling back to the old "session" and "system" buses (respectively) in
case of failure. The "session" (G_BUS_TYPE_SESSION) and "system"
(G_BUS_TYPE_SYSTEM) bus types always connect to the dbus socket, as
today.

> Can we please take a big step back from the kernel-vs-userspace debate
> here?  Can we try to understand why it's so bloody slow before
> thinking about merging something that might ossify it?
> 
> With a slightly improved test (it shows how many calls happened) and
> userspace dbus, I got:
> 
> $ taskset -c 0 dbus-launch bash
> $ perf stat ./test.sh
> HANDLER: bus_acquired_handler
> HANDLER: name_acquired_handler
> 20000 calls in 4.978810 s = 0.248940 ms per call
> 
>  Performance counter stats for './test.sh':
> 
>        3866.770672 task-clock (msec)         #    0.738 CPUs utilized
>            300,633 context-switches          #    0.078 M/sec
>                  3 cpu-migrations            #    0.001 K/sec
>                817 page-faults               #    0.211 K/sec
>     13,572,244,134 cycles                    #    3.510 GHz
>          [83.26%]
>      8,799,771,026 stalled-cycles-frontend   #   64.84% frontend
> cycles idle    [82.75%]
>      6,914,976,524 stalled-cycles-backend    #   50.95% backend
> cycles idle    [68.92%]
>      9,915,362,980 instructions              #    0.73  insns per cycle
>                                              #    0.89  stalled cycles
> per insn [84.82%]
>      2,343,233,242 branches                  #  605.992 M/sec
>          [82.25%]
>         44,699,493 branch-misses             #    1.91% of all
> branches         [82.83%]
> 
>        5.238823116 seconds time elapsed
> 
> 
> That's more than 15 context switches per call.  Something is severely
> broken.  There should be almost exactly four context switches per
> call.
> 
> NB: subtract 250ms from the elapsed time.  test.sh contains a sleep
> 0.25 to work around a stupid race.
> 
> This performance is absolutely terrible.  Kdbus should be 100-1000x
> faster, not 2x faster, so kdbus' performance is still absolutely
> terrible.
> 
> Sure, we can crank the message size so high that the only thing that
> matters is the per-byte overhead, in which case anything that uses
> memfd will win, but dbus-daemon could pretty easily do that to.
> 
> Anyway, here's part of the problem.  The client does this for each
> message (I added the nanosleep for this test so I could see what was
> going on):
> 
> [pid 29592] eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 6
> [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
> revents=POLLIN}])
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
> revents=POLLIN}])
> [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000 <unfinished ...>
> [pid 29593] <... poll resumed> )        = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] read(5, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29593] sendmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"l\1\0\1\5\4\0\0=\4\0\0e\0\0\0\10\1g\0\1s\0\0\1\1o\0\r\0\0\0"...,
> 1149}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1149
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295) = 1 ([{fd=4, revents=POLLIN}])
> [pid 29593] read(5, 0x7f9115edfcf0, 16) = -1 EAGAIN (Resource
> temporarily unavailable)
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] recvmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"l\2\1\1\7\0\0\0>\4\0\0-\0\0\0", 16}], msg_controllen=0,
> msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
> [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
> [pid 29593] recvmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"\10\1g\0\1s\0\0\5\1u\0=\4\0\0\6\1s\0\4\0\0\0:1.1\0\0\0\0"...,
> 55}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
> = 55
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}], 1, 0) = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] write(6, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> [pid 29592] <... poll resumed> )        = 1 ([{fd=6, revents=POLLIN}])
> [pid 29592] futex(0x1f5c130, FUTEX_WAIT, 2, NULL <unfinished ...>
> [pid 29593] <... write resumed> )       = 8
> [pid 29593] futex(0x1f5c130, FUTEX_WAKE, 1 <unfinished ...>
> [pid 29592] <... futex resumed> )       = 0
> [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29592] futex(0x1f5c130, FUTEX_WAKE, 1) = 0
> [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] futex(0x1f5c140, FUTEX_WAKE, 2147483647) = 0
> [pid 29592] nanosleep({0, 100000},  <unfinished ...>
> [pid 29593] <... futex resumed> )       = 1
> [pid 29593] close(6)                    = 0
> [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295) = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295 <unfinished ...>
> [pid 29592] <... nanosleep resumed> NULL) = 0
> 
> Hmm.  So you're closing an fd for each message, thereby forcing an RCU
> cleanup, and you're doing some kind of inter-thread communication
> using several poll calls and a few futex wakes.
> 
> Can someone who likes kdbus please benchmark kdbus against something
> that doesn't go out of its way to be inefficient?  I don't even want
> to review kernel code that has as a major claim to fame the ability to
> beat this mess by a mere factor of two.
> 
> For comparison, I have some code that uses Thrift, which is dog-slow
> [1], to serialize a message, send it, and deserialize it on the other
> end.  The entire process takes 28 microseconds on average.  That would
> be almost exactly three orders of magnitude faster.  Of course, I'm
> not comparing apples to apples here, but the species of fruit being
> compared does not justify a three order of magnitude difference.
> 
> [1] For all I know, the glib serialization stuff is also dog-slow, or
> perhaps snail-slow.  Nonetheless, Thrift, or at least the version I'm
> using, is not a shining example of performance.  If you want
> serialization performance, use Cap'n Proto.
> 
> --Andy
> 

-- 
Lukasz Skalski
Samsung R&D Institute Poland
Samsung Electronics
l.skalski@samsung.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/