Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966071AbbD1Ume (ORCPT ); Tue, 28 Apr 2015 16:42:34 -0400 Received: from mail-lb0-f178.google.com ([209.85.217.178]:33377 "EHLO mail-lb0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965994AbbD1Umc (ORCPT ); Tue, 28 Apr 2015 16:42:32 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150413190350.GA9485@kroah.com> <20150423130548.GA4253@kroah.com> <20150423163616.GA10874@kroah.com> <20150423171640.GA11227@kroah.com> <553A4A2F.5090406@samsung.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 13:42:10 -0700 Message-ID: Subject: Re: [GIT PULL] kdbus for 4.1-rc1 To: David Lang Cc: Havoc Pennington , Linus Torvalds , Lukasz Skalski , Greg Kroah-Hartman , Andrew Morton , Arnd Bergmann , "Eric W. Biederman" , One Thousand Gnomes , Tom Gundersen , Jiri Kosina , "linux-kernel@vger.kernel.org" , Daniel Mack , David Herrmann , Djalal Harouni Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4265 Lines: 91 On Tue, Apr 28, 2015 at 1:34 PM, David Lang wrote: > On Tue, 28 Apr 2015, Havoc Pennington wrote: > >> On Tue, Apr 28, 2015 at 1:19 PM, David Lang wrote: >>> >>> If the examples that are being used to show the performance advantage of >>> kdbus vs normal dbus are doing the wrong thing, then we need to get some >>> other examples available to people who don't live and breath dbus that >>> 'so >>> things right' so that the kernel developers can see what you think is the >>> real problem and how kdbus addresses it. >>> >>> So far, this 'wrong' example is the only thing that's been posted to show >>> the performance advantage of kdbus. >> >> >> I'm hopeful someone will do that. >> >> fwiw, I would be suspicious of a broken benchmark if it didn't show: >> >> * the bus daemon means an extra read/parse and marshal/write per >> message, so 4 vs. 2 >> * the existence of the bus daemon therefore makes a message >> send/receive take roughly twice as long >> >> https://lwn.net/Articles/580194/ has a bit more elaboration about >> number of copies, validations, and context switches in each case. >> >> From what I can tell, the core performance claim for kdbus is that for >> a userspace daemon to be a routing intermediary, it has to receive and >> re-send messages. If the baseline performance of IPC is the cost to >> send once and receive once, adding the daemon means there's twice as >> much to do (1 more receive, 1 more send). However fast you make >> send/receive, the daemon always means there are twice as many >> send/receives as there would be with no daemon. > > > there are twice as many context switches, nobody disputes that, the question > is if it matters. > > It doesn't matter if the message router is in kernel space or user space, it > still needs to read/parse, marshal/write the data, so you aren't saving that > time due to it being in the kernel. > >> If that isn't what a benchmark shows, then there's a mystery to >> explain... (one disruption to the ratio of course could be if the >> clients use a much faster or slower dbus lib than the daemon) >> >> As noted many times, of course this 2x penalty for the daemon was a >> conscious tradeoff - kdbus is trying to escape the tradeoff in order >> to extend usage of dbus to more use cases. Given the tradeoff, >> _existing_ uses of dbus seem to prefer the performance hit to the loss >> of useful semantics, but potential new users would like to or need to >> have both. > > > If there is a 2x performance improvement for being in the kernel, but a 100x > performance improvement from fixing the userspace code, the effort should be > spent on the userspace code, not on moving things to kernel space. I would guess that, if we compared a highly optimized userspace implementation to a kernel implementation, we'd see less than 2x difference. After all, a userspace daemon doesn't really need to unmarshal and re-marshal anything except headers. For large messages, we could use splice and avoid a couple of copies, too. If the scheduler became a bottleneck, it could be interesting to add something like a send-and-poll primitive. I suspect that some workloads currently do unnecessary context switches with only standard POSIX primitives. If A sends a message to B, then there's a brief window in which both A and B are runnable. Ideally we wouldn't context switch until A calls poll or epoll_wait, but I don't know how well that works in practice. There's more room for generic improvements than just that. At LSF/MM we were talking about more scalable epoll variants that would allow a multithreaded daemon to be woken up on the core that received incoming data. That would allow an efficient multi-queue dbus with fewer migrations and IPIs. At some point, I'd like to implement PCID on x86 (if no one beats me to it, and this is a low priority for me), which will allow us to skip expensive TLB flushes while context switching. I have no idea whether ARM can do something similar. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/