Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752416AbbGAALh (ORCPT ); Tue, 30 Jun 2015 20:11:37 -0400 Received: from smtp-68.nebula.fi ([83.145.220.68]:34976 "EHLO smtp.nebula.fi" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753226AbbGAALb (ORCPT ); Tue, 30 Jun 2015 20:11:31 -0400 X-Greylist: delayed 441 seconds by postgrey-1.27 at vger.kernel.org; Tue, 30 Jun 2015 20:11:31 EDT Date: Wed, 1 Jul 2015 03:03:58 +0300 From: "Kalle A. Sandstrom" To: linux-kernel@vger.kernel.org Subject: Re: kdbus: to merge or not to merge? Message-ID: <20150701000358.GA32283@molukki> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10529 Lines: 184 [delurk; apparently kdbus is not receiving the architectural review it should. i've got quite a bit of knowledge on message-passing mechanisms in general, and kernel IPC in particular, so i'll weigh in uninvited. apologies for length. as my "proper" review on this topic is still under construction, i'll try (and fail) to be brief here. i started down that road only to realize that kdbus is quite the ball of mud even if the only thing under the scope is its interface, and that if i held off until properly ready i'd risk kdbus having already been merged, making review moot.] Ingo Molnar wrote: >- I've been closely monitoring Linux kernel changes for over 20 years, and for the > last 10 years the linux/ipc/* code has been dormant: it works and was kept good > for existing usecases, but no-one was maintaining and enhancing it with the > future in mind. It's my understanding that linux/ipc/* contains only SysV IPC, i.e. shm, sem, SysV message queues, and POSIX message queues. There are other IPC-implementing things in the kernel also, such as unix domain sockets, pipes, shared memory via mmap(), signals, mappings that appear shared across fork(), and whatever else provides either kernel-mediated multi-client buffer access or some combination of shared memory and synchronization that lets userspace exchange hot data across the address space boundary. It's also my understanding that no-one in their right mind would call SysV IPC state-of-the-art even at the level of interface; indeed its presence in the hoariest of vendor unixes suggests it's not supposed to be even close. However, the suggested replacement in kdbus replicates the worst[-1] of all known user-to-user IPC mechanisms, i.e. Mach. I'm not suggesting that Linux adopt e.g. a different microkernel IPC mechanism-- those are by and large inapplicable to a monolithic kernel for reasons of ABI (and, well, why would you do IPC when function calls are zomgfast already?)-- but rather, that the existing ones either are good enough at this time or can be reworked to become near-equivalent to the state of the art in terms of performance. > So there exists a technical vacuum: the kernel does not have any good, modern > IPC ABI at the moment that distros can rely on as a 'golden standard'. This is > partly technical, partly political. The technical reason is that SysV IPC is > ancient and cumbersome. The political reason is that SystemD could be using > and extending Android's existing kernel accelerated IPC subsystem (Binder) > that is already upstream - but does not. I'll contend that the reason for this vacuum is that the existing kernel IPC interfaces are fine to the point that other mechanisms may be derived from them solely in user-space without significant performance demerit, and without pushing ca. 10k SLOC of IPC broker and policy engine into kernel space. Furthermore, it's my well-ruminated opinion that implementations of the userspace ABI specified in the kdbus 4.1-rc1 version (of April this year) will always be necessarily slower than existing IPC primitives in terms of both throughput and latency; and that the latter are directly applicable to constructing a more convenient user-space IPC broker that implements what kdbus seeks to provide: naming, broadcast, unidirectional signaling, bidirectional "method calls", and a policy mechanism. In addition I'll argue that as currently specified, the kdbus interface-- even if tuned to its utmost-- is not only necessarily inferior to e.g. a well-tuned version of unix domain sockets, but also fundamentally flawed in ways that prohibit construction of robust in-system distributed programs by kdbus' mechanisms alone (i.e. byzantine call-site workarounds notwithstanding). For the first, compare unix domain sockets (i.e. point-to-point mode, access control through filesystem [or fork() parentage], read/write/select) to the kdbus message-sending ioctl. In the main data-exchanging portion, the former requires only a connection identifier, a pointer to a buffer, and the length of data in that buffer. To contrast, kdbus takes a complex message-sending command structure with 0..n items of m kinds that the ioctl must parse in a m-way switching loop, and then another complex message-describing structure which has its own 1..n items of another m kinds describing its contents, destination-lookup options, negotiation of supported options, and so forth. Consequently, a carefully optimized implementation of unix domain sockets (and by extension all the data-carrying SysV etc. IPC primitives, optimized similarly) will always be superior to kdbus for both message throughput and latency, for the reason of kdbus' comparatively great interface complexity alone. There's an obvious caveat here, i.e. "well where is it, then?". Given the overhead dictated by its interface, kdbus' performance is already inferior for short messages. For long messages (> L1 cache size per Stetson-Harrison[0]) the only performance benefit from kdbus is its claimed single-copy mode of operation-- an equivalent to which could be had with ye olde sockets by copying data from the writer directly into the reader while one of them blocks[1] in the appropriate syscall. That the current Linux pipes, SysV queues, unix domain sockets, etc. don't do this doesn't really factor in. For the second, kdbus is fundamentally designed to buffer message data, up to a fixed limit, in the pool associated with receivers' connections. I cannot overstate the degree of this _outright architectural blunder_, so I'll put an extra paragraph break here just for emphasis' sake. A consequence of this buffering is that whenever a client sends a message with kdbus, it must be prepared to handle an out-of-space non-delivery status. (kdbus has two of those, one for queue length and another for buffer space. why, i have no idea-- do clients have a different behaviour in response to one of them from the other?) There's no option to e.g. overwrite a previous message, or to discard queued messages in an oldest-first order, instead of rebuffing the sender. For broadcast messaging, a recipient may observe that messages were dropped by looking at a `dropped_msgs' field delivered (and then reset) as part of the message reception ioctl. Its value is the number of messages dropped since last read, so arguably a client could achieve the equivalent of the condition's absence by resynchronizing explicitly with all signal-senders on its current bus wrt which it knows the protocol, when the value is >0. This method could in principle apply to 1-to-1 unidirectional messaging as well[2]. Looking at the kdbus "send message, wait for tagged reply" feature in conjunction with these details appears to reveal two holes in its state graph. The first is that if replies are delivered through the requestor's buffer, concurrent sends into that same buffer may cause it to become full (or the queue to grow too long, w/e) before the service gets a chance to reply. If this condition causes a reply to fall out of the IPC flow, the requestor will hang until either its specified timeout happens or it gets interrupted by a signal. If replies are delivered outside the shm pool, the requestor must be prepared to pick them up using a different means from the "in your pool w/ offset X, length Y" format the main-line kdbus interface provides. [i've seen no such thing in the kdbus docs so far.] As far as alternative solutions go, preallocation of space for a reply message is an incomplete fix unless every reply's size has a known upper bound (e.g. with use of an IDL compiler); in this scheme it'd be necessary for the requestor to specify this, suffering consequences if the number is too low, and to prepare to handle a "not enough buffer space for a reply" condition at send. The kdbus docs specify no such condition. The second problem is that given how there can be a timeout or interrupt on the receive side of a "method call" transaction, it's possible for the requestor to bow out of the IPC flow _while the service is processing its request_. This results either in the reply message being lost, or its ending up in the requestor's buffer to appear in a loop where it may not be expected. Either way, the client must at that point resynchronize wrt all objects related to the request's side effects, or abandon the IPC flow entirely and start over. (services need only confirm their replies before effecting e.g. a chardev-like "destructively read N bytes from buffer" operation's outcome, which is slightly less ugly.) Tying this back into the first point: to prevent this type of denial-of-service against sanguinely-written software it's necessary for kdbus to invoke the policy engine to determine that an unrelated participant isn't allowed to consume a peer's buffer space. As this operation is absent in unix-domain sockets, an ideal implementation of kdbus 4.1-rc1 will be slower in point-to-point communication even if the particulars of its message-descriptor format get reworked to a light-weight alternative. In addition, its API ends up requiring highly involved state-tracking wrappers or inversion-of-control machinery in its clients, to the point where just using unix domain sockets with a heavyweight user-space broker would be nicer. It's my opinionated conclusion that merging kdbus as-is would be the sort of cock-up which we'll look back at, point a finger, giggle a bit, and wonder only half-jokingly if there was something besides horse bones in that glue. Its docs betray an absence of careful analysis, and the spec of its interface is so loose as to make programs written for kdbus 4.1-rc1 subtly incompatible to any later program through deeply-baked design consequences stemming from quirks of its current implementation. I'm not a Linux kernel developer. But if I were, this would be where I'd put my NAK. Sincerely, -KS [-1] author's opinion [0] no bunny rabbits were harmed [1] the case where both use non-blocking I/O requires either a buffer or support from the scheduler. the former is no optimization at all, and the latter may be _quite involved indeed_. [2] as for whether freedesktop.org programs will be designed and built to such a standard, i suspend judgement. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/