Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754441AbbGFVS2 (ORCPT ); Mon, 6 Jul 2015 17:18:28 -0400 Received: from smtp-68.nebula.fi ([83.145.220.68]:38741 "EHLO smtp.nebula.fi" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751081AbbGFVST (ORCPT ); Mon, 6 Jul 2015 17:18:19 -0400 Date: Tue, 7 Jul 2015 00:18:14 +0300 From: "Kalle A. Sandstrom" To: David Herrmann Cc: linux-kernel Subject: Re: kdbus: to merge or not to merge? Message-ID: <20150706211814.GA12061@molukki> References: <20150701000358.GA32283@molukki> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12629 Lines: 252 On Wed, Jul 01, 2015 at 06:51:41PM +0200, David Herrmann wrote: > Hi > Thanks for the answers; in response I've got some further questions. Again, apologies for length -- I apparently don't know how to discuss IPC tersely. > On Wed, Jul 1, 2015 at 2:03 AM, Kalle A. Sandstrom wrote: > > For the first, compare unix domain sockets (i.e. point-to-point mode, access > > control through filesystem [or fork() parentage], read/write/select) to the > > kdbus message-sending ioctl. In the main data-exchanging portion, the former > > requires only a connection identifier, a pointer to a buffer, and the length > > of data in that buffer. To contrast, kdbus takes a complex message-sending > > command structure with 0..n items of m kinds that the ioctl must parse in a > > m-way switching loop, and then another complex message-describing structure > > which has its own 1..n items of another m kinds describing its contents, > > destination-lookup options, negotiation of supported options, and so forth. > > sendmsg(2) uses a very similar payload to kdbus. send(2) is a shortcut > to simplify the most common use-case. I'd be more than glad to accept > patches adding such shortcuts to kdbus, if accompanied by benchmark > numbers and reasoning why this is a common path for dbus/etc. clients. > A shortcut special case for e.g. only iovec-like payload items, only to a numerically designated peer, and only RPC forms, should be an immediate gain given that reduced functionality would lower the number of instructions executed, the number of impredictable branches met, and the number of possibly-cold cache lines accessed. The difference in raw cycles should be significant in comparison to the number of kernel exits avoided during a client's RPC to a service and the associated reply. Assuming that such RPCs are the bulk of what kdbus will do, and that c/s avoidance is crucial to the performance argument in its design, it seems silly not to have such a fast-path -- even if it is initially implemented as a simple wrapper of the full send ioctl. It would also put the basic send operation on par with sendmsg(2) over a connected socket in terms of interface complexity, and simplify any future "exit into peer without scheduler latency" shenanigans. However, these gains would go unobserved in code written to the current kdbus ABI. Bridging to such a fast-path from the full interface would eliminate most of its benefits while hurting its legit callers. That being said, considering that the eventual de-facto user API to kdbus is a library with explicit deserialization, endianness conversion, and suchlike, I could see how the difference would go unobserved. > The kdbus API is kept generic and extendable, while trying to keep > runtime overhead minimal. If this overhead turns out to be a > significant runtime slowdown (which none of my benchmarks showed), we > should consider adding shortcuts. Until then, I prefer an API that is > consistent, easy to extend and flexible. > Out of curiosity, what payload item types do you see being added in the near future, e.g. the next year? UDS knows only of simple buffers, scatter/gather iovecs, and inter-process dup(2); and recent Linux adds sourcing from a file descriptor. Perhaps a "pass this previously-received message on" item? > > Consequently, a carefully optimized implementation of unix domain sockets (and > > by extension all the data-carrying SysV etc. IPC primitives, optimized > > similarly) will always be superior to kdbus for both message throughput and > > latency, [...] > > Yes, that's due to the point-to-point nature of UDS. > Does this change for broadcast, unassociated, or doubly-addressed[0] operation? For the first, kdbus must already cause allocation of cache lines in proportion to msg_length * n_recvrs, which mutes the broker's single-copy advantage as the number of receivers grows. For the second, name lookup from (say) a hash table only adds to required processing, though the resulting identifier could be re-used immediately afterward; and the third mode would prohibit that optimization altogether. Relatedly, is there publicly-available data concerning the distribution of various dbus IPC modalities? Such as a desktop booting under systemd, running for a decent bit, and shutting down; or the automotive industry's (presumably signaling-heavy) use cases which I've heard quoted for a figure of 600k transactions before steady state. > > [...] For long messages (> L1 cache size per Stetson-Harrison[0]) the > > only performance benefit from kdbus is its claimed single-copy mode of > > operation-- an equivalent to which could be had with ye olde sockets by copying > > data from the writer directly into the reader while one of them blocks[1] in > > the appropriate syscall. That the current Linux pipes, SysV queues, unix domain > > sockets, etc. don't do this doesn't really factor in. > > Parts of the network subsystem have supported single-copy (mmap'ed IO) > for quite some time. kdbus mandates it, but otherwise is not special > in that regard. > I'm not intending to discuss mmap() tricks, but rather that with the existing system calls a pending write(2) would be made to substitute for the in-kernel buffer where the corresponding read(2) grabs its bytes from; or vice versa. This'd make conventional IPC single-copy while permitting the receiver to designate an arbitrary location for its data; e.g. an IPC daemon first reading a message header from a sender's socket, figuring out its routing and allocation, and then receiving the body directly into the destination shm pool. That's not directly related to kdbus, except as a hypothetical [transparent] speed-up for a strictly POSIX user-space reimplementation using the same mmap()'d shm-pool receive semantics. [snipped the bit about sender's buffer-full handling] > > For broadcast messaging, a recipient may observe that messages were dropped by > > looking at a `dropped_msgs' field delivered (and then reset) as part of the > > message reception ioctl. Its value is the number of messages dropped since last > > read, so arguably a client could achieve the equivalent of the condition's > > absence by resynchronizing explicitly with all signal-senders on its current > > bus wrt which it knows the protocol, when the value is >0. This method could in > > principle apply to 1-to-1 unidirectional messaging as well[2]. > > Correct. > As a followup question, does the synchronous RPC mode also return `dropped_msgs'? If so, does it reset the counter? Such operation would seem to complicate all RPC call sites, and I didn't find it discussed in the in-tree kdbus documentation. [case of RPC client timeout/interrupt before service reply] > If sending a reply fails, the kdbus_reply state is destructed and the > caller must be woken up. We do that for sync-calls just fine, but the > async case does indeed lack a wake-up in the error path. I noted this > down and will fix it. > What's the reply sender's error code in this case, i.e. failure due to caller having bowed out? The spec suggests either ENXIO (for a disappeared client), ECONNRESET (for a deactivated client connection), or EPIPE (by reflection to the service from how it's described in kdbus.message.xml). (Also, I'm baffled by the difference between ENXIO and ECONNRESET. Is there a circumstance where a kdbus application would care of the difference between a peer's not having been there to begin with, and its connection not being active?) [snipped speculation about out-of-pool reply delivery, which doesn't happen] > > The second problem is that given how there can be a timeout or interrupt on the > > receive side of a "method call" transaction, it's possible for the requestor to > > bow out of the IPC flow _while the service is processing its request_. This > > results either in the reply message being lost, or its ending up in the > > requestor's buffer to appear in a loop where it may not be expected. Either > > (for completeness: we properly support resuming interrupted sync-calls) > How does the client do this? A quick grep through the docs didn't show any hits for "resume". Moreover, can the client resume an interrupted sync-call equivalently both before and after the service has picked the request up, including the service's call to release the shm-pool allocation (and possibly also the kdbus tracking structures)? That's to say: is there any case where non-idempotent RPCs, or their replies, would end up duplicated due to interrupts or timeouts? > > way, the client must at that point resynchronize wrt all objects related to the > > request's side effects, or abandon the IPC flow entirely and start over. > > (services need only confirm their replies before effecting e.g. a chardev-like > > "destructively read N bytes from buffer" operation's outcome, which is slightly > > less ugly.) > > Correct. If you time-out, or refuse to resume, a sync-call, you have > to treat this transaction as failed. > More generally speaking, what's the use case for having a timeout? What's a client to do? Given that a RPC client can, having timed out, either resume (having e.g. approximated co?perative multitasking in between? [surely not?]) or abort (to try it over again?), I don't understand why this API is there in the first place. Unless it's (say) a reheating of the poor man's distributed deadlock detection, but that's a bit of a nasty way of looking at it-- RPC dependencies should be acyclic anyhow. For example let's assume a client and two services, one being what the client calls into and the other what the first uses to implement some portion of its interface. The first does a few things and then calls into the second before replying, which in its service code ends up incurring a delay from sleeping on a mutex, waiting for disk/network I/O, or local system load. The first service does not specify a timeout, but the chain-initiating client does, and this timeout ends up being reached due to delay in the second service. (notably, the timeout will not have occurred due to a deadlock and so cannot be resolved by the intermediate chain releasing mutex'd primitives and starting over.) How does the client recover from the timeout? Are intermediary services required to exhibit composable idempotence? Is there a greater transaction bracket around a client RPC, so that rollback/commit can happen regardless of intermediaries? > > Tying this back into the first point: to prevent this type of denial-of-service > > against sanguinely-written software it's necessary for kdbus to invoke the > > policy engine to determine that an unrelated participant isn't allowed to > > consume a peer's buffer space. > > It's not the policy engine, but quota-handling, but otherwise correct. > I have some further questions on the topic of shm pool semantics. (I'm trying to figure out what a robust RPC client call-site would look like, as that is part of kdbus' end-to-end performance.) Does the client receive a status code if a reply fails due to the quota mechanism? This, again, I didn't find in the spec. Is there some way for a RPC peer to know that replies below a certain size will be delivered regardless of the state of the client's shm pool at the time of reply? Such as a per-connection parameter (i.e. one that's implicitly a part of the protocol) or a per-RPC field that a client may set to achieve reliable operation without "no reply because buffer full" handling even in the face of concurrent senders. Does the client receive scattered data if its reception pool has enough room for a reply message, but the largest piece is smaller than the reply payload? If so, is there some method by which a sender could indicate legitimate breaks in the message contents, e.g. between strings or integers, so that a middleware (IDL compiler, basically) could wrap that data into a function call's out-parameters without doing an extra gather stage [copy] in userspace? If not, must a client process call into the message-handling part of its main loop (to release shmpool space by handling messages) whenever a reply fails for this reason? Interestedly, -KS [0] I'm referring to the part where a send operation (or the message) may specify both numeric recipient ID and name, which kdbus would match or reject the message. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/