Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932420AbbHCXC6 (ORCPT ); Mon, 3 Aug 2015 19:02:58 -0400 Received: from mail.kernel.org ([198.145.29.136]:51486 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754319AbbHCXC4 (ORCPT ); Mon, 3 Aug 2015 19:02:56 -0400 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Mon, 3 Aug 2015 16:02:30 -0700 Message-ID: Subject: Re: kdbus: to merge or not to merge? To: Linus Torvalds , "linux-kernel@vger.kernel.org" , David Herrmann , Djalal Harouni , Greg KH , Havoc Pennington , "Eric W. Biederman" , One Thousand Gnomes , Tom Gundersen , Daniel Mack Cc: "Kalle A. Sandstrom" , Borislav Petkov , cee1 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3526 Lines: 75 On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski wrote: > 2. Kdbus introduces a novel buffering model. Receivers allocate a big > chunk of what's essentially tmpfs space. Assuming that space is > available (in a virtual memory sense), senders synchronously write to > the receivers' tmpfs space. Broadcast senders synchronously write to > *all* receivers' tmpfs space. I think that, regardless of > implementation, this is problematic if the sender and the receiver are > in different memcgs. Suppose that the message is to be written to a > page in the receivers' tmpfs space that is not currently resident. If > the write happens in the sender's memcg context, then a receiver can > effectively allocate an unlimited number of pages in the sender's > memcg, which will, in practice, be the init memcg if the sender is > systemd. This breaks the memcg model. If, on the other hand, the > sender writes to the receiver's tmpfs space in the receiver's memcg > context, then the sender will block (or fail? presumably > unpredictable failures are a bad thing) if the receiver's memcg is at > capacity. I realize that everyone is sick of this thread. Nonetheless, I should emphasize that I'm actually serious about this issue. I got Fedora Rawhide working under kdbus (thanks, everyone!), and I ran this little program: #include #include int main(int argc, char *argv[]) { while (1) { sd_bus *bus; if (sd_bus_open_system(&bus) < 0) { /* warn("sd_bus_open_system"); */ continue; } sd_bus_close(bus); } } under both userspace dbus and under kdbus. Userspace dbus burns some CPU -- no big deal. I expected kdbus to fail to scale and burn a disproportionate amount of CPU (because I don't see how it /can/ scale). Instead it fell over completely. I didn't bother debugging it, but offhand I'd guess that the system OOMed and didn't come back. On very brief inspection, Rawhide seems to have a lot of kdbus connections with 16MiB of mapped tmpfs stuff each. (53 of them mapped, and I don't know how many exist with tmpfs backing but aren't mapped. Presumably the number only goes up as the degree of reliance on the userspace proxy goes down. As it stands, that's over 3GB of uncommitted backing store that my test is likely to forcibly commit very quickly.) Frankly, I don't understand how it's possible to cleanly implement kdbus' broadcast or lifetime semantics* in an environment with bounded CPU or bounded memory. (And unbounded memory just changes the problem, since the message backlog can just get worse and worse.) I work in an industry in which lots of parties broadcast lots of data to lots of people. If you try to drink from the firehose and you can't swallow fast enough, either you need to throw something out (and test your recovery code!) or you fail. At least in finance, no one pretends that a global order of events in different cities is practical. * Detecting when when your peer goes away is, of course, a widely encountered and widely solved problem. I don't know of any deployed systems that solve it by broadcasting the lifetime of everything to everyone and relying on those broadcasts going through, though. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/