MIME-Version: 1.0
In-Reply-To: <CA+55aFyDCR1bdzNe+kwuAO-XAzdxhtBjqRNrRnWFuo0Q8-QUmA@mail.gmail.com>
References: <20161026191810.12275-1-dh.herrmann@gmail.com> <CA+55aFyDCR1bdzNe+kwuAO-XAzdxhtBjqRNrRnWFuo0Q8-QUmA@mail.gmail.com>
From: David Herrmann <dh.herrmann@gmail.com>
Date: Wed, 26 Oct 2016 22:34:30 +0200
Message-ID: <CANq1E4R_fg6Ptjg53RYzZZ2csrt+VvmExDcE_suvOFfxv-bxcw@mail.gmail.com>
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Andy Lutomirski <luto@amacapital.net>, Jiri Kosina <jikos@kernel.org>,
        Greg KH <greg@kroah.com>, Hannes Reinecke <hare@suse.com>,
        Steven Rostedt <rostedt@goodmis.org>, Arnd Bergmann <arnd@arndb.de>,
        Tom Gundersen <teg@jklm.no>, Josh Triplett <josh@joshtriplett.org>,
        Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5222
Lines: 102

Hi

On Wed, Oct 26, 2016 at 9:39 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So the thing that tends to worry me about these is resource management.
>
> If I understood the documentation correctly, this has per-user
> resource management, which guarantees that at least the system won't
> run out of memory. Good. The act of sending a message transfers the
> resource to the receiver end. Fine.
>
> However, the usual problem ends up being that a bad user can basically
> DoS a system agent, especially since for obvious performance reasons
> the send/receive has to be asynchronous.
>
> So the usual DoS model is that some user just sends a lot of messages
> to a system agent, filling up the system agent resource quota, and
> basically killing the system. No, it didn't run out of memory, but the
> system agent may not be able to do anything more, since it is now out
> of resources.
>
> Keeping the resource management with the sender doesn't solve the
> problem, it just reverses it: now the attack will be to send a lot of
> queries to the system agent, but then just refuse to listen to the
> replies - again causing the system agent to run out of resources.
>
> Usually the way this is resolved this is by forcing a
> "request-and-reply" resource management model, where the person who
> sends out a request is not only the one who is accounted for the
> request, but also accounted for the reply buffer. That way the system
> agent never runs out of resources, because it's always the requesting
> party that has its resources accounted, never the system agent.
>
> You may well have solved this, but can you describe what the solution
> is without forcing people to read the code and try to analyze it?

All accounting on bus1 is done on a UID-basis. This is the initial
model that tries to match POSIX semantics. More advanced accounting is
left as a future extension (like cgroup-based, etc.). So whenever we
talk about "user accounting", we talk about the user-abstraction in
bus1 that right now is based on UIDs, but could be extended for other
schemes.

All bus1 resources are owned by a peer, and each peer has a user
assigned (which right now corresponds to file->f_cred->uid). Whenever
a peer allocates resources, it is accounted on its user. There are
global limits per user which cannot be exceeded. Additionally, each
peer can set its own per-peer limits, to further partition the
per-user limits. Of course, per-user limits override per-peer limits,
if necessary.

Now this is all trivial and obvious. It works like any resource
accounting in the kernel. It becomes tricky when we try to transfer
resources. Before SEND, a resource is always accounted on the sender.
After RECV, a resource is accounted on the receiver. That is, resource
ownership is transferred. In most cases this is obvious: memory is
copied from one address-space into another, or file-descriptors are
added into the file-table of another process, etc.

Lastly, when a resource is queued, we decided to go with
receiver-accounting. This means, at the time of SEND resource
ownership is transferred (unlike sender-accounting, which would
transfer it at time of RECV). The reasons are manifold, but mainly we
want RECV to not fail due to accounting, resource exhaustion, etc. We
wanted SEND to do the heavy-lifting, and RECV to just dequeue. By
avoiding sender-based accounting, we avoid attacks where a receiver
does not dequeue messages and thus exhausts the sender's limits. The
issue left is senders DoS'ing a target user. To mitigate this, we
implemented a quota system. Whenever a sender wants to transfer
resources to a receiver, it only gets access to a subset of the
receivers resource limits. The inflight resources are accounted on a
uid<->uid basis, and the current algorithm allows a receiver access to
at most half the limit of the destination not currently used by anyone
else.

Example:
Imagine a receiver with a limit of 1024 handles. A sender transmits a
message to that receiver. It gets access to half the limit not used by
anyone else, hence 512 handles. It does not matter how many senders
there are, nor how many messages are sent, it will reach its quota at
512. As long as they all belong to the same user, they will share the
quota and can queue at most 512 handles. If a second sending user
comes into play, it gets half the remaining not used by anyone else,
which ends up being 256. And so on... If the peer dequeues messages in
between, the numbers get higher again. But if you do the math, the
most you can get is 50% of the targets resources, if you're the only
sender. In all other cases you get less (like intertwined transfers,
etc).

We did look into sender-based inflight accounting, but the same set of
issues arises. Sure, a Request+Reply model would make this easier to
handle, but we want to explicitly support a Subscribe+Event{n} model.
In this case there is more than one Reply to a message.

Long story short: We have uid<->uid quotas so far, which prevent DoS
attacks, unless you get access to a ridiculous amount of local UIDs.
Details on which resources are accounted can be found in the wiki [1].

Thanks
David

[1] https://github.com/bus1/documentation/wiki/Quota