Hi Linus,
Can you opine as to whether you think that kdbus should be merged? I
don't mean whether you'd accept a pull request that Greg may or may
not send during this merge window -- I mean whether you think that
kdbus should be merged if it had appropriate review and people were
okay with the implementation.
The current state of uncertainty is problematic, I think. The kdbus
team is spending a lot of time making things compatible with kdbus,
and the latest systemd release makes kdbus userspace support
mandatory. The kernel people who would review it (myself included)
probably don't want to review new versions at a line-by-line level,
because we (myself included) either don't know whether there's any
point or don't think that it should be merged *even if the
implementation were flawless*.
For my part, here's my argument why the answer should be "no, kdbus
shouldn't be merged":
1. It's not necessary. kdbus is a giant API surface. The problems it
purports to solve are (very roughly) performance, ability to collect
metadata in a manner that doesn't suck, sandbox support, better
logging/monitoring, and availability very early in userspace startup.
I think that the performance issues should be solved in userspace --
gdbus performance is atrocious for reasons that have nothing to do
with the kernel or context switches [1]. The metadata problem, to the
extent that it's a real problem, can and should be solved by improving
AF_UNIX. The logging, monitoring, and early userspace problems can
and should be solved in userspace. See #3 below for my thoughts on
the sandbox. Right now, kdbus sounds awfully like Tux.
2. Kdbus introduces a novel buffering model. Receivers allocate a big
chunk of what's essentially tmpfs space. Assuming that space is
available (in a virtual memory sense), senders synchronously write to
the receivers' tmpfs space. Broadcast senders synchronously write to
*all* receivers' tmpfs space. I think that, regardless of
implementation, this is problematic if the sender and the receiver are
in different memcgs. Suppose that the message is to be written to a
page in the receivers' tmpfs space that is not currently resident. If
the write happens in the sender's memcg context, then a receiver can
effectively allocate an unlimited number of pages in the sender's
memcg, which will, in practice, be the init memcg if the sender is
systemd. This breaks the memcg model. If, on the other hand, the
sender writes to the receiver's tmpfs space in the receiver's memcg
context, then the sender will block (or fail? presumably
unpredictable failures are a bad thing) if the receiver's memcg is at
capacity.
3. The sandbox model is, in my opinion, an experiment that isn't going
to succeed. It's a poor model: a "restricted endpoint" (i.e. a
sandboxed kdbus client) sees a view of the world defined by a limited
policy language implemented by the kernel. This completely fails to
express what I think should be common use cases. If a sandboxed app
is given permission to access, say,
/org/gnome/evolution/dataserver/CalendarView/3125/12, then it knows
that it's looking at CalendarView/3125/12 (whatever that means) and
there's no way to hide the name. If someone subsequently deletes that
CalendarView and creates a new one with that name, racelessly blocking
access to the new one for the app may be complicated. If a sandbox
wants to prompt the user before allowing access to some resource, it
has a problem: the policy language doesn't seem to be able to express
request interception.
The sandbox model is also already starting to accumulate kludges.
Apparently it was recently discovered that the kdbus connection
lifetime model was incompatible with sandbox policy, so as of a recent
change [2] connection lifetime messages completely bypass sandbox
policy. Maybe this isn't obviously insecure, but it seems like a bad
sign that "it's probably okay to poke this hole" is already happening
before the thing is even merged.
I'll point out that a pure userspace implementation of sandboxed dbus
connections would be straightforward to implement today, would have
none of these problems, and would allow arbitrarily complex policy and
the flexibility to redesign it in the future if the initial design
turned out to be inappropriate for the sandbox being written. (You
could even have two different implementations to go with two different
sandboxes. Let a thousand sandboxes bloom, which is easy in userspace
but not so great in the kernel.)
In summary, I think that a very high quality implementation of the
kdbus concept and API would be a bit faster than a very high quality
userspace implementation of dbus. Other than that, I think it would
actually be worse. The kdbus proponents seem to be comparing the
current kdbus implementation to the current userspace implementation,
and a favorable comparison there is not a good reason to merge it.
--Andy
[1] I spent a while today trying to benchmark sd-bus. I gave up,
because I couldn't get test code to build. I don't have the patience
to try harder.
[2] https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/commit/?h=kdbus&id=d27c8057699d164648b7d8c1559fa6529998f89d
On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <[email protected]> wrote:
> 3. The sandbox model is, in my opinion, an experiment that isn't going
> to succeed. It's a poor model: a "restricted endpoint" (i.e. a
> sandboxed kdbus client) sees a view of the world defined by a limited
> policy language implemented by the kernel. This completely fails to
> express what I think should be common use cases. If a sandboxed app
> is given permission to access, say,
> /org/gnome/evolution/dataserver/CalendarView/3125/12, then it knows
> that it's looking at CalendarView/3125/12 (whatever that means) and
> there's no way to hide the name. If someone subsequently deletes that
> CalendarView and creates a new one with that name, racelessly blocking
> access to the new one for the app may be complicated. If a sandbox
> wants to prompt the user before allowing access to some resource, it
> has a problem: the policy language doesn't seem to be able to express
> request interception.
...
>
> I'll point out that a pure userspace implementation of sandboxed dbus
> connections would be straightforward to implement today, would have
> none of these problems, and would allow arbitrarily complex policy and
> the flexibility to redesign it in the future if the initial design
> turned out to be inappropriate for the sandbox being written. (You
> could even have two different implementations to go with two different
> sandboxes. Let a thousand sandboxes bloom, which is easy in userspace
> but not so great in the kernel.)
I should add that I'm not just speculating about my dream sandbox
here. Sandstorm has a very nice sandbox model that's layered on top
of an object-oriented RPC system. Putting aside the fact that
Sandstorm would be very unlikely to use kdbus because Sandstorm is
network-transparent, the Sandstorm sandbox would not be expressible in
kdbus policy language. Nevertheless, I suspect that the current
Sandstorm implementation (over TCP, no less) outperforms kdbus by a
large margin even though it's entirely in userspace and relies heavily
on per-sandbox userspace proxies. If we tuned the kernel for faster
context switches (by implementing PCID, for example), Sandstorm would
get even faster.
If the kernel were to add APIs to accelerate something like Sandstorm,
the API would probably look a bit like a cross between AF_UNIX and the
seL4 API. The kernel would not have any policy implementation
whatsoever. D-Bus could probably be accelerated with exactly the same
API. (Preserving the D-Bus magic power of strictly ordering
broadcasts relative to unicast messages might be tricky, but I'm not
sure that property makes sense in the underlying API in any event
except on systems that magically have infinite RAM in which to buffer
messages. (On the other hand, all of the useful benefits of that
ordering could probably be preserved by simply explicitly tracking
event ordering in userspace.)
But no one's asking for ksandstorm because it's not necessary, and I
don't think anyone will unless someone writes a really nice, broadly
applicable accelerated kernel IPC mechanism.
--Andy
On Mon, Jun 22, 2015 at 11:06:09PM -0700, Andy Lutomirski wrote:
> Hi Linus,
>
> Can you opine as to whether you think that kdbus should be merged?
Ah, a preemptive pull request denial, how nice.
I don't think I've ever seen such a thing before, congratulations for
creating something so must have previously been lacking in our
development model in how to work together in a community in a productive
manner.
Not.
> I
> don't mean whether you'd accept a pull request that Greg may or may
> not send during this merge window -- I mean whether you think that
> kdbus should be merged if it had appropriate review and people were
> okay with the implementation.
How about you just wait for the real merge request to be submitted, and
we can go from there. Perhaps I wasn't going to do it this release?
Perhaps I was? Who knows? Who cares.
> The current state of uncertainty is problematic, I think. The kdbus
> team is spending a lot of time making things compatible with kdbus,
> and the latest systemd release makes kdbus userspace support
> mandatory.
I stopped here in this email, as this is just flat out totally wrong,
and I don't want to waste my time trying to refute other totally wrong
statements as that would just somehow give them some validation that
they could possibly be correct.
greg k-h
On Tue, Jun 23, 2015 at 8:41 AM, Greg KH <[email protected]> wrote:
>> The current state of uncertainty is problematic, I think. The kdbus
>> team is spending a lot of time making things compatible with kdbus,
>> and the latest systemd release makes kdbus userspace support
>> mandatory.
>
> I stopped here in this email, as this is just flat out totally wrong,
> and I don't want to waste my time trying to refute other totally wrong
> statements as that would just somehow give them some validation that
> they could possibly be correct.
For the guys who not follow systemd development, this is the
announcement in question:
http://lists.freedesktop.org/archives/systemd-devel/2015-June/033170.html
* kdbus support is no longer compile-time optional. It is now
always built-in. However, it can still be disabled at
runtime using the kdbus=0 kernel command line setting, and
that setting may be changed to default to off, by specifying
--disable-kdbus at build-time. Note though that the kernel
command line setting has no effect if the kdbus.ko kernel
module is not installed, in which case kdbus is (obviously)
also disabled. We encourage all downstream distributions to
begin testing kdbus by adding it to the kernel images in the
development distributions, and leaving kdbus support in
systemd enabled.
Now kdbus is opt-out instead of opt-in.
Although I didn't test it so far, systemd should work just fine if
kdbus is not present
as it can fall back to dbus.
--
Thanks,
//richard
On Mon, Jun 22, 2015 at 11:41:40PM -0700, Greg KH wrote:
> On Mon, Jun 22, 2015 at 11:06:09PM -0700, Andy Lutomirski wrote:
> > Hi Linus,
> >
> > Can you opine as to whether you think that kdbus should be merged?
>
> Ah, a preemptive pull request denial, how nice.
I think you're misunderstanding Andy. IMO, he's asking whether he should
invest time in reviewing it as kdbus is not trivial to go through.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
Am Dienstag, 23. Juni 2015, 09:22:40 schrieb Richard Weinberger:
> On Tue, Jun 23, 2015 at 8:41 AM, Greg KH <[email protected]>
wrote:
> >> The current state of uncertainty is problematic, I think. The kdbus
> >> team is spending a lot of time making things compatible with kdbus,
> >> and the latest systemd release makes kdbus userspace support
> >> mandatory.
> >
> > I stopped here in this email, as this is just flat out totally wrong,
> > and I don't want to waste my time trying to refute other totally wrong
> > statements as that would just somehow give them some validation that
> > they could possibly be correct.
>
> For the guys who not follow systemd development, this is the
> announcement in question:
> http://lists.freedesktop.org/archives/systemd-devel/2015-June/033170.html
>
> * kdbus support is no longer compile-time optional. It is now
> always built-in. However, it can still be disabled at
> runtime using the kdbus=0 kernel command line setting, and
> that setting may be changed to default to off, by specifying
> --disable-kdbus at build-time. Note though that the kernel
> command line setting has no effect if the kdbus.ko kernel
> module is not installed, in which case kdbus is (obviously)
> also disabled. We encourage all downstream distributions to
> begin testing kdbus by adding it to the kernel images in the
> development distributions, and leaving kdbus support in
> systemd enabled.
>
> Now kdbus is opt-out instead of opt-in.
> Although I didn't test it so far, systemd should work just fine if
> kdbus is not present
> as it can fall back to dbus.
Andy, I think it was partly this what triggered your mail. I wrote a mail
about asking for a careful review of dbus exactly due to this some days ago,
but didn´t include any Ccs.
In that I wrote:
-------------------------------------------------------------------------
I hope you kernel developers will still review kdbus carefully as you did so
far, instead of giving in to any downstream pressure by distros.
It is exactly this attitude and this approach of systemd upstream that I
feel uneasy about. Instead of humbly waiting and working towards having
kdbus accepted to the kernel, systemd developers seem to use any means to
create indirect pressure to have it included eventually.
I hope that it will still be technical excellence as entry barrier for
anything that goes into the kernel.
Please note: I do not judge upon the technical quality of kdbus. I think
others are more knowledgeable to do it.
-------------------------------------------------------------------------
I think the move of systemd developers is able to create downstream pressure
to include kdbus into the kernel and Andy´s mail is partly a reaction to
that.
I personally wouldn´t ask for it not to be included into the kernel, but I
just ask for a careful review instead of giving in to any downstream
pressure the move of systemd developers may trigger.
But unlike Andy I did not review kdbus for technical quality. It seems that
Andy has strong technical concerns about it. But you Greg, write that these
are not correct without any explaination on why you think this is so. You
outrightly write that these are invalid without any explaination at all.
Greg, if you do not want any preemptive decision not to merge kdbus before
your next pull request, I kindly also ask for for no preemptive decision to
actually merge it then, as it may be included in Fedora or other distro
kernels already. Having it in distros can be good for testing things, but it
does not necessarily say anything about technical qualification of the
patches for the upstream kernel. So no argument like in "but, look, its in
the distros" in my view is enough reason to merge it into the upstream
kernel.
On the next time you do your pull request, if Andy or anyone else posts
technical concerns, for a careful review process I think it is important
that you or someone else actually addresses them instead of just telling
that these are invalid (in your point of view!).
Cause this is exactly again an attitude I found with systemd upstream. "I am
right, you are wrong, go away". It is this kind of attitude – I have seen it
on both sides of this discussion – that creates most of the friction and
energy blockage and polarity around this topic. I tried to bring this up in
systemd-devel once, but in the end I unsubscribed after having been called
"being a dick" there. From Lennart himself who on the other hand whines
about perceived rudeness in kernel community.
So again I ask: What is it what you actually want to create? And how can you
create it (instead of creating something, like this friction and energy
blockage, that you probably didn´t want to create at all)? I ask this to
anyone involved.
Thank you,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
Am Dienstag, 23. Juni 2015, 11:25:49 schrieb Martin Steigerwald:
> Am Dienstag, 23. Juni 2015, 09:22:40 schrieb Richard Weinberger:
> > On Tue, Jun 23, 2015 at 8:41 AM, Greg KH <[email protected]>
>
> wrote:
> > >> The current state of uncertainty is problematic, I think. The kdbus
> > >> team is spending a lot of time making things compatible with kdbus,
> > >> and the latest systemd release makes kdbus userspace support
> > >> mandatory.
> > >
> > > I stopped here in this email, as this is just flat out totally wrong,
> > > and I don't want to waste my time trying to refute other totally wrong
> > > statements as that would just somehow give them some validation that
> > > they could possibly be correct.
> >
> > For the guys who not follow systemd development, this is the
> > announcement in question:
> > http://lists.freedesktop.org/archives/systemd-devel/2015-June/033170.htm
> > l
> >
> > * kdbus support is no longer compile-time optional. It is now
> >
> > always built-in. However, it can still be disabled at
> > runtime using the kdbus=0 kernel command line setting, and
> > that setting may be changed to default to off, by specifying
> > --disable-kdbus at build-time. Note though that the kernel
> > command line setting has no effect if the kdbus.ko kernel
> > module is not installed, in which case kdbus is (obviously)
> > also disabled. We encourage all downstream distributions to
> > begin testing kdbus by adding it to the kernel images in the
> > development distributions, and leaving kdbus support in
> > systemd enabled.
> >
> > Now kdbus is opt-out instead of opt-in.
> > Although I didn't test it so far, systemd should work just fine if
> > kdbus is not present
> > as it can fall back to dbus.
>
> Andy, I think it was partly this what triggered your mail. I wrote a mail
> about asking for a careful review of dbus exactly due to this some days
> ago, but didn´t include any Ccs.
>
> In that I wrote:
>
> -------------------------------------------------------------------------
> I hope you kernel developers will still review kdbus carefully as you did
> so far, instead of giving in to any downstream pressure by distros.
>
> It is exactly this attitude and this approach of systemd upstream that I
> feel uneasy about. Instead of humbly waiting and working towards having
> kdbus accepted to the kernel, systemd developers seem to use any means to
> create indirect pressure to have it included eventually.
>
> I hope that it will still be technical excellence as entry barrier for
> anything that goes into the kernel.
>
> Please note: I do not judge upon the technical quality of kdbus. I think
> others are more knowledgeable to do it.
> -------------------------------------------------------------------------
>
> I think the move of systemd developers is able to create downstream
> pressure to include kdbus into the kernel and Andy´s mail is partly a
> reaction to that.
>
> I personally wouldn´t ask for it not to be included into the kernel, but I
> just ask for a careful review instead of giving in to any downstream
> pressure the move of systemd developers may trigger.
>
> But unlike Andy I did not review kdbus for technical quality. It seems
> that Andy has strong technical concerns about it. But you Greg, write
> that these are not correct without any explaination on why you think this
> is so. You outrightly write that these are invalid without any
> explaination at all.
>
> Greg, if you do not want any preemptive decision not to merge kdbus before
> your next pull request, I kindly also ask for for no preemptive decision
> to actually merge it then, as it may be included in Fedora or other
> distro kernels already. Having it in distros can be good for testing
> things, but it does not necessarily say anything about technical
> qualification of the patches for the upstream kernel. So no argument like
> in "but, look, its in the distros" in my view is enough reason to merge
> it into the upstream kernel.
>
> On the next time you do your pull request, if Andy or anyone else posts
> technical concerns, for a careful review process I think it is important
> that you or someone else actually addresses them instead of just telling
> that these are invalid (in your point of view!).
>
> Cause this is exactly again an attitude I found with systemd upstream. "I
> am right, you are wrong, go away". It is this kind of attitude – I have
> seen it on both sides of this discussion – that creates most of the
> friction and energy blockage and polarity around this topic. I tried to
> bring this up in systemd-devel once, but in the end I unsubscribed after
> having been called "being a dick" there. From Lennart himself who on the
> other hand whines about perceived rudeness in kernel community.
Let me take some judgement out of what I wrote and have an attempt to
describe instead:
>From Lennart himself who on the other hand complains about what he perceives
as (my wording from memory of his Google Plus post) rudeness in kernel
community.
> So again I ask: What is it what you actually want to create? And how can
> you create it (instead of creating something, like this friction and
> energy blockage, that you probably didn´t want to create at all)? I ask
> this to anyone involved.
>
> Thank you,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Tue, Jun 23, 2015 at 12:22 AM, Richard Weinberger
<[email protected]> wrote:
> On Tue, Jun 23, 2015 at 8:41 AM, Greg KH <[email protected]> wrote:
>>> The current state of uncertainty is problematic, I think. The kdbus
>>> team is spending a lot of time making things compatible with kdbus,
>>> and the latest systemd release makes kdbus userspace support
>>> mandatory.
>>
>> I stopped here in this email, as this is just flat out totally wrong,
>> and I don't want to waste my time trying to refute other totally wrong
>> statements as that would just somehow give them some validation that
>> they could possibly be correct.
>
> For the guys who not follow systemd development, this is the
> announcement in question:
> http://lists.freedesktop.org/archives/systemd-devel/2015-June/033170.html
>
> * kdbus support is no longer compile-time optional. It is now
> always built-in. However, it can still be disabled at
> runtime using the kdbus=0 kernel command line setting, and
> that setting may be changed to default to off, by specifying
> --disable-kdbus at build-time. Note though that the kernel
> command line setting has no effect if the kdbus.ko kernel
> module is not installed, in which case kdbus is (obviously)
> also disabled. We encourage all downstream distributions to
> begin testing kdbus by adding it to the kernel images in the
> development distributions, and leaving kdbus support in
> systemd enabled.
That is, indeed, what I was referring to. Sorry if my half-a-sentence
summary wasn't spot on.
FWIW, once there are real distros with kdbus userspace enabled,
reviewing kdbus gets more complicated -- we'll be in the position
where merging kdbus in a different form from that which was proposed
will break existing users.
--Andy
On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <[email protected]> wrote:
>
> Can you opine as to whether you think that kdbus should be merged? I
> don't mean whether you'd accept a pull request that Greg may or may
> not send during this merge window -- I mean whether you think that
> kdbus should be merged if it had appropriate review and people were
> okay with the implementation.
So I am still expecting to merge it, mainly for a rather simple
reason: I trust my submaintainers, and Greg in particular. So when a
major submaintainer wants to merge something, that pulls a *lot* of
weight with me.
That said, I have to admit to being particularly disappointed with the
performance argument for merging it. Having looked at the dbus
performance, and come to the conclusion that the reason dbus performs
abysmally badly is just pure shit user space code, I am not AT ALL
impressed by the performance argument. We don't merge kernel code just
because user space was written by a retarded monkey on crack. Kernel
code has higher standards, and yes, that also means that it tends to
perform better, but no, "user space code is shit" is not a valid
reason for pushing things into the kernel.
So quite frankly, the "better performance" argument is bogus in my opinion.
That still leaves other arguments, but it does weaken the case for
kdbus quite a bit.
Because go out and read pretty much any argument for kdbus, and the
*first* argument is always performance. The articles never say "..
because the user-space dbus code is crap", though.
Linus
On Tue, Jun 23, 2015 at 4:19 PM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Can you opine as to whether you think that kdbus should be merged? I
>> don't mean whether you'd accept a pull request that Greg may or may
>> not send during this merge window -- I mean whether you think that
>> kdbus should be merged if it had appropriate review and people were
>> okay with the implementation.
>
> So I am still expecting to merge it, mainly for a rather simple
> reason: I trust my submaintainers, and Greg in particular. So when a
> major submaintainer wants to merge something, that pulls a *lot* of
> weight with me.
Then I'll try to review the parts that I can review, time permitting,
in the event that someone sends a clean, reviewable set of patches.
Preferably not during the merge window.
If my, or anyone else's, review uncovers an ABI issue, then I will be
correspondingly grumpy now that the userspace code is slated to ship
with new systemd versions. Because we can't actually ship a
kernel.org kernel that will fail to boot with Fedora Rawhide or Arch
AUR or whatever unless kdbus=0 is set on the kernel command line.
If someone ships an actual desktop sandbox based on kdbus custom
endpoints, I'll try to poke holes in it as usual. I don't intend to
review that part for security in advance because I've already said my
part: I think the design is unfit for its purpose. Given that I don't
see how one is supposed to use it in a sensible manner for sandboxing
in the first place, it's hard to evaluate whether it will do its job a
priori.
(NB: I think I may have figured out what people mean when they say
that custom endpoints are useful for sandboxes. They might be talking
about BusPolicy= in systemd .service files. That's a nifty feature,
but it seems rather limited and doesn't seem to me like it would be
useful for things like xdg-app. Also, it could certainly be
implemented in userspace.)
--Andy
P.S. I still remain unconvinced that any of the other arguments for
merging it are better than the performance argument. But whatever.
* Linus Torvalds <[email protected]> wrote:
> On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <[email protected]> wrote:
> >
> > Can you opine as to whether you think that kdbus should be merged? I don't
> > mean whether you'd accept a pull request that Greg may or may not send during
> > this merge window -- I mean whether you think that kdbus should be merged if
> > it had appropriate review and people were okay with the implementation.
>
> So I am still expecting to merge it, mainly for a rather simple reason: I trust
> my submaintainers, and Greg in particular. So when a major submaintainer wants
> to merge something, that pulls a *lot* of weight with me.
>
> That said, I have to admit to being particularly disappointed with the
> performance argument for merging it. Having looked at the dbus performance, and
> come to the conclusion that the reason dbus performs abysmally badly is just
> pure shit user space code, I am not AT ALL impressed by the performance
> argument. We don't merge kernel code just because user space was written by a
> retarded monkey on crack. Kernel code has higher standards, and yes, that also
> means that it tends to perform better, but no, "user space code is shit" is not
> a valid reason for pushing things into the kernel.
>
> So quite frankly, the "better performance" argument is bogus in my opinion.
>
> That still leaves other arguments, but it does weaken the case for kdbus quite a
> bit.
Beyond the cons, I see four arguments in favor of kdbus:
- In its current form kdbus really does not hurt the core kernel in any appreciable
way: like Android's Binder it sits in its own corner that doesn't hurt anyone.
Here's the kdbus diffstat (merged to v4.2-rc1-to-be with trivial conflicts fixed):
97 files changed, 34069 insertions(+), 3 deletions(-)
But ignoring the kdbus/ specific bits, the diffstat shows essentially zero impact:
triton:~/tip> git diff -M linus.. --stat | grep -v kdbus
Documentation/Makefile | 2 +-
Documentation/ioctl/ioctl-number.txt | 1 +
MAINTAINERS | 13 +
Makefile | 1 +
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/magic.h | 2 +
init/Kconfig | 13 +
ipc/Makefile | 2 +-
samples/Kconfig | 7 +
samples/Makefile | 3 +-
tools/testing/selftests/Makefile | 1 +
kdbus is a driver in essence, with no core kernel impact other than its
placement in ipc/kdbus/.
Beyond some vague opportunity cost kdbus is almost zero-cost for the kernel.
- I've been closely monitoring Linux kernel changes for over 20 years, and for the
last 10 years the linux/ipc/* code has been dormant: it works and was kept good
for existing usecases, but no-one was maintaining and enhancing it with the
future in mind.
So there exists a technical vacuum: the kernel does not have any good, modern
IPC ABI at the moment that distros can rely on as a 'golden standard'. This is
partly technical, partly political. The technical reason is that SysV IPC is
ancient and cumbersome. The political reason is that SystemD could be using
and extending Android's existing kernel accelerated IPC subsystem (Binder)
that is already upstream - but does not.
Now that ipc/kdbus/ has been proposed people are up in arms and suggest better
approaches to almost every aspect. Where have you been for the past 10 years
and where is your working code and the user-space project that takes advantage
of an alternative approach? I believe it's fair to say that much of that
interest and activity would dry up overnight if kdbus was rejected permanently,
which is sad.
- Once one (or two) major distros go with kdbus, it becomes a de-facto ABI. If
the ABI is bad then that distro will hurt from it regardless of whether we
merge it upstream or not - so technical pressure is there to improve it. But if
the kernel refuses to merge it, Linux users will get hurt disproportionately
badly. The kernel not being the first mover with a new ABI is absolutely
sensible. But once Linux distros have taken the initial (non-trivial) plunge,
not merging a zero-cost ABI upstream becomes more like revenge and obstruction,
which is not productive. The kernel has very little value without full
user-space, after all, so within reason the kernel project has to own up to
distro ABI mistakes as well.
- Life does not stop after a merge: once kdbus is upstream, we _can_ express
pressure to only extend the kernel side ABI in sensible ways. I am fully
prepared to NAK any crap in its ABI that I care about.
So just like I was in favor of merging Android's Binder when everyone was against
it a few years ago, I'm in favor of merging kdbus as well.
Not because I like it so much, but because I think the merge process should be
stripped of politics and emotion as much as possible: if an initial submission is
good and addresses all technical review properly, and if the cost to the core
kernel is low, then barring alternative, fully equivalent and superior patch
submissions, rejecting it does more harm than good.
I'm all for replacing good code with even better code, but I'm not in favor of
replacing good code with words.
Thanks,
Ingo
On Tue, Jun 23, 2015 at 8:06 AM, Andy Lutomirski <[email protected]> wrote:
> 3. The sandbox model is, in my opinion, an experiment that isn't going
> to succeed. It's a poor model: a "restricted endpoint" (i.e. a
> sandboxed kdbus client) sees a view of the world defined by a limited
> policy language implemented by the kernel. This completely fails to
> express what I think should be common use cases. If a sandboxed app
> is given permission to access, say,
> /org/gnome/evolution/dataserver/CalendarView/3125/12, then it knows
> that it's looking at CalendarView/3125/12 (whatever that means) and
> there's no way to hide the name. If someone subsequently deletes that
> CalendarView and creates a new one with that name, racelessly blocking
> access to the new one for the app may be complicated. If a sandbox
> wants to prompt the user before allowing access to some resource, it
> has a problem: the policy language doesn't seem to be able to express
> request interception.
>
> The sandbox model is also already starting to accumulate kludges.
> Apparently it was recently discovered that the kdbus connection
> lifetime model was incompatible with sandbox policy, so as of a recent
> change [2] connection lifetime messages completely bypass sandbox
> policy. Maybe this isn't obviously insecure, but it seems like a bad
> sign that "it's probably okay to poke this hole" is already happening
> before the thing is even merged.
>
> I'll point out that a pure userspace implementation of sandboxed dbus
> connections would be straightforward to implement today, would have
> none of these problems, and would allow arbitrarily complex policy and
> the flexibility to redesign it in the future if the initial design
> turned out to be inappropriate for the sandbox being written. (You
> could even have two different implementations to go with two different
> sandboxes. Let a thousand sandboxes bloom, which is easy in userspace
> but not so great in the kernel.)
I don't really understand this objection. I'm working on an
application sandboxing model for desktop applications (xdg-app), and
the kdbus model matches my needs well. In fact, I'm currently using a
userspace filtering proxy that implements exactly the kdbus policy
model. Of course, this adds *yet* another context switch per message.
The only problem I found is that kdbus filtering broke the ability to
track the lifetime of clients[1]. However, this has now been fixed
with exactly the change you complain about above.
I definitely don't want to do low level request interception with UI.
We learned long ago that it is a very poor fit for desktop use. At the
interception point you have no context at all about the larger scope,
such as what window caused the operation and how you would make it
modal or even just get the window parenting right. Also, if you do
this you will keep popping up windows all the time as apps do calls in
the background to be able to e.g. gray out unavailable menu items,
update folder counts, etc. Any operation that may cause user
interaction must be carefully designed to handle this.
The way I expect to use kdbus policy, for an app called say
"org.gnome.gedit" is to have the following policy:
TALK org.freedesktop.DBus
OWN org.gnome.gedit
OWN org.gnome.gedit.*
TALK org.freedesktop.portal.*
This allows the app to conntect to and talk to the bus, own its own
name and broadcast signals. It also lets anyone else (that are not
sandboxed) talk to the app and it will be able to reply. This is
enough to have regular dbus activation of desktop files[2], as well
as allowing app-related custom services.
It also allows the app to talk to a set of "portals" which are
sandbox-specific APIs that supply the necessary services for sandboxed
apps to interact with each other and the host. For instance, it would
have APIs for file choosing, where all user interaction will happen on
the host side and the app just gets back the file data. Another
example is sharing with intents-like semantics, where you'd say "I
want to share text <foo>" and we open a dialog on the host side
allowing you to chose how to share the text (tweet it, open in other
app, etc) without the app knowing anything about it other than
supplying the data.
Operations like these are safe because they are interactive. An app
can't use them to silently read the users files, and the user can
always interactively abort the operation if it was unexpected.
Now, there will likely be a few cases where we need a more
fine-grained access limit. For instance you may have a service that
dynamically grants access to particular objects in a portal service to
an app. These things can be implemented fine in userspace in the
actual service itself. The way I do that currently is by looking at
the peer cgroup name, which encodes the xdg-app id. I don't see how
making up policy dynamically and uploading it to the bus is better
than just doing the filtering in the portal.
[1] http://lists.freedesktop.org/archives/dbus/2015-May/016670.html
[2] http://standards.freedesktop.org/desktop-entry-spec/latest/ar01s07.html
Ingo Molnar <[email protected]> writes:
> Not because I like it so much, but because I think the merge process should be
> stripped of politics and emotion as much as possible: if an initial submission is
> good and addresses all technical review properly, and if the cost to the core
> kernel is low, then barring alternative, fully equivalent and superior patch
> submissions, rejecting it does more harm than good.
This is largely not what happened with kdbus.
The initial submission was problematic. Many pieces of technical review
were not addressed at the time a pull request was sent to Linus. Even
now there are remaining outstanding technical items such as performance
that have not been addressed.
The cost to the rest of the core is potentially quite high as parts of
kdbus double down on the worst mistakes in user interface of the kernel.
Politics and emotion are involved because the discussions around kdbus
have not been honest:
- Lennart Poettering who has been hugely involved in the creation and
the design of kdbus has not shown is face on lkml during the review,
and he seems the only one who can actually answer many of the
technical questions about kdbus.
- Many times it was said some feature of kdbus is not important because
using it was not required, and yet in practice using that feature is
required in the common case.
- Performance has been said to be a large benefit of kdbus and yet in
the common case there will be a number of shared cache lines modifed
for every message sent, for reference counts.
At a quick glance it appears that communication with every system
daemon will be serialized because they all have init as their parent
process, so every reply will modify the reference count of init's
struct pid.
At this point I honestly do not know how to have a technical dialogue
about the code in kdbus.
Pointing out that bumping several reference counts per message is a bad
idea, has gotten no where so far.
Crazy things like using the processes command line (copied from
userspace when a message is sent) for message authentication is still
present in the code.
I don't think any of these things are particularly subtle, hard to
understand, or hard to fix yet months after they have been pointed out
the code persists.
For subtle issues who knows. Every review I have seen seems to get to
a couple of simple things, point them out, and then stops. I am
actually very strongly surprised at how many of these little issues
remain in the code. There were enough changes added to the kdbus tree
to fix small issues since the last merge window I would have thought I
would have had to looked a little harder for problems.
So whatever else the case may be I think the current kdbus code base is
a long way from being ready to be merged.
Eric
Am Mittwoch, 24. Juni 2015, 10:05:02 schrieb Ingo Molnar:
> - Once one (or two) major distros go with kdbus, it becomes a de-facto
> ABI. If the ABI is bad then that distro will hurt from it regardless of
> whether we merge it upstream or not - so technical pressure is there to
> improve it. But if the kernel refuses to merge it, Linux users will get
> hurt disproportionately badly. The kernel not being the first mover with
> a new ABI is absolutely sensible. But once Linux distros have taken the
> initial (non-trivial) plunge, not merging a zero-cost ABI upstream
> becomes more like revenge and obstruction, which is not productive. The
> kernel has very little value without full user-space, after all, so
> within reason the kernel project has to own up to distro ABI mistakes as
> well.
So, in order to merge something that is not accepted upstream yet, is it an
accepted way to encourage distros to use it nonetheless, to get it upstream
then anyway as in "as, look, now this and this distro uses it"?
When I read
> Not because I like it so much, but because I think the merge process
> should be stripped of politics and emotion as much as possible: if an
> initial submission is good and addresses all technical review properly,
> and if the cost to the core kernel is low, then barring alternative,
> fully equivalent and superior patch submissions, rejecting it does more
> harm than good.
I think you didn?t mean it that way, as you state proper technical review as
a requirement.
Can you clarify?
Still as far as I got it, Andy raised technical concerns which Greg
outrightly rejected as invalid without any further explaination. That does
not seem like technical concerns have been properly addressed to me.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
Am Mittwoch, 24. Juni 2015, 10:05:02 schrieb Ingo Molnar:
> Not because I like it so much, but because I think the merge process
> should be stripped of politics and emotion as much as possible: if an
> initial submission is good and addresses all technical review properly,
> and if the cost to the core kernel is low, then barring alternative,
> fully equivalent and superior patch submissions, rejecting it does more
> harm than good.
Now that is an interesting challenge.
As I realize more and more we are all feeling beings.
Linus himself according to his own words as I received them wants to make
perfectly sure that the developer who receives a message from him exactly
knows how he feels, especially when he disagrees with a pull request and
does not want to take it.
To my perception the whole kernel development process is quite full of
emotion, including your message I reply to.
And now you want to get rid of it.
I bet you can.
If you remove Linus… and every other kernel developer from the development
process, including yourself.
But then, who will develop the kernel?
I think a different way to handle emotions can help and I intend handle them
this way to see what results I create this way. I am aiming to feel my
feelings as they are, instead of immediately judging them or attaching a
thought to them basically making them emotions and distorting them that way,
blocking my energy in them [1].
So I will attempt to feel my feelings before I answer again. I didn´t do so
in the last answer to you, and I think it shows.
[1] Arnold M. Patent, "You can have it all"
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
* Martin Steigerwald <[email protected]> wrote:
> Am Mittwoch, 24. Juni 2015, 10:05:02 schrieb Ingo Molnar:
>
> > - Once one (or two) major distros go with kdbus, it becomes a de-facto ABI.
> > If the ABI is bad then that distro will hurt from it regardless of whether we
> > merge it upstream or not - so technical pressure is there to improve it. But
> > if the kernel refuses to merge it, Linux users will get hurt
> > disproportionately badly. The kernel not being the first mover with a new ABI
> > is absolutely sensible. But once Linux distros have taken the initial
> > (non-trivial) plunge, not merging a zero-cost ABI upstream becomes more like
> > revenge and obstruction, which is not productive. The kernel has very little
> > value without full user-space, after all, so within reason the kernel project
> > has to own up to distro ABI mistakes as well.
>
> So, in order to merge something that is not accepted upstream yet, is it an
> accepted way to encourage distros to use it nonetheless, to get it upstream then
> anyway as in "as, look, now this and this distro uses it"?
>
> When I read
>
> > Not because I like it so much, but because I think the merge process should be
> > stripped of politics and emotion as much as possible: if an initial submission
> > is good and addresses all technical review properly, and if the cost to the
> > core kernel is low, then barring alternative, fully equivalent and superior
> > patch submissions, rejecting it does more harm than good.
>
> I think you didn?t mean it that way, as you state proper technical review as a
> requirement.
>
> Can you clarify?
There's no conflict: when merging something upstream, technical feedback has to be
addressed. AFAICS that is what happened when we merged controversial bits in the
past where Linux distros jumped the gun: such as AppArmor or Binder.
The main question that gets eliminated by a major distro using something is the
(important) question of: 'does the Linux kernel need an ABI like that?'.
Distros still run a considerable risk when forking new ABIs, obviously - as 'pre
release' ABIs rarely survive upstreaming, and there's no guarantee that it will be
accepted upstream.
> Still as far as I got it, Andy raised technical concerns which Greg outrightly
> rejected as invalid without any further explaination. That does not seem like
> technical concerns have been properly addressed to me.
I haven't seen such responses but maybe I haven't managed to dig deep enough into
the rather sizable discussion. Not addressing valid technical feedback would be a
first for Greg in my book, so he definitely deserves the benefit of doubt from me.
And the thing is, in hindsight, after such huge flamewars, years down the line,
almost never do I see the following question asked: 'what were we thinking merging
that crap??'. If any question arises it's usually along the lines of: 'what was
the big fuss about?'. So I think by and large the process works.
Thanks,
Ingo
* Martin Steigerwald <[email protected]> wrote:
> Am Mittwoch, 24. Juni 2015, 10:05:02 schrieb Ingo Molnar:
> > Not because I like it so much, but because I think the merge process
> > should be stripped of politics and emotion as much as possible: if an
> > initial submission is good and addresses all technical review properly,
> > and if the cost to the core kernel is low, then barring alternative,
> > fully equivalent and superior patch submissions, rejecting it does more
> > harm than good.
>
> Now that is an interesting challenge.
>
> As I realize more and more we are all feeling beings.
>
> Linus himself according to his own words as I received them wants to make
> perfectly sure that the developer who receives a message from him exactly
> knows how he feels, especially when he disagrees with a pull request and
> does not want to take it.
So that twists what I said: how 'I feel about a pull request' is a technical term
for: 'what is my subjective but rational technological opinion' about it.
That's not an invitation to be irrationally emotional. (I'm reasonably sure that's
what Linus meant there too, but I don't speak for him.)
Thanks,
Ingo
On Wed, Jun 24, 2015 at 2:55 AM, Alexander Larsson
<[email protected]> wrote:
>
> I don't really understand this objection. I'm working on an
> application sandboxing model for desktop applications (xdg-app), and
> the kdbus model matches my needs well. In fact, I'm currently using a
> userspace filtering proxy that implements exactly the kdbus policy
> model. Of course, this adds *yet* another context switch per message.
> The only problem I found is that kdbus filtering broke the ability to
> track the lifetime of clients[1]. However, this has now been fixed
> with exactly the change you complain about above.
I find myself wondering whether the change I complain about will be a
problem down the road. It's certainly an information leak of some
sort. Whether the information that it leaks is valuable to anyone is
an interesting question.
>
> I definitely don't want to do low level request interception with UI.
> We learned long ago that it is a very poor fit for desktop use. At the
> interception point you have no context at all about the larger scope,
> such as what window caused the operation and how you would make it
> modal or even just get the window parenting right. Also, if you do
> this you will keep popping up windows all the time as apps do calls in
> the background to be able to e.g. gray out unavailable menu items,
> update folder counts, etc. Any operation that may cause user
> interaction must be carefully designed to handle this.
>
> The way I expect to use kdbus policy, for an app called say
> "org.gnome.gedit" is to have the following policy:
> TALK org.freedesktop.DBus
> OWN org.gnome.gedit
> OWN org.gnome.gedit.*
> TALK org.freedesktop.portal.*
Aha! You're not doing what I assumed you were doing at all.
>
> This allows the app to conntect to and talk to the bus, own its own
> name and broadcast signals. It also lets anyone else (that are not
> sandboxed) talk to the app and it will be able to reply. This is
> enough to have regular dbus activation of desktop files[2], as well
> as allowing app-related custom services.
Do I understand correctly that you're committing to an iOS-like model
in which activations go to a particular named app as opposed to a more
Android-like model in which multiple providers can offer the same
service?
>
> It also allows the app to talk to a set of "portals" which are
> sandbox-specific APIs that supply the necessary services for sandboxed
> apps to interact with each other and the host.
[snip description of what the portal does]
This seems generally sensible. Here are my concerns. Feel free to
tell me I'm nuts or ask me more.
1. Other than allowing non-sandboxed code to contact sandboxed apps
directly (as opposed to via the portal), I still don't see how this is
better than having a completely separate kdbusfs instance (or
userspace socket or whatever) per app. The only things on the outside
the app talks to are org.freedesktop.portal.*, and whatever service
provides them could be taught to provide them to more than one running
sandboxed app.
By doing it with a policy rule like this, you're at risk of random
non-sandboxed programs having a bright idea to offer some completely
insecure service with a name like "org.freedesktop.portal.badidea"
that destroys security. See, for example, the tons of reports of
exploitable Android system services that shouldn't have been there in
the first place.
By using this type of policy rule, you're also preventing meaningful
use of two different portal implementations -- their names will
collide. That's fine when there's exactly one implementation that
you're developing, but it might be nice to be able to run some apps
under a super-locked-down portal, some under a standard portal, and
some under some other fork of the portal, all at once.
2. Without seeing more details, I don't see how you will defend
against name collisions. By allowing a sandboxed application to claim
a well-known name with global significance (e.g.
org.freedesktop.gedit), you're vulnerable to apps that maliciously
claim some other app's name (e.g. by sticking it in their manifest or
whatever). Search for the iOS "XARA" attacks, which mostly work like
this and almost completely break iOS security (currently unfixed
AFAIK).
3. Due to the IMO absurd way that kdbus policy works, you think you're
limiting sandboxed apps to talking to names that match entries in your
policy table. Instead, you're limiting sandboxed apps to talking to
peer ids that advertise names that match entries in your policy table.
As I understand it, you are completely and utterly hosed if your
portal implements org.freedesktop.portal.secure_printing and
org.freedesktop.admin.something_else. This issue is a large part of
the reason that I consider kdbus' policy framework to be an
unacceptable design.
> Now, there will likely be a few cases where we need a more
> fine-grained access limit. For instance you may have a service that
> dynamically grants access to particular objects in a portal service to
> an app. These things can be implemented fine in userspace in the
> actual service itself. The way I do that currently is by looking at
> the peer cgroup name, which encodes the xdg-app id. I don't see how
> making up policy dynamically and uploading it to the bus is better
> than just doing the filtering in the portal.
I don't think you should make up policy dynamically. I thought you
were and I was confused.
If you use per-sandbox busses or sockets, you completely avoid needing
to muck with cgroups names. You know a priori which request comes
from which sandboxed app because it could only have come from one
place.
--Andy
On Wed, Jun 24, 2015 at 5:38 PM, Andy Lutomirski <[email protected]> wrote:
> Was this intentionally off-list?
Nah, that was a mistake, adding back the list.
> On Wed, Jun 24, 2015 at 8:10 AM, Alexander Larsson
>> The way i did it in the userspace proxy is to allow peer exited
>> messages from services that talked to you at some point, as this is
>> the core requirement (you must be able to limit things to the lifetime
>> of clients). However, i can see how tracking that in the kernel is a
>> bit painful, so just allowing all is probably a reasonable choice.
>>
>
> Hmm. I guess this is an ugliness of dbus in general. Since dbus
> doesn't really have a concept of objects (AIUI) you can't really get a
> notification that a particular object that you have a reference to is
> gone, so you have to ask for notification that the peer providing the
> object is gone, but there was never any concept of having a reference
> to a peer, so here we are :(
You keep using works like ugly and stupid, which isn't super
impressive. My name is on the dbus specification, and I am (and was
then) well aware of systems with object references. In fact, both
previous ipc systems (Corba ORBs) that Gnome used before we designed
dbus uses object references, and they had a lot of problems which dbus
solved for us. I'm not saying dbus is perfect, but it has its reasons
for the way it works.
So, dbus-the-system has some kind of notion of an object reference
(peer + object path), but the *bus* is fundamentally a way to
communicate between peers, and the object path is merely some
uninterpreted metadata. Once the message reaches the destination
process it is essentially free to interpret the object path however
they want. If something needs a long lasting "reference" to an object
you can implement that by e.g. using a Subscribe method, and you can
guarantee cleanup because the bus will tell you if the peer died.
This also means that the bus itself is vastly simplified. It only has
to track peers, not every object in every peer. And clients are more
flexible with how objects are handled. They can be instantiated
lazily, or even created algorithmically from the object path if
needed.
You wish that the kernel controlled access to a particular object in a
peer, but once the message is dispatched into the target process all
bets are off anyway. It will be running some code parsing your message
in the process with no real separation from the other objects. Any bug
there could give you wider access. I don't see how this fundamentally
makes the whole system much more secure. On the other hand, I *do*
remember having to track down cross-process leaks from circular
references between processes using Corba...
>> The desktop file lists the icon, name and whatnot which is displayed
>> by the desktop environment. If DBusActivatable is true, then the app
>> is started by sending dbus messages to the same name as the desktop
>> file, to the org.freedesktop.Application interface, this way we can
>> ensure a singleton app and you can do more things than just spawning
>> it.
>
> How do I install apps as an unprivileged user? What about running
> sandboxed apps that aren't installed at all? What about downloading
> one app and running three instances of it that are all isolated from
> each other?
Users install desktop files in a file in their home directory
(~/.local/share/applications/ typically).
xdg-app apps require some form of installation before running.
You can run three instances of an app, but only one of them can own
the bus name. This works out fine if your app does not use dbus, but
it may be a problem if it uses dbus activation.
>> Well, your "other than" part kinda breaks things like launching the
>> application. So, we need to be on the real bus.
>> Could you then *also* have a bus per app for talking to the portal? I
>> guess so, but I don't quite see the point. Just having the portals
>> trying to find all new buses that come and go will be all kinds of
>> painful.
>
> How many portals will there be? It seems like, if you want multiple
> portals programs (in the org.freedesktop.portal.* sense), then you'd
> have some awkwardness if each app were on its own bus and you didn't
> want a proxy, but I think you'll also have prevented yourself from
> meaningfully sandboxing the portals themselves.
You can sandbox the portals to some extent, but fundamentally they are
meant to run in some kind of "higher privileges" mode, so they have to
have access to things normal apps would not. For instance, they have
to be able to activate other dbus names.
> Android, on the other hand, sandboxes most of its service providers,
> and Binder provides a nice way to selectively grant capabilities
> between sandboxes. (The privacy and security disaster that's built on
> top of Binder is another story, but that's not Binder's fault.)
Well, the service providers are not the same as the portals. Say you
have a twitter client that you want to register to be able to share
some selected text. The twitter client can be fully sandboxed. The
portal is just the link between the requesting app and the list of
registered share providers.
> For launching the application, couldn't you just say "hey, portal,
> launch this application please?" Using custom endpoints with shared
> namespaces just for this purpose sounds unnecessary.
Well, that is essentially what a portal like the share one does.
Although it shows a user controlled UI inbetween to avoid the app
being able to start any other app it wants.
> Semi-shameless plug [1]: Sandstorm is inventing something quite
> similar to what you're doing, but they're doing it with a single proxy
> per sandbox instance and a capability model. There are no special
> portal services -- instead, there's a general and *very* clean
> mechanism for allowing other sandboxed applications to supply
> interfaces that can be requested from any sandbox (with appropriate
> user confirmation). It could be worth comparing notes. I'll try to
> see if they have a nice summary of how the pieces fit together.
That sounds similar to the portal idea, would be cool to read about.
>> I don't think that is really a problem. People can do stupid things on
>> the non-sandboxed side in all kinds of ways. You just can't protect
>> against that.
>
> But you can prevent people on the non-sandboxed side from being
> invoked *at all* from the sandbox without explicit user opt-in, which
> makes the scope for screwing things up much smaller. If you combine
> that with sandboxing the portal providers too, you end up in decent
> shape.
The non-sandboxed side can scribble all over your application. Trying
to protect against it doing something wrong is imho fruitless.
>> Basically, the app id is linked to the dbus names that the app may
>> own. So, the only way it could take over org.gnome.gedit is if you
>> actually install the app called "org.gnome.gedit". Of course if you
>> trust some random untrusted repo and install something called that you
>> get what you want...
>
> Um.
>
> Have you ever noticed that friendly android app names like "Yelp" have
> nothing whatsoever to do with the app ids? Heck, it's hard to find an
> app id shown in the UI at all when installing an Android app.
> Nevertheless, you're supposed to be safe if you install random apps,
> as long as you don't grant them dangerous permissions.
>
> So, yes, users really ought to be protected against apps that
> partially break the sandbox because their names are malicious.
Yeah. I'm aware of this. Anything that allows free text and icon to
describe the app will be able to impersonate another app (to the user,
if not the system). The only solution to this is manual verification
and trust. The limitations above make it possible for a reviewer to
check the app id and the exported files (such as the desktop file and
the dbus service file) and will be able to catch such impersonation.
So, if you trust an upstream to do such verifications you register
their public gpg key for the repo.
That obviously doesn't scale super well, as you don't want every user
to have to trust every upstream. So, what I want to do is have
repositories that only have references to other repositories + local
signatures. That way e.g. gnome could run an app store, which
essentially collects a set of upstream repositories, does some manual
verification of each new release and then slaps a signature on it. The
user could them chose to trust this middle-man instead of each
individual upstream. This is basically how android solves this. They
have people looking at submitted apps. Only with this approach you are
not tied to the google appstore only. The whole system is distributed.
>> I understand that you think this is weird, but it is how dbus works,
>> and it is important that it keeps working like this, as this is how
>> you keep talking to the same service if a new one is taking over an
>> old name.
>
> In an object-capability system, the client asks some broker for a
> capability representing some service or name. The broker gives the
> client that capability (if policy says it's okay). If a new service
> takes over the name, the client keeps talking to the old one. If a
> service offers two names, then they'll be represented by two different
> capabilities, and everything works fine.
In dbus this is optional. You can either use the peer id + object path
to keep referencing the old object, or you can use the name + the
object path to always (race-free) go to the latest version.
> Dbus got this wrong, full stop.
Your opinion has been noted.
>>> If you use per-sandbox busses or sockets, you completely avoid needing
>>> to muck with cgroups names. You know a priori which request comes
>>> from which sandboxed app because it could only have come from one
>>> place.
>>
>> That would only be true for the app talking to the portals, not for
>> other peers talking to your app on the regular bus (such as during
>> activation).
>
> Other than activation, when would a non-portal app talk to a sandboxed app?
Well, it would mostly happen indirectly via portals, yeah. Although I
can think of other extension points to the shell an app could install.
For instance, the gnome-shell search provider system works by the app
installing some file that tells the shell its bus name.
> Honestly, given that all of the portal interfaces are new, I find
> myself wondering whether dbus is a good thing to build them on or
> whether it might actually make sense to use something that can pass
> object references (i.e. capabilities) around for real.
You're free to design such a system and write a desktop to use it.
However, in Gnome (and in the other desktops as well), dbus is already
used for all ipc like this and all the freedesktop specs,
infrastructure, type systems, interfaces, code and frameworks are
built around that. There has to be a *massive* advantage for us to use
something else, and I'm not at all convinced by the issues you bring
up.
On Wed, 24 Jun 2015, Ingo Molnar wrote:
> And the thing is, in hindsight, after such huge flamewars, years down the line,
> almost never do I see the following question asked: 'what were we thinking merging
> that crap??'. If any question arises it's usually along the lines of: 'what was
> the big fuss about?'. So I think by and large the process works.
counterexamples, devfs, tux
David Lang
David Lang <[email protected]> writes:
> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>
>> And the thing is, in hindsight, after such huge flamewars, years down the line,
>> almost never do I see the following question asked: 'what were we thinking merging
>> that crap??'. If any question arises it's usually along the lines of: 'what was
>> the big fuss about?'. So I think by and large the process works.
>
> counterexamples, devfs, tux
The biggest I can think of cgroups.
The way cgroups connect to processes instead of resources (semantically)
and the fact that controllers are different from fundamental entities
like schedulers.
Of course I don't think "What were we thinking" I remember it all too
well in that case.
I think "What do we do now that we have made this mess".
Eric
Am Mittwoch, 24. Juni 2015, 10:39:52 schrieb David Lang:
> On Wed, 24 Jun 2015, Ingo Molnar wrote:
> > And the thing is, in hindsight, after such huge flamewars, years down
> > the line, almost never do I see the following question asked: 'what
> > were we thinking merging that crap??'. If any question arises it's
> > usually along the lines of: 'what was the big fuss about?'. So I think
> > by and large the process works.
> counterexamples, devfs, tux
What was tux?
The filesystem tux3 is not merged as far as I am aware.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Wed, 24 Jun 2015, Martin Steigerwald wrote:
> Am Mittwoch, 24. Juni 2015, 10:39:52 schrieb David Lang:
>> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>>> And the thing is, in hindsight, after such huge flamewars, years down
>>> the line, almost never do I see the following question asked: 'what
>>> were we thinking merging that crap??'. If any question arises it's
>>> usually along the lines of: 'what was the big fuss about?'. So I think
>>> by and large the process works.
>> counterexamples, devfs, tux
>
> What was tux?
in-kernel webserver
David Lang
On Wed, Jun 24, 2015 at 10:11 AM, Alexander Larsson
<[email protected]> wrote:
> On Wed, Jun 24, 2015 at 5:38 PM, Andy Lutomirski <[email protected]> wrote:
>> Was this intentionally off-list?
>
> Nah, that was a mistake, adding back the list.
>
>> On Wed, Jun 24, 2015 at 8:10 AM, Alexander Larsson
>
>>> The way i did it in the userspace proxy is to allow peer exited
>>> messages from services that talked to you at some point, as this is
>>> the core requirement (you must be able to limit things to the lifetime
>>> of clients). However, i can see how tracking that in the kernel is a
>>> bit painful, so just allowing all is probably a reasonable choice.
>>>
>>
>> Hmm. I guess this is an ugliness of dbus in general. Since dbus
>> doesn't really have a concept of objects (AIUI) you can't really get a
>> notification that a particular object that you have a reference to is
>> gone, so you have to ask for notification that the peer providing the
>> object is gone, but there was never any concept of having a reference
>> to a peer, so here we are :(
>
> You keep using works like ugly and stupid, which isn't super
> impressive.
Fair enough. On the other hand, I've called my own code ugly plenty of times.
> My name is on the dbus specification, and I am (and was
> then) well aware of systems with object references. In fact, both
> previous ipc systems (Corba ORBs) that Gnome used before we designed
> dbus uses object references, and they had a lot of problems which dbus
> solved for us. I'm not saying dbus is perfect, but it has its reasons
> for the way it works.
>
> So, dbus-the-system has some kind of notion of an object reference
> (peer + object path), but the *bus* is fundamentally a way to
> communicate between peers, and the object path is merely some
> uninterpreted metadata.
I'm talking about the reference part, not the object part. Peer +
object path is a name, not a reference.
> Once the message reaches the destination
> process it is essentially free to interpret the object path however
> they want. If something needs a long lasting "reference" to an object
> you can implement that by e.g. using a Subscribe method, and you can
> guarantee cleanup because the bus will tell you if the peer died.
Except you can't pass them around. So it's still reference-by-name
instead of reference-by-actual-reference.
>
> This also means that the bus itself is vastly simplified. It only has
> to track peers, not every object in every peer. And clients are more
> flexible with how objects are handled. They can be instantiated
> lazily, or even created algorithmically from the object path if
> needed.
True. Nonetheless, things like Cap'n Proto and seL4 are quite simple
and have real references.
>
> You wish that the kernel controlled access to a particular object in a
> peer, but once the message is dispatched into the target process all
> bets are off anyway. It will be running some code parsing your message
> in the process with no real separation from the other objects. Any bug
> there could give you wider access. I don't see how this fundamentally
> makes the whole system much more secure. On the other hand, I *do*
> remember having to track down cross-process leaks from circular
> references between processes using Corba...
If you have peer ids keeping things alive on dbus, surely you can also
have circular references, no?
>
>>> The desktop file lists the icon, name and whatnot which is displayed
>>> by the desktop environment. If DBusActivatable is true, then the app
>>> is started by sending dbus messages to the same name as the desktop
>>> file, to the org.freedesktop.Application interface, this way we can
>>> ensure a singleton app and you can do more things than just spawning
>>> it.
>>
>> How do I install apps as an unprivileged user? What about running
>> sandboxed apps that aren't installed at all? What about downloading
>> one app and running three instances of it that are all isolated from
>> each other?
>
> Users install desktop files in a file in their home directory
> (~/.local/share/applications/ typically).
>
> xdg-app apps require some form of installation before running.
IMO that's unfortunate. If nothing else, it prevents programs from
easily starting one-off sandboxed apps that weren't separately
installed.
>
> You can run three instances of an app, but only one of them can own
> the bus name. This works out fine if your app does not use dbus, but
> it may be a problem if it uses dbus activation.
I'd really like to be able to xdg-app --stateless oowriter
some_untrusted_file.docx and have it fully functional, including
printing, even if I have another instance running.
>
>>> Well, your "other than" part kinda breaks things like launching the
>>> application. So, we need to be on the real bus.
>>> Could you then *also* have a bus per app for talking to the portal? I
>>> guess so, but I don't quite see the point. Just having the portals
>>> trying to find all new buses that come and go will be all kinds of
>>> painful.
>>
>> How many portals will there be? It seems like, if you want multiple
>> portals programs (in the org.freedesktop.portal.* sense), then you'd
>> have some awkwardness if each app were on its own bus and you didn't
>> want a proxy, but I think you'll also have prevented yourself from
>> meaningfully sandboxing the portals themselves.
>
> You can sandbox the portals to some extent, but fundamentally they are
> meant to run in some kind of "higher privileges" mode, so they have to
> have access to things normal apps would not. For instance, they have
> to be able to activate other dbus names.
>
>> Android, on the other hand, sandboxes most of its service providers,
>> and Binder provides a nice way to selectively grant capabilities
>> between sandboxes. (The privacy and security disaster that's built on
>> top of Binder is another story, but that's not Binder's fault.)
>
> Well, the service providers are not the same as the portals. Say you
> have a twitter client that you want to register to be able to share
> some selected text. The twitter client can be fully sandboxed. The
> portal is just the link between the requesting app and the list of
> registered share providers.
>
Ah. I clearly am misunderstanding something. What's a portal?
>> For launching the application, couldn't you just say "hey, portal,
>> launch this application please?" Using custom endpoints with shared
>> namespaces just for this purpose sounds unnecessary.
>
> Well, that is essentially what a portal like the share one does.
> Although it shows a user controlled UI inbetween to avoid the app
> being able to start any other app it wants.
Hmm. So shouldn't xdg-app be explicitly choosing which portals are
allowed for which sandboxed apps rather than allowing
org.freedesktop.portal.*?
>
>> Semi-shameless plug [1]: Sandstorm is inventing something quite
>> similar to what you're doing, but they're doing it with a single proxy
>> per sandbox instance and a capability model. There are no special
>> portal services -- instead, there's a general and *very* clean
>> mechanism for allowing other sandboxed applications to supply
>> interfaces that can be requested from any sandbox (with appropriate
>> user confirmation). It could be worth comparing notes. I'll try to
>> see if they have a nice summary of how the pieces fit together.
>
> That sounds similar to the portal idea, would be cool to read about.
>
>>> I don't think that is really a problem. People can do stupid things on
>>> the non-sandboxed side in all kinds of ways. You just can't protect
>>> against that.
>>
>> But you can prevent people on the non-sandboxed side from being
>> invoked *at all* from the sandbox without explicit user opt-in, which
>> makes the scope for screwing things up much smaller. If you combine
>> that with sandboxing the portal providers too, you end up in decent
>> shape.
>
> The non-sandboxed side can scribble all over your application. Trying
> to protect against it doing something wrong is imho fruitless.
Other way around. Vendors can and will write blatantly and perhaps
intentionally insecure things. If they can expose them via
org.freedesktop.portal.* so their little widget is easier to
implement, they will. If it has to go through a sharing mechanism,
then it will at least only be accessible to apps that the user has
explicitly granted access to.
>
>>> Basically, the app id is linked to the dbus names that the app may
>>> own. So, the only way it could take over org.gnome.gedit is if you
>>> actually install the app called "org.gnome.gedit". Of course if you
>>> trust some random untrusted repo and install something called that you
>>> get what you want...
>>
>> Um.
>>
>> Have you ever noticed that friendly android app names like "Yelp" have
>> nothing whatsoever to do with the app ids? Heck, it's hard to find an
>> app id shown in the UI at all when installing an Android app.
>> Nevertheless, you're supposed to be safe if you install random apps,
>> as long as you don't grant them dangerous permissions.
>>
>> So, yes, users really ought to be protected against apps that
>> partially break the sandbox because their names are malicious.
>
> Yeah. I'm aware of this. Anything that allows free text and icon to
> describe the app will be able to impersonate another app (to the user,
> if not the system). The only solution to this is manual verification
> and trust. The limitations above make it possible for a reviewer to
> check the app id and the exported files (such as the desktop file and
> the dbus service file) and will be able to catch such impersonation.
> So, if you trust an upstream to do such verifications you register
> their public gpg key for the repo.
Or you could design the system such that the name of the app simply
does not matter. An app with the same name and icon can, of course,
impersonate the app to the user, but it's in a sandbox so the damage
should be limited. Allowing an app with the same
not.shown.dotted.name to cause damage seems like a design flaw.
>
> That obviously doesn't scale super well, as you don't want every user
> to have to trust every upstream. So, what I want to do is have
> repositories that only have references to other repositories + local
> signatures. That way e.g. gnome could run an app store, which
> essentially collects a set of upstream repositories, does some manual
> verification of each new release and then slaps a signature on it. The
> user could them chose to trust this middle-man instead of each
> individual upstream. This is basically how android solves this. They
> have people looking at submitted apps. Only with this approach you are
> not tied to the google appstore only. The whole system is distributed.
It would be nice to avoid having to trust the repository at all,
except perhaps to the extent that if you're downloading a "Yelp" app,
then perhaps you'd have to trust it enough that you believe it's the
Yelp app when you type your Yelp password in to it.
>
> You're free to design such a system and write a desktop to use it.
> However, in Gnome (and in the other desktops as well), dbus is already
> used for all ipc like this and all the freedesktop specs,
> infrastructure, type systems, interfaces, code and frameworks are
> built around that. There has to be a *massive* advantage for us to use
> something else, and I'm not at all convinced by the issues you bring
> up.
The custom endpoint policy thing is brand new, whereas using a
userspace proxy for xdg-app actually sounds easier than using a
separate kdbus bus. Sticking dbus in the kernel would also be new.
--
Andy Lutomirski
AMA Capital Management, LLC
On Wed, Jun 24, 2015 at 9:43 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 24, 2015 at 10:11 AM, Alexander Larsson
> <[email protected]> wrote:
>> My name is on the dbus specification, and I am (and was
>> then) well aware of systems with object references. In fact, both
>> previous ipc systems (Corba ORBs) that Gnome used before we designed
>> dbus uses object references, and they had a lot of problems which dbus
>> solved for us. I'm not saying dbus is perfect, but it has its reasons
>> for the way it works.
>>
>> So, dbus-the-system has some kind of notion of an object reference
>> (peer + object path), but the *bus* is fundamentally a way to
>> communicate between peers, and the object path is merely some
>> uninterpreted metadata.
>
> I'm talking about the reference part, not the object part. Peer +
> object path is a name, not a reference.
True, its not a reference in the "refcount" style.
>> You wish that the kernel controlled access to a particular object in a
>> peer, but once the message is dispatched into the target process all
>> bets are off anyway. It will be running some code parsing your message
>> in the process with no real separation from the other objects. Any bug
>> there could give you wider access. I don't see how this fundamentally
>> makes the whole system much more secure. On the other hand, I *do*
>> remember having to track down cross-process leaks from circular
>> references between processes using Corba...
>
> If you have peer ids keeping things alive on dbus, surely you can also
> have circular references, no?
Technically you could set up a situation where this happens, but in
practice it doesn't really. Because object paths don't keep other
processes alive you don't accidentally get circular references,
whereas this happened a lot on corba because references was the only
thing you had.
>> You can run three instances of an app, but only one of them can own
>> the bus name. This works out fine if your app does not use dbus, but
>> it may be a problem if it uses dbus activation.
>
> I'd really like to be able to xdg-app --stateless oowriter
> some_untrusted_file.docx and have it fully functional, including
> printing, even if I have another instance running.
If that was to work then you'd have to have a way to make all the
session services that are needed for it to work to also listen to the
new custom bus for only that app.
>> Well, the service providers are not the same as the portals. Say you
>> have a twitter client that you want to register to be able to share
>> some selected text. The twitter client can be fully sandboxed. The
>> portal is just the link between the requesting app and the list of
>> registered share providers.
>>
>
> Ah. I clearly am misunderstanding something. What's a portal?
Well, portal is a general name for "service needed for making
sandboxed apps work". So, they can be a bit different, but in essence
they are small dbus services that facilitate communication between
different apps and between the app and the host session, in a safe
way. Think of them sort of like filtering proxies, but with a gui.
>> Well, that is essentially what a portal like the share one does.
>> Although it shows a user controlled UI inbetween to avoid the app
>> being able to start any other app it wants.
>
> Hmm. So shouldn't xdg-app be explicitly choosing which portals are
> allowed for which sandboxed apps rather than allowing
> org.freedesktop.portal.*?
Right now there is no default policy for this, as we don't really have
the portal system fully formed yet. But, yeah, using portal.* was an
example of a policy, another would be to list the allowed portals
explicitly.
>> You're free to design such a system and write a desktop to use it.
>> However, in Gnome (and in the other desktops as well), dbus is already
>> used for all ipc like this and all the freedesktop specs,
>> infrastructure, type systems, interfaces, code and frameworks are
>> built around that. There has to be a *massive* advantage for us to use
>> something else, and I'm not at all convinced by the issues you bring
>> up.
>
> The custom endpoint policy thing is brand new, whereas using a
> userspace proxy for xdg-app actually sounds easier than using a
> separate kdbus bus. Sticking dbus in the kernel would also be new.
Yeah, some code in the middle is new, but the entire infrastructure
and sematics are the same. I got the feeling you were proposing
something completely different to dbus.
On Tue, Jun 23, 2015 at 08:07:41AM -0700, Andy Lutomirski wrote:
>
> FWIW, once there are real distros with kdbus userspace enabled,
> reviewing kdbus gets more complicated -- we'll be in the position
> where merging kdbus in a different form from that which was proposed
> will break existing users.
Actually, I think distros having it in their kernel before it's in mainline is
actually a good thing. Let them straighten out the issues that may come up
(not to mention possible CVEs). If the distros have it in their kernels and
out in the public for 6 months or more, that may give enough information as to
whether or not it should be merged.
I don't think it will complicate things even if the API changes. The distros
will have to deal with that fall out. Mainline only cares about its own
regressions. But any API changes would only be done for good reasons, and give
the distros an excuse to fix whatever was done wrong in the first place.
-- Steve
On Wed, Jun 24, 2015 at 7:14 PM, Steven Rostedt <[email protected]> wrote:
>
> I don't think it will complicate things even if the API changes. The distros
> will have to deal with that fall out. Mainline only cares about its own
> regressions. But any API changes would only be done for good reasons, and give
> the distros an excuse to fix whatever was done wrong in the first place.
I don't think that's true.
Realistically, every single kernel developer tends to work on a
machine with some random distro. If that developer cannot compile his
own kernel because his distro stops working, or has to use some
"kdbus=0" switch to turn off the kernel kdbus and (hopefuly) the
distro just switches to the legacy user mode bus, then for that
developer, merging and enabling incompatible kdbus implementation is
basically a regression.
We've seen this before. We end up stuck with the ABI of whatever user
land applications. It doesn't matter where that ABI came from.
I do agree that distro's that want to enable kdbus before any agreed
version has been merged would get to also act as guinea pigs and do
their own QA, and handle fallout from whatever problems they encounter
etc. That part might be good. But I don't think we really end up
having the option to make up some incompatible kdbus ABI
after-the-fact.
Linus
Am Mittwoch, 24. Juni 2015, 19:20:27 schrieb Linus Torvalds:
> On Wed, Jun 24, 2015 at 7:14 PM, Steven Rostedt <[email protected]>
wrote:
> > I don't think it will complicate things even if the API changes. The
> > distros will have to deal with that fall out. Mainline only cares about
> > its own regressions. But any API changes would only be done for good
> > reasons, and give the distros an excuse to fix whatever was done wrong
> > in the first place.
> I don't think that's true.
>
> Realistically, every single kernel developer tends to work on a
> machine with some random distro. If that developer cannot compile his
> own kernel because his distro stops working, or has to use some
> "kdbus=0" switch to turn off the kernel kdbus and (hopefuly) the
> distro just switches to the legacy user mode bus, then for that
> developer, merging and enabling incompatible kdbus implementation is
> basically a regression.
>
> We've seen this before. We end up stuck with the ABI of whatever user
> land applications. It doesn't matter where that ABI came from.
>
> I do agree that distro's that want to enable kdbus before any agreed
> version has been merged would get to also act as guinea pigs and do
> their own QA, and handle fallout from whatever problems they encounter
> etc. That part might be good. But I don't think we really end up
> having the option to make up some incompatible kdbus ABI
> after-the-fact.
Linus, so is that a recommendation to the distros to be careful to put kdbus
into the distro kernel right now and probably better defer it or are you
thinking that the ABI of kdbus already is suitable for merging and you see
no issues to merge a kdbus with the ABI it currently has, but probably
otherwise improved?
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
Am Donnerstag, 25. Juni 2015, 08:01:35 schrieb Martin Steigerwald:
> Am Mittwoch, 24. Juni 2015, 19:20:27 schrieb Linus Torvalds:
> > On Wed, Jun 24, 2015 at 7:14 PM, Steven Rostedt <[email protected]>
>
> wrote:
> > > I don't think it will complicate things even if the API changes. The
> > > distros will have to deal with that fall out. Mainline only cares
> > > about
> > > its own regressions. But any API changes would only be done for good
> > > reasons, and give the distros an excuse to fix whatever was done wrong
> > > in the first place.
> >
> > I don't think that's true.
> >
> > Realistically, every single kernel developer tends to work on a
> > machine with some random distro. If that developer cannot compile his
> > own kernel because his distro stops working, or has to use some
> > "kdbus=0" switch to turn off the kernel kdbus and (hopefuly) the
> > distro just switches to the legacy user mode bus, then for that
> > developer, merging and enabling incompatible kdbus implementation is
> > basically a regression.
> >
> > We've seen this before. We end up stuck with the ABI of whatever user
> > land applications. It doesn't matter where that ABI came from.
> >
> > I do agree that distro's that want to enable kdbus before any agreed
> > version has been merged would get to also act as guinea pigs and do
> > their own QA, and handle fallout from whatever problems they encounter
> > etc. That part might be good. But I don't think we really end up
> > having the option to make up some incompatible kdbus ABI
> > after-the-fact.
>
> Linus, so is that a recommendation to the distros to be careful to put
> kdbus into the distro kernel right now and probably better defer it or
> are you thinking that the ABI of kdbus already is suitable for merging
> and you see no issues to merge a kdbus with the ABI it currently has, but
> probably otherwise improved?
Or, do you think, that there is a different option to handle this then the
both I outlined above?
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Wed, Jun 24, 2015 at 10:39:52AM -0700, David Lang wrote:
> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>
> >And the thing is, in hindsight, after such huge flamewars, years down the line,
> >almost never do I see the following question asked: 'what were we thinking merging
> >that crap??'. If any question arises it's usually along the lines of: 'what was
> >the big fuss about?'. So I think by and large the process works.
>
> counterexamples, devfs, tux
Don't knock devfs. It created a lot of things that we take for granted
now with our development model. Off the top of my head, here's a short
list:
- it showed that we can't arbritrary make user/kernel api
changes without working with people outside of the kernel
developer community, and expect people to follow them
- the idea was sound, but the implementation was not, it had
unfixable problems, so to fix those problems, we came up with
better, kernel-wide solutions, forcing us to unify all
device/driver subsystems.
- we were forced to try to document our user/kernel apis better,
hence Documentation/ABI/ was created
- to remove devfs, we had to create a structure of _how_ to
remove features. It took me 2-3 years to be able to finally
delete the devfs code, as the infrastructure and feedback
loops were just not in place before then to allow that to
happen.
So I would strongly argue that merging devfs was a good thing, it
spurned a lot of us to get the job done correctly. Without it, we would
have never seen the need, or had the knowledge of what needed to be
done.
thanks,
greg k-h
On Wed, 24 Jun 2015, Greg KH wrote:
> On Wed, Jun 24, 2015 at 10:39:52AM -0700, David Lang wrote:
>> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>>
>>> And the thing is, in hindsight, after such huge flamewars, years down the line,
>>> almost never do I see the following question asked: 'what were we thinking merging
>>> that crap??'. If any question arises it's usually along the lines of: 'what was
>>> the big fuss about?'. So I think by and large the process works.
>>
>> counterexamples, devfs, tux
>
> Don't knock devfs. It created a lot of things that we take for granted
> now with our development model. Off the top of my head, here's a short
> list:
> - it showed that we can't arbritrary make user/kernel api
> changes without working with people outside of the kernel
> developer community, and expect people to follow them
> - the idea was sound, but the implementation was not, it had
> unfixable problems, so to fix those problems, we came up with
> better, kernel-wide solutions, forcing us to unify all
> device/driver subsystems.
> - we were forced to try to document our user/kernel apis better,
> hence Documentation/ABI/ was created
> - to remove devfs, we had to create a structure of _how_ to
> remove features. It took me 2-3 years to be able to finally
> delete the devfs code, as the infrastructure and feedback
> loops were just not in place before then to allow that to
> happen.
>
> So I would strongly argue that merging devfs was a good thing, it
> spurned a lot of us to get the job done correctly. Without it, we would
> have never seen the need, or had the knowledge of what needed to be
> done.
I don't disagree with you, but it was definantly a case of adding something that
was later regretted and removed. A lot was learned in the process, but that
wasn't the issue I was referring to.
I don't want kdbus to end up the same way. The more I think back to those
discussions, the more parallels I see between the two.
David Lang
* David Lang <[email protected]> wrote:
> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>
> > And the thing is, in hindsight, after such huge flamewars, years down the
> > line, almost never do I see the following question asked: 'what were we
> > thinking merging that crap??'. If any question arises it's usually along the
> > lines of: 'what was the big fuss about?'. So I think by and large the process
> > works.
>
> counterexamples, devfs, tux
Actually, we never merged the Tux web server upstream, and the devfs concept has
kind of made a comeback via devtmpfs.
And there are examples of bits we _should_ have merged:
- GGI (General Graphics Interface)
- [ and we should probably also have merged kgdb a decade earlier to avoid
wasting all that energy on flaming about it unnecessarily ;-) ]
And the thing is, I specifically talked about 'near zero cost' kernel patches that
don't appreciably impact the 'core kernel'.
There's plenty of examples of features with non-trivial 'core kernel' costs that
weren't merged, and rightfully IMHO:
- the STREAMS ABI
- various forms of a generic kABI that were proposed
- moving the kernel to C++ :-)
... and devfs arguably belongs into that category as well.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
>
> * David Lang <[email protected]> wrote:
>
> > On Wed, 24 Jun 2015, Ingo Molnar wrote:
> >
> > > And the thing is, in hindsight, after such huge flamewars, years down the
> > > line, almost never do I see the following question asked: 'what were we
> > > thinking merging that crap??'. If any question arises it's usually along the
> > > lines of: 'what was the big fuss about?'. So I think by and large the process
> > > works.
> >
> > counterexamples, devfs, tux
>
> Actually, we never merged the Tux web server upstream, and the devfs concept has
> kind of made a comeback via devtmpfs.
Bits of devfs also live on in sysfs. So devfs wasn't a bad initial idea IMHO, but
we had to do one more (incompatible ...) iteration to figure out why we didn't
like it.
Furthermore, I'm pretty sure there's a snowball's chance in hell that we'd have
ended up with the current pretty cleaned up hardware/system ABI _without_ devfs.
So it was a necessary pain.
Thanks,
Ingo
On Wed, Jun 24, 2015 at 9:12 PM, David Lang <[email protected]> wrote:
> On Wed, 24 Jun 2015, Martin Steigerwald wrote:
>> Am Mittwoch, 24. Juni 2015, 10:39:52 schrieb David Lang:
>>> On Wed, 24 Jun 2015, Ingo Molnar wrote:
>>>> And the thing is, in hindsight, after such huge flamewars, years down
>>>> the line, almost never do I see the following question asked: 'what
>>>> were we thinking merging that crap??'. If any question arises it's
>>>> usually along the lines of: 'what was the big fuss about?'. So I think
>>>> by and large the process works.
>>>
>>> counterexamples, devfs, tux
>>
>> What was tux?
>
> in-kernel webserver
Which was cool, and small, and _faster_ than anything else...
Until it was integrated, and people working on (userspace) webservers
started considering its performance as a target, and soon it was
out-performed by userspace webservers...
So it did teach us a lesson...
(Perhaps the above paragraph is actually good advocacy for integrating
kdbus, and for seeding a better userspace implementation? ;-)
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu, Jun 25, 2015 at 08:05:58AM +0200, Martin Steigerwald wrote:
>
> Or, do you think, that there is a different option to handle this then the
> both I outlined above?
Hmm... distros could have their engineers **fix** the busted userspace
code, instead of fixing the problem by jamming a different
implementation into the kernel?
- Ted
Am Donnerstag, 25. Juni 2015, 09:34:56 schrieb Theodore Ts'o:
> On Thu, Jun 25, 2015 at 08:05:58AM +0200, Martin Steigerwald wrote:
> > Or, do you think, that there is a different option to handle this then
> > the both I outlined above?
>
> Hmm... distros could have their engineers **fix** the busted userspace
> code, instead of fixing the problem by jamming a different
> implementation into the kernel?
Hmm, I read on Devuan mailing list, that Qt engineers work on doing dbus
directly inside Qt instead of using the existing libdbus. I did not verify
this claim yet. But considering what I read here about performance issues
with libdbus I think it would make quite some sense.
Also I wonder who will use sdbus stuff from systemd / libsystemd – I sure
hope sdbus will work without systemd running as PID 1, but I am not clear on
this either – from the desktop environment people beside xdg-app. I doubt
that Qt will depend on it, being available for more than the Linux platform.
And if GNOME wants to be portable to the BSD variants at least, they can´t
depend on it either.
So who will use non portable sdbus anyway – except specialized apps?
In case I missed this in the discussion so far, sorry, but from what I read
from the various threads I am really not clear on this.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Thu, Jun 25, 2015 at 09:57:45AM +0200, Geert Uytterhoeven wrote:
> >
> > in-kernel webserver
>
> Which was cool, and small, and _faster_ than anything else...
> Until it was integrated, and people working on (userspace) webservers
> started considering its performance as a target, and soon it was
> out-performed by userspace webservers...
>
> So it did teach us a lesson...
>
> (Perhaps the above paragraph is actually good advocacy for integrating
> kdbus, and for seeding a better userspace implementation? ;-)
>
Except back then, the userspace web servers were created by the competition
and there was a strong incentive to beat tux.
But today, kdbus is written by the same folks that write dbus, and there's no
other competition. There's no incentive to fix dbus once kdbus is merged, and
in fact, it gives incentive to just drop it completely.
-- Steve
[delurk; apparently kdbus is not receiving the architectural review it should.
i've got quite a bit of knowledge on message-passing mechanisms in general, and
kernel IPC in particular, so i'll weigh in uninvited. apologies for length.
as my "proper" review on this topic is still under construction, i'll try (and
fail) to be brief here. i started down that road only to realize that kdbus is
quite the ball of mud even if the only thing under the scope is its interface,
and that if i held off until properly ready i'd risk kdbus having already been
merged, making review moot.]
Ingo Molnar wrote:
>- I've been closely monitoring Linux kernel changes for over 20 years, and for the
> last 10 years the linux/ipc/* code has been dormant: it works and was kept good
> for existing usecases, but no-one was maintaining and enhancing it with the
> future in mind.
It's my understanding that linux/ipc/* contains only SysV IPC, i.e. shm, sem,
SysV message queues, and POSIX message queues. There are other IPC-implementing
things in the kernel also, such as unix domain sockets, pipes, shared memory
via mmap(), signals, mappings that appear shared across fork(), and whatever
else provides either kernel-mediated multi-client buffer access or some
combination of shared memory and synchronization that lets userspace exchange
hot data across the address space boundary.
It's also my understanding that no-one in their right mind would call SysV IPC
state-of-the-art even at the level of interface; indeed its presence in the
hoariest of vendor unixes suggests it's not supposed to be even close.
However, the suggested replacement in kdbus replicates the worst[-1] of all
known user-to-user IPC mechanisms, i.e. Mach. I'm not suggesting that Linux
adopt e.g. a different microkernel IPC mechanism-- those are by and large
inapplicable to a monolithic kernel for reasons of ABI (and, well, why would
you do IPC when function calls are zomgfast already?)-- but rather, that the
existing ones either are good enough at this time or can be reworked to become
near-equivalent to the state of the art in terms of performance.
> So there exists a technical vacuum: the kernel does not have any good, modern
> IPC ABI at the moment that distros can rely on as a 'golden standard'. This is
> partly technical, partly political. The technical reason is that SysV IPC is
> ancient and cumbersome. The political reason is that SystemD could be using
> and extending Android's existing kernel accelerated IPC subsystem (Binder)
> that is already upstream - but does not.
I'll contend that the reason for this vacuum is that the existing kernel IPC
interfaces are fine to the point that other mechanisms may be derived from
them solely in user-space without significant performance demerit, and without
pushing ca. 10k SLOC of IPC broker and policy engine into kernel space.
Furthermore, it's my well-ruminated opinion that implementations of the
userspace ABI specified in the kdbus 4.1-rc1 version (of April this year) will
always be necessarily slower than existing IPC primitives in terms of both
throughput and latency; and that the latter are directly applicable to
constructing a more convenient user-space IPC broker that implements what
kdbus seeks to provide: naming, broadcast, unidirectional signaling,
bidirectional "method calls", and a policy mechanism.
In addition I'll argue that as currently specified, the kdbus interface-- even
if tuned to its utmost-- is not only necessarily inferior to e.g. a well-tuned
version of unix domain sockets, but also fundamentally flawed in ways that
prohibit construction of robust in-system distributed programs by kdbus'
mechanisms alone (i.e. byzantine call-site workarounds notwithstanding).
For the first, compare unix domain sockets (i.e. point-to-point mode, access
control through filesystem [or fork() parentage], read/write/select) to the
kdbus message-sending ioctl. In the main data-exchanging portion, the former
requires only a connection identifier, a pointer to a buffer, and the length
of data in that buffer. To contrast, kdbus takes a complex message-sending
command structure with 0..n items of m kinds that the ioctl must parse in a
m-way switching loop, and then another complex message-describing structure
which has its own 1..n items of another m kinds describing its contents,
destination-lookup options, negotiation of supported options, and so forth.
Consequently, a carefully optimized implementation of unix domain sockets (and
by extension all the data-carrying SysV etc. IPC primitives, optimized
similarly) will always be superior to kdbus for both message throughput and
latency, for the reason of kdbus' comparatively great interface complexity
alone.
There's an obvious caveat here, i.e. "well where is it, then?". Given the
overhead dictated by its interface, kdbus' performance is already inferior for
short messages. For long messages (> L1 cache size per Stetson-Harrison[0]) the
only performance benefit from kdbus is its claimed single-copy mode of
operation-- an equivalent to which could be had with ye olde sockets by copying
data from the writer directly into the reader while one of them blocks[1] in
the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
sockets, etc. don't do this doesn't really factor in.
For the second, kdbus is fundamentally designed to buffer message data, up to
a fixed limit, in the pool associated with receivers' connections. I cannot
overstate the degree of this _outright architectural blunder_, so I'll put an
extra paragraph break here just for emphasis' sake.
A consequence of this buffering is that whenever a client sends a message with
kdbus, it must be prepared to handle an out-of-space non-delivery status.
(kdbus has two of those, one for queue length and another for buffer space.
why, i have no idea-- do clients have a different behaviour in response to one
of them from the other?) There's no option to e.g. overwrite a previous
message, or to discard queued messages in an oldest-first order, instead of
rebuffing the sender.
For broadcast messaging, a recipient may observe that messages were dropped by
looking at a `dropped_msgs' field delivered (and then reset) as part of the
message reception ioctl. Its value is the number of messages dropped since last
read, so arguably a client could achieve the equivalent of the condition's
absence by resynchronizing explicitly with all signal-senders on its current
bus wrt which it knows the protocol, when the value is >0. This method could in
principle apply to 1-to-1 unidirectional messaging as well[2].
Looking at the kdbus "send message, wait for tagged reply" feature in
conjunction with these details appears to reveal two holes in its state graph.
The first is that if replies are delivered through the requestor's buffer,
concurrent sends into that same buffer may cause it to become full (or the
queue to grow too long, w/e) before the service gets a chance to reply. If this
condition causes a reply to fall out of the IPC flow, the requestor will hang
until either its specified timeout happens or it gets interrupted by a signal.
If replies are delivered outside the shm pool, the requestor must be prepared
to pick them up using a different means from the "in your pool w/ offset X,
length Y" format the main-line kdbus interface provides. [i've seen no such
thing in the kdbus docs so far.]
As far as alternative solutions go, preallocation of space for a reply message
is an incomplete fix unless every reply's size has a known upper bound (e.g.
with use of an IDL compiler); in this scheme it'd be necessary for the
requestor to specify this, suffering consequences if the number is too low, and
to prepare to handle a "not enough buffer space for a reply" condition at send.
The kdbus docs specify no such condition.
The second problem is that given how there can be a timeout or interrupt on the
receive side of a "method call" transaction, it's possible for the requestor to
bow out of the IPC flow _while the service is processing its request_. This
results either in the reply message being lost, or its ending up in the
requestor's buffer to appear in a loop where it may not be expected. Either
way, the client must at that point resynchronize wrt all objects related to the
request's side effects, or abandon the IPC flow entirely and start over.
(services need only confirm their replies before effecting e.g. a chardev-like
"destructively read N bytes from buffer" operation's outcome, which is slightly
less ugly.)
Tying this back into the first point: to prevent this type of denial-of-service
against sanguinely-written software it's necessary for kdbus to invoke the
policy engine to determine that an unrelated participant isn't allowed to
consume a peer's buffer space. As this operation is absent in unix-domain
sockets, an ideal implementation of kdbus 4.1-rc1 will be slower in
point-to-point communication even if the particulars of its message-descriptor
format get reworked to a light-weight alternative. In addition, its API ends up
requiring highly involved state-tracking wrappers or inversion-of-control
machinery in its clients, to the point where just using unix domain sockets
with a heavyweight user-space broker would be nicer.
It's my opinionated conclusion that merging kdbus as-is would be the sort of
cock-up which we'll look back at, point a finger, giggle a bit, and wonder only
half-jokingly if there was something besides horse bones in that glue. Its docs
betray an absence of careful analysis, and the spec of its interface is so
loose as to make programs written for kdbus 4.1-rc1 subtly incompatible to any
later program through deeply-baked design consequences stemming from quirks of
its current implementation.
I'm not a Linux kernel developer. But if I were, this would be where I'd put
my NAK.
Sincerely,
-KS
[-1] author's opinion
[0] no bunny rabbits were harmed
[1] the case where both use non-blocking I/O requires either a buffer or
support from the scheduler. the former is no optimization at all, and the
latter may be _quite involved indeed_.
[2] as for whether freedesktop.org programs will be designed and built to such
a standard, i suspend judgement.
Hi
On Wed, Jul 1, 2015 at 2:03 AM, Kalle A. Sandstrom <[email protected]> wrote:
> For the first, compare unix domain sockets (i.e. point-to-point mode, access
> control through filesystem [or fork() parentage], read/write/select) to the
> kdbus message-sending ioctl. In the main data-exchanging portion, the former
> requires only a connection identifier, a pointer to a buffer, and the length
> of data in that buffer. To contrast, kdbus takes a complex message-sending
> command structure with 0..n items of m kinds that the ioctl must parse in a
> m-way switching loop, and then another complex message-describing structure
> which has its own 1..n items of another m kinds describing its contents,
> destination-lookup options, negotiation of supported options, and so forth.
sendmsg(2) uses a very similar payload to kdbus. send(2) is a shortcut
to simplify the most common use-case. I'd be more than glad to accept
patches adding such shortcuts to kdbus, if accompanied by benchmark
numbers and reasoning why this is a common path for dbus/etc. clients.
The kdbus API is kept generic and extendable, while trying to keep
runtime overhead minimal. If this overhead turns out to be a
significant runtime slowdown (which none of my benchmarks showed), we
should consider adding shortcuts. Until then, I prefer an API that is
consistent, easy to extend and flexible.
> Consequently, a carefully optimized implementation of unix domain sockets (and
> by extension all the data-carrying SysV etc. IPC primitives, optimized
> similarly) will always be superior to kdbus for both message throughput and
> latency, [...]
Yes, that's due to the point-to-point nature of UDS.
> [...] For long messages (> L1 cache size per Stetson-Harrison[0]) the
> only performance benefit from kdbus is its claimed single-copy mode of
> operation-- an equivalent to which could be had with ye olde sockets by copying
> data from the writer directly into the reader while one of them blocks[1] in
> the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
> sockets, etc. don't do this doesn't really factor in.
Parts of the network subsystem have supported single-copy (mmap'ed IO)
for quite some time. kdbus mandates it, but otherwise is not special
in that regard.
> A consequence of this buffering is that whenever a client sends a message with
> kdbus, it must be prepared to handle an out-of-space non-delivery status.
> [...] There's no option to e.g. overwrite a previous
> message, or to discard queued messages in an oldest-first order, instead of
> rebuffing the sender.
Correct.
> For broadcast messaging, a recipient may observe that messages were dropped by
> looking at a `dropped_msgs' field delivered (and then reset) as part of the
> message reception ioctl. Its value is the number of messages dropped since last
> read, so arguably a client could achieve the equivalent of the condition's
> absence by resynchronizing explicitly with all signal-senders on its current
> bus wrt which it knows the protocol, when the value is >0. This method could in
> principle apply to 1-to-1 unidirectional messaging as well[2].
Correct.
> Looking at the kdbus "send message, wait for tagged reply" feature in
> conjunction with these details appears to reveal two holes in its state graph.
> The first is that if replies are delivered through the requestor's buffer,
> concurrent sends into that same buffer may cause it to become full (or the
> queue to grow too long, w/e) before the service gets a chance to reply. If this
> condition causes a reply to fall out of the IPC flow, the requestor will hang
> until either its specified timeout happens or it gets interrupted by a signal.
If sending a reply fails, the kdbus_reply state is destructed and the
caller must be woken up. We do that for sync-calls just fine, but the
async case does indeed lack a wake-up in the error path. I noted this
down and will fix it.
> If replies are delivered outside the shm pool, the requestor must be prepared
> to pick them up using a different means from the "in your pool w/ offset X,
> length Y" format the main-line kdbus interface provides. [...]
Replies are never delivered outside the shm pool.
> The second problem is that given how there can be a timeout or interrupt on the
> receive side of a "method call" transaction, it's possible for the requestor to
> bow out of the IPC flow _while the service is processing its request_. This
> results either in the reply message being lost, or its ending up in the
> requestor's buffer to appear in a loop where it may not be expected. Either
(for completeness: we properly support resuming interrupted sync-calls)
> way, the client must at that point resynchronize wrt all objects related to the
> request's side effects, or abandon the IPC flow entirely and start over.
> (services need only confirm their replies before effecting e.g. a chardev-like
> "destructively read N bytes from buffer" operation's outcome, which is slightly
> less ugly.)
Correct. If you time-out, or refuse to resume, a sync-call, you have
to treat this transaction as failed.
> Tying this back into the first point: to prevent this type of denial-of-service
> against sanguinely-written software it's necessary for kdbus to invoke the
> policy engine to determine that an unrelated participant isn't allowed to
> consume a peer's buffer space.
It's not the policy engine, but quota-handling, but otherwise correct.
Thanks
David
On Wed, Jul 01, 2015 at 06:51:41PM +0200, David Herrmann wrote:
> Hi
>
Thanks for the answers; in response I've got some further questions. Again,
apologies for length -- I apparently don't know how to discuss IPC tersely.
> On Wed, Jul 1, 2015 at 2:03 AM, Kalle A. Sandstrom <[email protected]> wrote:
> > For the first, compare unix domain sockets (i.e. point-to-point mode, access
> > control through filesystem [or fork() parentage], read/write/select) to the
> > kdbus message-sending ioctl. In the main data-exchanging portion, the former
> > requires only a connection identifier, a pointer to a buffer, and the length
> > of data in that buffer. To contrast, kdbus takes a complex message-sending
> > command structure with 0..n items of m kinds that the ioctl must parse in a
> > m-way switching loop, and then another complex message-describing structure
> > which has its own 1..n items of another m kinds describing its contents,
> > destination-lookup options, negotiation of supported options, and so forth.
>
> sendmsg(2) uses a very similar payload to kdbus. send(2) is a shortcut
> to simplify the most common use-case. I'd be more than glad to accept
> patches adding such shortcuts to kdbus, if accompanied by benchmark
> numbers and reasoning why this is a common path for dbus/etc. clients.
>
A shortcut special case for e.g. only iovec-like payload items, only to a
numerically designated peer, and only RPC forms, should be an immediate gain
given that reduced functionality would lower the number of instructions
executed, the number of impredictable branches met, and the number of
possibly-cold cache lines accessed.
The difference in raw cycles should be significant in comparison to the
number of kernel exits avoided during a client's RPC to a service and the
associated reply. Assuming that such RPCs are the bulk of what kdbus will
do, and that c/s avoidance is crucial to the performance argument in its
design, it seems silly not to have such a fast-path -- even if it is
initially implemented as a simple wrapper of the full send ioctl.
It would also put the basic send operation on par with sendmsg(2) over a
connected socket in terms of interface complexity, and simplify any future
"exit into peer without scheduler latency" shenanigans.
However, these gains would go unobserved in code written to the current
kdbus ABI. Bridging to such a fast-path from the full interface would
eliminate most of its benefits while hurting its legit callers.
That being said, considering that the eventual de-facto user API to kdbus is
a library with explicit deserialization, endianness conversion, and
suchlike, I could see how the difference would go unobserved.
> The kdbus API is kept generic and extendable, while trying to keep
> runtime overhead minimal. If this overhead turns out to be a
> significant runtime slowdown (which none of my benchmarks showed), we
> should consider adding shortcuts. Until then, I prefer an API that is
> consistent, easy to extend and flexible.
>
Out of curiosity, what payload item types do you see being added in the near
future, e.g. the next year? UDS knows only of simple buffers, scatter/gather
iovecs, and inter-process dup(2); and recent Linux adds sourcing from a file
descriptor. Perhaps a "pass this previously-received message on" item?
> > Consequently, a carefully optimized implementation of unix domain sockets (and
> > by extension all the data-carrying SysV etc. IPC primitives, optimized
> > similarly) will always be superior to kdbus for both message throughput and
> > latency, [...]
>
> Yes, that's due to the point-to-point nature of UDS.
>
Does this change for broadcast, unassociated, or doubly-addressed[0]
operation? For the first, kdbus must already cause allocation of cache lines
in proportion to msg_length * n_recvrs, which mutes the broker's single-copy
advantage as the number of receivers grows. For the second, name lookup from
(say) a hash table only adds to required processing, though the resulting
identifier could be re-used immediately afterward; and the third mode would
prohibit that optimization altogether.
Relatedly, is there publicly-available data concerning the distribution of
various dbus IPC modalities? Such as a desktop booting under systemd,
running for a decent bit, and shutting down; or the automotive industry's
(presumably signaling-heavy) use cases which I've heard quoted for a figure
of 600k transactions before steady state.
> > [...] For long messages (> L1 cache size per Stetson-Harrison[0]) the
> > only performance benefit from kdbus is its claimed single-copy mode of
> > operation-- an equivalent to which could be had with ye olde sockets by copying
> > data from the writer directly into the reader while one of them blocks[1] in
> > the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
> > sockets, etc. don't do this doesn't really factor in.
>
> Parts of the network subsystem have supported single-copy (mmap'ed IO)
> for quite some time. kdbus mandates it, but otherwise is not special
> in that regard.
>
I'm not intending to discuss mmap() tricks, but rather that with the
existing system calls a pending write(2) would be made to substitute for the
in-kernel buffer where the corresponding read(2) grabs its bytes from; or
vice versa. This'd make conventional IPC single-copy while permitting the
receiver to designate an arbitrary location for its data; e.g. an IPC daemon
first reading a message header from a sender's socket, figuring out its
routing and allocation, and then receiving the body directly into the
destination shm pool.
That's not directly related to kdbus, except as a hypothetical [transparent]
speed-up for a strictly POSIX user-space reimplementation using the same
mmap()'d shm-pool receive semantics.
[snipped the bit about sender's buffer-full handling]
> > For broadcast messaging, a recipient may observe that messages were dropped by
> > looking at a `dropped_msgs' field delivered (and then reset) as part of the
> > message reception ioctl. Its value is the number of messages dropped since last
> > read, so arguably a client could achieve the equivalent of the condition's
> > absence by resynchronizing explicitly with all signal-senders on its current
> > bus wrt which it knows the protocol, when the value is >0. This method could in
> > principle apply to 1-to-1 unidirectional messaging as well[2].
>
> Correct.
>
As a followup question, does the synchronous RPC mode also return
`dropped_msgs'? If so, does it reset the counter? Such operation would seem
to complicate all RPC call sites, and I didn't find it discussed in the
in-tree kdbus documentation.
[case of RPC client timeout/interrupt before service reply]
> If sending a reply fails, the kdbus_reply state is destructed and the
> caller must be woken up. We do that for sync-calls just fine, but the
> async case does indeed lack a wake-up in the error path. I noted this
> down and will fix it.
>
What's the reply sender's error code in this case, i.e. failure due to
caller having bowed out? The spec suggests either ENXIO (for a disappeared
client), ECONNRESET (for a deactivated client connection), or EPIPE (by
reflection to the service from how it's described in kdbus.message.xml).
(Also, I'm baffled by the difference between ENXIO and ECONNRESET. Is there
a circumstance where a kdbus application would care of the difference
between a peer's not having been there to begin with, and its connection not
being active?)
[snipped speculation about out-of-pool reply delivery, which doesn't happen]
> > The second problem is that given how there can be a timeout or interrupt on the
> > receive side of a "method call" transaction, it's possible for the requestor to
> > bow out of the IPC flow _while the service is processing its request_. This
> > results either in the reply message being lost, or its ending up in the
> > requestor's buffer to appear in a loop where it may not be expected. Either
>
> (for completeness: we properly support resuming interrupted sync-calls)
>
How does the client do this? A quick grep through the docs didn't show any
hits for "resume".
Moreover, can the client resume an interrupted sync-call equivalently both
before and after the service has picked the request up, including the
service's call to release the shm-pool allocation (and possibly also the
kdbus tracking structures)? That's to say: is there any case where
non-idempotent RPCs, or their replies, would end up duplicated due to
interrupts or timeouts?
> > way, the client must at that point resynchronize wrt all objects related to the
> > request's side effects, or abandon the IPC flow entirely and start over.
> > (services need only confirm their replies before effecting e.g. a chardev-like
> > "destructively read N bytes from buffer" operation's outcome, which is slightly
> > less ugly.)
>
> Correct. If you time-out, or refuse to resume, a sync-call, you have
> to treat this transaction as failed.
>
More generally speaking, what's the use case for having a timeout? What's a
client to do?
Given that a RPC client can, having timed out, either resume (having e.g.
approximated co?perative multitasking in between? [surely not?]) or abort
(to try it over again?), I don't understand why this API is there in the
first place. Unless it's (say) a reheating of the poor man's distributed
deadlock detection, but that's a bit of a nasty way of looking at it-- RPC
dependencies should be acyclic anyhow.
For example let's assume a client and two services, one being what the
client calls into and the other what the first uses to implement some
portion of its interface. The first does a few things and then calls into
the second before replying, which in its service code ends up incurring a
delay from sleeping on a mutex, waiting for disk/network I/O, or local
system load. The first service does not specify a timeout, but the
chain-initiating client does, and this timeout ends up being reached due to
delay in the second service. (notably, the timeout will not have occurred
due to a deadlock and so cannot be resolved by the intermediate chain
releasing mutex'd primitives and starting over.)
How does the client recover from the timeout? Are intermediary services
required to exhibit composable idempotence? Is there a greater transaction
bracket around a client RPC, so that rollback/commit can happen regardless
of intermediaries?
> > Tying this back into the first point: to prevent this type of denial-of-service
> > against sanguinely-written software it's necessary for kdbus to invoke the
> > policy engine to determine that an unrelated participant isn't allowed to
> > consume a peer's buffer space.
>
> It's not the policy engine, but quota-handling, but otherwise correct.
>
I have some further questions on the topic of shm pool semantics. (I'm
trying to figure out what a robust RPC client call-site would look like, as
that is part of kdbus' end-to-end performance.)
Does the client receive a status code if a reply fails due to the quota
mechanism? This, again, I didn't find in the spec.
Is there some way for a RPC peer to know that replies below a certain size
will be delivered regardless of the state of the client's shm pool at the
time of reply? Such as a per-connection parameter (i.e. one that's
implicitly a part of the protocol) or a per-RPC field that a client may set
to achieve reliable operation without "no reply because buffer full"
handling even in the face of concurrent senders.
Does the client receive scattered data if its reception pool has enough room
for a reply message, but the largest piece is smaller than the reply
payload? If so, is there some method by which a sender could indicate
legitimate breaks in the message contents, e.g. between strings or integers,
so that a middleware (IDL compiler, basically) could wrap that data into a
function call's out-parameters without doing an extra gather stage [copy] in
userspace? If not, must a client process call into the message-handling part
of its main loop (to release shmpool space by handling messages) whenever a
reply fails for this reason?
Interestedly,
-KS
[0] I'm referring to the part where a send operation (or the message) may
specify both numeric recipient ID and name, which kdbus would match
or reject the message.
On Mon 2015-06-22 23:41:40, Greg KH wrote:
> On Mon, Jun 22, 2015 at 11:06:09PM -0700, Andy Lutomirski wrote:
> > Hi Linus,
> >
> > Can you opine as to whether you think that kdbus should be merged?
>
> Ah, a preemptive pull request denial, how nice.
> I don't think I've ever seen such a thing before, congratulations for
> creating something so must have previously been lacking in our
> development model in how to work together in a community in a productive
> manner.
Apparently, new tools are needed in the community, as normal review
comments did not stop drivers/android/binder.c merge.
For example binder_transaction does not exactly look like a kernel
code, "TODO: fput" does not really invoke confidence, and ammount of
BUG_ON()s is quite amazing...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, Jul 8, 2015 at 3:54 PM, Pavel Machek <[email protected]> wrote:
> Apparently, new tools are needed in the community, as normal review
> comments did not stop drivers/android/binder.c merge.
>
> For example binder_transaction does not exactly look like a kernel
> code, "TODO: fput" does not really invoke confidence, and ammount of
> BUG_ON()s is quite amazing...
Amazingly, checkpatch (without --strict) only complains about long lines.
Seems like the test for "BUG" is (and always has been) commented out...
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu, 2015-07-09 at 10:39 +0200, Geert Uytterhoeven wrote:
> On Wed, Jul 8, 2015 at 3:54 PM, Pavel Machek <[email protected]> wrote:
> > Apparently, new tools are needed in the community, as normal review
> > comments did not stop drivers/android/binder.c merge.
> >
> > For example binder_transaction does not exactly look like a kernel
> > code, "TODO: fput" does not really invoke confidence, and amount of
> > BUG_ON()s is quite amazing...
>
> Amazingly, checkpatch (without --strict) only complains about long lines.
>
> Seems like the test for "BUG" is (and always has been) commented out...
Maybe (requires --strict when scanning files)
---
scripts/checkpatch.pl | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 90e1edc..11c8186 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3439,13 +3439,15 @@ sub process {
}
}
-# # no BUG() or BUG_ON()
-# if ($line =~ /\b(BUG|BUG_ON)\b/) {
-# print "Try to use WARN_ON & Recovery code rather than BUG() or BUG_ON()\n";
-# print "$herecurr";
-# $clean = 0;
-# }
+# avoid BUG() or BUG_ON()
+ if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
+ my $msg_type = \&WARN;
+ $msg_type = \&CHK if ($file);
+ &{$msg_type}("AVOID_BUG",
+ "Avoid crashing the kernel - Try using WARN_ON & Recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
+ }
+# avoid LINUX_VERSION_CODE
if ($line =~ /\bLINUX_VERSION_CODE\b/) {
WARN("LINUX_VERSION_CODE",
"LINUX_VERSION_CODE should be avoided, code should be for the version to which it is merged\n" . $herecurr);
On Thu, Jul 9, 2015 at 12:29 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2015-07-09 at 10:39 +0200, Geert Uytterhoeven wrote:
>> On Wed, Jul 8, 2015 at 3:54 PM, Pavel Machek <[email protected]> wrote:
>> > Apparently, new tools are needed in the community, as normal review
>> > comments did not stop drivers/android/binder.c merge.
>> >
>> > For example binder_transaction does not exactly look like a kernel
>> > code, "TODO: fput" does not really invoke confidence, and amount of
>> > BUG_ON()s is quite amazing...
>>
>> Amazingly, checkpatch (without --strict) only complains about long lines.
>>
>> Seems like the test for "BUG" is (and always has been) commented out...
>
> Maybe (requires --strict when scanning files)
> ---
> scripts/checkpatch.pl | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
Thanks!
Detected 31 occurrences (+ 1 commented out), shudder...
Tested-by: Geert Uytterhoeven <[email protected]>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu 2015-07-09 10:39:58, Geert Uytterhoeven wrote:
> On Wed, Jul 8, 2015 at 3:54 PM, Pavel Machek <[email protected]> wrote:
> > Apparently, new tools are needed in the community, as normal review
> > comments did not stop drivers/android/binder.c merge.
> >
> > For example binder_transaction does not exactly look like a kernel
> > code, "TODO: fput" does not really invoke confidence, and ammount of
> > BUG_ON()s is quite amazing...
>
> Amazingly, checkpatch (without --strict) only complains about long lines.
Well, checkpatch only tells half of storry.
Anyway worst problem is that there's no documentation of kernel<->user
interface binder provides, making understanding it hard/impossible.
Closest to documentation pointer is:
* Based on, but no longer compatible with, the original
* OpenBinder.org binder driver interface, which is:
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <[email protected]> wrote:
> 2. Kdbus introduces a novel buffering model. Receivers allocate a big
> chunk of what's essentially tmpfs space. Assuming that space is
> available (in a virtual memory sense), senders synchronously write to
> the receivers' tmpfs space. Broadcast senders synchronously write to
> *all* receivers' tmpfs space. I think that, regardless of
> implementation, this is problematic if the sender and the receiver are
> in different memcgs. Suppose that the message is to be written to a
> page in the receivers' tmpfs space that is not currently resident. If
> the write happens in the sender's memcg context, then a receiver can
> effectively allocate an unlimited number of pages in the sender's
> memcg, which will, in practice, be the init memcg if the sender is
> systemd. This breaks the memcg model. If, on the other hand, the
> sender writes to the receiver's tmpfs space in the receiver's memcg
> context, then the sender will block (or fail? presumably
> unpredictable failures are a bad thing) if the receiver's memcg is at
> capacity.
I realize that everyone is sick of this thread. Nonetheless, I should
emphasize that I'm actually serious about this issue. I got Fedora
Rawhide working under kdbus (thanks, everyone!), and I ran this little
program:
#include <systemd/sd-bus.h>
#include <err.h>
int main(int argc, char *argv[])
{
while (1) {
sd_bus *bus;
if (sd_bus_open_system(&bus) < 0) {
/* warn("sd_bus_open_system"); */
continue;
}
sd_bus_close(bus);
}
}
under both userspace dbus and under kdbus. Userspace dbus burns some
CPU -- no big deal. I expected kdbus to fail to scale and burn a
disproportionate amount of CPU (because I don't see how it /can/
scale). Instead it fell over completely. I didn't bother debugging
it, but offhand I'd guess that the system OOMed and didn't come back.
On very brief inspection, Rawhide seems to have a lot of kdbus
connections with 16MiB of mapped tmpfs stuff each. (53 of them
mapped, and I don't know how many exist with tmpfs backing but aren't
mapped. Presumably the number only goes up as the degree of reliance
on the userspace proxy goes down. As it stands, that's over 3GB of
uncommitted backing store that my test is likely to forcibly commit
very quickly.)
Frankly, I don't understand how it's possible to cleanly implement
kdbus' broadcast or lifetime semantics* in an environment with bounded
CPU or bounded memory. (And unbounded memory just changes the
problem, since the message backlog can just get worse and worse.)
I work in an industry in which lots of parties broadcast lots of data
to lots of people. If you try to drink from the firehose and you
can't swallow fast enough, either you need to throw something out (and
test your recovery code!) or you fail. At least in finance, no one
pretends that a global order of events in different cities is
practical.
* Detecting when when your peer goes away is, of course, a widely
encountered and widely solved problem. I don't know of any deployed
systems that solve it by broadcasting the lifetime of everything to
everyone and relying on those broadcasts going through, though.
--Andy
Hi
On Tue, Aug 4, 2015 at 1:02 AM, Andy Lutomirski <[email protected]> wrote:
> I got Fedora
> Rawhide working under kdbus (thanks, everyone!), and I ran this little
> program:
>
> #include <systemd/sd-bus.h>
> #include <err.h>
>
> int main(int argc, char *argv[])
> {
> while (1) {
> sd_bus *bus;
> if (sd_bus_open_system(&bus) < 0) {
> /* warn("sd_bus_open_system"); */
> continue;
> }
> sd_bus_close(bus);
You lack a call to sd_bus_unref() here. Without it, your loop contains:
while (1)
malloc(1024);
This simple malloc-loop already hogs your system. If I add the
required call to _unref(), your tool runs smoothly on my machine.
> }
> }
>
> under both userspace dbus and under kdbus. Userspace dbus burns some
> CPU -- no big deal. I expected kdbus to fail to scale and burn a
> disproportionate amount of CPU (because I don't see how it /can/
> scale). Instead it fell over completely. I didn't bother debugging
> it, but offhand I'd guess that the system OOMed and didn't come back.
I cannot see the relation to kdbus.
> On very brief inspection, Rawhide seems to have a lot of kdbus
> connections with 16MiB of mapped tmpfs stuff each. (53 of them
> mapped, and I don't know how many exist with tmpfs backing but aren't
> mapped. Presumably the number only goes up as the degree of reliance
> on the userspace proxy goes down.
What does this have to do with the proxy? Why would resource
consumption go *up* as the proxy users decline? Please elaborate.
> I don't know of any deployed
> systems that solve it by broadcasting the lifetime of everything to
> everyone and relying on those broadcasts going through, though.
Luckily, kdbus does not do this.
Thanks
David
On Tue, Aug 4, 2015 at 1:58 AM, David Herrmann <[email protected]> wrote:
>
> You lack a call to sd_bus_unref() here.
I assume it was intentional. Why would Andy talk about "scaling" otherwise?
And the worry was why the kdbus version killed the machine, but the
userspace version did not. That's a rather big difference, and not a
good one.
Possibly the kdbus version ends up not just allocating user space
memory (which we should handle gracefully), but kernel allocations too
(which absolutely have to be explicitly resource-managed).
Linus
Hi
On Tue, Aug 4, 2015 at 3:46 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Aug 4, 2015 at 1:58 AM, David Herrmann <[email protected]> wrote:
>>
>> You lack a call to sd_bus_unref() here.
>
> I assume it was intentional. Why would Andy talk about "scaling" otherwise?
>
> And the worry was why the kdbus version killed the machine, but the
> userspace version did not. That's a rather big difference, and not a
> good one.
Neither test 'kills' the machine:
* The userspace version will be killed by the OOM killer after about
20s running (depending how much memory you have).
* The kernel version runs for 1024 iterations (maximum kdbus
connections per user) and then produces errors.
In fact, the kernel version is even more stable than the user-space
version, and bails out much earlier. Run it on a VT and everything
works just fine.
The only issue you get with kdbus is the compat-bus-daemon, which
assert()s as a side-effect of accept4() failing. In other words, the
compat bus-daemon gets ENFILE if you open that many connections, then
assert()s and thus kills all other proxy connections. This has the
side effect, that Xorg loses access to your graphics device and thus
your screen 'freezes'. Also networkmanager bails out and stops network
connections.
This is a bug in the proxy (which is already fixed).
Thanks
David
On Tue, Aug 4, 2015 at 7:09 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Tue, Aug 4, 2015 at 3:46 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Tue, Aug 4, 2015 at 1:58 AM, David Herrmann <[email protected]> wrote:
>>>
>>> You lack a call to sd_bus_unref() here.
>>
>> I assume it was intentional. Why would Andy talk about "scaling" otherwise?
It was actually an error. I assumed that, since the user version
worked fine (at least for as long as I ran it) and the kernel version
didn't (killed X and left a blinking cursor, no visible log messages
even when run from a text console, and no obvious OOM recovery after a
long wait) that it was a kdbus issue or issue with other kdbus
clients.
I'll play with it more today.
>>
>> And the worry was why the kdbus version killed the machine, but the
>> userspace version did not. That's a rather big difference, and not a
>> good one.
>
> Neither test 'kills' the machine:
>
> * The userspace version will be killed by the OOM killer after about
> 20s running (depending how much memory you have).
Not on my system. Maybe too much memory?
>
> * The kernel version runs for 1024 iterations (maximum kdbus
> connections per user) and then produces errors.
>
> In fact, the kernel version is even more stable than the user-space
> version, and bails out much earlier. Run it on a VT and everything
> works just fine.
On my system, everything died as described above.
>
> The only issue you get with kdbus is the compat-bus-daemon, which
> assert()s as a side-effect of accept4() failing. In other words, the
> compat bus-daemon gets ENFILE if you open that many connections, then
> assert()s and thus kills all other proxy connections. This has the
> side effect, that Xorg loses access to your graphics device and thus
> your screen 'freezes'. Also networkmanager bails out and stops network
> connections.
Ah, interesting.
>
> This is a bug in the proxy (which is already fixed).
Should I expect to see it in Rawhide soon?
Anyway, the broadcasts that I intended to exercise were
KDBUS_ITEM_ID_REMOVE. Those appear to be broadcast to everyone,
irrespective of "policy", so long as the "match" thingy allows it. As
far as I can tell, that's the default behavior (i.e. receivers accept
KDBUS_DST_ID_BROADCAST), but even if it's not default, we'll still
fail to scale as long as the number of receivers accepting
KDBUS_DST_ID_BROADCAST grows as systems become more kdbus-integrated.
The bloom filter thing won't help at all according to the docs: bloom
filters don't apply to kernel-generated notifications.
So yes, as far as I can tell, kdbus really does track object lifetime
by broadcasting every single destruction event to every single
receiver (subject to caveats above) and pokes the data into every
receiver's tmpfs space (via kdbus_bus_broadcast ->
kdbus_conn_entry_insert -> lots of other stuff -> vfs_iter_write). At
that point, there's well over a gigabyte of tmpfs space that can be
scribbled on (and thus committed and thus needs to be read) by rogue
broadcasters even on Rawhide, and Rawhide seems to have barely started
converting all the kdbus clients from using the proxy to using kdbus
directly.
IIUC, once gdbus switches over to using kdbus directly, with current
buffer sizing, the average laptop will have more kdbus pool tmpfs
space mapped than total RAM. I still don't see how this will work
well.
I guess my test didn't exercise what I meant it to. I wrote it,
userspace survived (on my system) and kdbus didn't. Apparently I blew
up the bus proxy, not the pool mechanism. Next time I'll try to
better characterize exactly what it is I'm doing to my poor VM...
--Andy
On Tue, Aug 4, 2015 at 7:47 AM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Aug 4, 2015 at 7:09 AM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> On Tue, Aug 4, 2015 at 3:46 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Tue, Aug 4, 2015 at 1:58 AM, David Herrmann <[email protected]> wrote:
>>>>
>>>> You lack a call to sd_bus_unref() here.
>>>
>>> I assume it was intentional. Why would Andy talk about "scaling" otherwise?
>
> It was actually an error. I assumed that, since the user version
> worked fine (at least for as long as I ran it) and the kernel version
> didn't (killed X and left a blinking cursor, no visible log messages
> even when run from a text console, and no obvious OOM recovery after a
> long wait) that it was a kdbus issue or issue with other kdbus
> clients.
>
> I'll play with it more today.
>
I added the missing sd_bus_unref call.
With userspace dbus, my program takes 95% CPU and dbus-daemon takes
88% CPU or so.
With kdbus, I see abuse-bus (my test), systemd-journald,
systemd-bus-proxy, auditd, gnome-shell, mission-control, sedispatch,
firewalld, polkitd, NetworkManager, systemd, avahi-daemon, audisp,
abrt-dump-jour* (whatever it's called -- it truncated), upowerd, and
systemd-logind all taking tons of CPU. I've listed them in decreasing
order of amount of CPU burned -- the top several are taking about as
much as is possible. Load average is over 13. That's if I run it
from a text console while I'm logged in to gnome in a different VT.
If I run the program from a graphical terminal, everything freezes so
hard that the cursor doesn't even make it to the next line when I hit
enter.
So I still claim that kdbus doesn't scale. I'm not even just saying
that it doesn't scale to large systems -- somewhat to my surprise, it
doesn't even seem to scale well enough for a mostly empty Rawhide
workstation system running just a graphical terminal. And I didn't
even try to find stress tests more interesting than connecting and
disconnecting in a loop.
FWIW, the old test (without the unref) appeared to be allocating 16M
of mapped kdbus pool every iteration, which seems unlikely to have
helped matters.
--Andy
Hi
On Tue, Aug 4, 2015 at 4:47 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Aug 4, 2015 at 7:09 AM, David Herrmann <[email protected]> wrote:
>> This is a bug in the proxy (which is already fixed).
>
> Should I expect to see it in Rawhide soon?
Use this workaround until it does:
$ DBUS_SYSTEM_BUS_ADDRESS="kernel:path=/sys/fs/kdbus/0-system/bus"
./your-binary
> Anyway, the broadcasts that I intended to exercise were
> KDBUS_ITEM_ID_REMOVE. Those appear to be broadcast to everyone,
> irrespective of "policy", so long as the "match" thingy allows it.
Matches are opt-in, not opt-out. Nobody will get this message unless
they opt in.
> The bloom filter thing won't help at all according to the docs: bloom
> filters don't apply to kernel-generated notifications.
Bloom filters apply to message payloads. Kernel notifications do not
carry a message payload. Message metadata can be filtered for
explicitly (without false-positives).
> So yes, as far as I can tell, kdbus really does track object lifetime
> by broadcasting every single destruction event to every single
> receiver (subject to caveats above) and pokes the data into every
> receiver's tmpfs space.
Broadcast reception is opt-in.
Thanks
David
On Wed, Aug 5, 2015 at 12:10 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Tue, Aug 4, 2015 at 4:47 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Aug 4, 2015 at 7:09 AM, David Herrmann <[email protected]> wrote:
>>> This is a bug in the proxy (which is already fixed).
>>
>> Should I expect to see it in Rawhide soon?
>
> Use this workaround until it does:
>
> $ DBUS_SYSTEM_BUS_ADDRESS="kernel:path=/sys/fs/kdbus/0-system/bus"
> ./your-binary
>
Which binary is supposed to be run like that?
>> Anyway, the broadcasts that I intended to exercise were
>> KDBUS_ITEM_ID_REMOVE. Those appear to be broadcast to everyone,
>> irrespective of "policy", so long as the "match" thingy allows it.
>
> Matches are opt-in, not opt-out. Nobody will get this message unless
> they opt in.
>
And what opts in? Either something's broken, or there's a different
scalabilty problem, or a whole pile of kdbus-using programs in Fedora
Rawhide do, in fact, opt in.
My interest in instrumenting kdbus and systemd to figure out the exact
mechanism by which my tiny test case causes my system to freeze is
near zero. I bet I'm actually right about the mechanism, but that's
sort of beside the point. It freezes, so /something's/ wrong. The
only real relevance of my suspicion about the failure mode is that I
think it's a design issue that isn't going to be easy to fix.
>
>> So yes, as far as I can tell, kdbus really does track object lifetime
>> by broadcasting every single destruction event to every single
>> receiver (subject to caveats above) and pokes the data into every
>> receiver's tmpfs space.
>
> Broadcast reception is opt-in.
I've pointed out several times that there a feature in kdbus that
doesn't work well and I get told that the problematic feature is
opt-in. Given that all existing prototype userspace that I'm aware of
(systemd and its consumers) apparently opts in, I don't really care
that the feature is opt-in.
Also, given things like this:
commit d27c8057699d164648b7d8c1559fa6529998f89d
Author: David Herrmann <[email protected]>
Date: Tue May 26 09:30:14 2015 +0200
kdbus: forward ID notifications to everyone
it really does seem to me that the point of these ID notifications is
for everyone to get them.
Also, you haven't addressed the memory usage issues -- I don't see how
a full kdbus-using desktop system can be expected to fit into RAM on
anything short of the biggest and beefiest laptops. I also don't see
how a kdbus-using xdg-app-happy kdbus-using system (with
correspondingly many pools) will fit into RAM on even the biggest
laptops.
--Andy
Hi Andy,
On 08/05/2015 02:18 AM, Andy Lutomirski wrote:
> I added the missing sd_bus_unref call.
>
> With userspace dbus, my program takes 95% CPU and dbus-daemon takes
> 88% CPU or so.
>
> With kdbus, I see abuse-bus (my test), systemd-journald,
> systemd-bus-proxy, auditd, gnome-shell, mission-control, sedispatch,
> firewalld, polkitd, NetworkManager, systemd, avahi-daemon, audisp,
> abrt-dump-jour* (whatever it's called -- it truncated), upowerd, and
> systemd-logind all taking tons of CPU. I've listed them in decreasing
> order of amount of CPU burned -- the top several are taking about as
> much as is possible. Load average is over 13. That's if I run it
> from a text console while I'm logged in to gnome in a different VT.
That's right, I can reproduce this here. To explain what's going on, let
me provide some background.
Every time a client connects to kdbus, a new ID is assigned to the
connection, and other connections which have previously subscribed to
notifications of type KDBUS_ITEM_ID_ADD or _REMOVE get a notification
and are woken up so they can dispatch it. By default, no such matches
exists, applicaions have to explicitly opt-in if they are interested in
these events.
In DBus (both kdbus and DBus1), such matches are installed on the
NameOwnerChanged signal, and they can be either specific to a single ID,
or broad, which will make them match on any ID. There's actually no
reason for applications to install unspecific matches, but if they do,
they will of course get what they asked for, and are woken up on every
ID that is added to or removed from the bus. What you're seeing in your
system profile is that some applications misbehave and install
unspecific matches when they shouldn't. That's a userspace bug that
needs fixing. Two candidates were actually in the systemd code base
(logind and PID1), and both are now patched.
Note that these applications are actually affected on both DBus1 and
kdbus. The reason you didn't see them trip up in your test is that
sd_bus_open() behaves differently in the two worlds. In kdbus, it will
immediately call into the kernel and register a new connection, hence
triggering the behavior described above. On DBus1, however, the HELLO
message will not be transmitted to the daemon until the first message is
sent, so no ID is assigned, and no notifications are sent. When
augmenting the test program a little so it reads its own ID on the bus,
you'll see similar behavior on DBus1 as well, but the bottleneck in this
case is the daemon, which significantly mitigates the load caused by
other tasks.
So, to wrap it up: you've triggered an existing userspace bug. The
userspace components under our control have now been fixed, and we'll
talk to other people to make them aware of the issue too. However, these
issues are not directly related to kdbus, but rather show more impact as
a side-effect now.
You've raised a valid point here. Thanks a lot for providing this test,
much appreciated!
Daniel
Hi
On Wed, Aug 5, 2015 at 10:11 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Aug 5, 2015 at 12:10 AM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> On Tue, Aug 4, 2015 at 4:47 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Tue, Aug 4, 2015 at 7:09 AM, David Herrmann <[email protected]> wrote:
>>>> This is a bug in the proxy (which is already fixed).
>>>
>>> Should I expect to see it in Rawhide soon?
>>
>> Use this workaround until it does:
>>
>> $ DBUS_SYSTEM_BUS_ADDRESS="kernel:path=/sys/fs/kdbus/0-system/bus"
>> ./your-binary
>>
>
> Which binary is supposed to be run like that?
Your test.
>>> Anyway, the broadcasts that I intended to exercise were
>>> KDBUS_ITEM_ID_REMOVE. Those appear to be broadcast to everyone,
>>> irrespective of "policy", so long as the "match" thingy allows it.
>>
>> Matches are opt-in, not opt-out. Nobody will get this message unless
>> they opt in.
>>
>
> And what opts in? Either something's broken, or there's a different
> scalabilty problem, or a whole pile of kdbus-using programs in Fedora
> Rawhide do, in fact, opt in.
See Daniel's explanation. If applications subscribe to all
notifications, they get what they asked for. I recommend filing bug
reports for the applications in question.
> Given that all existing prototype userspace that I'm aware of
> (systemd and its consumers) apparently opts in, I don't really care
> that the feature is opt-in.
This is just plain wrong. Out of the dozens of dbus applications, you
found like 9 which are buggy? Two of them are already fixed, the
maintainers of the other ones notified.
I'd be interested where you got this notion that "all existing
prototype userspace [...] opts in".
> Also, given things like this:
>
> commit d27c8057699d164648b7d8c1559fa6529998f89d
> Author: David Herrmann <[email protected]>
> Date: Tue May 26 09:30:14 2015 +0200
>
> kdbus: forward ID notifications to everyone
>
> it really does seem to me that the point of these ID notifications is
> for everyone to get them.
It's not. This patch just opens the policy so everyone can see those
notifications. By default, it's not delivered to anyone.
> Also, you haven't addressed the memory usage issues --
..because it doesn't change anything. If your IPC is message based and
async, _someone_ needs to buffer. I don't see the difference between
buffering locally on !EPOLLOUT or buffering in a shmem pool. In both
cases, clients have control over the buffer size. If you disagree,
please _elaborate_.
Thanks
David
Am Donnerstag, 6. August 2015, 10:04:57 schrieb David Herrmann:
> > Given that all existing prototype userspace that I'm aware of
> >
> > (systemd and its consumers) apparently opts in, I don't really care
> > that the feature is opt-in.
>
> This is just plain wrong. Out of the dozens of dbus applications, you
> found like 9 which are buggy? Two of them are already fixed, the
> maintainers of the other ones notified.
> I'd be interested where you got this notion that "all existing
> prototype userspace [...] opts in".
But these few can create the issues Andy described?
Sure, one can argue I can setup a stress or stress-ng command line invocation
as root user that will basically grind a Linux system to a halt – and in a way
I consider this to be a bug in the kernel as well, but one that exists since a
long time. But a GUI application running as a user?
How about some robustness regarding what you see as bugs in userspace here?
I think "The bug is not mine" is exactly the same language we have seen here
before. If the kernel relies on bug-free userspace applications in order to do
its job properly I think it has robustness issues. One certainly wouldn´t want
this with any mission critical realtime OS. I think it is the kernel that
should be in control.
Thanks,
--
Martin
On Aug 6, 2015 1:04 AM, "David Herrmann" <[email protected]> wrote:
> > Given that all existing prototype userspace that I'm aware of
> > (systemd and its consumers) apparently opts in, I don't really care
> > that the feature is opt-in.
>
> This is just plain wrong. Out of the dozens of dbus applications, you
> found like 9 which are buggy? Two of them are already fixed, the
> maintainers of the other ones notified.
> I'd be interested where you got this notion that "all existing
> prototype userspace [...] opts in".
>
I would say instead that, out of one in-use kdbus library, I found one
that was buggy. Maybe gdbus really does use kdbus already, but on
very brief inspection it looked like it didn't at least on my test VM.
>
> > Also, you haven't addressed the memory usage issues --
>
> ..because it doesn't change anything. If your IPC is message based and
> async, _someone_ needs to buffer. I don't see the difference between
> buffering locally on !EPOLLOUT or buffering in a shmem pool. In both
> cases, clients have control over the buffer size. If you disagree,
> please _elaborate_.
If the client buffers on !EPOLLOUT and has a monster buffer, then
that's the client's problem.
If every single program has a monster buffer, then it's everyone's
problem, and the size of the problem gets multiplied by the number of
programs.
Also, sensible clients that produce bulk data will throttle on
!EPOLLOUT rather than blindly buffering, but that's not an option when
the huge buffer is on the receiver's end. Read up on "bufferbloat".
--Andy
On Thu, Aug 6, 2015 at 12:06 AM, Daniel Mack <[email protected]> wrote:
> Hi Andy,
>
> On 08/05/2015 02:18 AM, Andy Lutomirski wrote:
>> I added the missing sd_bus_unref call.
>>
>> With userspace dbus, my program takes 95% CPU and dbus-daemon takes
>> 88% CPU or so.
>>
>> With kdbus, I see abuse-bus (my test), systemd-journald,
>> systemd-bus-proxy, auditd, gnome-shell, mission-control, sedispatch,
>> firewalld, polkitd, NetworkManager, systemd, avahi-daemon, audisp,
>> abrt-dump-jour* (whatever it's called -- it truncated), upowerd, and
>> systemd-logind all taking tons of CPU. I've listed them in decreasing
>> order of amount of CPU burned -- the top several are taking about as
>> much as is possible. Load average is over 13. That's if I run it
>> from a text console while I'm logged in to gnome in a different VT.
>
> That's right, I can reproduce this here. To explain what's going on, let
> me provide some background.
>
> Every time a client connects to kdbus, a new ID is assigned to the
> connection, and other connections which have previously subscribed to
> notifications of type KDBUS_ITEM_ID_ADD or _REMOVE get a notification
> and are woken up so they can dispatch it. By default, no such matches
> exists, applicaions have to explicitly opt-in if they are interested in
> these events.
>
> In DBus (both kdbus and DBus1), such matches are installed on the
> NameOwnerChanged signal, and they can be either specific to a single ID,
> or broad, which will make them match on any ID. There's actually no
> reason for applications to install unspecific matches, but if they do,
> they will of course get what they asked for, and are woken up on every
> ID that is added to or removed from the bus. What you're seeing in your
> system profile is that some applications misbehave and install
> unspecific matches when they shouldn't. That's a userspace bug that
> needs fixing. Two candidates were actually in the systemd code base
> (logind and PID1), and both are now patched.
Can you point me at the patch?
It sounds like that will reduce the scalability issue with this
particular test from whatever userspace overhead exists * number of
clients to just the overhead of looping over all clients and their
matches in the kernel.
--Andy
On 08/06/2015 05:27 PM, Andy Lutomirski wrote:
>> In DBus (both kdbus and DBus1), such matches are installed on the
>> > NameOwnerChanged signal, and they can be either specific to a single ID,
>> > or broad, which will make them match on any ID. There's actually no
>> > reason for applications to install unspecific matches, but if they do,
>> > they will of course get what they asked for, and are woken up on every
>> > ID that is added to or removed from the bus. What you're seeing in your
>> > system profile is that some applications misbehave and install
>> > unspecific matches when they shouldn't. That's a userspace bug that
>> > needs fixing. Two candidates were actually in the systemd code base
>> > (logind and PID1), and both are now patched.
>
> Can you point me at the patch?
https://github.com/systemd/systemd/pull/876
https://github.com/systemd/systemd/pull/887
firewalld and possibly some other applications in the Fedora default
install use python-slip, a convenience library that currently
unconditionally installs the broad matches. I filed a bug with patches here:
https://fedorahosted.org/python-slip/ticket/2
And I filed more bugs for some GNOME components.
Thanks,
Daniel
On 08/06/2015 05:21 PM, Andy Lutomirski wrote:
> Maybe gdbus really does use kdbus already, but on
> very brief inspection it looked like it didn't at least on my test VM.
No, it's not in any released version yet. The patches for that are being
worked on though and look promising.
> If the client buffers on !EPOLLOUT and has a monster buffer, then
> that's the client's problem.
>
> If every single program has a monster buffer, then it's everyone's
> problem, and the size of the problem gets multiplied by the number of
> programs.
The size of the memory pool of a bus client is chosen by the client
itself individually during the HELLO call. It's pretty much the same as
if the client allocated the buffer itself, except that the kernel does
it on their behalf.
Also note that kdbus features a peer-to-peer based quota accounting
logic, so a single bus connection can not DOS another one by filling its
buffer.
Thanks,
Daniel
On Thu, Aug 6, 2015 at 11:14 AM, Daniel Mack <[email protected]> wrote:
> On 08/06/2015 05:21 PM, Andy Lutomirski wrote:
>> Maybe gdbus really does use kdbus already, but on
>> very brief inspection it looked like it didn't at least on my test VM.
>
> No, it's not in any released version yet. The patches for that are being
> worked on though and look promising.
>
>> If the client buffers on !EPOLLOUT and has a monster buffer, then
>> that's the client's problem.
>>
>> If every single program has a monster buffer, then it's everyone's
>> problem, and the size of the problem gets multiplied by the number of
>> programs.
>
> The size of the memory pool of a bus client is chosen by the client
> itself individually during the HELLO call. It's pretty much the same as
> if the client allocated the buffer itself, except that the kernel does
> it on their behalf.
>
> Also note that kdbus features a peer-to-peer based quota accounting
> logic, so a single bus connection can not DOS another one by filling its
> buffer.
I haven't looked at the quota code at all.
Nonetheless, it looks like the slice logic (aside: it looks *way* more
complicated than necessary -- what's wrong with circular buffers)
will, under most (but not all!) workloads, concentrate access to a
smallish fraction of the pool. This is IMO bad, since it means that
most of the time most of the pool will remain uncommitted. If, at
some point, something causes the access pattern to change and hit all
the pages (even just once), suddenly all of the pools get committed,
and your memory usage blows up.
Again, please stop blaming the clients. In practice, kdbus is a
system involving the kernel, systemd, sd-bus, and other stuff, mostly
written by the same people. If kdbus gets merged and it survives but
half the clients blow up and peoples' systems fall over, that's not
okay.
--Andy
On 08/06/2015 08:43 PM, Andy Lutomirski wrote:
> Nonetheless, it looks like the slice logic (aside: it looks *way* more
> complicated than necessary -- what's wrong with circular buffers)
> will, under most (but not all!) workloads, concentrate access to a
> smallish fraction of the pool. This is IMO bad, since it means that
> most of the time most of the pool will remain uncommitted. If, at
> some point, something causes the access pattern to change and hit all
> the pages (even just once), suddenly all of the pools get committed,
> and your memory usage blows up.
That's a general problem with memory overcommitment, and not specific to
kdbus. IOW: You'd have the same problem with a similar logic implemented
in userspace, right?
Daniel
On Fri, Aug 7, 2015 at 7:40 AM, Daniel Mack <[email protected]> wrote:
> On 08/06/2015 08:43 PM, Andy Lutomirski wrote:
>> Nonetheless, it looks like the slice logic (aside: it looks *way* more
>> complicated than necessary -- what's wrong with circular buffers)
>> will, under most (but not all!) workloads, concentrate access to a
>> smallish fraction of the pool. This is IMO bad, since it means that
>> most of the time most of the pool will remain uncommitted. If, at
>> some point, something causes the access pattern to change and hit all
>> the pages (even just once), suddenly all of the pools get committed,
>> and your memory usage blows up.
>
> That's a general problem with memory overcommitment, and not specific to
> kdbus. IOW: You'd have the same problem with a similar logic implemented
> in userspace, right?
>
Sure, except that, if it's in userspace and it starts causing
problems, then userspace can fix it without running into kernel ABI
stability issues.
--Andy
2015-08-07 2:43 GMT+08:00 Andy Lutomirski <[email protected]>:
> On Thu, Aug 6, 2015 at 11:14 AM, Daniel Mack <[email protected]> wrote:
>> On 08/06/2015 05:21 PM, Andy Lutomirski wrote:
>>> Maybe gdbus really does use kdbus already, but on
>>> very brief inspection it looked like it didn't at least on my test VM.
>>
>> No, it's not in any released version yet. The patches for that are being
>> worked on though and look promising.
>>
>>> If the client buffers on !EPOLLOUT and has a monster buffer, then
>>> that's the client's problem.
>>>
>>> If every single program has a monster buffer, then it's everyone's
>>> problem, and the size of the problem gets multiplied by the number of
>>> programs.
>>
>> The size of the memory pool of a bus client is chosen by the client
>> itself individually during the HELLO call. It's pretty much the same as
>> if the client allocated the buffer itself, except that the kernel does
>> it on their behalf.
>>
>> Also note that kdbus features a peer-to-peer based quota accounting
>> logic, so a single bus connection can not DOS another one by filling its
>> buffer.
>
> I haven't looked at the quota code at all.
>
> Nonetheless, it looks like the slice logic (aside: it looks *way* more
> complicated than necessary -- what's wrong with circular buffers)
> will, under most (but not all!) workloads, concentrate access to a
> smallish fraction of the pool. This is IMO bad, since it means that
> most of the time most of the pool will remain uncommitted. If, at
> some point, something causes the access pattern to change and hit all
> the pages (even just once), suddenly all of the pools get committed,
> and your memory usage blows up.
>
> Again, please stop blaming the clients. In practice, kdbus is a
> system involving the kernel, systemd, sd-bus, and other stuff, mostly
> written by the same people. If kdbus gets merged and it survives but
> half the clients blow up and peoples' systems fall over, that's not
> okay.
Any comments about the questions mentioned by Andy?
In KDBUS, sender writes a page of receiver's tmpfs space, may either
helps receiver to escape its memcg limitation, or incurs receiver's
limitation?
Also, I'm curious about similar problems in these cases:
1. A UNIX domain Server (SOCK_STREAM or SOCK_DGRAM) replies to its
Clients, but some clients consume the messages __too slow__, will the
server block? Or can it serve other clients instead of blocking?
2. Open netlink sockets of NETLINK_KOBJECT_UEVENT, but some processes
consume uevent __too slow__, and uevent is continually triggered. Will
the system block? Or those processes finally lost some uevents?
3. Watch a directory via inotify, but some processes consume events
__too slow__, and file operations is continually performed against the
directory. Will the system block? Or those processes finally lost some
events?
--
Regards,
- cee1
On Fri, Aug 07, 2015 at 06:26:31PM +0300, Linus Torvalds wrote:
> User space memory allocation is not AT ALL the same thing as kdbus.
> Kernel allocations are very very different from user allocations. We
> have reasonable, fairly tested, and generic models for handling user
> space memory allocation issues - limiting, debugging, failing, and
> handling catastrophes (ie oom). And no, even that doesn't always work
> perfectly, but at least there is a *lot* of support for it, and this
> is not some special case.
The memory in this case is a shmem file that is created by the kernel,
but on behalf of the bus client task, which will eventually own it. As
discussed with the mm developers, the same logic for accounting, OOM
handling, etc. applies to the kdbus shmem buffers, as they are written
to from the context of another task.? If this is mistaken, then yes, you
are right, and the code will have to be changed.
> This discussion has been full of kdbus people ignoring Andy saying "it
> worked with the user space version, it killed the machine with kdbus".
> And now people trying to claim the issues are the same. HELL NO.
Andy found some great bugs with regards to flooding the bus with
requests, which has not been ignored at all.? The same issue is present
in dbus today, but the kdbus code runs faster and more messages were
being sent than the current userspace dbus daemon, so the machine
becomes unresponsive easier.
The issue is with userspace clients opting in to receive all
NameOwnerChanged messages on the bus, which is not a good idea as they
constantly get woken up and process them, which is why the CPU was
pegged.? This issue should now be fixed in Rawhide for some of the
packages we found that were doing this. Maintainers of other packages
have been informed.? End result, no one has ever really tested sending
"bad" messages to the current system as all existing dbus users try to
be "good actors", thanks to Andy's testing, these apps should all now
become much more robust.
In chatting with Daniel on IRC, he is writing up a summary of how the
kdbus memory pools work in more detail, and he said he would sent that
out in a day or so, so that everyone can review.
thanks,
greg k-h
On 08/09/2015 09:00 PM, Greg Kroah-Hartman wrote:
> In chatting with Daniel on IRC, he is writing up a summary of how the
> kdbus memory pools work in more detail, and he said he would sent that
> out in a day or so, so that everyone can review.
Yes, let me quickly describe again how the kdbus pool logic works.
Every bus connection (peer) owns a buffer which is used in order to
receive payloads. Such payloads are either messages sent from other
connections, notifications or returned answer structures in return of
query commands (name lists, etc).
In order to avoid the kernel having to maintaining an internal buffer
the connections then read from with an extra command, we decided to let
the connections own their buffer directly, so they can mmap() the memory
into their task. Allocating a local buffer to collect asynchronous
messages is what they would need to do anyway, so we implemented a
short-cut that allows the kernel to directly access the memory and write
to it. The size of this buffer pool is configured by each connection
individually, during the HELLO call, so the kernel interface is as
flexible as any other memory allocation scheme the kernel provides and
is subject to the same limits.
Internally, the connection pool is simply a shmem backed file. From the
context of the HELLO ioctl, we are calling into shmem_file_setup(), so
the file is eventually owned by the task which created the bus task
connecting to the bus. One reason why we do the shmem file allocation in
the kernel and on behalf of a the userspace task is that we clear the
VM_MAYWRITE bit to prevent the task from writing to the pool through its
mapped buffer. We also do not set VM_NORESERVE, so the entire buffer is
pre-accounted for the task that created the connection.
The pool implementation uses an r/b tree to organize the buffer into
slices. Those slices can be kept by userspace as long as the parsing
implementation needs to have access to them. When finished, the slices
are freed. A simple ring buffer cannot cope with the gaps that emerge by
that.
When a connection buffer is written to, it is done from the context of
another task which calls into the kdbus code through one of the ioctls.
The memcg implementation should hence charge the task that acts as
writer, which is maybe not ideal but can be changed easily with some
addition to the internal APIs. We omitted it for the current version,
which is non-intrusive with regards to other kernel subsystems.
The kdbus implementation is actually comparable to two tasks X and Y
which both have their own buffer file open and mmap()ed, and they both
pass their FD to the other side. If X now writes to Y's file, and that
is causing a page fault, X is accounted for it, correct?
The kernel does *not* do any memory allocation to buffer payload, and
all other allocations (for instance, to keep around the internal state
of a connection, names etc) are subject to conservatively chosen
limitations. There is no unbounded memory allocation in kdbus that I am
aware of. If there was, it would clearly be a bug.
Addressing the point Andy made earlier: yes, due to memory
overcommitment, OOM situations may happen with certain patterns, but the
kernel should have the same measures to deal with them that it already
has with other types of shared userspace memory. Right?
Hope that all makes sense, we're open to discussions around the desired
accounting details. I've copied linux-mm to let more people have a look
into this again.
Thanks,
Daniel
On Sun, Aug 9, 2015 at 3:11 PM, Daniel Mack <[email protected]> wrote:
>
> Internally, the connection pool is simply a shmem backed file. From the
> context of the HELLO ioctl, we are calling into shmem_file_setup(), so
> the file is eventually owned by the task which created the bus task
> connecting to the bus. One reason why we do the shmem file allocation in
> the kernel and on behalf of a the userspace task is that we clear the
> VM_MAYWRITE bit to prevent the task from writing to the pool through its
> mapped buffer. We also do not set VM_NORESERVE, so the entire buffer is
> pre-accounted for the task that created the connection.
I don't have access to the system I've been using for testing right
now, but I wonder how the kdbus pool stack up against the entire rest
of memory allocations for the average desktop process.
>
> The pool implementation uses an r/b tree to organize the buffer into
> slices. Those slices can be kept by userspace as long as the parsing
> implementation needs to have access to them. When finished, the slices
> are freed. A simple ring buffer cannot cope with the gaps that emerge by
> that.
>
> When a connection buffer is written to, it is done from the context of
> another task which calls into the kdbus code through one of the ioctls.
> The memcg implementation should hence charge the task that acts as
> writer, which is maybe not ideal but can be changed easily with some
> addition to the internal APIs. We omitted it for the current version,
> which is non-intrusive with regards to other kernel subsystems.
>
This has at least the following weakness. I can very easily get
systemd to write to my shmem-backed pool: simply subscribe to one of
its broadcasts. If I cause such a write to be very slow
(intentionally or otherwise), then PID 1 blocks.
If you change the memcg code to charge me instead of PID 1 (as it
should IMO), then the problem gets worse.
> The kdbus implementation is actually comparable to two tasks X and Y
> which both have their own buffer file open and mmap()ed, and they both
> pass their FD to the other side. If X now writes to Y's file, and that
> is causing a page fault, X is accounted for it, correct?
If PID 1 accepted a memfd from me (even a properly sealed one) and
wrote to it, I would wonder whether it were actually a good idea.
Does this scheme have any actual measurable advantage over the
traditional model of a small non-paged buffer in the kernel (i.e. the
way sockets work) with explicit userspace memfd use as appropriate?
--Andy
On Sun, 9 Aug 2015, Greg Kroah-Hartman wrote:
> The issue is with userspace clients opting in to receive all
> NameOwnerChanged messages on the bus, which is not a good idea as they
> constantly get woken up and process them, which is why the CPU was
> pegged.? This issue should now be fixed in Rawhide for some of the
> packages we found that were doing this. Maintainers of other packages
> have been informed.? End result, no one has ever really tested sending
> "bad" messages to the current system as all existing dbus users try to
> be "good actors", thanks to Andy's testing, these apps should all now
> become much more robust.
Does it require elevated privileges to opt to receive all NameOwnerChanged
messages on the bus? Is it the default unless the apps opt for something more
restrictive? or is it somewhere in between?
I was under the impression that the days of writing system-level stuff that
assumes that all userspace apps are going to 'play nice' went out a decade or
more ago. It's fine if the userspace app can kill itself, or possibly even the
user it's running as, but being able to kill apps running as other users, let
alone the whole system is a problem nowdays.
It may be able to happen in a default system, but this is why cgroups and
namespaces have been created, to give the system admin the ability to limit the
resources that any one app can consume. Introducing a new mechanism that allows
one user to consume resources allocated to another and kill the system without
providing a kernel level mechanism to limit the damage (as opposed to fixing
individual apps) seems rather short-sighted at best.
David Lang
On Sun, Aug 9, 2015 at 3:11 PM, Daniel Mack <[email protected]> wrote:
>
> The kdbus implementation is actually comparable to two tasks X and Y
> which both have their own buffer file open and mmap()ed, and they both
> pass their FD to the other side. If X now writes to Y's file, and that
> is causing a page fault, X is accounted for it, correct?
No.
With shared memory, there's no particularly obvious accounting rules.
In particular, when somebody maps an already allocated page, it's
basically a no-op from a memory allocation standpoint.
The whole "this is equivalent to the user space deamon" argument is
bogus. Shared memory is very very different from just sending messages
(copying the buffers) and is generally much harder to get a handle on.
And thats' what you should be comparing to.
The old "communicate over a unix domain socket" had pretty clear
accounting rules, and while unix domain sockets have some horribly
nasty issues (most are about passing fd's around) that isn't one of
them.
Anyway, the real issue for me here is that Andy is reporting all these
actual real problems that happen in practice, and the developer
replies are dismissing them on totally irrelevant grounds ("this
should be equivalent to something entirely different that nobody ever
does" or "well, people could opt out, even if they didn't" yadda yadda
yadda).
For example, the whole "tasks X and Y communicate over shmem" is
irrelevant. Normally, when people write those kinds of applications,
they are just regular applications. If they have issues, nobody else
cares. Andy's concern is about one of X/Y being a system daemon and
tricking it into doing bad things ends up effectively killing the
system - whether the *kernel* is alive or not and did the right thing
is almost entirely immaterial.
So please. When Andy sends a bug report with a exploit that kills his
system, just stop responding with irrelevant theoretical arguments. It
is not appropriate. Instead, acknowledge the problem and work on
fixing it, none of this "but but but it's all the same" crap.
Linus