2018-06-29 06:47:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

For devices with a class, we create a "glue" directory between
the parent device and the new device with the class name.

This directory is never "explicitely" removed when empty however,
this is left to the implicit sysfs removal done by kobjects when
they are released on the last kobject_put().

This is problematic because as long as it's not been removed from
sysfs, it is still present in the class kset and in sysfs directory
structure.

The presence in the class kset exposes a use after free bug fixed
by the previous patch, but the presence in sysfs means that until
the kobject is released, which can take a while (especially with
kobject debugging), any attempt at re-creating such as binding a
new device for that class/parent pair, will result in a sysfs
duplicate file name error.

This fixes it by instead doing an explicit kobject_del() when
the glue dir is empty, by keeping track of the number of
child devices of the gluedir.

This is made easy by the fact that all glue dir operations are
done with a global mutex, and there's already a function
(cleanup_glue_dir) called in all the right places taking that
mutex that can be enhanced for this.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---

(Adding lkml, I just realized I completely forgot to CC it in
the first place on this whole conversation, blame the 1am debugging
session)

drivers/base/core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index e9eff2099896..66e15daa9980 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1436,6 +1436,13 @@ struct kobject *virtual_device_parent(struct device *dev)
struct class_dir {
struct kobject kobj;
struct class *class;
+
+ /*
+ * This counts the number of children of the gluedir in
+ * order to "cleanly" kobject_del() it when it becomes
+ * empty. It is protected by the gdb_mutex
+ */
+ unsigned int child_count;
};

#define to_class_dir(obj) container_of(obj, struct class_dir, kobj)
@@ -1470,6 +1477,7 @@ class_dir_create_and_add(struct class *class, struct kobject *parent_kobj)
return NULL;

dir->class = class;
+ dir->child_count = 1;
kobject_init(&dir->kobj, &class_dir_ktype);

dir->kobj.kset = &class->p->glue_dirs;
@@ -1526,6 +1534,8 @@ static struct kobject *get_device_parent(struct device *dev,
}
spin_unlock(&dev->class->p->glue_dirs.list_lock);
if (kobj) {
+ struct class_dir *class_dir = to_class_dir(kobj);
+ class_dir->child_count++;
mutex_unlock(&gdp_mutex);
return kobj;
}
@@ -1567,11 +1577,18 @@ static inline struct kobject *get_glue_dir(struct device *dev)
*/
static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
{
+ struct class_dir *class_dir = to_class_dir(glue_dir);
+
/* see if we live in a "glue" directory */
if (!live_in_glue_dir(glue_dir, dev))
return;

mutex_lock(&gdp_mutex);
+ if (WARN_ON(class_dir->child_count == 0))
+ goto bail;
+ if (--class_dir->child_count == 0)
+ kobject_del(glue_dir);
+ bail:
kobject_put(glue_dir);
mutex_unlock(&gdp_mutex);
}


2018-06-29 16:14:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Thu, Jun 28, 2018 at 7:22 PM Benjamin Herrenschmidt
<[email protected]> wrote:
>
> This fixes it by instead doing an explicit kobject_del() when
> the glue dir is empty, by keeping track of the number of
> child devices of the gluedir.

Ugh. I was hoping that you'd just do the "only check duplicate names
if actually in use" instead.

This patch seems to make patch 1/2 pointless. No?

Linus

2018-06-29 16:14:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Fri, Jun 29, 2018 at 6:56 AM Linus Torvalds
<[email protected]> wrote:
>
> This patch seems to make patch 1/2 pointless. No?

.. well, maybe not "pointless", since it seems to be a good idea
regardless, but you get what I mean, I think.

Linus

2018-06-30 01:07:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Fri, 2018-06-29 at 06:56 -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:22 PM Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
>
> Ugh. I was hoping that you'd just do the "only check duplicate names
> if actually in use" instead.

I had a look (see my other email). It's non-trivial. We can still look
into it, but from what I gathered of how sysfs works, it's based on
kernfs which doesn't have the kobjects nor access to the reference
count, and is the holder of the names rbtree.

So we'd need to find a way to convey that "in-use" information down to
kernfs (I thought maybe adding an optional pointer to a kref ? but
then, it's still somewhat racy ...)

Additionally, we would need a few more pairs of eyes to make sure that
sticking duplicates in that rbtree isn't going to break some corner
case in there.

It's a code base I'm not at all familiar with. So while I agree with
the overall idea that a kobject whose reference has gone down to 0
should probably be ignored by sysfs, actually doing it doesn't seem
simple.

> This patch seems to make patch 1/2 pointless. No?

Somewhat. It's stil the "right thing" to do in case a hole shows up
elsewhere, and it's an easier stable backport. But yes. I wrote it
before I figured out what to do with 2/2 (I was still trying to do what
you wanted and just skip ref=0 in sysfs name lookups).

But yes, with 2/2, there shouldn't be a kobject with a 0 ref in the
list anymore... hopefully.

I would very much appreciate the opinion of someone more familiar with
sysfs and the device core on all this though.

Cheers,
Ben.


2018-06-30 05:42:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sat, 2018-06-30 at 11:04 +1000, Benjamin Herrenschmidt wrote:
> I had a look (see my other email). It's non-trivial. We can still look
> into it, but from what I gathered of how sysfs works, it's based on
> kernfs which doesn't have the kobjects nor access to the reference
> count, and is the holder of the names rbtree.
>
> So we'd need to find a way to convey that "in-use" information down to
> kernfs (I thought maybe adding an optional pointer to a kref ? but
> then, it's still somewhat racy ...)
>
> Additionally, we would need a few more pairs of eyes to make sure that
> sticking duplicates in that rbtree isn't going to break some corner
> case in there.

Just to clarify, I will look into it, but it will take more time so I'm
keen on having the low hanging fixes in now.

Also in general, I dislike the idea of leaving things in sysfs with a
refcount of 0. That "late" cleanup in kobject_release() looks to me
more like a band-aid than a feature.

I'd be almost tempted to stick a WARN_ON in there and see who gets hit
but I worry it's going send me down one of those rabbit holes and my
time is limited these days ;-)

Cheers,
Ben.


2018-07-01 02:10:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

[ Ben - feel free to post the missing emails to lkml too ]

On Sat, Jun 30, 2018 at 6:56 PM Benjamin Herrenschmidt
<[email protected]> wrote:
>
> On Sat, 2018-06-30 at 12:29 -0700, Linus Torvalds wrote:
> >
> > Because normally, the last kobject_put() really does do a synchronous
> > kobject_del(). So normally, this is all completely pointless, and the
> > normal kobject lifetime rules should be sufficient.
>
> Not entirely sadly....
>
> There's a small race between the kref going down to 0 and the last
> kobject_put(). Something might still "catch" the 0-reference object
> during that window.

That's only true if the underlying subsystem doesn't have any
serialization itself.

Which I don't think is normal.

IOW, if you have (say) a PCI hotplug model, the PCI layer will have
the pci_hp_mutex, for example, which protects the PCI hotplug slot
lists and the kobj things.

Those locks won't protect kobj races in _general_ (ie there is no
locking between two totally unrelated buses), but they *should*
serialize the case of a device being added within one class. No?

If not, maybe we need to add some class-based serialization.

> I think it just opens an existing race more widely. The race always
> exist becaues another CPU can observe the object between the reference
> going to 0 and the last kobject_del done by kobject_release.

No it really damn well can't, exactly because we should not allow it -
and allowing it would be fundamentally racy.

We should never allow a kobject_get() with a zero reference count.
It's always a *fundamental* bug - see my other email on your 1/2
patch.

So there is no race, because the reference count itself protects the
object. Either another CPU gets it in time (before it went down to
zero), or it went down to zero and it is dead (because at that point,
deletion will have been scheduled.

Note that this is entirely independent of the auto-clean actions,
since even with absolutely zero cleanup, you still have the
fundamental issue of "reference count goes to zero causes the object
to be free'd".

Linus

2018-07-01 02:20:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sat, Jun 30, 2018 at 7:07 PM Linus Torvalds
<[email protected]> wrote:
>
> Those locks won't protect kobj races in _general_ (ie there is no
> locking between two totally unrelated buses), but they *should*
> serialize the case of a device being added within one class. No?

Side note: there had *better* be some locking whenever there is a way
to find an object, because otherwise you have a fundamental lifetime
problem: one thread finding the object at the same time another thread
frees it for the last time. Even the "unless_zero()" won't fix it,
because the final free will release the underlying object itself, so
the "zero" state is ephemeral.

That locking might be just RCU during lookup, and rcu-delaying the
release, of course. I think that's all the sysfs code needs, for
example, since that's what lookup uses.

And for any other embedded kobj cases, where you can reach the object
using some random subsystem pointers, there had better be other
locking in place for that pointer lookup vs the last removal.

kobject itself doesn't provide that locking, it only provides the
reference counting. But that's partly why it really has to disallow
any kobject_get() of a zero object, because it means that the
tear-down has been started, but the tear-down itself may not have had
time to get the lock yet (ie kobject_release() may be just about to
call the t->release() function).

But maybe I'm missing something subtle.

Linus

2018-07-01 07:18:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sat, 2018-06-30 at 20:57 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 8:42 PM Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > But what if something *else* still holds a reference to the kobject ?
> > It could be anything really... t
>
> But that's fine. Then the object will continue to exist, and the sysfs
> file will continue to exist, and you won't get a new glue directory.

I suspect you didn't read it my entire argument or I wasn't clear
enough :-) This is actually the crux of the problem:

Yes the object continues to exist. However, the *last* kobject_put to
it will no longer be done under whatever higher level locking the
subsystem provides (whatever prevents for example concurrent add and
removes).

Thus in that scenario the "last minute" kobject_release() done by the
last kobject_put() will be effectively unprotected from for example the
gdp_mutex (in the case of the gluedirs) or whatever other locking the
subsystem owning the kobject is using to avoid making that "refount 0"
object "discoverable".

So not only the conitnued existence of the object will potentially
trigger the duplicate sysfs name problem, but it *will* also re-open
the window for the object to be temporarily be visible with a 0
refcount.

The kobject debugging doesn't create a new race here. It just
significantly increase the size of an existing race window.

You argued yourself that the reason that window is a non-issue without
kobject debugging is that the expectation is that the release happens
synchronously with the last kobject_put done by device_del, which has
appropriate higher-level locking.

My point is that this specific assumption doesn't hold in the case
where another reference exists. In that case the object will be freed
with the race open for others to observe it while its refcount is 0.

> In fact, what you describe is a problem with *your* patch, exactly
> because you introduce another counter that is *not* the reference
> count, and now you have the problem with "old directory kobject is
> still live, but I removed the sysfs part, and now I'm creating a new
> object with the same name".
>
> Hmm?

So my patch 1/2 prevents us from finding the old dying object (and thus
from crashing) but replaces this with the duplicate name problem.

My patch 2/2 removes that second problem by ensuring we remove the
object from sysfs synchronously in device_del when it no longer
contains any children, explicitely rather than implicitely by the
virtue of doing the "last" kobject_put.

You'll notice the existing code wraps that kobject_put() with the
gdp_mutex, specifically I suppose to synronize with the creation path
of the gluedir, ie with the expectation that this kobject_put() will
synchronously remove the object from the kset.

I argue this assumption is flawed, that another reference could delay
the removal from the kset to a point where the mutex isn't held, and
thus the race re-opens.

My patch doesn't implement a separate "refcount" per-se, it implements
a "childcount". This is akin to mm_users vs mm_count.

We want the glue dir to be removed when it has no more children, we
currently "conflate" this with the kobject having no reference. This is
what I argue is incorrect. The kobject can have "unrelated" references,
that define the lifetime of the object in memory, but it's presence in
sysfs is dictated by the number of children it contains.

> > It is and there's a WARN_ON about it inside kobject_get(). I don't
> > think anybody argues against that, you are absolutely right.
>
> No. That the zero kobject_get() will not result in a warning. It just
> does a kref_get(), no warnings anywhere.

It's there but it's in refcount:

void refcount_inc(refcount_t *r)
{
WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
}
EXPORT_SYMBOL(refcount_inc);

In fact that's how I started digging into that problem in the first place :-)

> Yes, there is a kobject_get() warning, but that's about an
> _uninitialized_ kobject, not a already released one! You will get no
> warning that I can see from the "oh, you just got a stale one".

Cheers,
Ben.

> Linus

2018-07-01 17:06:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
<[email protected]> wrote:
>
> I suspect you didn't read it my entire argument or I wasn't clear
> enough :-) This is actually the crux of the problem:
>
> Yes the object continues to exist. However, the *last* kobject_put to
> it will no longer be done under whatever higher level locking the
> subsystem provides (whatever prevents for example concurrent add and
> removes).

Well, yes and no.

Why "no"?

The last dropping is actually not necessarily that interesting.
Especially with the sysfs interface, we basically know that you can
look up the object using RCU (since that's what the filesystem lookup
does), and that basically means that the refcount is always the final
serialization mechanism.There is nothing else that can possibly lock
it.

So this is where we disagree:

> Thus in that scenario the "last minute" kobject_release() done by the
> last kobject_put() will be effectively unprotected from for example the
> gdp_mutex (in the case of the gluedirs) or whatever other locking the
> subsystem owning the kobject is using to avoid making that "refount 0"
> object "discoverable".

No. *Fundamentally*, there is only one thing that protects that
object: the refcount.

And my argument is that anything that has this model (which means
anything that has any sysfs linkage, which pretty much means any
kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
had an existing stable pointer and just wants to increase the refcount
(ie "already got a reference throuigh one of my data structures that
use the lock")

But if you have that model, that means that the "last drop" is
actually almost totally irrelevant. Because it doesn't matter if it's
done inside the lock or not - you know that once it has been done,
that object is entirely gone. It's not discoverable any more -
regardless of locking. The discoverability is basically controlled
entirely by the refcount.

So what happens then?

The locking isn't important for the last release, but it *is*
important for new object *creation*.

Why?

The refcount means that once an object is gone, it's gone for
*everyone*. It's a one-way thing, and it's thread-safe. So the code
that does *creation* can do this:

- get subsystem lock
- look up object (using the "unless_zero()" model)
- if you got the object, re-use it, and you're done: drop lock and return
- otherwise, you know that nobody else can get it either
- create new object and instantiate it
- drop lock

and this means that you create a new object IFF the old object had its
refcount drop to zero. So you always have exactly one copy (or no copy
at all, in between the last drop and the creation of the new one).

See? The lack of locking at drop time didn't matter. The refcount
itself serialized things.

So the above is what I *think* the "glue_dir" logic should be. No need
for any other count. Just re-use the old glue dir if you can find it,
and create a new one if you can't.

The one important thing is that the object lookup above needs to use
the lookup needs to find the object all the way until the refcount has
become zero. And that actually means that the object MUST NOT be
removed from the object lists until *after* the refcount has been
decremented to zero. Which is actually why that "automatic cleanup"
that you hate is actually an integral and important part of the
process: removing the object *before* the refcount went to zero is
broken, because that means that the "look up object" phase can now
miss an object that still has a non-zero refcount.

> So my patch 1/2 prevents us from finding the old dying object (and thus
> from crashing) but replaces this with the duplicate name problem.

So I absolutely agree with your patch 1/2. My argument against it is
actually that I think the "unless_zero" thing needs to be more
universal.

> My patch 2/2 removes that second problem by ensuring we remove the
> object from sysfs synchronously in device_del when it no longer
> contains any children, explicitely rather than implicitely by the
> virtue of doing the "last" kobject_put.

No. See above. The reason I think your patch 2/2 is wrong is that is
actually *breaks* the above model, exactly because of that thing that
you hatre.

The explicit removal is actively wrong for the "I want to reuse the
object" model, exactly because it happens before the refcount has gone
to zero.

> > No. That the zero kobject_get() will not result in a warning. It just
> > does a kref_get(), no warnings anywhere.
>
> It's there but it's in refcount:
>
> void refcount_inc(refcount_t *r)
> {
> WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> }
> EXPORT_SYMBOL(refcount_inc);
>
> In fact that's how I started digging into that problem in the first place :-)

Hey, you are again hitting this because of extra config options.

Because the refcount_inc() that I found looks like this:

static inline void refcount_inc(refcount_t *r)
{
atomic_inc(&r->refs);
}

and has no warning.

I wonder how many people actually run with REFCOUNT_FULL that warns -
because it's way too expensive. It's not set in the default Fedora
config, for example, and that's despite how Fedora tends to set all
the other debug options.

So no, it really doesn't warn normally.

Linus

2018-07-01 23:38:24

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
>
> So this is where we disagree:
>
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
>
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
>
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
>
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant. Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.

This is almost true... the issue there is that gap of time between the
refcount going to 0 and kobject_release() actually doing the
kobject_del. Thus the object will be present in the kset for example,
and thus discoverable, while its refcount is 0.

Yes, you are absolutely right, anything that can find such a dead
object must be using kobject_get_unless_zero(). This is what patch 1/2
does and it fixes the basic crashing race.

But we are then exposed to the duplicate name issue on fast bind/unbind
or add/remove.

> So what happens then?
>
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
>
> Why?
>
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
>
> - get subsystem lock
> - look up object (using the "unless_zero()" model)
> - if you got the object, re-use it, and you're done: drop lock and return
> - otherwise, you know that nobody else can get it either
> - create new object and instantiate it
> - drop lock
>
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).

Except that create will occasionally fail due to duplicate names. So we
need to fix that one way or another (see below)

> See? The lack of locking at drop time didn't matter. The refcount
> itself serialized things.

Oh I absolutely agree with all that when it comes to object lifetime, I
don't completely agree when it comes to visibility to the outside
world, see below.

> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
>
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.

In the case of gluedir that is perfectly fine though. It's a bit like
having an open unlinked file. You want to "unlink" the gluedir
synchronously with the last of its children going away, in order to be
able to create a new one as soon as a new children comes in. However
you don't want to "free" the underlying kobject until after all users
have dropped their reference.

What I don't like is that idea of conflating discoverability and
lifetime. We may just have to agree to disagree on that one, but unless
we can make kernfs&sysfs prevent "finding" of a 0-ref object and thus
allow creating of a duplicate in that case (which isn't that simple a
problem to solve), we still have a problem.

> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
>
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.

I agree with that, absolutely. kobject_get() should probably just be an
unless_zero variant always rather than a special case.

> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
>
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that
> you hatre.
>
> The explicit removal is actively wrong for the "I want to reuse the
> object" model, exactly because it happens before the refcount has gone
> to zero.

That's indeed where we disagree.

In fact that's the entire point of the gdb_mutex today, ie the code was
clearly written with the *intention* that kobject_put() in device_del
is the last thing happening to the glue dir so that it gets
synchronously removed from the kset and sysfs along with the last
child.

This is the only reason why it makes any sense to be taking that mutex
around kobject_put.

It does so in order to avoid the race alltogether.

And that assumption in the code today is wrong because there could be
another reference being held, delaying the removal from the kset and
sysfs to outside of that mutex.

So kobject debugging's delayed freeing doesn't create a new race here,
it just exposes more widely an existing one.

So to get back to "how do we fix things", I think we agree with patch 1
and we agree that we should probably make kobject_get unconditionally
be a "unless zero" version accross the board, but let's Greg chime in
here before we do anything drastic.

This fixes the use-after-free and crash. However it does open us to a
duplicate name window where we'll fail to re-bind or re-add a device
due to a late un-reference. Kobject debugging makes this very likely to
happen by delaying the freeing by seconds, but without it, there' still
a small window.

There's two ways we've proposed to fix this. One is my patch 2 which
you don't like, which basically is equivalent to saying let's unlink
the gluedir synchronously with the last children going away, and keep
the kobeject refcount strictly as a mean to deal with the liftime of
the structure (freeing it). It uses the existing gdp_mutex and just
makes the existing assumptions in the code work.

The other way is what you prefer, which I agree is a good idea, but it
will take me at least some time to figure out a way to actually
implement it, and it is to essentially change sysfs so that objects
that have a refount of 0 aren't discoverable anymore, so that we don't
name-clash on trying to create a new one.

It's tricky because all the name business is handled by kernfs which
has no idea about kobjects of krefs. My initial idea would be to add an
optional pointer to a kref to kernfs entries, so it can check the
refcount on lookups and name collision checks. I'm not entirely certain
how to work out the locking/ordering for this though, we might need to
add memory barriers in kobject_put to match.

> > > No. That the zero kobject_get() will not result in a warning. It just
> > > does a kref_get(), no warnings anywhere.
> >
> > It's there but it's in refcount:
> >
> > void refcount_inc(refcount_t *r)
> > {
> > WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> > }
> > EXPORT_SYMBOL(refcount_inc);
> >
> > In fact that's how I started digging into that problem in the first place :-)
>
> Hey, you are again hitting this because of extra config options.

Ah possibly. I enabled every debug option I could find :-)
>
> Because the refcount_inc() that I found looks like this:
>
> static inline void refcount_inc(refcount_t *r)
> {
> atomic_inc(&r->refs);
> }
>
> and has no warning.
>
> I wonder how many people actually run with REFCOUNT_FULL that warns -
> because it's way too expensive. It's not set in the default Fedora
> config, for example, and that's despite how Fedora tends to set all
> the other debug options.
>
> So no, it really doesn't warn normally.

Yup, you're right.

Cheers,
Ben.


2018-07-02 10:23:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > I suspect you didn't read it my entire argument or I wasn't clear
> > enough :-) This is actually the crux of the problem:
> >
> > Yes the object continues to exist. However, the *last* kobject_put to
> > it will no longer be done under whatever higher level locking the
> > subsystem provides (whatever prevents for example concurrent add and
> > removes).
>
> Well, yes and no.
>
> Why "no"?
>
> The last dropping is actually not necessarily that interesting.
> Especially with the sysfs interface, we basically know that you can
> look up the object using RCU (since that's what the filesystem lookup
> does), and that basically means that the refcount is always the final
> serialization mechanism.There is nothing else that can possibly lock
> it.
>
> So this is where we disagree:
>
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
>
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
>
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
>
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant. Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.
>
> So what happens then?
>
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
>
> Why?
>
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
>
> - get subsystem lock
> - look up object (using the "unless_zero()" model)
> - if you got the object, re-use it, and you're done: drop lock and return
> - otherwise, you know that nobody else can get it either
> - create new object and instantiate it
> - drop lock
>
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).
>
> See? The lack of locking at drop time didn't matter. The refcount
> itself serialized things.
>
> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
>
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.
>
> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
>
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.
>
> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
>
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that
> you hatre.
>
> The explicit removal is actively wrong for the "I want to reuse the
> object" model, exactly because it happens before the refcount has gone
> to zero.
>
> > > No. That the zero kobject_get() will not result in a warning. It just
> > > does a kref_get(), no warnings anywhere.
> >
> > It's there but it's in refcount:
> >
> > void refcount_inc(refcount_t *r)
> > {
> > WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> > }
> > EXPORT_SYMBOL(refcount_inc);
> >
> > In fact that's how I started digging into that problem in the first place :-)
>
> Hey, you are again hitting this because of extra config options.
>
> Because the refcount_inc() that I found looks like this:
>
> static inline void refcount_inc(refcount_t *r)
> {
> atomic_inc(&r->refs);
> }
>
> and has no warning.
>
> I wonder how many people actually run with REFCOUNT_FULL that warns -
> because it's way too expensive. It's not set in the default Fedora
> config, for example, and that's despite how Fedora tends to set all
> the other debug options.
>
> So no, it really doesn't warn normally.
>
> Linus

2018-07-02 10:26:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, 2018-07-02 at 09:36 +1000, Benjamin Herrenschmidt wrote:
> > No. See above. The reason I think your patch 2/2 is wrong is that is
> > actually *breaks* the above model, exactly because of that thing that
> > you hatre.
> >
> > The explicit removal is actively wrong for the "I want to reuse the
> > object" model, exactly because it happens before the refcount has gone
> > to zero.
>
> That's indeed where we disagree.

Sooo...

I had some quality time with my good friend Oban 14 last night,
thinking through this and during the resulting hangover today. I've
tried very hard to "get" your perspective, but after another dive
through the code, I'm afraid I still think patch 2/2 is the right thing
to do.

(Let's ignore patch 1/1 for now and assume we get that part right)

Let me try one last ditch attempt to convince you using maybe a
different perspective: this is how sysfs is intended to work and how
the device model already does everywhere else except the gluedirs.

Sooo... let's start pulling the string bit by bit...

I think you agree since (I'm not going to try to quote you here) you
did mention adding/removing need to generally be protected by higher
level locks in whatever subsystem owns the kobjects. I don't think
there's an argument here.

This is hinted by some comments (among others) in sysfs/dir.c about the
kobject owner being responsible to ensure that removal doesn't race
with any operation, which is hard to do when such removal doesn't
happen under subsystem internal locks.

Now let's look at how things are handled for a normal struct device
kobject.

The device core provides device_add and device_del functions. Those are
completely orthogonal to the object lifetime controlled by
device_initialize (creation) and device_put. And for good reasons (yes,
there are helpers, device_register/unregister that combine them, but
those are just that... helpers).

Why ? For the exact same reason mentioned above: Along with all the
driver binding/unbinding business etc... those functions ensure that
the kobject of the device is added/removed from sysfs under all the
necessary locks in the device model, so that they cannot race with each
other.

This guarantees among others that it is possible to do a new device_add
for the same device immediately after device_del, without clashing with
a duplicate sysfs name, *even* if something still holds a reference to
the kobject for any reason (and there are plenty of these, see the cdev
example below).

The actual lifetime of the struct device is handled *separately* by
device_get/put which are just wrappers on kobject_get/put.

This is the same with a cdev. cdev_add and cdev_del or their friends
cdev_device_add/del do the same thing. The chardev case is interesting
specifically because it is much more common for these to persist
(lifetime) beyond cdev_del. You just need userspace to have an open
file descriptor to the chardev. Drivers have to deal with it of course,
they need to handle things like read/write/ioctl being called on
already open files after remove/del/unbind, and hopefully do so.

*But* it's also possible to create a new identical cdev with the same
sysfs name immediately after cdev_device_del even if the old one still
has open instanced precisely because we separate the sysfs presence,
handled through the device model internal locking, from the object
lifetime.

Now, going back to our gluedirs, in essence those are just more objects
owned by the device core. More specifically, they need to be created
synchronously with the first child device of a class added under a
given parent, and removed synchronously with the last child device of
that class under that same parent.

Delaying that removal from sysfs simply means that we now are unable to
re-add the same device after removing it for some arbitrary amount of
time, because something might still hold a kobject reference to the
gluedir for any reason, it doesn't matter.

It also means that the deletion from sysfs is no longer synchronized
with addition since it no longer happens from within the context of
decice_del which is the "owner" of the glue dir, under whatever locks
it uses to ensure there is no races between multiple devices addition
removal etc...

In fact, if you look at the code and the way the gdp_mutex is used, I
think this is *exactly* the intention of the code. That the
kobject_put() done inside device_del() is the last one, done under that
mutex, thus removing the gluedir synchronously and we have no race.

However, that code just fails to take into account the delayed removal
by kobject debugging and the possibility of another reference existing.

Now it's possible there there can be no other references of a gluedir
today, to be frank I haven't audited the entire code base to figure out
if anything can possibly hold one that isn't a child of that gluedir,
so if indeed there can't be any and the only references to the kref
possible *ever* on a gluedir are its children, then it's possibe that
the assumption holds, except with kobject debugging.

However, even if that is the case, I call it fragile. I think we should
ensure, which is what patch 2/2 does, that regardless of "when" the
object is actually freed, and regardless of what other references might
exist, the gluedir behaves just like other struct device or cdev, and
gets removed from sysfs synchronously, along with the last child, with
all the necessary device core mutexes held, as to never race with an
addition and as to never trigger the duplicate name problem.

Now, I know why you dislike patch 2/2, because it adds another counter,
and there's something "itchy" about having a refcount and that
childcount.

We don't absolutely need that childcount. It could be that we have ways
to ask sysfs whether there are any children left in that directory,
which would work as well, I just haven't really figured out how, and a
child count felt like a trivial way to solve the problem. It's
protected by the gdp_mutex so it's safe.

There are precedents to such dual counters, such as mm_users vs
mm_count. There are precedents to separating file system visibility
from lifetime, every fs does it with unlink() when there are still open
files, so I think overall, what I'm trying to achieve is just pretty
much standard practice: unlink the gluedir exactly when it needs to be
unlinked so a new one can be created when needed, and deal with
lifetime of the actual structure the usual way which can be delayed.

Now if this doesn't convince you, I'm afraid nothing will ;-)

I'm curious to hear from Greg though.

Cheers,
Ben.



2018-07-02 19:25:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, Jul 2, 2018 at 3:23 AM Benjamin Herrenschmidt
<[email protected]> wrote:
>
> Let me try one last ditch attempt to convince you using maybe a
> different perspective: this is how sysfs is intended to work and how
> the device model already does everywhere else except the gluedirs.

So don't get me wrong. I don't think that patch is _wrong_. It may
well be the best thing we can do now.

I just think some of the arguments for the patch are bogus.

I still think that the auto-cleanup is actually a good thing in
general. Not because it simplifies things (which it can do), but
because it fundamentally *allows* you to use less locking.

And that ends up being my real reason to not like your patch all that
much. It depends on

(a) an extra reference count

(b) on fairly heavy-handed locking (ie the whole "lock on release
too, not just on allocation").

and I think both of those are non-optimal.

So:

> The actual lifetime of the struct device is handled *separately* by
> device_get/put which are just wrappers on kobject_get/put.

I agree that that is the end result of your patch (and perhaps the
buggy intent of the original code).

I just don't necessarily agree it's a great thing in general. It
basically uses sysfs visibility as an argument for why lifetimes
should differ from the refcounted lifetimes.

And that may be a practical argument, but I don't think it's a very
good argument in general. I think it would arguably be much better to
be less serialized, and depend on refcounting more, and make less of a
deal about the sysfs visibility.

For example, if we *really* add back the exact same device immediately
after removing it, and the device was still in use somehow (ie the
refcount never went down to zero), maybe we really should not have
reused the same name? Or if we do re-use the same name, maybe we
should have re-used the device node itself?

See what I'm saying? I understand where your patch is coming from, but
I am suspicious of the fact that it seems to want to re-use a name
(but not the object) that is by definition still in use.

Maybe that's the right thing in this case. Considering that we have a
real oops and a real problem, and I don't have an alternative patch, I
guess the "patches talk, bullshit walks" rule applies.

Linus

2018-07-03 00:58:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, 2018-07-02 at 12:24 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 3:23 AM Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > Let me try one last ditch attempt to convince you using maybe a
> > different perspective: this is how sysfs is intended to work and how
> > the device model already does everywhere else except the gluedirs.
>
> So don't get me wrong. I don't think that patch is _wrong_. It may
> well be the best thing we can do now.
>
> I just think some of the arguments for the patch are bogus.
>
> I still think that the auto-cleanup is actually a good thing in
> general. Not because it simplifies things (which it can do), but
> because it fundamentally *allows* you to use less locking.
>
> And that ends up being my real reason to not like your patch all that
> much. It depends on
>
> (a) an extra reference count

Right, that is not that great, and it might be possible to avoid it if
we have a way to check that the sysfs dir is empty. Maybe kobj->sd-
>subdirs ? This will find subdirs, not files, but for gluedirs that's
sufficient. I can play with that today see if it works.

BTW. Talking of extra reference counts, we already have two everywhere
afaik... kernfs has its own that's separate from the kobject one. I got
lost in that maze a couple of times already.

> (b) on fairly heavy-handed locking (ie the whole "lock on release
> too, not just on allocation").

This is already there, I'm not adding it ;-) It's just ineffective
without my patch due to the possibility of the "auto-cleanup" to leak
outside the lock.

> and I think both of those are non-optimal.
>
> So:
>
> > The actual lifetime of the struct device is handled *separately* by
> > device_get/put which are just wrappers on kobject_get/put.
>
> I agree that that is the end result of your patch (and perhaps the
> buggy intent of the original code).

This is the existing device_add/del without my patch. I was describing
how the core device stuff works for the normal device directories
today. My patch makes the gludirs do the same thing basically, which I
think was the intend of the code already but was broken.

> I just don't necessarily agree it's a great thing in general. It
> basically uses sysfs visibility as an argument for why lifetimes
> should differ from the refcounted lifetimes.

Yes, it does that so that we can remove and re-add without name
clashes.

> And that may be a practical argument, but I don't think it's a very
> good argument in general. I think it would arguably be much better to
> be less serialized, and depend on refcounting more, and make less of a
> deal about the sysfs visibility.
>
> For example, if we *really* add back the exact same device immediately
> after removing it, and the device was still in use somehow (ie the
> refcount never went down to zero), maybe we really should not have
> reused the same name? Or if we do re-use the same name, maybe we
> should have re-used the device node itself?

Well, maybe but that's what we've been doing forever, and I think
drivers deal with it already. It applies to devices and to chardevs for
example.

> See what I'm saying? I understand where your patch is coming from, but
> I am suspicious of the fact that it seems to want to re-use a name
> (but not the object) that is by definition still in use.

Yes, I see what you mean.

I don't really have the same qualms as you do, it's how we've done
things forever (it's just that the gluedir code was broken), and it
generally works well.

It's not different in principle to what we do with files. You can
unlink an open file. It's still "in use" but the directory entry is
gone and one can create a new one.

It a slightly more roundabout way, we do that with mm_structs, the last
process can be gone but we keep them around due to being referenced (or
lazily attached to a kthread). Here we have 2 counters too.

> Maybe that's the right thing in this case. Considering that we have a
> real oops and a real problem, and I don't have an alternative patch, I
> guess the "patches talk, bullshit walks" rule applies.

Haha, as you prefer. Patch 1 fixes the oops, patch 2 fixes the name
collision, so in theory you could get away with just patch 1. But then
device would start failing to be re-added or re-bound.

(In my case it was a sub-device created by the driver of a parent
device, so it happens on bind/unbind of that parent driver).

I'll have a look see if I can find a way to check that the sysfs dir is
empty without adding that childcount, that would at least alleviate
some of your objection, and would probably make the patch
smaller/simpler.

Cheers,
Ben.


2018-07-03 02:16:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, Jul 2, 2018 at 5:57 PM Benjamin Herrenschmidt
<[email protected]> wrote:
>
> I'll have a look see if I can find a way to check that the sysfs dir is
> empty without adding that childcount, that would at least alleviate
> some of your objection, and would probably make the patch
> smaller/simpler.

That would definitely make me happier.

Right now we already remove the actual device node sysfs associations
synchronously with "kobject_del()" (even if it still then stays around
as a kobject), and that does the remove for the object itself - just
not the glue directory.

If we then just did a "if glue dir is empty, delete the glue dir too"
in cleanup_glue_dir(), at least ther patch would be prettier.

I *think* that should be easy. Maybe. Something almost, but not quite,
entirely unlike the below UNTESTED hack?

It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
USE UNDER ANY CIRCUMSTANCES. Think of it more as a "something like
this might work, but probably doesn't". Maybe it gives you an idea,
although that idea might be "Linus has finally lost it".

I suspect it's broken because kernfs has this notion of inactive
nodes, so just testing for the rbtree being empty is probably wrong. I
think maybe it needs to test for there not being any active nodes, but
then you'd need the kernfs_mutex etc, so that would make things much
worse.

I really haven't touched the kernfs code much, thus the "this may be
complete and utter garbage" warning.

Adding Tejun to the participants, he should know. Or Greg, who has
been silent, presumably because he's still getting over his crazy
travels.

Linus

---

--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1585,6 +1585,12 @@ static inline struct kobject
*get_glue_dir(struct device *dev)
return dev->kobj.parent;
}

+static inline bool has_children(struct kobject *dir)
+{
+ struct kernfs_node *kn = dir->sd;
+ return kn && kn->dir.children.rb_node;
+}
+
/*
* make sure cleaning up dir as the last step, we need to make
* sure .release handler of kobject is run with holding the
@@ -1597,6 +1603,10 @@ static void cleanup_glue_dir(struct device
*dev, struct kobject *glue_dir)
return;

mutex_lock(&gdp_mutex);
- kobject_put(glue_dir);
+ if (has_children(glue_dir))
+ kobject_put(glue_dir);
+ else
+ kobject_del(glue_dir);
mutex_unlock(&gdp_mutex);
}

2018-07-03 02:28:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
<[email protected]> wrote:
>
> It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> USE UNDER ANY CIRCUMSTANCES. Think of it more as a "something like
> this might work, but probably doesn't". Maybe it gives you an idea,
> although that idea might be "Linus has finally lost it".

Even if it were to work, it should probably just be done inside kernfs
and exposed as some kind of "kernfs_remove_if_empty()" function.

We happen to know that we hold the lock that creates all the entries
in the glue directory (and we don't allow users to move or create
stuff, afaik, even if we alloc chmod etc), so there should be no
races, but a generic kernfs helper function would probably have to get
proper locks and check it the right way.

Linus

2018-07-03 02:38:51

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, 2018-07-02 at 19:15 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 5:57 PM Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > I'll have a look see if I can find a way to check that the sysfs dir is
> > empty without adding that childcount, that would at least alleviate
> > some of your objection, and would probably make the patch
> > smaller/simpler.
>
> That would definitely make me happier.
>
> Right now we already remove the actual device node sysfs associations
> synchronously with "kobject_del()" (even if it still then stays around
> as a kobject), and that does the remove for the object itself - just
> not the glue directory.
>
> If we then just did a "if glue dir is empty, delete the glue dir too"
> in cleanup_glue_dir(), at least ther patch would be prettier.
>
> I *think* that should be easy. Maybe. Something almost, but not quite,
> entirely unlike the below UNTESTED hack?

Yes, something like that. I have a bunch of other things I need to
attend to today, so I might not come up with a patch before some time
tomorrow but I'll have a look.

> It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> USE UNDER ANY CIRCUMSTANCES. Think of it more as a "something like
> this might work, but probably doesn't". Maybe it gives you an idea,
> although that idea might be "Linus has finally lost it".
>
> I suspect it's broken because kernfs has this notion of inactive
> nodes, so just testing for the rbtree being empty is probably wrong. I
> think maybe it needs to test for there not being any active nodes, but
> then you'd need the kernfs_mutex etc, so that would make things much
> worse.

There's also a "subdirs" in kernfs_node which we could use. That's a
bit ugly because it only account directories but we happen to know that
the gluedirs only contain directories, so it would work fine in
practice for that case.

> I really haven't touched the kernfs code much, thus the "this may be
> complete and utter garbage" warning.
>
> Adding Tejun to the participants, he should know. Or Greg, who has
> been silent, presumably because he's still getting over his crazy
> travels.

Cheers,
Ben.

> Linus
>
> ---
>
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1585,6 +1585,12 @@ static inline struct kobject
> *get_glue_dir(struct device *dev)
> return dev->kobj.parent;
> }
>
> +static inline bool has_children(struct kobject *dir)
> +{
> + struct kernfs_node *kn = dir->sd;
> + return kn && kn->dir.children.rb_node;
> +}
> +
> /*
> * make sure cleaning up dir as the last step, we need to make
> * sure .release handler of kobject is run with holding the
> @@ -1597,6 +1603,10 @@ static void cleanup_glue_dir(struct device
> *dev, struct kobject *glue_dir)
> return;
>
> mutex_lock(&gdp_mutex);
> - kobject_put(glue_dir);
> + if (has_children(glue_dir))
> + kobject_put(glue_dir);
> + else
> + kobject_del(glue_dir);
> mutex_unlock(&gdp_mutex);
> }

2018-07-03 02:41:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > USE UNDER ANY CIRCUMSTANCES. Think of it more as a "something like
> > this might work, but probably doesn't". Maybe it gives you an idea,
> > although that idea might be "Linus has finally lost it".
>
> Even if it were to work, it should probably just be done inside kernfs
> and exposed as some kind of "kernfs_remove_if_empty()" function.

That would be harder, we'd have to expose it via a
kobject_del_if_empty() then.

Since we control completely when things are added/removed from the
gluedir and have a big fat mutex for it, I don't think locking is much
of an issue. At least in that specific case.

> We happen to know that we hold the lock that creates all the entries
> in the glue directory (and we don't allow users to move or create
> stuff, afaik, even if we alloc chmod etc), so there should be no
> races, but a generic kernfs helper function would probably have to get
> proper locks and check it the right way.

Yes that or document very clearly under what circumstances that
function can be used.

> Linus

2018-07-03 05:23:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Tue, 2018-07-03 at 12:39 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> > On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > > USE UNDER ANY CIRCUMSTANCES. Think of it more as a "something like
> > > this might work, but probably doesn't". Maybe it gives you an idea,
> > > although that idea might be "Linus has finally lost it".
> >
> > Even if it were to work, it should probably just be done inside kernfs
> > and exposed as some kind of "kernfs_remove_if_empty()" function.
>
> That would be harder, we'd have to expose it via a
> kobject_del_if_empty() then.
>
> Since we control completely when things are added/removed from the
> gluedir and have a big fat mutex for it, I don't think locking is much
> of an issue. At least in that specific case.

Ok, kernfs was a tad tougher to crack than I hoped but here's a
prototype lightly tested.

Tejun, Greg, does it look reasonable to you ?

Cheers,
Ben.

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..9166f68276c6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1570,6 +1570,8 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
return;

mutex_lock(&gdp_mutex);
+ if (!kobject_has_children(glue_dir))
+ kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(&gdp_mutex);
}
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..e4bec09bc602 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1251,6 +1251,29 @@ void kernfs_activate(struct kernfs_node *kn)
mutex_unlock(&kernfs_mutex);
}

+bool kernfs_has_children(struct kernfs_node *kn)
+{
+ bool has_children = false;
+ struct kernfs_node *pos;
+
+ /* Lockless shortcut */
+ if (RB_EMPTY_NODE(&kn->rb))
+ return false;
+
+ /* Now check for active children */
+ mutex_lock(&kernfs_mutex);
+ pos = NULL;
+ while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
+ if (kernfs_active(pos)) {
+ has_children = true;
+ break;
+ }
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return has_children;
+}
+
static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..776350ac1cbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -300,6 +300,12 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
return kn->flags & KERNFS_NS;
}

+/**
+ * kernfs_has_children - Returns whether a kernfs node has children.
+ * @kn: the node to test
+ */
+bool kernfs_has_children(struct kernfs_node *kn);
+
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
char *buf, size_t buflen);
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..1e3294ab95f0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,17 @@ extern void kobject_put(struct kobject *kobj);
extern const void *kobject_namespace(struct kobject *kobj);
extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);

+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+ WARN_ON_ONCE(kref_read(&kobj->kref) == 0);
+
+ return kobj->sd && kernfs_has_children(kobj->sd);
+}
+
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;


2018-07-03 15:47:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

Hello,

On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> +bool kernfs_has_children(struct kernfs_node *kn)
> +{
> + bool has_children = false;
> + struct kernfs_node *pos;
> +
> + /* Lockless shortcut */
> + if (RB_EMPTY_NODE(&kn->rb))
> + return false;

Hmm... shouldn't it be testing !rb_first(kn->dir.children)? The above
would test whether @kn itself is unlinked.

> +
> + /* Now check for active children */
> + mutex_lock(&kernfs_mutex);
> + pos = NULL;
> + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> + if (kernfs_active(pos)) {
> + has_children = true;
> + break;
> + }
> + }
> + mutex_unlock(&kernfs_mutex);
> +
> + return has_children;

The active ref is there to synchronize removal against in-flight reads
so that kernfs_remove() can wait for them to drain. On return from
kernfs_remove(), the node is guaranteed to be off the rbtree and the
above test isn't necessary. !rb_first() test should be enough
(provided that there's external synchronization, of course).

Thanks.

--
tejun

2018-07-04 01:11:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

On Tue, 2018-07-03 at 08:46 -0700, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> > +bool kernfs_has_children(struct kernfs_node *kn)
> > +{
> > + bool has_children = false;
> > + struct kernfs_node *pos;
> > +
> > + /* Lockless shortcut */
> > + if (RB_EMPTY_NODE(&kn->rb))
> > + return false;
>
> Hmm... shouldn't it be testing !rb_first(kn->dir.children)? The above
> would test whether @kn itself is unlinked.

Ah possibly, the rb tree usage got me confused a few times in there.

> > +
> > + /* Now check for active children */
> > + mutex_lock(&kernfs_mutex);
> > + pos = NULL;
> > + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> > + if (kernfs_active(pos)) {
> > + has_children = true;
> > + break;
> > + }
> > + }
> > + mutex_unlock(&kernfs_mutex);
> > +
> > + return has_children;
>
> The active ref is there to synchronize removal against in-flight reads
> so that kernfs_remove() can wait for them to drain. On return from
> kernfs_remove(), the node is guaranteed to be off the rbtree and the
> above test isn't necessary. !rb_first() test should be enough
> (provided that there's external synchronization, of course).

Ok. Yes there's external synchronization. So a simple !rb_first test
doesn't need the mutex I suppose ?

Cheers,
Ben.

>
> Thanks.
>