Hi all,
This is a refreshed version of the patch series Tejun posted quite a while
ago that allowed sysfs attributes to commit suicide directly:
http://thread.gmane.org/gmane.linux.kernel/582130/
I'm dusting this off (with Tejun's approval) because of recent changes
I introduced into the PCI core, which allow for logical hotplug of any
device in the system (via sysfs):
http://thread.gmane.org/gmane.linux.kernel.pci/3713/
My removal mechanism uses the much-hated sysfs_schedule_callback()
mechanism, and indeed, even some light testing has already shown some
drawbacks, namely we produce at least one false positive in lockdep.
I'm taking a two-prong approch here. The first step is to modify
sysfs_schedule_callback() and get it off the global work queue. This
will eliminate false positives in lockdep, and also stop us from
hogging the shared work queue with a long-running ->remove event,
such as removing a PCI bridge near the root of the hierarchy with
lots of devices underneath. I'm having Kenji test this patch right
now, since I'd like to get it fixed for the current merge window:
http://thread.gmane.org/gmane.linux.kernel.pci/3713/focus=3756
[note: Greg, if Kenji's testing is successful, I plan on sending
that patch as another .30 change]
The other prong is getting discussion going on this patch series again.
The most contentious part is patch 1/3, wherein sysfs abuses the
module notifier call chain, and basically prevents all module unloads
until suicidal sysfs attributes have completed.
This is poison of a different flavor from last time. The earlier version
of this series modified the module API and created an interface that
allowed anyone to inhibit module unload.
This time, only sysfs is allowed to be so... special. Which is a slight
improvement, but the question as to whether sysfs should be allowed to
do something like this is unresolved.
I'd like to get Rusty's opinion on this approach; I didn't see anything
in the archives from previous threads.
A secondary minor concern is the impurity that I've introduced into
sysfs, but I think most of the folks copied here would agree that it's
a worthwhile tradeoff if we can eliminate the callback mechanism.
Finally, please note that I didn't refresh the 4th patch in the original
series, the good one that actually removes all the cruft. I figured
we could discuss the module unload inhibition first, and in the meantime,
I could let some of the merge activity settle out before touching
the callsites.
Comments appreciated.
Thanks.
/ac
---
Alex Chiang (1):
sysfs: add blocking notifier to prohibit module unload
Tejun Heo (2):
sysfs: care-free suicide for sysfs files
sysfs: make the sysfs_addrm_cxt->removed list FIFO
fs/sysfs/dir.c | 9 +++-
fs/sysfs/file.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/sysfs/mount.c | 8 +++
fs/sysfs/sysfs.h | 6 ++
4 files changed, 152 insertions(+), 6 deletions(-)
From: Tejun Heo <[email protected]>
Add sysfs_addrm_cxt->removed_tail to make the ->removed list FIFO.
This will be used to implement care-free suicide.
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---
fs/sysfs/dir.c | 7 +++++--
fs/sysfs/sysfs.h | 2 +-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 82d3b79..39320a5 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -369,6 +369,7 @@ void sysfs_addrm_start(struct sysfs_addrm_cxt *acxt,
memset(acxt, 0, sizeof(*acxt));
acxt->parent_sd = parent_sd;
+ acxt->removed_tail = &acxt->removed;
/* Lookup parent inode. inode initialization is protected by
* sysfs_mutex, so inode existence can be determined by
@@ -485,8 +486,10 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
sysfs_unlink_sibling(sd);
sd->s_flags |= SYSFS_FLAG_REMOVED;
- sd->s_sibling = acxt->removed;
- acxt->removed = sd;
+ *acxt->removed_tail = sd;
+ /* keep them in FIFO order, suicide check depends on it */
+ acxt->removed_tail = &sd->s_sibling;
+ *acxt->removed_tail = NULL;
if (sysfs_type(sd) == SYSFS_DIR && acxt->parent_inode)
drop_nlink(acxt->parent_inode);
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 93c6d6b..cb8ac65 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -81,7 +81,7 @@ static inline unsigned int sysfs_type(struct sysfs_dirent *sd)
struct sysfs_addrm_cxt {
struct sysfs_dirent *parent_sd;
struct inode *parent_inode;
- struct sysfs_dirent *removed;
+ struct sysfs_dirent *removed, **removed_tail;
int cnt;
};
As its name suggests, module_inhibit_unload() inhibits all module
unloading till the matching module_allow_unload() is called. This
unload inhibition doesn't affect whether a module can be unloaded or
not. It just stalls the final module free till the inhibition is
lifted.
This sledgehammer mechanism is to be used briefly in obscure cases
where identifying or getting the module to prevent from unloading is
difficult or not worth the effort. Note that module unloading is
siberia-cold path. If the inhibtion is relatively brief in human
scale, that is, upto a few secs at maximum, it should be fine.
Even if something goes wrong with unload inhibition (e.g. someone
forgets to lift the inhibition), it doesn't prevent modules from being
loaded.
Originally written by Tejun Heo. Refreshed and implemented as a
blocking notifier that registers with the module core.
Cc: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---
fs/sysfs/file.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/sysfs/mount.c | 8 ++++++
fs/sysfs/sysfs.h | 2 ++
3 files changed, 79 insertions(+), 0 deletions(-)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 7cc4dc0..bf93b58 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -21,6 +21,7 @@
#include <linux/mutex.h>
#include <linux/limits.h>
#include <asm/uaccess.h>
+#include <asm/atomic.h>
#include "sysfs.h"
@@ -60,6 +61,74 @@ struct sysfs_buffer {
struct list_head list;
};
+static atomic_t module_unload_inhibit_cnt = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(module_unload_wait);
+
+/**
+ * sysfs_module_callback - block until module unload allowed again
+ *
+ * (ab)use the blocking notifier call chain in delete_module()
+ * and prevent module unload for our own purposes, namely a
+ * suicidal sysfs attribute has finished killing itself.
+ *
+ * This prevents a module from freeing its code section before
+ * we are done accessing it.
+ */
+int sysfs_module_callback(struct notifier_block *self, unsigned long val,
+ void *data)
+{
+ DECLARE_WAITQUEUE(wait, current);
+
+ if (val == MODULE_STATE_COMING)
+ return NOTIFY_DONE;
+
+ add_wait_queue(&module_unload_wait, &wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (atomic_read(&module_unload_inhibit_cnt))
+ schedule();
+ __set_current_state(TASK_RUNNING);
+ remove_wait_queue(&module_unload_wait, &wait);
+
+ return NOTIFY_DONE;
+}
+
+/**
+ * module_inhibit_unload - inhibit module unload
+ *
+ * Inhibit module unload until allowed again. All module unload
+ * operations which reach zero reference count after this call
+ * has returned are guaranteed to be stalled till inhibition is
+ * lifted.
+ *
+ * This is a simple mechanism to prevent premature unload while
+ * code on a to-be-unloaded module is still executing. Unload
+ * inhibitions must be finite and relatively short.
+ *
+ * LOCKING:
+ * None.
+ */
+static void module_inhibit_unload(void)
+{
+ atomic_inc(&module_unload_inhibit_cnt);
+}
+
+/**
+ * module_allow_unload - allow module unload
+ *
+ * Allow module unload. Must be balanced with calls to
+ * module_inhibit_unload().
+ *
+ * LOCKING:
+ * None.
+ */
+static void module_allow_unload(void)
+{
+ if (atomic_dec_and_test(&module_unload_inhibit_cnt))
+ wake_up_all(&module_unload_wait);
+
+ BUG_ON(atomic_read(&module_unload_inhibit_cnt) < 0);
+}
+
/**
* fill_read_buffer - allocate and fill buffer from object.
* @dentry: dentry pointer.
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index ab343e3..c805bec 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -83,6 +83,11 @@ static struct file_system_type sysfs_fs_type = {
.kill_sb = kill_anon_super,
};
+static struct notifier_block sysfs_module_nb = {
+ .notifier_call = sysfs_module_callback,
+ .priority = 0
+};
+
int __init sysfs_init(void)
{
int err = -ENOMEM;
@@ -109,6 +114,9 @@ int __init sysfs_init(void)
}
} else
goto out_err;
+
+ register_module_notifier(&sysfs_module_nb);
+
out:
return err;
out_err:
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index cb8ac65..5d9b340 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -153,6 +153,8 @@ int sysfs_inode_init(void);
* file.c
*/
extern const struct file_operations sysfs_file_operations;
+extern int sysfs_module_callback(struct notifier_block *self, unsigned long val,
+ void *data);
int sysfs_add_file(struct sysfs_dirent *dir_sd,
const struct attribute *attr, int type);
From: Tejun Heo <[email protected]>
Life can be weary and some sysfs files choose to commit suicide (kills
itself when written to). This is troublesome because while a sysfs
file is being accessed, the accessing task holds active references to
the node and its parent. Removing a sysfs node waits for active
references to be drained. The suicidal thread ends up waiting for its
own active reference and thus is sadly forced to live till the end of
the time.
Till now, this has been dealt with by requiring suicidal node to ask
someone else (workqueue) to kill it. In recognition of the
inhumanitarian nature of this solution, this patch implements care-free
suicide support.
Suicide attempt is automagically detected and the active references
the suiciding task is holding are put early to avoid the above
described deadlock. Module unload is inhibited till the sysfs file
access is complete to avoid freeing the code region too early.
This patch only implements care-free suicide support. The next patch
will convert the users.
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---
fs/sysfs/dir.c | 2 ++
fs/sysfs/file.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/sysfs/sysfs.h | 2 +-
3 files changed, 67 insertions(+), 3 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 39320a5..993edd1 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -583,6 +583,8 @@ void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt)
sd->s_sibling = NULL;
sysfs_drop_dentry(sd);
+ if (sysfs_type(sd) == SYSFS_KOBJ_ATTR)
+ sysfs_file_check_suicide(sd);
sysfs_deactivate(sd);
sysfs_put(sd);
}
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index bf93b58..7bd29cc 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -59,6 +59,8 @@ struct sysfs_buffer {
int needs_read_fill;
int event;
struct list_head list;
+ struct task_struct *accessor;
+ int committed_suicide;
};
static atomic_t module_unload_inhibit_cnt = ATOMIC_INIT(0);
@@ -130,6 +132,57 @@ static void module_allow_unload(void)
}
/**
+ * sysfs_file_check_suicide - check whether a file is trying to kill itself
+ * @sd: sysfs_dirent of interest
+ *
+ * Check whether @sd is trying to commit suicide. If so, help it
+ * by putting active references early such that deactivation
+ * doesn't deadlock waiting for its own active references.
+ *
+ * This works because a leaf node is always removed before its
+ * parent. By the time deactivation is called on the parent, the
+ * suiciding node has already put the active references to itself
+ * and the parent.
+ *
+ * LOCKING:
+ * None.
+ */
+void sysfs_file_check_suicide(struct sysfs_dirent *sd)
+{
+ struct sysfs_open_dirent *od;
+ struct sysfs_buffer *buffer;
+
+ spin_lock(&sysfs_open_dirent_lock);
+
+ od = sd->s_attr.open;
+ if (od) {
+ list_for_each_entry(buffer, &od->buffers, list) {
+ if (buffer->accessor != current)
+ continue;
+
+ /* it's trying to commit suicide, help it */
+
+ /* Inhibit unload till the suiciding method is
+ * complete. This is to avoid premature
+ * unload of the owner of the suiciding
+ * method.
+ */
+ module_inhibit_unload();
+
+ /* Global unload inhibition in effect, safe to
+ * put active references.
+ */
+ sysfs_put_active_two(sd);
+ buffer->committed_suicide = 1;
+
+ break;
+ }
+ }
+
+ spin_unlock(&sysfs_open_dirent_lock);
+}
+
+/**
* fill_read_buffer - allocate and fill buffer from object.
* @dentry: dentry pointer.
* @buffer: data buffer for file.
@@ -158,9 +211,14 @@ static int fill_read_buffer(struct dentry * dentry, struct sysfs_buffer * buffer
return -ENODEV;
buffer->event = atomic_read(&attr_sd->s_attr.open->event);
+ buffer->accessor = current;
+
count = ops->show(kobj, attr_sd->s_attr.attr, buffer->page);
- sysfs_put_active_two(attr_sd);
+ if (buffer->committed_suicide)
+ module_allow_unload();
+ else
+ sysfs_put_active_two(attr_sd);
/*
* The code works fine with PAGE_SIZE return but it's likely to
@@ -275,9 +333,13 @@ flush_write_buffer(struct dentry * dentry, struct sysfs_buffer * buffer, size_t
if (!sysfs_get_active_two(attr_sd))
return -ENODEV;
+ buffer->accessor = current;
rc = ops->store(kobj, attr_sd->s_attr.attr, buffer->page, count);
- sysfs_put_active_two(attr_sd);
+ if (buffer->committed_suicide)
+ module_allow_unload();
+ else
+ sysfs_put_active_two(attr_sd);
return rc;
}
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 5d9b340..d44603c 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -155,7 +155,7 @@ int sysfs_inode_init(void);
extern const struct file_operations sysfs_file_operations;
extern int sysfs_module_callback(struct notifier_block *self, unsigned long val,
void *data);
-
+void sysfs_file_check_suicide(struct sysfs_dirent *sd);
int sysfs_add_file(struct sysfs_dirent *dir_sd,
const struct attribute *attr, int type);
Interesting.
Fixing a read/writer deadlock by allowing the writers to nest
inside the readers.
My first impression is that it is too clever.
Furthermore I think this is walking around the edges of a more
general problem. How should we serial hotplug and hotunplug
in general. In what context should remove methods run in.
My impression is that we have a huge hole in our infrastructure
for hotplug drivers. Problems like how do we get a user space
context for the code to run in and how do we handle
multiple hotplug actions for overlapping device trees from
stomping on each other.
My hypothesis is once we solve this for the general case of
device hotplug and removal we won't need special support from
sysfs. At least not in the suicidal way.
We still have very weird cases such as the lock inversion that
we have today between rtnl_lock and active reference count,
coming from the networking code.
Eric
On Tue, 24 Mar 2009, Alex Chiang wrote:
> Hi all,
>
> This is a refreshed version of the patch series Tejun posted quite a while
> ago that allowed sysfs attributes to commit suicide directly:
>
> http://thread.gmane.org/gmane.linux.kernel/582130/
> The most contentious part is patch 1/3, wherein sysfs abuses the
> module notifier call chain, and basically prevents all module unloads
> until suicidal sysfs attributes have completed.
>
> This is poison of a different flavor from last time. The earlier version
> of this series modified the module API and created an interface that
> allowed anyone to inhibit module unload.
>
> This time, only sysfs is allowed to be so... special. Which is a slight
> improvement, but the question as to whether sysfs should be allowed to
> do something like this is unresolved.
I tend to agree with Eric that this feels a little like a band-aid, and
a more general solution would be preferable. But I don't have one to
offer, and getting the immediate problems fixed is also important.
Why change the inhibit-module-unload interface? This new approach
seems a lot more complicated than needed; a simple rwsem should work
okay. Exposing it to the entire kernel when only sysfs uses it doesn't
matter -- there must be plenty of EXPORTed symbols with only one user.
Which reminds me... What happens if two different processes write to
the same suicidal sysfs attribute at the same time?
Alan Stern
* Eric W. Biederman <[email protected]>:
>
> Interesting.
>
> Fixing a read/writer deadlock by allowing the writers to nest
> inside the readers.
>
> My first impression is that it is too clever.
Clever points go to Tejun. All I did was refresh the series
slightly. :)
> Furthermore I think this is walking around the edges of a more
> general problem. How should we serial hotplug and hotunplug
> in general. In what context should remove methods run in.
>
> My impression is that we have a huge hole in our infrastructure
> for hotplug drivers. Problems like how do we get a user space
> context for the code to run in and how do we handle
> multiple hotplug actions for overlapping device trees from
> stomping on each other.
>
> My hypothesis is once we solve this for the general case of
> device hotplug and removal we won't need special support from
> sysfs. At least not in the suicidal way.
I agree that we have problems in our infrastructure, especially,
as you point out, overlapping device trees, etc.
I see your point about adding extra cruft into sysfs to work
around a special case while leaving the hard problem unsolved.
Perhaps the status quo is better. I do think that getting
suicidal sysfs attributes off the global workqueue is a band-aid
that actually helps, vs. the proposed patches here which are
questionable in nature.
Oh well.
Thanks for the comments.
/ac
>
> We still have very weird cases such as the lock inversion that
> we have today between rtnl_lock and active reference count,
> coming from the networking code.
>
> Eric
>
* Alan Stern <[email protected]>:
> On Tue, 24 Mar 2009, Alex Chiang wrote:
>
> > Hi all,
> >
> > This is a refreshed version of the patch series Tejun posted quite a while
> > ago that allowed sysfs attributes to commit suicide directly:
> >
> > http://thread.gmane.org/gmane.linux.kernel/582130/
>
> > The most contentious part is patch 1/3, wherein sysfs abuses the
> > module notifier call chain, and basically prevents all module unloads
> > until suicidal sysfs attributes have completed.
> >
> > This is poison of a different flavor from last time. The earlier version
> > of this series modified the module API and created an interface that
> > allowed anyone to inhibit module unload.
> >
> > This time, only sysfs is allowed to be so... special. Which is a slight
> > improvement, but the question as to whether sysfs should be allowed to
> > do something like this is unresolved.
>
> I tend to agree with Eric that this feels a little like a band-aid, and
> a more general solution would be preferable. But I don't have one to
> offer, and getting the immediate problems fixed is also important.
Well, getting the sysfs callback off the global workqueue is an
immediate fix that:
- introduces no conceptual change
- fixes the lockdep false positive
- doesn't try to be clever with references
If the consensus here is that this suicide patch series is simply
a band-aid, then I think my other patch will have solved the
problem as much as possible without getting mired in a
conversation about truth and beauty.
> Why change the inhibit-module-unload interface? This new approach
> seems a lot more complicated than needed; a simple rwsem should work
> okay. Exposing it to the entire kernel when only sysfs uses it doesn't
> matter -- there must be plenty of EXPORTed symbols with only one user.
My concern was more the other way around, that exposing a
sledgehammer interface to anyone who wants to inhibit module
unload might not seem like such a wise choice.
I felt that going through the blocking notifier call chain was a
little more proper, in the sense of, "ok well we're going to
allow this inhibit-unload but we know exactly who's doing it".
But that seems irrelevant now.
> Which reminds me... What happens if two different processes write to
> the same suicidal sysfs attribute at the same time?
Good question; I didn't test that with Tejun's patches.
Using the callback mechanism, and a recent patch I wrote that
Greg accepted for 2.6.30, we only allow one in-flight callback
per sysfs attribute/kobject at a time. The loser of the race gets
-EAGAIN while the remove is occurring, and then when the
attribute goes away, gets "file not found" (or something
similar).
Thanks.
/ac
Alex Chiang <[email protected]> writes:
> * Eric W. Biederman <[email protected]>:
>>
>> Interesting.
>>
>> Fixing a read/writer deadlock by allowing the writers to nest
>> inside the readers.
>>
>> My first impression is that it is too clever.
>
> Clever points go to Tejun. All I did was refresh the series
> slightly. :)
>
>> Furthermore I think this is walking around the edges of a more
>> general problem. How should we serial hotplug and hotunplug
>> in general. In what context should remove methods run in.
>>
>> My impression is that we have a huge hole in our infrastructure
>> for hotplug drivers. Problems like how do we get a user space
>> context for the code to run in and how do we handle
>> multiple hotplug actions for overlapping device trees from
>> stomping on each other.
>>
>> My hypothesis is once we solve this for the general case of
>> device hotplug and removal we won't need special support from
>> sysfs. At least not in the suicidal way.
>
> I agree that we have problems in our infrastructure, especially,
> as you point out, overlapping device trees, etc.
>
> I see your point about adding extra cruft into sysfs to work
> around a special case while leaving the hard problem unsolved.
>
> Perhaps the status quo is better. I do think that getting
> suicidal sysfs attributes off the global workqueue is a band-aid
> that actually helps, vs. the proposed patches here which are
> questionable in nature.
Sounds like it. I'm not trying to shoot this down, rather
I'm trying to figure out how to solve this cleanly, as I am slowly
trying to sort out the pci hotplug and unplug issues.
I'm not certain how general we can be. pci layer, device layer or kobject
layer, but I think it makes sense to have a dedicated work queue to use
when devices are removed. As every hotplug driver currently has to
invent one. The fake hotplug code is very normal in this respect.
If we can get the work queue creation and the calling of remove put
into the generic pci layer, we should be able to simply all of the
hotplug controller drivers.
I'm not seeing a patch from you where you are using a separate
workqueue. Am I missing something? But if we can place that workqueue
in say the pci layer I think it would be just a little re factoring
and not a lot more code.
Eric
* Eric W. Biederman <[email protected]>:
> Alex Chiang <[email protected]> writes:
>
> > * Eric W. Biederman <[email protected]>:
> >>
> >> Interesting.
> >>
> >> Fixing a read/writer deadlock by allowing the writers to nest
> >> inside the readers.
> >>
> >> My first impression is that it is too clever.
> >
> > Clever points go to Tejun. All I did was refresh the series
> > slightly. :)
> >
> >> Furthermore I think this is walking around the edges of a more
> >> general problem. How should we serial hotplug and hotunplug
> >> in general. In what context should remove methods run in.
> >>
> >> My impression is that we have a huge hole in our infrastructure
> >> for hotplug drivers. Problems like how do we get a user space
> >> context for the code to run in and how do we handle
> >> multiple hotplug actions for overlapping device trees from
> >> stomping on each other.
> >>
> >> My hypothesis is once we solve this for the general case of
> >> device hotplug and removal we won't need special support from
> >> sysfs. At least not in the suicidal way.
> >
> > I agree that we have problems in our infrastructure, especially,
> > as you point out, overlapping device trees, etc.
> >
> > I see your point about adding extra cruft into sysfs to work
> > around a special case while leaving the hard problem unsolved.
> >
> > Perhaps the status quo is better. I do think that getting
> > suicidal sysfs attributes off the global workqueue is a band-aid
> > that actually helps, vs. the proposed patches here which are
> > questionable in nature.
>
> Sounds like it. I'm not trying to shoot this down, rather
> I'm trying to figure out how to solve this cleanly, as I am slowly
> trying to sort out the pci hotplug and unplug issues.
Please do keep me informed on any progress you make or thoughts
you have here.
> I'm not certain how general we can be. pci layer, device layer or kobject
> layer, but I think it makes sense to have a dedicated work queue to use
> when devices are removed. As every hotplug driver currently has to
> invent one. The fake hotplug code is very normal in this respect.
>
> If we can get the work queue creation and the calling of remove put
> into the generic pci layer, we should be able to simply all of the
> hotplug controller drivers.
Hm, that is a good idea.
Simplifying all the various hotplug drivers is on my TODO list,
but it's a long and tricky process. I agree though, there is no
reason why they should all be as complicated as they are.
> I'm not seeing a patch from you where you are using a separate
> workqueue. Am I missing something?
http://lkml.org/lkml/2009/3/25/489
But I suspect that is not the workqueue you are looking for. ;)
> But if we can place that workqueue in say the pci layer I think
> it would be just a little re factoring and not a lot more code.
The PCI layer doesn't need a workqueue to remove devices, not on
its own behalf.
You are talking about providing something for the benefit of all
the hotplug drivers, right?
Thanks.
/ac
Hello,
Eric W. Biederman wrote:
>>> Fixing a read/writer deadlock by allowing the writers to nest
>>> inside the readers.
>>>
>>> My first impression is that it is too clever.
>> Clever points go to Tejun. All I did was refresh the series
>> slightly. :)
Thanks for the points. I do agree that it could be a bit too clever,
but the thing is that protecting the code area from going underneath
something is a pretty special thing to begin with and I think it's
better to apply special solution rather than trying to work around it
using general mechanisms. So, I actually think the global inhibit
thing is one of the better ways to deal with the nasty-in-nature
problem.
>>> My hypothesis is once we solve this for the general case of
>>> device hotplug and removal we won't need special support from
>>> sysfs. At least not in the suicidal way.
>> I agree that we have problems in our infrastructure, especially,
>> as you point out, overlapping device trees, etc.
I don't really see how some grand general solution would solve
deadlock problem at sysfs layer, care to elaborate a bit?
>> I see your point about adding extra cruft into sysfs to work
>> around a special case while leaving the hard problem unsolved.
>>
>> Perhaps the status quo is better. I do think that getting
>> suicidal sysfs attributes off the global workqueue is a band-aid
>> that actually helps, vs. the proposed patches here which are
>> questionable in nature.
>
> Sounds like it. I'm not trying to shoot this down, rather
> I'm trying to figure out how to solve this cleanly, as I am slowly
> trying to sort out the pci hotplug and unplug issues.
The problem I see is that there aren't too many users and the solution
is a bit too narrow focused, but with increasing support for
hotplug/unplug, I think the problem is becoming more widespread and
the workqueue solution is quite fragile and cumbersome for each and
every user, so unless there are other directions we can pursue (the
general one above maybe?), I think it's better to add a bit of
complexity to sysfs rather than forcing everyone user of it to do it.
Thanks.
--
tejun
Alex Chiang <[email protected]> writes:
>> Sounds like it. I'm not trying to shoot this down, rather
>> I'm trying to figure out how to solve this cleanly, as I am slowly
>> trying to sort out the pci hotplug and unplug issues.
>
> Please do keep me informed on any progress you make or thoughts
> you have here.
>
>> I'm not certain how general we can be. pci layer, device layer or kobject
>> layer, but I think it makes sense to have a dedicated work queue to use
>> when devices are removed. As every hotplug driver currently has to
>> invent one. The fake hotplug code is very normal in this respect.
>>
>> If we can get the work queue creation and the calling of remove put
>> into the generic pci layer, we should be able to simply all of the
>> hotplug controller drivers.
>
> Hm, that is a good idea.
>
> Simplifying all the various hotplug drivers is on my TODO list,
> but it's a long and tricky process. I agree though, there is no
> reason why they should all be as complicated as they are.
>
>> I'm not seeing a patch from you where you are using a separate
>> workqueue. Am I missing something?
>
> http://lkml.org/lkml/2009/3/25/489
>
> But I suspect that is not the workqueue you are looking for. ;)
Not quite.
>> But if we can place that workqueue in say the pci layer I think
>> it would be just a little re factoring and not a lot more code.
>
> The PCI layer doesn't need a workqueue to remove devices, not on
> its own behalf.
>
> You are talking about providing something for the benefit of all
> the hotplug drivers, right?
Yes. The common case is that we discover a card needs to be or
has been removed from an interrupt handler.
Eric
Tejun Heo <[email protected]> writes:
> Thanks for the points. I do agree that it could be a bit too clever,
> but the thing is that protecting the code area from going underneath
> something is a pretty special thing to begin with and I think it's
> better to apply special solution rather than trying to work around it
> using general mechanisms. So, I actually think the global inhibit
> thing is one of the better ways to deal with the nasty-in-nature
> problem.
Protecting the data structures from going away is just as important,
and the module_inhibit does not address that.
When I looked at it I could not see any touches of kobj in the sysfs
code after we dropped the reference count in a strange place, but
I haven't been able to convince myself that we will be safe.
>>>> My hypothesis is once we solve this for the general case of
>>>> device hotplug and removal we won't need special support from
>>>> sysfs. At least not in the suicidal way.
>>> I agree that we have problems in our infrastructure, especially,
>>> as you point out, overlapping device trees, etc.
>
> I don't really see how some grand general solution would solve
> deadlock problem at sysfs layer, care to elaborate a bit?
See below. I'm really not thinking of doing anything different
just putting the code somewhere different that sysfs.
>>> I see your point about adding extra cruft into sysfs to work
>>> around a special case while leaving the hard problem unsolved.
>>>
>>> Perhaps the status quo is better. I do think that getting
>>> suicidal sysfs attributes off the global workqueue is a band-aid
>>> that actually helps, vs. the proposed patches here which are
>>> questionable in nature.
>>
>> Sounds like it. I'm not trying to shoot this down, rather
>> I'm trying to figure out how to solve this cleanly, as I am slowly
>> trying to sort out the pci hotplug and unplug issues.
>
> The problem I see is that there aren't too many users and the solution
> is a bit too narrow focused, but with increasing support for
> hotplug/unplug, I think the problem is becoming more widespread and
> the workqueue solution is quite fragile and cumbersome for each and
> every user, so unless there are other directions we can pursue (the
> general one above maybe?), I think it's better to add a bit of
> complexity to sysfs rather than forcing everyone user of it to do it.
My view is that this is a general hotplug problem and not a sysfs problem.
Further I see inhibiting module reload as only solving have the problem
as dropping the kobject reference opens a window to a use after free on
the kobj.
The problem that I see is that we are missing support from the device
model for hotunplug. Running the device remove method from process
context is required. Typically hotplug controllers discover a
device has been removed or will be removed in interrupt context.
Therefore every hotplug driver I have looked at has it's own workqueue
to solve the problem of getting the notification of a hotplug event
from an inappropriate context.
So the general problem that I see is that I need a solution to trigger
a remove from interrupt context and that same solution will happen to
work just fine for sysfs.
Eric
Hello,
Eric W. Biederman wrote:
> Tejun Heo <[email protected]> writes:
>
>> Thanks for the points. I do agree that it could be a bit too clever,
>> but the thing is that protecting the code area from going underneath
>> something is a pretty special thing to begin with and I think it's
>> better to apply special solution rather than trying to work around it
>> using general mechanisms. So, I actually think the global inhibit
>> thing is one of the better ways to deal with the nasty-in-nature
>> problem.
>
> Protecting the data structures from going away is just as important,
> and the module_inhibit does not address that.
Yeap, I was talking about the code issue only.
> When I looked at it I could not see any touches of kobj in the sysfs
> code after we dropped the reference count in a strange place, but
> I haven't been able to convince myself that we will be safe.
The reference is dropped when the suiciding thread calls delete on the
sysfs node. It forfeits its right to access the object when it
deletes it, which makes sense. The things which are guaranteed after
deleting the base object are the code it's running off of and the
sysfs object itself. I think it's pretty intuitive from user's POV.
> My view is that this is a general hotplug problem and not a sysfs
> problem. Further I see inhibiting module reload as only solving
> have the problem as dropping the kobject reference opens a window to
> a use after free on the kobj.
kobject_del(obj); obj->whatever; isn't any different from kfree(p);
*p;. If the caller accesses the object after deleting it, it's gonna
fail unless it already held a separate reference count. There is no
window.
> The problem that I see is that we are missing support from the device
> model for hotunplug. Running the device remove method from process
> context is required. Typically hotplug controllers discover a
> device has been removed or will be removed in interrupt context.
>
> Therefore every hotplug driver I have looked at has it's own workqueue
> to solve the problem of getting the notification of a hotplug event
> from an inappropriate context.
>
> So the general problem that I see is that I need a solution to trigger
> a remove from interrupt context and that same solution will happen to
> work just fine for sysfs.
I think the problem is more driver domain specific and not quite sure
whether one size would fit all. We have a lot of drivers in the tree.
I think the best approach would be trying to move upwards from the
bottom. ie. Consolidate hotplug / error handling support from low
level drivers to specific driver subsystem, from driver subsystems to
higher layer (ie. block layer) and then see whether there can be more
commonalities which can be factored, but the chance is that once
things are pushed upwards enough, moving it into the kobject layer
probably wouldn't worth the trouble. Well, it's all speculations at
this point tho.
Thanks.
--
tejun
Hello,
Alex Chiang wrote:
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 39320a5..993edd1 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -583,6 +583,8 @@ void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt)
> sd->s_sibling = NULL;
>
> sysfs_drop_dentry(sd);
> + if (sysfs_type(sd) == SYSFS_KOBJ_ATTR)
> + sysfs_file_check_suicide(sd);
> sysfs_deactivate(sd);
> sysfs_put(sd);
> }
I think there's a hole here. sysfs_file_check_suicide() should be
done inside sysfs_deactivate() such that commiting suicide atomically
deactivates the sd. This will solve the multiple writes to suicide
node problem nicely.
Thanks.
--
tejun
On Thu, 26 Mar 2009, Tejun Heo wrote:
> > The problem that I see is that we are missing support from the device
> > model for hotunplug. Running the device remove method from process
> > context is required. Typically hotplug controllers discover a
> > device has been removed or will be removed in interrupt context.
> >
> > Therefore every hotplug driver I have looked at has it's own workqueue
> > to solve the problem of getting the notification of a hotplug event
> > from an inappropriate context.
> >
> > So the general problem that I see is that I need a solution to trigger
> > a remove from interrupt context and that same solution will happen to
> > work just fine for sysfs.
>
> I think the problem is more driver domain specific and not quite sure
> whether one size would fit all. We have a lot of drivers in the tree.
> I think the best approach would be trying to move upwards from the
> bottom. ie. Consolidate hotplug / error handling support from low
> level drivers to specific driver subsystem, from driver subsystems to
> higher layer (ie. block layer) and then see whether there can be more
> commonalities which can be factored, but the chance is that once
> things are pushed upwards enough, moving it into the kobject layer
> probably wouldn't worth the trouble. Well, it's all speculations at
> this point tho.
It sounds like Eric's point is that sysfs_schedule_callback() is a
special-purpose interface devoted to sysfs, whereas a more generally
useful interface would allow delayed unregistration of any struct
device. (Or perhaps delayed invocation of an arbitrary function with a
struct device as the argument, but unregistration is the single most
important usage.)
The actual interface wouldn't be very different from
sysfs_schedule_callback(). In fact, the only changes I see offhand are
that the new routine would accept a pointer to struct device instead of
a pointer to struct kobject and there wouldn't be any func argument.
The idea is that this would come in useful both for suicidal sysfs
attributes and for hot-unplug events detected by an interrupt handler.
But there's something I'm not clear on. If hot-unplug events are
detected by an interrupt handler, then what about hot-plug events?
Wouldn't they be detected by the same interrupt handler? Obviously you
can't register new devices in interrupt context, so there must be a
workqueue or kernel thread involved somewhere. Shouldn't the two types
of events be managed by the same workqueue/thread?
Alan Stern
On Thu, 26 Mar 2009 10:21:23 -0400 (EDT),
Alan Stern <[email protected]> wrote:
> The idea is that this would come in useful both for suicidal sysfs
> attributes and for hot-unplug events detected by an interrupt handler.
Yes; the ccw bus uses it's own workqueue, so it doesn't need
device_schedule_callback() to commit suicide. I guess other busses
could do the same.
>
> But there's something I'm not clear on. If hot-unplug events are
> detected by an interrupt handler, then what about hot-plug events?
> Wouldn't they be detected by the same interrupt handler? Obviously you
> can't register new devices in interrupt context, so there must be a
> workqueue or kernel thread involved somewhere. Shouldn't the two types
> of events be managed by the same workqueue/thread?
They should, you want to serialize plug/unplug. You'll even want to use
the same queue for plug/unplug not detected in interrupt context.
The next question is how granular those workqueues should be. Per
subsystem? Per bus? Something else?