Hi Vegard, sysfs folks,
Vegard was nice enough to test my PCI remove/rescan patches under
kmemcheck. Maybe "torture" is a more appropriate term. ;)
My patch series introduces a sysfs "remove" attribute for PCI
devices, which will remove that device (and child devices).
http://thread.gmane.org/gmane.linux.kernel.pci/3495
Vegard decided that he wanted to do something like:
# while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
which caused a nasty oops in my code. You can see the results of
his testing in the thread I referenced above.
After looking at my code for a bit, I decided that maybe it
wasn't completely my fault. ;) See, I'm using device_schedule_callback()
which really is a wrapper around sysfs_schedule_callback() which
is the way that a sysfs attribute is supposed to remove itself to
prevent deadlock.
The problem that Vegard's test exposed is that if you repeatedly
call a sysfs attribute that's supposed to remove itself using
device_schedule_callback, we'll keep scheduling work queue tasks
with a kobj that we really want to release.
[nb, I bet that /sys/bus/scsi/devices/.../delete will exhibit the
same problems]
This is very racy, and at some point, whatever remove handler
we've scheduled with device_schedule_callback will end up
referencing a freed kobj.
I came up with the below patch which changes the semantics of
device/sysfs_schedule_callback. We now only allow one in-flight
callback per kobj, and return -EBUSY if that kobj already has a
callback scheduled for it.
This patch, along with my updated 07/11 patch in my series,
prevents at least the first oops that Vegard reported, and I
suspect it prevents the second kmemcheck error too, although I
haven't tested under kmemcheck (yet*).
I'm looking for comments on the approach I took, specifically:
- are we ok with the new restriction I imposed?
- is it ok to return -EBUSY to our callers?
- is the simple linked list proof of concept
implementation going to scale too poorly?
To answer my own first two questions, I checked for callers of
both device_ and sysfs_schedule_callback, and it looks like
everyone is using it the same way: to schedule themselves for
removal. That is, although the interface could be used to
schedule any sort of callback, the only use case is the removal
use case. I don't think it will be a problem to limit ourselves
to one remove callback per kobj.
Maybe this patch really wants to be a new interface called
sysfs_schedule_callback_once or _single, where we check for an
already-scheduled callback for a kobj, and if we pass, then we
simply continue on to the existing, unchanged
sysfs_schedule_callback. I don't feel too strongly about creating
a new interface, but my belief is that changing the semantics of
the existing interface is probably the better solution.
My opinion on my own third question is that removing a device is
not in the performance path, so a simple linked list is
sufficient.
Depending on the feedback here, I'll resend this patch with a
full changelog (and giving credit to Vegard/kmemcheck as Ingo
requested I do) or I can rework it.
Thanks.
/ac
*: I googled for kmemcheck and the most recent tree I found was
from 2008. Is that really true? Do the patches apply to current
upstream?
---
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 1f4a3f8..e05a172 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -659,13 +659,16 @@ void sysfs_remove_file_from_group(struct kobject *kobj,
EXPORT_SYMBOL_GPL(sysfs_remove_file_from_group);
struct sysfs_schedule_callback_struct {
- struct kobject *kobj;
+ struct list_head workq_list;
+ struct kobject *kobj;
void (*func)(void *);
void *data;
struct module *owner;
struct work_struct work;
};
+static DEFINE_MUTEX(sysfs_workq_mutex);
+static LIST_HEAD(sysfs_workq);
static void sysfs_schedule_callback_work(struct work_struct *work)
{
struct sysfs_schedule_callback_struct *ss = container_of(work,
@@ -674,6 +677,9 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
(ss->func)(ss->data);
kobject_put(ss->kobj);
module_put(ss->owner);
+ mutex_lock(&sysfs_workq_mutex);
+ list_del(&ss->workq_list);
+ mutex_unlock(&sysfs_workq_mutex);
kfree(ss);
}
@@ -700,10 +706,19 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
void *data, struct module *owner)
{
- struct sysfs_schedule_callback_struct *ss;
+ struct sysfs_schedule_callback_struct *ss, *tmp;
if (!try_module_get(owner))
return -ENODEV;
+
+ mutex_lock(&sysfs_workq_mutex);
+ list_for_each_entry_safe(ss, tmp, &sysfs_workq, workq_list)
+ if (ss->kobj == kobj) {
+ mutex_unlock(&sysfs_workq_mutex);
+ return -EBUSY;
+ }
+ mutex_unlock(&sysfs_workq_mutex);
+
ss = kmalloc(sizeof(*ss), GFP_KERNEL);
if (!ss) {
module_put(owner);
@@ -715,6 +730,10 @@ int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
ss->data = data;
ss->owner = owner;
INIT_WORK(&ss->work, sysfs_schedule_callback_work);
+ INIT_LIST_HEAD(&ss->workq_list);
+ mutex_lock(&sysfs_workq_mutex);
+ list_add_tail(&ss->workq_list, &sysfs_workq);
+ mutex_unlock(&sysfs_workq_mutex);
schedule_work(&ss->work);
return 0;
}
On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
> Hi Vegard, sysfs folks,
>
> Vegard was nice enough to test my PCI remove/rescan patches under
> kmemcheck. Maybe "torture" is a more appropriate term. ;)
>
> My patch series introduces a sysfs "remove" attribute for PCI
> devices, which will remove that device (and child devices).
>
> http://thread.gmane.org/gmane.linux.kernel.pci/3495
>
> Vegard decided that he wanted to do something like:
>
> # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
>
> which caused a nasty oops in my code. You can see the results of
> his testing in the thread I referenced above.
>
> After looking at my code for a bit, I decided that maybe it
> wasn't completely my fault. ;) See, I'm using device_schedule_callback()
why? Are you really in interrupt context here to need to do the remove
at a later time?
> which really is a wrapper around sysfs_schedule_callback() which
> is the way that a sysfs attribute is supposed to remove itself to
> prevent deadlock.
Yeah, it's the scsi code that needed this mess :(
If at all possible, I would recommend not using it.
> The problem that Vegard's test exposed is that if you repeatedly
> call a sysfs attribute that's supposed to remove itself using
> device_schedule_callback, we'll keep scheduling work queue tasks
> with a kobj that we really want to release.
>
> [nb, I bet that /sys/bus/scsi/devices/.../delete will exhibit the
> same problems]
I think it's harder to remove devices multiple times, as it isn't done
through sysfs, but from another external request. But I could be wrong.
> This is very racy, and at some point, whatever remove handler
> we've scheduled with device_schedule_callback will end up
> referencing a freed kobj.
>
> I came up with the below patch which changes the semantics of
> device/sysfs_schedule_callback. We now only allow one in-flight
> callback per kobj, and return -EBUSY if that kobj already has a
> callback scheduled for it.
>
> This patch, along with my updated 07/11 patch in my series,
> prevents at least the first oops that Vegard reported, and I
> suspect it prevents the second kmemcheck error too, although I
> haven't tested under kmemcheck (yet*).
>
> I'm looking for comments on the approach I took, specifically:
>
> - are we ok with the new restriction I imposed?
> - is it ok to return -EBUSY to our callers?
> - is the simple linked list proof of concept
> implementation going to scale too poorly?
>
> To answer my own first two questions, I checked for callers of
> both device_ and sysfs_schedule_callback, and it looks like
> everyone is using it the same way: to schedule themselves for
> removal. That is, although the interface could be used to
> schedule any sort of callback, the only use case is the removal
> use case. I don't think it will be a problem to limit ourselves
> to one remove callback per kobj.
I agree.
> Maybe this patch really wants to be a new interface called
> sysfs_schedule_callback_once or _single, where we check for an
> already-scheduled callback for a kobj, and if we pass, then we
> simply continue on to the existing, unchanged
> sysfs_schedule_callback. I don't feel too strongly about creating
> a new interface, but my belief is that changing the semantics of
> the existing interface is probably the better solution.
>
> My opinion on my own third question is that removing a device is
> not in the performance path, so a simple linked list is
> sufficient.
>
> Depending on the feedback here, I'll resend this patch with a
> full changelog (and giving credit to Vegard/kmemcheck as Ingo
> requested I do) or I can rework it.
I have no objection to this change. But I really would recommend not
using this interface at all if possible.
thanks,
greg k-h
* Greg KH <[email protected]>:
> On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
> > Hi Vegard, sysfs folks,
> >
> > Vegard was nice enough to test my PCI remove/rescan patches under
> > kmemcheck. Maybe "torture" is a more appropriate term. ;)
> >
> > My patch series introduces a sysfs "remove" attribute for PCI
> > devices, which will remove that device (and child devices).
> >
> > http://thread.gmane.org/gmane.linux.kernel.pci/3495
> >
> > Vegard decided that he wanted to do something like:
> >
> > # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
> >
> > which caused a nasty oops in my code. You can see the results of
> > his testing in the thread I referenced above.
> >
> > After looking at my code for a bit, I decided that maybe it
> > wasn't completely my fault. ;) See, I'm using device_schedule_callback()
>
> why? Are you really in interrupt context here to need to do the remove
> at a later time?
What other interface can I use to remove objects from sysfs?
Demo:
Here's the topology of my machine. Sorry about the long line, but
device 0000:04 is a complex device with an internal bridge and 4
PCI functions.
[root@tahitifp1 ~]# lspci -vt
+-[0000:03]---00.0-[0000:04-07]----00.0-[0000:05-07]--+-02.0-[0000:06]--+-00.0 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
| | \-00.1 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
| \-04.0-[0000:07]--+-00.0 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
| \-00.1 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
Here's the state of my machine before I do anything.
[root@tahitifp1 ~]# ls /sys/bus/pci/devices/
0000:00:01.0 0000:00:02.2 0000:02:02.0 0000:05:04.0 0000:08:00.0
0000:00:01.1 0000:00:04.0 0000:02:02.1 0000:06:00.0 0000:0a:00.0
0000:00:01.2 0000:01:01.0 0000:03:00.0 0000:06:00.1
0000:00:02.0 0000:01:01.1 0000:04:00.0 0000:07:00.0
0000:00:02.1 0000:02:01.0 0000:05:02.0 0000:07:00.1
We see the following PCI buses.
[root@tahitifp1 ~]# ls /sys/class/pci_bus/
0000:00 0000:02 0000:04 0000:06 0000:08 0000:0a
0000:01 0000:03 0000:05 0000:07 0000:09 0000:0b
Use my shiny new patch set. :)
[root@tahitifp1 ~]# echo 1 > /sys/bus/pci/devices/0000\:04\:00.0/remove
Now device 0000:04:00.0 is gone, as are all the devices below it.
Just as we expect.
[root@tahitifp1 ~]# ls /sys/bus/pci/devices/
0000:00:01.0 0000:00:02.0 0000:00:04.0 0000:02:01.0 0000:03:00.0
0000:00:01.1 0000:00:02.1 0000:01:01.0 0000:02:02.0 0000:08:00.0
0000:00:01.2 0000:00:02.2 0000:01:01.1 0000:02:02.1 0000:0a:00.0
And the PCI buses are gone too.
[root@tahitifp1 ~]# ls /sys/class/pci_bus/
0000:00 0000:01 0000:02 0000:03 0000:08 0000:09 0000:0a 0000:0b
Let's get them back with the 2nd part of my shiny patch set.
[root@tahitifp1 ~]# echo 1 > /sys/bus/pci/rescan
All the devices came back:
[root@tahitifp1 ~]# ls /sys/bus/pci/devices/
0000:00:01.0 0000:00:02.2 0000:02:02.0 0000:05:04.0 0000:08:00.0
0000:00:01.1 0000:00:04.0 0000:02:02.1 0000:06:00.0 0000:0a:00.0
0000:00:01.2 0000:01:01.0 0000:03:00.0 0000:06:00.1
0000:00:02.0 0000:01:01.1 0000:04:00.0 0000:07:00.0
0000:00:02.1 0000:02:01.0 0000:05:02.0 0000:07:00.1
And so did our PCI buses.
[root@tahitifp1 ~]# ls /sys/class/pci_bus/
0000:00 0000:02 0000:04 0000:06 0000:08 0000:0a
0000:01 0000:03 0000:05 0000:07 0000:09 0000:0b
What other internal API should I be using for something like the
above to work? Especially if someone does something crazy like
that while loop that Vegard showed?
As I understand it, attempting to allow device 0000:04:00.0 to
directly remove itself in the sysfs attribute's store method will
result in deadlock, is that not the case?
And if it will result in deadlock, then I have to use the
device_schedule_callback() API, right?
> > which really is a wrapper around sysfs_schedule_callback() which
> > is the way that a sysfs attribute is supposed to remove itself to
> > prevent deadlock.
>
> Yeah, it's the scsi code that needed this mess :(
>
> If at all possible, I would recommend not using it.
>
> > The problem that Vegard's test exposed is that if you repeatedly
> > call a sysfs attribute that's supposed to remove itself using
> > device_schedule_callback, we'll keep scheduling work queue tasks
> > with a kobj that we really want to release.
> >
> > [nb, I bet that /sys/bus/scsi/devices/.../delete will exhibit the
> > same problems]
>
> I think it's harder to remove devices multiple times, as it isn't done
> through sysfs, but from another external request. But I could be wrong.
I'm not sure I understand what you mean by this.
> > This is very racy, and at some point, whatever remove handler
> > we've scheduled with device_schedule_callback will end up
> > referencing a freed kobj.
> >
> > I came up with the below patch which changes the semantics of
> > device/sysfs_schedule_callback. We now only allow one in-flight
> > callback per kobj, and return -EBUSY if that kobj already has a
> > callback scheduled for it.
> >
> > This patch, along with my updated 07/11 patch in my series,
> > prevents at least the first oops that Vegard reported, and I
> > suspect it prevents the second kmemcheck error too, although I
> > haven't tested under kmemcheck (yet*).
> >
> > I'm looking for comments on the approach I took, specifically:
> >
> > - are we ok with the new restriction I imposed?
> > - is it ok to return -EBUSY to our callers?
> > - is the simple linked list proof of concept
> > implementation going to scale too poorly?
> >
> > To answer my own first two questions, I checked for callers of
> > both device_ and sysfs_schedule_callback, and it looks like
> > everyone is using it the same way: to schedule themselves for
> > removal. That is, although the interface could be used to
> > schedule any sort of callback, the only use case is the removal
> > use case. I don't think it will be a problem to limit ourselves
> > to one remove callback per kobj.
>
> I agree.
>
> > Maybe this patch really wants to be a new interface called
> > sysfs_schedule_callback_once or _single, where we check for an
> > already-scheduled callback for a kobj, and if we pass, then we
> > simply continue on to the existing, unchanged
> > sysfs_schedule_callback. I don't feel too strongly about creating
> > a new interface, but my belief is that changing the semantics of
> > the existing interface is probably the better solution.
> >
> > My opinion on my own third question is that removing a device is
> > not in the performance path, so a simple linked list is
> > sufficient.
> >
> > Depending on the feedback here, I'll resend this patch with a
> > full changelog (and giving credit to Vegard/kmemcheck as Ingo
> > requested I do) or I can rework it.
>
> I have no objection to this change. But I really would recommend not
> using this interface at all if possible.
I'm open to other suggestions, but my imagination (or grepping
skills) must be limited. :)
If, as I suspect, that I really do need to use that interface,
please let me know and I'll re-submit my patch with proper
changelog, etc.
Thanks.
/ac
Alex Chiang wrote:
> * Greg KH <[email protected]>:
>> On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
>>> Hi Vegard, sysfs folks,
>>>
>>> Vegard was nice enough to test my PCI remove/rescan patches under
>>> kmemcheck. Maybe "torture" is a more appropriate term. ;)
>>>
>>> My patch series introduces a sysfs "remove" attribute for PCI
>>> devices, which will remove that device (and child devices).
>>>
>>> http://thread.gmane.org/gmane.linux.kernel.pci/3495
>>>
>>> Vegard decided that he wanted to do something like:
>>>
>>> # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
>>>
>>> which caused a nasty oops in my code. You can see the results of
>>> his testing in the thread I referenced above.
>>>
>>> After looking at my code for a bit, I decided that maybe it
>>> wasn't completely my fault. ;) See, I'm using device_schedule_callback()
>> why? Are you really in interrupt context here to need to do the remove
>> at a later time?
>
> What other interface can I use to remove objects from sysfs?
I haven't read your code yet but I seem to recall doing something
similar. Ah.. okay, this one didn't get in and I forgot about this.
http://thread.gmane.org/gmane.linux.kernel/582130
But, yeah, committing suicide is currently quite hariy. I tought SCSI
did it correctly with all the grab/release dances. Does SCSI have the
problem too?
Thanks.
--
tejun
On Wed, Mar 11, 2009 at 01:03:59AM -0600, Alex Chiang wrote:
> * Greg KH <[email protected]>:
> > On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
> > > Hi Vegard, sysfs folks,
> > >
> > > Vegard was nice enough to test my PCI remove/rescan patches under
> > > kmemcheck. Maybe "torture" is a more appropriate term. ;)
> > >
> > > My patch series introduces a sysfs "remove" attribute for PCI
> > > devices, which will remove that device (and child devices).
> > >
> > > http://thread.gmane.org/gmane.linux.kernel.pci/3495
> > >
> > > Vegard decided that he wanted to do something like:
> > >
> > > # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
> > >
> > > which caused a nasty oops in my code. You can see the results of
> > > his testing in the thread I referenced above.
> > >
> > > After looking at my code for a bit, I decided that maybe it
> > > wasn't completely my fault. ;) See, I'm using device_schedule_callback()
> >
> > why? Are you really in interrupt context here to need to do the remove
> > at a later time?
>
> What other interface can I use to remove objects from sysfs?
What everyone else does, device_unregister()? SCSI can't do this
because it is in interrupt context at the moment, that is why it delays
the action from happening.
> Demo:
>
> Here's the topology of my machine. Sorry about the long line, but
> device 0000:04 is a complex device with an internal bridge and 4
> PCI functions.
>
> [root@tahitifp1 ~]# lspci -vt
> +-[0000:03]---00.0-[0000:04-07]----00.0-[0000:05-07]--+-02.0-[0000:06]--+-00.0 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
> | | \-00.1 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
> | \-04.0-[0000:07]--+-00.0 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
> | \-00.1 Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
>
> Here's the state of my machine before I do anything.
>
> [root@tahitifp1 ~]# ls /sys/bus/pci/devices/
> 0000:00:01.0 0000:00:02.2 0000:02:02.0 0000:05:04.0 0000:08:00.0
> 0000:00:01.1 0000:00:04.0 0000:02:02.1 0000:06:00.0 0000:0a:00.0
> 0000:00:01.2 0000:01:01.0 0000:03:00.0 0000:06:00.1
> 0000:00:02.0 0000:01:01.1 0000:04:00.0 0000:07:00.0
> 0000:00:02.1 0000:02:01.0 0000:05:02.0 0000:07:00.1
>
> We see the following PCI buses.
>
> [root@tahitifp1 ~]# ls /sys/class/pci_bus/
> 0000:00 0000:02 0000:04 0000:06 0000:08 0000:0a
> 0000:01 0000:03 0000:05 0000:07 0000:09 0000:0b
>
> Use my shiny new patch set. :)
>
> [root@tahitifp1 ~]# echo 1 > /sys/bus/pci/devices/0000\:04\:00.0/remove
>
> Now device 0000:04:00.0 is gone, as are all the devices below it.
> Just as we expect.
And this is just what other subsystems do (like USB) when their device
disappears, nothing new here :)
Why can't you use device_unregister()? Or, you could use device_del(),
which lets you rely on the fact that the device structure is still
around for a bit, but it will disappear from sysfs. Just don't forget
to do the final put_device() on it to free the memory and "really"
release it.
Or am I missing something else here?
thanks,
greg k-h
On Wed, 11 Mar 2009 08:32:28 -0700,
Greg KH <[email protected]> wrote:
> Why can't you use device_unregister()? Or, you could use device_del(),
> which lets you rely on the fact that the device structure is still
> around for a bit, but it will disappear from sysfs. Just don't forget
> to do the final put_device() on it to free the memory and "really"
> release it.
>
> Or am I missing something else here?
You can't unregister a device from one of its attribute callbacks, it
locks up in sysfs (removing the sysfs dir waits for all active
references to be dropped, but the reference obtained before calling
->store won't be dropped until after ->store returned...)
device_schedule_callback() was introduced to solve exactly that problem.
(For the original oops, I'd rather solve the problem by making sure the
caller doesn't trigger removal several times - should probably be less
code than the proposed patch?)
* Cornelia Huck <[email protected]>:
> On Wed, 11 Mar 2009 08:32:28 -0700,
> Greg KH <[email protected]> wrote:
>
> > Why can't you use device_unregister()? Or, you could use device_del(),
> > which lets you rely on the fact that the device structure is still
> > around for a bit, but it will disappear from sysfs. Just don't forget
> > to do the final put_device() on it to free the memory and "really"
> > release it.
> >
> > Or am I missing something else here?
>
> You can't unregister a device from one of its attribute callbacks, it
> locks up in sysfs (removing the sysfs dir waits for all active
> references to be dropped, but the reference obtained before calling
> ->store won't be dropped until after ->store returned...)
> device_schedule_callback() was introduced to solve exactly that problem.
Yes, that is exactly why I used device_schedule_callback.
> (For the original oops, I'd rather solve the problem by making sure the
> caller doesn't trigger removal several times - should probably be less
> code than the proposed patch?)
The scsi stuff [sdev_store_delete()] is susceptible to exactly
the same problem.
I could put the "only allow one removal per kobj" code into my
specific code path, but I think it makes more sense to put it in
the common code.
Thanks.
/ac
On Wed, Mar 11, 2009 at 06:47:29PM +0100, Cornelia Huck wrote:
> On Wed, 11 Mar 2009 08:32:28 -0700,
> Greg KH <[email protected]> wrote:
>
> > Why can't you use device_unregister()? Or, you could use device_del(),
> > which lets you rely on the fact that the device structure is still
> > around for a bit, but it will disappear from sysfs. Just don't forget
> > to do the final put_device() on it to free the memory and "really"
> > release it.
> >
> > Or am I missing something else here?
>
> You can't unregister a device from one of its attribute callbacks,
Doh! You're right, very sorry Alex for missing this.
> (For the original oops, I'd rather solve the problem by making sure the
> caller doesn't trigger removal several times - should probably be less
> code than the proposed patch?)
Any ideas on how to do this?
thanks,
greg k-h
* Greg KH <[email protected]>:
> On Wed, Mar 11, 2009 at 06:47:29PM +0100, Cornelia Huck wrote:
> > On Wed, 11 Mar 2009 08:32:28 -0700,
> > Greg KH <[email protected]> wrote:
> >
> > > Why can't you use device_unregister()? Or, you could use device_del(),
> > > which lets you rely on the fact that the device structure is still
> > > around for a bit, but it will disappear from sysfs. Just don't forget
> > > to do the final put_device() on it to free the memory and "really"
> > > release it.
> > >
> > > Or am I missing something else here?
> >
> > You can't unregister a device from one of its attribute callbacks,
>
> Doh! You're right, very sorry Alex for missing this.
That's ok.
> > (For the original oops, I'd rather solve the problem by making sure the
> > caller doesn't trigger removal several times - should probably be less
> > code than the proposed patch?)
>
> Any ideas on how to do this?
I still think the original patch I proposed is the right answer.
/ac
---
file.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
---
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 1f4a3f8..e05a172 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -659,13 +659,16 @@ void sysfs_remove_file_from_group(struct kobject *kobj,
EXPORT_SYMBOL_GPL(sysfs_remove_file_from_group);
struct sysfs_schedule_callback_struct {
- struct kobject *kobj;
+ struct list_head workq_list;
+ struct kobject *kobj;
void (*func)(void *);
void *data;
struct module *owner;
struct work_struct work;
};
+static DEFINE_MUTEX(sysfs_workq_mutex);
+static LIST_HEAD(sysfs_workq);
static void sysfs_schedule_callback_work(struct work_struct *work)
{
struct sysfs_schedule_callback_struct *ss = container_of(work,
@@ -674,6 +677,9 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
(ss->func)(ss->data);
kobject_put(ss->kobj);
module_put(ss->owner);
+ mutex_lock(&sysfs_workq_mutex);
+ list_del(&ss->workq_list);
+ mutex_unlock(&sysfs_workq_mutex);
kfree(ss);
}
@@ -700,10 +706,19 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
void *data, struct module *owner)
{
- struct sysfs_schedule_callback_struct *ss;
+ struct sysfs_schedule_callback_struct *ss, *tmp;
if (!try_module_get(owner))
return -ENODEV;
+
+ mutex_lock(&sysfs_workq_mutex);
+ list_for_each_entry_safe(ss, tmp, &sysfs_workq, workq_list)
+ if (ss->kobj == kobj) {
+ mutex_unlock(&sysfs_workq_mutex);
+ return -EBUSY;
+ }
+ mutex_unlock(&sysfs_workq_mutex);
+
ss = kmalloc(sizeof(*ss), GFP_KERNEL);
if (!ss) {
module_put(owner);
@@ -715,6 +730,10 @@ int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
ss->data = data;
ss->owner = owner;
INIT_WORK(&ss->work, sysfs_schedule_callback_work);
+ INIT_LIST_HEAD(&ss->workq_list);
+ mutex_lock(&sysfs_workq_mutex);
+ list_add_tail(&ss->workq_list, &sysfs_workq);
+ mutex_unlock(&sysfs_workq_mutex);
schedule_work(&ss->work);
return 0;
}
* Tejun Heo <[email protected]>:
> Alex Chiang wrote:
> > * Greg KH <[email protected]>:
> >> On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
> >>> Hi Vegard, sysfs folks,
> >>>
> >>> Vegard was nice enough to test my PCI remove/rescan patches under
> >>> kmemcheck. Maybe "torture" is a more appropriate term. ;)
> >>>
> >>> My patch series introduces a sysfs "remove" attribute for PCI
> >>> devices, which will remove that device (and child devices).
> >>>
> >>> http://thread.gmane.org/gmane.linux.kernel.pci/3495
> >>>
> >>> Vegard decided that he wanted to do something like:
> >>>
> >>> # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
> >>>
> >>> which caused a nasty oops in my code. You can see the results of
> >>> his testing in the thread I referenced above.
> >>>
> >>> After looking at my code for a bit, I decided that maybe it
> >>> wasn't completely my fault. ;) See, I'm using device_schedule_callback()
> >> why? Are you really in interrupt context here to need to do the remove
> >> at a later time?
> >
> > What other interface can I use to remove objects from sysfs?
>
> I haven't read your code yet but I seem to recall doing something
> similar. Ah.. okay, this one didn't get in and I forgot about this.
>
> http://thread.gmane.org/gmane.linux.kernel/582130
>
> But, yeah, committing suicide is currently quite hariy. I tought SCSI
> did it correctly with all the grab/release dances. Does SCSI have the
> problem too?
I haven't dived into the SCSI code yet, but they are doing some
sort of magic that I don't understand with their state machine.
Regardless, I think we have two issues.
1. The existing callback mechanism that everyone hates
has a "bug".
2. Your suicide patches haven't made it into mainline yet.
The reason that I think that the "bug" is with the callback
mechanism is because any caller can repeatedly schedule suicide
over and over again, and the callback handler will eventually get
a stale pointer. Rather than make all the callsites handle the
locking, doesn't it make more sense for the infrastructure to do
it?
I realize we're trying to fix something that everyone wants to go
away, but the PCI rescan patches add some pretty useful
functionality and pretty much ready to go except for this. I
could add the bookkeeping into my suicide path, but that's
actually a slightly bigger patch, because now I have to malloc my
own callback structs. And again, I think it's more appropriate to
put that sort of code into the core.
Can we fix 1 in the short term and move towards 2 as the real
solution?
Thanks.
/ac
On Wed, Mar 11, 2009 at 06:27:37PM -0600, Alex Chiang wrote:
> * Tejun Heo <[email protected]>:
> > Alex Chiang wrote:
> > > * Greg KH <[email protected]>:
> > >> On Tue, Mar 10, 2009 at 05:20:27PM -0600, Alex Chiang wrote:
> > >>> Hi Vegard, sysfs folks,
> > >>>
> > >>> Vegard was nice enough to test my PCI remove/rescan patches under
> > >>> kmemcheck. Maybe "torture" is a more appropriate term. ;)
> > >>>
> > >>> My patch series introduces a sysfs "remove" attribute for PCI
> > >>> devices, which will remove that device (and child devices).
> > >>>
> > >>> http://thread.gmane.org/gmane.linux.kernel.pci/3495
> > >>>
> > >>> Vegard decided that he wanted to do something like:
> > >>>
> > >>> # while true ; do echo 1 > /sys/bus/pci/devices/.../remove ; done
> > >>>
> > >>> which caused a nasty oops in my code. You can see the results of
> > >>> his testing in the thread I referenced above.
> > >>>
> > >>> After looking at my code for a bit, I decided that maybe it
> > >>> wasn't completely my fault. ;) See, I'm using device_schedule_callback()
> > >> why? Are you really in interrupt context here to need to do the remove
> > >> at a later time?
> > >
> > > What other interface can I use to remove objects from sysfs?
> >
> > I haven't read your code yet but I seem to recall doing something
> > similar. Ah.. okay, this one didn't get in and I forgot about this.
> >
> > http://thread.gmane.org/gmane.linux.kernel/582130
> >
> > But, yeah, committing suicide is currently quite hariy. I tought SCSI
> > did it correctly with all the grab/release dances. Does SCSI have the
> > problem too?
>
> I haven't dived into the SCSI code yet, but they are doing some
> sort of magic that I don't understand with their state machine.
>
> Regardless, I think we have two issues.
>
> 1. The existing callback mechanism that everyone hates
> has a "bug".
>
> 2. Your suicide patches haven't made it into mainline yet.
>
> The reason that I think that the "bug" is with the callback
> mechanism is because any caller can repeatedly schedule suicide
> over and over again, and the callback handler will eventually get
> a stale pointer. Rather than make all the callsites handle the
> locking, doesn't it make more sense for the infrastructure to do
> it?
>
> I realize we're trying to fix something that everyone wants to go
> away, but the PCI rescan patches add some pretty useful
> functionality and pretty much ready to go except for this. I
> could add the bookkeeping into my suicide path, but that's
> actually a slightly bigger patch, because now I have to malloc my
> own callback structs. And again, I think it's more appropriate to
> put that sort of code into the core.
>
> Can we fix 1 in the short term and move towards 2 as the real
> solution?
I have no objection to this plan.
thanks,
greg k-h
On Wed, 11 Mar 2009 12:42:25 -0600,
Alex Chiang <[email protected]> wrote:
> * Greg KH <[email protected]>:
> > On Wed, Mar 11, 2009 at 06:47:29PM +0100, Cornelia Huck wrote:
> > > (For the original oops, I'd rather solve the problem by making sure the
> > > caller doesn't trigger removal several times - should probably be less
> > > code than the proposed patch?)
> >
> > Any ideas on how to do this?
>
> I still think the original patch I proposed is the right answer.
How about just putting a marker on your device that is going to be
unregistered and refusing to schedule it again? (This marker could also
be used to block other undesired actions; that's what ccwgroup does.)
* Cornelia Huck <[email protected]>:
> On Wed, 11 Mar 2009 Alex Chiang <[email protected]> wrote:
>
> > * Greg KH <[email protected]>:
> > > On Wed, Mar 11, 2009 Cornelia Huck wrote:
> > > > (For the original oops, I'd rather solve the problem by
> > > > making sure the caller doesn't trigger removal several
> > > > times - should probably be less code than the proposed
> > > > patch?)
> > >
> > > Any ideas on how to do this?
> >
> > I still think the original patch I proposed is the right
> > answer.
>
> How about just putting a marker on your device that is going to
> be unregistered and refusing to schedule it again? (This marker
> could also be used to block other undesired actions; that's
> what ccwgroup does.)
I looked at what ccwgroup does, and you're right that it's a
smaller patch, and easy to implement.
But I still think that it makes more sense to fix the underlying
sysfs_schedule_callback() interface, rather than asking all the
callers to implement their own exclusion mechanisms.
Thanks.
/ac
* Greg KH <[email protected]>:
> On Wed, Mar 11, 2009 at 06:27:37PM -0600, Alex Chiang wrote:
> >
> > I haven't dived into the SCSI code yet, but they are doing some
> > sort of magic that I don't understand with their state machine.
> >
> > Regardless, I think we have two issues.
> >
> > 1. The existing callback mechanism that everyone hates
> > has a "bug".
> >
> > 2. Your suicide patches haven't made it into mainline yet.
> >
> > The reason that I think that the "bug" is with the callback
> > mechanism is because any caller can repeatedly schedule suicide
> > over and over again, and the callback handler will eventually get
> > a stale pointer. Rather than make all the callsites handle the
> > locking, doesn't it make more sense for the infrastructure to do
> > it?
> >
> > I realize we're trying to fix something that everyone wants to go
> > away, but the PCI rescan patches add some pretty useful
> > functionality and pretty much ready to go except for this. I
> > could add the bookkeeping into my suicide path, but that's
> > actually a slightly bigger patch, because now I have to malloc my
> > own callback structs. And again, I think it's more appropriate to
> > put that sort of code into the core.
> >
> > Can we fix 1 in the short term and move towards 2 as the real
> > solution?
>
> I have no objection to this plan.
>
> thanks,
>
> greg k-h
From: Alex Chiang <[email protected]>
sysfs: only allow one scheduled removal callback per kobj
The only way for a sysfs attribute to remove itself (without
deadlock) is to use the sysfs_schedule_callback() interface.
Vegard Nossum discovered that a poorly written sysfs ->store
callback can repeatedly schedule remove callbacks on the same
device over and over, e.g.
$ while true ; do echo 1 > /sys/devices/.../remove ; done
If the 'remove' attribute uses the sysfs_schedule_callback API
and also does not protect itself from concurrent accesses, its
callback handler will be called multiple times, and will
eventually attempt to perform operations on a freed kobject,
leading to many problems.
Instead of requiring all callers of sysfs_schedule_callback to
implement their own synchronization, provide the protection in
the infrastructure.
Now, sysfs_schedule_callback will only allow one scheduled
callback per kobject. On subsequent calls with the same kobject,
return -EAGAIN.
This is a short term fix. The long term fix is to allow sysfs
attributes to remove themselves directly, without any of this
callback hokey pokey.
Cc: [email protected]
Reported-by: [email protected]
Signed-off-by: Alex Chiang <[email protected]>
---
Greg, I think this is .30 material; we're late in the -rc cycle
now and we're changing the semantics of an API.
Cornelia, I understand your earlier point about a smaller patch
in the caller, but I think pushing the code down into the
infrastructure is the right thing to do. Also, I wasn't brave
enough to patch your ccwgroup_ungroup_store(), but I think you
won't need the gdev->onoff stuff anymore in that code path.
Thanks.
---
file.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
---
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 1f4a3f8..289c43a 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -659,13 +659,16 @@ void sysfs_remove_file_from_group(struct kobject *kobj,
EXPORT_SYMBOL_GPL(sysfs_remove_file_from_group);
struct sysfs_schedule_callback_struct {
- struct kobject *kobj;
+ struct list_head workq_list;
+ struct kobject *kobj;
void (*func)(void *);
void *data;
struct module *owner;
struct work_struct work;
};
+static DEFINE_MUTEX(sysfs_workq_mutex);
+static LIST_HEAD(sysfs_workq);
static void sysfs_schedule_callback_work(struct work_struct *work)
{
struct sysfs_schedule_callback_struct *ss = container_of(work,
@@ -674,6 +677,9 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
(ss->func)(ss->data);
kobject_put(ss->kobj);
module_put(ss->owner);
+ mutex_lock(&sysfs_workq_mutex);
+ list_del(&ss->workq_list);
+ mutex_unlock(&sysfs_workq_mutex);
kfree(ss);
}
@@ -695,15 +701,25 @@ static void sysfs_schedule_callback_work(struct work_struct *work)
* until @func returns.
*
* Returns 0 if the request was submitted, -ENOMEM if storage could not
- * be allocated, -ENODEV if a reference to @owner isn't available.
+ * be allocated, -ENODEV if a reference to @owner isn't available,
+ * -EAGAIN if a callback has already been scheduled for @kobj.
*/
int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
void *data, struct module *owner)
{
- struct sysfs_schedule_callback_struct *ss;
+ struct sysfs_schedule_callback_struct *ss, *tmp;
if (!try_module_get(owner))
return -ENODEV;
+
+ mutex_lock(&sysfs_workq_mutex);
+ list_for_each_entry_safe(ss, tmp, &sysfs_workq, workq_list)
+ if (ss->kobj == kobj) {
+ mutex_unlock(&sysfs_workq_mutex);
+ return -EAGAIN;
+ }
+ mutex_unlock(&sysfs_workq_mutex);
+
ss = kmalloc(sizeof(*ss), GFP_KERNEL);
if (!ss) {
module_put(owner);
@@ -715,6 +731,10 @@ int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
ss->data = data;
ss->owner = owner;
INIT_WORK(&ss->work, sysfs_schedule_callback_work);
+ INIT_LIST_HEAD(&ss->workq_list);
+ mutex_lock(&sysfs_workq_mutex);
+ list_add_tail(&ss->workq_list, &sysfs_workq);
+ mutex_unlock(&sysfs_workq_mutex);
schedule_work(&ss->work);
return 0;
}
On Thu, 12 Mar 2009 16:02:31 -0600,
Alex Chiang <[email protected]> wrote:
> From: Alex Chiang <[email protected]>
>
> sysfs: only allow one scheduled removal callback per kobj
>
> The only way for a sysfs attribute to remove itself (without
> deadlock) is to use the sysfs_schedule_callback() interface.
>
> Vegard Nossum discovered that a poorly written sysfs ->store
> callback can repeatedly schedule remove callbacks on the same
> device over and over, e.g.
>
> $ while true ; do echo 1 > /sys/devices/.../remove ; done
>
> If the 'remove' attribute uses the sysfs_schedule_callback API
> and also does not protect itself from concurrent accesses, its
> callback handler will be called multiple times, and will
> eventually attempt to perform operations on a freed kobject,
> leading to many problems.
>
> Instead of requiring all callers of sysfs_schedule_callback to
> implement their own synchronization, provide the protection in
> the infrastructure.
>
> Now, sysfs_schedule_callback will only allow one scheduled
> callback per kobject. On subsequent calls with the same kobject,
> return -EAGAIN.
>
> This is a short term fix. The long term fix is to allow sysfs
> attributes to remove themselves directly, without any of this
> callback hokey pokey.
>
> Cc: [email protected]
> Reported-by: [email protected]
> Signed-off-by: Alex Chiang <[email protected]>
> ---
> Greg, I think this is .30 material; we're late in the -rc cycle
> now and we're changing the semantics of an API.
>
> Cornelia, I understand your earlier point about a smaller patch
> in the caller, but I think pushing the code down into the
> infrastructure is the right thing to do.
OK, I don't have further objections.
> Also, I wasn't brave
> enough to patch your ccwgroup_ungroup_store(), but I think you
> won't need the gdev->onoff stuff anymore in that code path.
We still need it to prevent online/offline vs. ungroup races.
While device_schedule_callback() should not be able to return -EAGAIN
on us, I'll sleep better if you could add the following snippet to your
patch:
---
drivers/s390/cio/ccwgroup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- linux-2.6.orig/drivers/s390/cio/ccwgroup.c
+++ linux-2.6/drivers/s390/cio/ccwgroup.c
@@ -104,8 +104,9 @@ ccwgroup_ungroup_store(struct device *de
rc = device_schedule_callback(dev, ccwgroup_ungroup_callback);
out:
if (rc) {
- /* Release onoff "lock" when ungrouping failed. */
- atomic_set(&gdev->onoff, 0);
+ if (rc != -EAGAIN)
+ /* Release onoff "lock" when ungrouping failed. */
+ atomic_set(&gdev->onoff, 0);
return rc;
}
return count;
* Cornelia Huck <[email protected]>:
> On Thu, 12 Mar 2009 16:02:31 -0600,
> Alex Chiang <[email protected]> wrote:
> >
> > Cornelia, I understand your earlier point about a smaller patch
> > in the caller, but I think pushing the code down into the
> > infrastructure is the right thing to do.
>
> OK, I don't have further objections.
Thanks.
> > Also, I wasn't brave
> > enough to patch your ccwgroup_ungroup_store(), but I think you
> > won't need the gdev->onoff stuff anymore in that code path.
>
> We still need it to prevent online/offline vs. ungroup races.
>
> While device_schedule_callback() should not be able to return -EAGAIN
> on us, I'll sleep better if you could add the following snippet to your
> patch:
Added and resent to Greg.
Thanks.