2002-11-20 09:18:50

by Rick Lindsley

[permalink] [raw]
Subject: [BUG] 2.5.47: sysfs hierarchy can begin to disintegrate

2.5.47 (top of bk tree on 11/14, to be precise)

/sys has a sysfs file system on it. I'd expect /sys/block to contain
hd[abc], an assortment of ram disks, and perhaps some SCSI disks.
However, these all appear under /sys instead of /sys/block. /sys/block
exists but is empty.

Interesting observation: the megaraid controller has some problem right
now (suspected to be hardware) wherein its scsi disks appear at boot
time but then cannot be accessed later and are subsequently detached.
Since this could well be a little-used and little-tested path, my
suspicion is that either the megaraid, scsi, or sysfs code has a bug when
disks are detached. So far, I've not been able to find one, however, so
I thought I'd report this in case others might know just where to peek.
Could it be that removing entries from sysfs is done incorrectly in
some cases?

This was reproduced consistently on a machine at OSDL with the assistance
of Cliff White ... however I was not able to reproduce on my own 4-way
machine which has IDE disks and RAM disks but no megaraid.

I'm continuing to investigate this but there are more of you than of me
so ... :)

Rick


2002-11-20 17:42:49

by Mike Anderson

[permalink] [raw]
Subject: Re: [BUG] 2.5.47: sysfs hierarchy can begin to disintegrate

Rick Lindsley [[email protected]] wrote:
> 2.5.47 (top of bk tree on 11/14, to be precise)
>
> /sys has a sysfs file system on it. I'd expect /sys/block to contain
> hd[abc], an assortment of ram disks, and perhaps some SCSI disks.
> However, these all appear under /sys instead of /sys/block. /sys/block
> exists but is empty.
>
> Interesting observation: the megaraid controller has some problem right
> now (suspected to be hardware) wherein its scsi disks appear at boot
> time but then cannot be accessed later and are subsequently detached.
> Since this could well be a little-used and little-tested path, my
> suspicion is that either the megaraid, scsi, or sysfs code has a bug when
> disks are detached. So far, I've not been able to find one, however, so
> I thought I'd report this in case others might know just where to peek.
> Could it be that removing entries from sysfs is done incorrectly in
> some cases?

Rick,
There are cleanup issues in sysfs previous reported by others.
The patch below previously sent to the list by patmans and
myself helps in repeated insmod / rmmod / shutdown testing of
scsi modules for us. We have not seen the null parent pointer
problem you are seeing which causes your objects to show up
under /sys/ (YMMV).

-andmike
--
Michael Anderson
[email protected]

bus.c | 4 +++-
core.c | 2 --
driver.c | 2 ++
3 files changed, 5 insertions(+), 3 deletions(-)
------

===== drivers/base/bus.c 1.22 vs edited =====
--- 1.22/drivers/base/bus.c Thu Oct 31 08:20:23 2002
+++ edited/drivers/base/bus.c Wed Nov 6 22:06:39 2002
@@ -209,8 +209,10 @@
attach(dev);
else
dev->driver = NULL;
- } else
+ } else {
attach(dev);
+ error = 0;
+ }
}
return error;
}
===== drivers/base/core.c 1.50 vs edited =====
--- 1.50/drivers/base/core.c Thu Oct 31 08:20:23 2002
+++ edited/drivers/base/core.c Wed Nov 6 13:07:47 2002
@@ -173,8 +173,6 @@
return -EINVAL;

device_initialize(dev);
- if (dev->parent)
- get_device(dev->parent);
error = device_add(dev);
if (error && dev->parent)
put_device(dev->parent);
===== drivers/base/driver.c 1.14 vs edited =====
--- 1.14/drivers/base/driver.c Wed Oct 30 16:35:48 2002
+++ edited/drivers/base/driver.c Wed Nov 6 21:41:48 2002
@@ -127,6 +127,8 @@
drv->present = 0;
spin_unlock(&device_lock);
pr_debug("driver %s:%s: unregistering\n",drv->bus->name,drv->name);
+ bus_remove_driver(drv);
+ kobject_unregister(&drv->kobj);
put_driver(drv);
}

2002-11-20 17:51:22

by Patrick Mochel

[permalink] [raw]
Subject: Re: [BUG] 2.5.47: sysfs hierarchy can begin to disintegrate


On Wed, 20 Nov 2002, Mike Anderson wrote:

> Rick Lindsley [[email protected]] wrote:
> > 2.5.47 (top of bk tree on 11/14, to be precise)
> >
> > /sys has a sysfs file system on it. I'd expect /sys/block to contain
> > hd[abc], an assortment of ram disks, and perhaps some SCSI disks.
> > However, these all appear under /sys instead of /sys/block. /sys/block
> > exists but is empty.
> >
> > Interesting observation: the megaraid controller has some problem right
> > now (suspected to be hardware) wherein its scsi disks appear at boot
> > time but then cannot be accessed later and are subsequently detached.
> > Since this could well be a little-used and little-tested path, my
> > suspicion is that either the megaraid, scsi, or sysfs code has a bug when
> > disks are detached. So far, I've not been able to find one, however, so
> > I thought I'd report this in case others might know just where to peek.
> > Could it be that removing entries from sysfs is done incorrectly in
> > some cases?
>
> Rick,
> There are cleanup issues in sysfs previous reported by others.
> The patch below previously sent to the list by patmans and
> myself helps in repeated insmod / rmmod / shutdown testing of
> scsi modules for us. We have not seen the null parent pointer
> problem you are seeing which causes your objects to show up
> under /sys/ (YMMV).

I have not seen it, either. What SCSI driver is it? Is it modular or not?
Or, does it matter?

I apologize for the long delay in replying and integrating patches. I've
been moving, and readjusting, and sifting through email from the past 1.5
months, and tryign to fix up the various problems that people have
reported. I have integrated the patch just posted, BTW.

-pat


2002-11-20 20:04:21

by Patrick Mansfield

[permalink] [raw]
Subject: Re: [BUG] 2.5.47: sysfs hierarchy can begin to disintegrate

On Wed, Nov 20, 2002 at 01:25:47AM -0800, Rick Lindsley wrote:

> Interesting observation: the megaraid controller has some problem right
> now (suspected to be hardware) wherein its scsi disks appear at boot
> time but then cannot be accessed later and are subsequently detached.
> Since this could well be a little-used and little-tested path, my
> suspicion is that either the megaraid, scsi, or sysfs code has a bug when
> disks are detached. So far, I've not been able to find one, however, so

Regarding megaraid, Stephen Hemming and Mark Haverkamp posted about
problems, leading to this:

http://marc.theaimsgroup.com/?l=linux-scsi&m=103728900228366&w=2

Generally - the megaraid had problems, where the aic driver did not using
the same disk drives. AFAICT the megaraid caused a problem, and its error
handling is non existent, so it is unable to recover from the problem. I'm
sure Mark or Stephen could provide details.

> I thought I'd report this in case others might know just where to peek.
> Could it be that removing entries from sysfs is done incorrectly in
> some cases?

If you are talking about rmmod your-scsi-adapter, or removing a scsi_device
vi /proc/scsi, yes, there are problems in sysfs and in scsi. None of the
sysfs fixes are in the current bk tree, I assume Pat Mochel needs to push
them (he hasn't been online recently).

We can push a simple scsi fix (via James B), but the scsi sysfs code is not
good, Mike Anderson is working on some patches for that.

-- Patrick Mansfield