On Fri, May 12, 2006 at 06:53:27AM +0200, Or Gerlitz wrote:
> On 5/11/06, Erik Mouw <[email protected]> wrote:
> >Hi,
> >
> >While playing with libata in 2.6.17-git from today, I got this bug:
> >
> >SCSI subsystem initialized
> >libata version 1.20 loaded.
> >sata_promise 0000:02:05.0: version 1.04
> >kmem_cache_create: duplicate cache scsi_cmd_cache
> > <c0159591> kmem_cache_create+0x331/0x390
> > <e0924371> scsi_setup_command_freelist+0x71/0xf0 [scsi_mod]
> > <e092588e> scsi_host_alloc+0x17e/0x270 [scsi_mod]
> > <e08fd061> ata_host_add+0x41/0xc0 [libata]
> > <c0148c9f> __kzalloc+0x1f/0x50
> > <e08fd190> ata_device_add+0xb0/0x240 [libata]
> > <e0839baf> pdc_ata_init_one+0x27f/0x330 [sata_promise]
> > <c01dd1a9> pci_call_probe+0x19/0x20
> > <c01dd20e> __pci_device_probe+0x5e/0x70
> > <c01dd24f> pci_device_probe+0x2f/0x50
> > <c0220977> driver_probe_device+0xb7/0xe0
> > <c02b338a> klist_dec_and_del+0x1a/0x20
> > <c0220a30> __driver_attach+0x0/0x90
> > <c0220aa1>__driver_attach+0x71/0x90
> > <c021fd49> bus_for_each_dev+0x69/0x80
> > <c0220ae6> driver_attach+0x26/0x30
> > <c0220a30> __driver_attach+0x0/0x90
> > <c02202a3> bus_add_driver+0x83/0xc0
> > <c01dd4ed> __pci_register_driver+0x4d/0x70
> > <e0880017> pdc_ata_init+0x17/0x1b [sata_promise]
> > <c013a1a0> sys_init_module+0x120/0x1b0
> > <c0102f27> syscall_call+0x7/0xb
>
> I was pointing on 2.6.17 kmem_cache related issues while back to LKML
> and it turns out that you can reproduce them easily with standalone
> trivial module, see
> http://openib.org/pipermail/openib-general/2006-April/020582.html
>
> So far no one picked the issue anb i was adviced to use bisection...
I tracked it down with git bisect. The culprit is this commit:
56cf6504fc1c0c221b82cebc16a444b684140fb7 is first bad commit
diff-tree 56cf6504fc1c0c221b82cebc16a444b684140fb7 (from
d98550e334715b2d9e45f8f0f4e1608720108640)
Author: Russell King <[email protected]>
Date: Fri May 5 17:57:52 2006 +0100
[BLOCK] Fix oops on removal of SD/MMC card
The block layer keeps a reference (driverfs_dev) to the struct
device associated with the block device, and uses it internally
for generating uevents in block_uevent.
Block device uevents include umounting the partition, which can
occur after the backing device has been removed.
Unfortunately, this reference is not counted. This means that
if the struct device is removed from the device tree, the block
layers reference will become stale.
Guard against this by holding a reference to the struct device
in add_disk(), and only drop the reference when we're releasing
the gendisk kobject - in other words when we can be sure that no
further uevents will be generated for this block device.
Signed-off-by: Russell King <[email protected]>
Acked-by: Jens Axboe <[email protected]>
:040000 040000 4923c988a57db93382546cd83f4e3043b06c0eed 08e396a4fbf3c7d0beb9178f0fd0c9205c0b5305 M block
After reverting this commit in 2.6.17-rc4 I can't trigger the bug
anymore. Might be worth fixing before 2.6.17-final.
Note: I'm going on holiday next week, so I'm not able to test any
fixes. However, because this bug is very easy to trigger[1], anybody
with root on NFS and fully modular SCSI or libata should be able to
test.
Erik
[1] See http://marc.theaimsgroup.com/?l=linux-scsi&m=114736053211943&w=2
--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
On Fri, 12 May 2006, Erik Mouw wrote:
> On Fri, May 12, 2006 at 06:53:27AM +0200, Or Gerlitz wrote:
> > On 5/11/06, Erik Mouw <[email protected]> wrote:
> > >Hi,
> > >
> > >While playing with libata in 2.6.17-git from today, I got this bug:
> > >
> > >SCSI subsystem initialized
> > >libata version 1.20 loaded.
> > >sata_promise 0000:02:05.0: version 1.04
> > >kmem_cache_create: duplicate cache scsi_cmd_cache
> > > <c0159591> kmem_cache_create+0x331/0x390
> > > <e0924371> scsi_setup_command_freelist+0x71/0xf0 [scsi_mod]
> > > <e092588e> scsi_host_alloc+0x17e/0x270 [scsi_mod]
> > > <e08fd061> ata_host_add+0x41/0xc0 [libata]
> > > <c0148c9f> __kzalloc+0x1f/0x50
> > > <e08fd190> ata_device_add+0xb0/0x240 [libata]
> > > <e0839baf> pdc_ata_init_one+0x27f/0x330 [sata_promise]
> > > <c01dd1a9> pci_call_probe+0x19/0x20
> > > <c01dd20e> __pci_device_probe+0x5e/0x70
> > > <c01dd24f> pci_device_probe+0x2f/0x50
> > > <c0220977> driver_probe_device+0xb7/0xe0
> > > <c02b338a> klist_dec_and_del+0x1a/0x20
> > > <c0220a30> __driver_attach+0x0/0x90
> > > <c0220aa1>__driver_attach+0x71/0x90
> > > <c021fd49> bus_for_each_dev+0x69/0x80
> > > <c0220ae6> driver_attach+0x26/0x30
> > > <c0220a30> __driver_attach+0x0/0x90
> > > <c02202a3> bus_add_driver+0x83/0xc0
> > > <c01dd4ed> __pci_register_driver+0x4d/0x70
> > > <e0880017> pdc_ata_init+0x17/0x1b [sata_promise]
> > > <c013a1a0> sys_init_module+0x120/0x1b0
> > > <c0102f27> syscall_call+0x7/0xb
> >
> > I was pointing on 2.6.17 kmem_cache related issues while back to LKML
> > and it turns out that you can reproduce them easily with standalone
> > trivial module, see
> > http://openib.org/pipermail/openib-general/2006-April/020582.html
> >
> > So far no one picked the issue anb i was adviced to use bisection...
>
> I tracked it down with git bisect. The culprit is this commit:
>
> 56cf6504fc1c0c221b82cebc16a444b684140fb7 is first bad commit
> diff-tree 56cf6504fc1c0c221b82cebc16a444b684140fb7 (from
> d98550e334715b2d9e45f8f0f4e1608720108640)
> Author: Russell King <[email protected]>
> Date: Fri May 5 17:57:52 2006 +0100
>
> [BLOCK] Fix oops on removal of SD/MMC card
>
> The block layer keeps a reference (driverfs_dev) to the struct
> device associated with the block device, and uses it internally
> for generating uevents in block_uevent.
>
> Block device uevents include umounting the partition, which can
> occur after the backing device has been removed.
...
THat's really weird... I reported a completely unrelated problem some
days ago:
http://marc.theaimsgroup.com/?l=linux-kernel&m=114728598328769&w=2
> After reverting this commit in 2.6.17-rc4 I can't trigger the bug
> anymore. Might be worth fixing before 2.6.17-final.
and similarly reverted the commit in question...
I'm still trying to track this down...
> Note: I'm going on holiday next week, so I'm not able to test any
> fixes. However, because this bug is very easy to trigger[1], anybody
> with root on NFS and fully modular SCSI or libata should be able to
> test.
--
Andrew Vasquez
On Fri, 12 May 2006, Erik Mouw wrote:
> On Fri, May 12, 2006 at 06:53:27AM +0200, Or Gerlitz wrote:
> >
> > I was pointing on 2.6.17 kmem_cache related issues while back to LKML
> > and it turns out that you can reproduce them easily with standalone
> > trivial module, see
> > http://openib.org/pipermail/openib-general/2006-April/020582.html
> >
> > So far no one picked the issue anb i was adviced to use bisection...
>
> I tracked it down with git bisect. The culprit is this commit:
>
> 56cf6504fc1c0c221b82cebc16a444b684140fb7 is first bad commit
Ok, that's just _strange_. Clearly bisection picked the right commit,
since:
> After reverting this commit in 2.6.17-rc4 I can't trigger the bug
> anymore.
but at the same time, I don't see what that two-liner patch to
block/genhd.c has to do with that trivial module which doesn't even _use_
any of the functions that the patch affects.
It sounds like you used some other test-case than the trivial module above
to test bisection?
So when you say "I tracked _it_ down with git bisect" (my emphasis), it
sounds like "it" was really something else than the trivial stand-alone
module. Or are you saying that the trivial stand-alone module _also_ got
fixed?
> Might be worth fixing before 2.6.17-final.
Yes. We could just revert that commit, but it seems correct, and I'd
really like for somebody to understand _why_ that commit matters at all. I
certainly don't see the overlap here..
So I get the feeling that you bisected just the case below, right?
> Note: I'm going on holiday next week, so I'm not able to test any
> fixes. However, because this bug is very easy to trigger[1], anybody
> with root on NFS and fully modular SCSI or libata should be able to
> test.
>
> [1] See http://marc.theaimsgroup.com/?l=linux-scsi&m=114736053211943&w=2
Hmm. Jens? Any ideas?
Linus
On Fri, 2006-05-12 at 10:36 -0700, Linus Torvalds wrote:
> Ok, that's just _strange_. Clearly bisection picked the right commit,
> since:
I can guess what the problem is.
The kmem_cache_release is triggered from the device model release of the
host generic device. I suspect we've induced a tangle in here that
means scsi devices (and hence hosts) get deleted but never released.
I'll look at the release paths and see if I can work out what it is.
James
On Fri, 2006-05-12 at 12:47 -0500, James Bottomley wrote:
> I'll look at the release paths and see if I can work out what it is.
OK, here's the scoop. The problem patch adds a get of driverfs_dev in
add_disk(), but doesn't put it again until disk_release() (which occurs
on final put_disk() of the gendisk).
However, in SCSI, the driverfs_dev is the sdev_gendev. That means
there's a reference held on sdev_gendev until final disk put.
Unfortunately, we use the driver model driver_remove to trigger
del_gendisk (which removes the gendisk from visibility and decrements
the refcount), so we've introduced an unbreakable deadlock in the
reference counting with this.
I suggest simply reversing this patch at the moment. If Russell and
Jens can tell me what they're trying to do I'll see if there's another
way to do it.
James
On Fri, 12 May 2006, James Bottomley wrote:
>
> I suggest simply reversing this patch at the moment. If Russell and
> Jens can tell me what they're trying to do I'll see if there's another
> way to do it.
Reverted, with a big changelog entry to explain why.
Linus
On Fri, May 12, 2006 at 10:27:29AM -0700, Andrew Vasquez wrote:
> On Fri, 12 May 2006, Erik Mouw wrote:
> > I tracked it down with git bisect. The culprit is this commit:
> >
> > 56cf6504fc1c0c221b82cebc16a444b684140fb7 is first bad commit
> > diff-tree 56cf6504fc1c0c221b82cebc16a444b684140fb7 (from
> > d98550e334715b2d9e45f8f0f4e1608720108640)
> > Author: Russell King <[email protected]>
> > Date: Fri May 5 17:57:52 2006 +0100
> >
> > [BLOCK] Fix oops on removal of SD/MMC card
> >
> > The block layer keeps a reference (driverfs_dev) to the struct
> > device associated with the block device, and uses it internally
> > for generating uevents in block_uevent.
> >
> > Block device uevents include umounting the partition, which can
> > occur after the backing device has been removed.
> ...
>
> THat's really weird... I reported a completely unrelated problem some
> days ago:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=114728598328769&w=2
AFAICS it's the same bug. Apparently you can trigger it with any scsi
host driver.
Erik
--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> It sounds like you used some other test-case than the trivial module above
> to test bisection?
Yes, my test case is in here:
http://marc.theaimsgroup.com/?l=linux-scsi&m=114736053211943&w=2
> So when you say "I tracked _it_ down with git bisect" (my emphasis), it
> sounds like "it" was really something else than the trivial stand-alone
> module. Or are you saying that the trivial stand-alone module _also_ got
> fixed?
Sorry for being unclear. "it" is the bug report I send to linux-scsi
(see url above).
Erik
--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
On Fri, May 12, 2006 at 01:58:45PM -0500, James Bottomley wrote:
> On Fri, 2006-05-12 at 12:47 -0500, James Bottomley wrote:
> > I'll look at the release paths and see if I can work out what it is.
>
> OK, here's the scoop. The problem patch adds a get of driverfs_dev in
> add_disk(), but doesn't put it again until disk_release() (which occurs
> on final put_disk() of the gendisk).
>
> However, in SCSI, the driverfs_dev is the sdev_gendev. That means
> there's a reference held on sdev_gendev until final disk put.
> Unfortunately, we use the driver model driver_remove to trigger
> del_gendisk (which removes the gendisk from visibility and decrements
> the refcount), so we've introduced an unbreakable deadlock in the
> reference counting with this.
That explains why I could only trigger the bug with a disk attached to
the controller.
Thanks for figuring out,
Erik
--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> Yes. We could just revert that commit, but it seems correct, and I'd
> really like for somebody to understand _why_ that commit matters at all. I
> certainly don't see the overlap here..
Reverting the commit breaks MMC/SD in a very real way, and the fix
is plainly correct and is actually the only possible fix that can be
applied.
It sounds to me like SCSI is relying on some buggy behaviour which is
specific to the way that the kernel works with the fix removed. Maybe
thing is kfree'ing and then reallocating something which remained
registered somewhere when it shouldn't do?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 01:58:45PM -0500, James Bottomley wrote:
> I suggest simply reversing this patch at the moment. If Russell and
> Jens can tell me what they're trying to do I'll see if there's another
> way to do it.
When a MMC card is pulled, we remove the MMC device structure (which
is what the driverfs_dev points at.) At this point, the MMC layer
*totally* forgets about the MMC device and deletes it.
Unfortunately, an uncounted reference is kept while the partition is
mounted by the gendisk layer, which when the partition is unmounted
via hotplug causes another hotplug event to be generated with respect
to this freed MMC device structure, and hence you get an oops.
Since the MMC layer has lost all knowledge of the device, the only
possible solution is as given in that patch.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 12:09:50PM -0700, Linus Torvalds wrote:
> On Fri, 12 May 2006, James Bottomley wrote:
> > I suggest simply reversing this patch at the moment. If Russell and
> > Jens can tell me what they're trying to do I'll see if there's another
> > way to do it.
>
> Reverted, with a big changelog entry to explain why.
Great, I'm fucked by the SCSI folk again.
Can we revert the patch which broke the MMC/SD layer - the one which
added the mount/unmount hotplug events as well then.
That way we get back to a working MMC/SD layer as well as a working
SCSI layer.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, 12 May 2006, Russell King wrote:
>
> Great, I'm fucked by the SCSI folk again.
No, you introduced a regression. This isn't the SCSI layer being evil,
this is the "regressions aren't acceptable".
Fixing one bug and introducing another is actively _worse_ than having the
same bug stay around.
> Can we revert the patch which broke the MMC/SD layer - the one which
> added the mount/unmount hotplug events as well then.
>
> That way we get back to a working MMC/SD layer as well as a working
> SCSI layer.
That's certainly the logical fix - push the pain up the chain to the
person who introduced it. Which commit is that, do you know?
Really, the added ref-count should be gotten by whoever holds on to the
thing, and it sounds like it's the hotplug event that caused this and
should have held on to its hotplug reference.
Greg added to the Cc: list in case he already knows off-hand which commit
it is..
Linus
On Fri, May 12, 2006 at 09:38:50PM +0100, Russell King wrote:
> On Fri, May 12, 2006 at 12:09:50PM -0700, Linus Torvalds wrote:
> > On Fri, 12 May 2006, James Bottomley wrote:
> > > I suggest simply reversing this patch at the moment. If Russell and
> > > Jens can tell me what they're trying to do I'll see if there's another
> > > way to do it.
> >
> > Reverted, with a big changelog entry to explain why.
>
> Great, I'm fucked by the SCSI folk again.
>
> Can we revert the patch which broke the MMC/SD layer - the one which
> added the mount/unmount hotplug events as well then.
Why would the mount/unmount hotplug event change break MMC/SD? Do you
have a reference to the patch in question?
thanks,
greg k-h
On Fri, May 12, 2006 at 01:50:46PM -0700, Linus Torvalds wrote:
> On Fri, 12 May 2006, Russell King wrote:
> > Great, I'm fucked by the SCSI folk again.
>
> No, you introduced a regression. This isn't the SCSI layer being evil,
> this is the "regressions aren't acceptable".
No - I introduced a correct fix.
I think actually we're heading towards needing Linux V2 - the rewrite.
It seems that fixing simple bugs cause other bugs, and that means we're
heading into a maintainability nightmare.
> > Can we revert the patch which broke the MMC/SD layer - the one which
> > added the mount/unmount hotplug events as well then.
> >
> > That way we get back to a working MMC/SD layer as well as a working
> > SCSI layer.
>
> That's certainly the logical fix - push the pain up the chain to the
> person who introduced it. Which commit is that, do you know?
No idea I'm afraid, and since I've had a _really_ extremely shitty couple
of days I'm not about to start going to look for it.
> Really, the added ref-count should be gotten by whoever holds on to the
> thing, and it sounds like it's the hotplug event that caused this and
> should have held on to its hotplug reference.
... which would be the genhd layer - it's keeping the driverfs_dev
around so that the genhd layer can generate hotplug events using it
at mount/umount time. Thanks for just re-confirming my original fix
and that it's SCSI which is the real problem. 8)
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 01:52:15PM -0700, Greg KH wrote:
> On Fri, May 12, 2006 at 09:38:50PM +0100, Russell King wrote:
> > On Fri, May 12, 2006 at 12:09:50PM -0700, Linus Torvalds wrote:
> > > On Fri, 12 May 2006, James Bottomley wrote:
> > > > I suggest simply reversing this patch at the moment. If Russell and
> > > > Jens can tell me what they're trying to do I'll see if there's another
> > > > way to do it.
> > >
> > > Reverted, with a big changelog entry to explain why.
> >
> > Great, I'm fucked by the SCSI folk again.
> >
> > Can we revert the patch which broke the MMC/SD layer - the one which
> > added the mount/unmount hotplug events as well then.
>
> Why would the mount/unmount hotplug event change break MMC/SD? Do you
> have a reference to the patch in question?
Please read the commit message in the change in question.
The block layer holds on to a reference to a struct device which
isn't refcounted (until I added it with my patch.) Hence struct
gendisk structures have a completely independent lifetime and are
only destroyed when all references to them are removed.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 01:50:46PM -0700, Linus Torvalds wrote:
> On Fri, 12 May 2006, Russell King wrote:
> > Can we revert the patch which broke the MMC/SD layer - the one which
> > added the mount/unmount hotplug events as well then.
> >
> > That way we get back to a working MMC/SD layer as well as a working
> > SCSI layer.
>
> That's certainly the logical fix - push the pain up the chain to the
> person who introduced it. Which commit is that, do you know?
>
> Really, the added ref-count should be gotten by whoever holds on to the
> thing, and it sounds like it's the hotplug event that caused this and
> should have held on to its hotplug reference.
>
> Greg added to the Cc: list in case he already knows off-hand which commit
> it is..
No, I don't know, that's why I just asked :)
And this bug doesn't have anything to do with why my mmc/sd cards are
suddenly not showing up at all anymore in my laptop, right? I need to
track that regression from 2.6.17-rc1 down...
thanks,
greg k-h
On Fri, 12 May 2006, Russell King wrote:
>
> On Fri, May 12, 2006 at 01:50:46PM -0700, Linus Torvalds wrote:
> >
> > No, you introduced a regression. This isn't the SCSI layer being evil,
> > this is the "regressions aren't acceptable".
>
> No - I introduced a correct fix.
Not so.
You introduced a commit that fixed one thing, and broke another thing.
This is not rocket science, Russell. This is how programming works. There
are dependencies on behaviour, and if you change behaviour, it can fix one
problem while breaking another one, because different sides depend on
different semantics.
Real complexity is in linkages and communication. Surprise, surprise.
The _real_ broken ends up being elsewhere. It might just be a
communication error (two different subsystems got the wrong idea because
of bad communication about how something _should_ be used), but it migth
also be some bad design.
But saying that the two one-liners were the "correct fix" is premature
just because they fixed the symptoms for _you_.
> It seems that fixing simple bugs cause other bugs, and that means we're
> heading into a maintainability nightmare.
That's just stupid, Russell.
You DID NOT FIX A SIMPLE BUG.
You made a change, that on the face of it _looked_ like a simple fix for a
simple bug. It wasn't. No amount of you saying so makes it so. It broke
something else, and suddenly what you claim was a simple and clear bug-fix
obviously isn't.
Ergo, it ended up not being "fixing simple bugs" at all.
> No idea I'm afraid, and since I've had a _really_ extremely shitty couple
> of days I'm not about to start going to look for it.
Right. You're pissy. That's fine. But you having a shitty couple of days
doesn't mean you need to take it out on everybody else. Especially not
when people actually offered to help, and asked for details on exactly
what the problem that _you_ tracked was.
You depend on the exact same infrastructure that SCSI depends on. SCSI has
different expectations of that infrastructure than you do. Arguably, there
are more SCSI users out there than MMC users, so when you claim that it's
SCSI's fault, you're being biased. It's equally possible that the fault
was in the MMC layer all along, and that the MMC layers use of the
structure was the thing that caused the oops for the MMC later.
Since you didn't bother to even point to the oopses or the patch that
caused then now, you aren't actually giving any reason to believe that
your view of the situation is the only right one.
So you havign a hissy-fit actually makes it harder for people to even help
you. James already said
"If Russell and Jens can tell me what they're trying to do I'll
see if there's another way to do it."
but you didn't, you just started ranting.
See? Not very useful, is it?
> > Really, the added ref-count should be gotten by whoever holds on to the
> > thing, and it sounds like it's the hotplug event that caused this and
> > should have held on to its hotplug reference.
>
> ... which would be the genhd layer - it's keeping the driverfs_dev
> around so that the genhd layer can generate hotplug events using it
> at mount/umount time. Thanks for just re-confirming my original fix
> and that it's SCSI which is the real problem. 8)
No. You want your fix to be the correct one, but the fact is, it may well
not be that at all. Maybe - once we look at what the oops is - we find
that it really _is_ the MMC layer that just doesn't follow the assumptions
that the device layer was doing.
I said "sounds like". And without help, we can't much help you. And being
pissy about it, will just make everybody more sure that it's the MMC
layer.
Reference counts aren't always simple. Exactly because you can get a
"recursive" reference count, where a device holds on to its reference
count just by account of existing - and that is a BUG, because such a
reference count will obviously never go down to zero.
Now, not going down to zero will certainly hide any problems that come
from the object being released early, because it will _never_ be released.
And it sounds like this is actually what your patch _really_ caused. The
oops went away, but perhaps at the cost of a leak?
And hey, maybe your patch was proper, and the bug really _is_ in the SCSI
layer. In which case we get back to that message from James that you
ignored.
Linus
On Fri, 12 May 2006, Greg KH wrote:
>
> No, I don't know, that's why I just asked :)
>
> And this bug doesn't have anything to do with why my mmc/sd cards are
> suddenly not showing up at all anymore in my laptop, right? I need to
> track that regression from 2.6.17-rc1 down...
Well, that's certainly an interesting regression to look into too..
Here's all I know as back-ground.. Russell's patch certainly _looked_ ok,
which is why it then got acked and merged, but that was before we had
multiple people reporting that it breaks things for them.
Linus
--
On Thu, 4 May 2006, Russell King wrote:
>
> Linus,
>
> This has been confirmed to fix an issue which Mikkel discovered, and
> I'm now seeing reports from other people about this.
>
> I'm not getting any response on it from either Al or Jens and it
> appears to be otherwise ignored - help! Would you take this patch
> direct from me?
>
> ----- Forwarded message from Russell King <[email protected]> -----
>
> Date: Fri, 7 Apr 2006 15:40:46 +0100
> From: Russell King <[email protected]>
> To: Pierre Ossman <[email protected]>,
> Al Viro <[email protected]>, Jens Axboe <[email protected]>
> Cc: Mikkel Erup <[email protected]>, Greg KH <[email protected]>,
> [email protected], Andrew Morton <[email protected]>
> Subject: Re: sdhci driver produces kernel oops on ejecting the card
>
> On Fri, Apr 07, 2006 at 10:47:54AM +0200, Pierre Ossman wrote:
> > Mikkel Erup wrote:
> > >
> > > It happens with 2.6.16-git20 as well.
> > > Attached are the log file and kernel .config
> > >
> >
> > Since it ooopses during umount, I'm guessing that it's a problem
> > somewhere in mmc_block. I'll try to get some time to look closer at it
> > during the weekend. Perhaps Russell has some idea until then?
>
> $ grep -n driverfs_dev block/genhd.c
> 558: physdev = disk->driverfs_dev;
> $
>
> Hmm, okay, genhd contains a reference to a device object, but there's
> no sign of _any_ refcounting in sight.
>
> What's happening is that the MMC card block device is setup and registered
> with genhd. We set md->disk->driverfs_dev to point at the owning device
> structure.
>
> This generates a uevent, which causes disk->driverfs_dev to be dereferenced.
> All fine here. We mount the partition, which causes another uevent to be
> generated, again dereferencing disk->driverfs_dev.
>
> If we remove the MMC card, we destroy the MMC card block device. This
> seems to generate another uevent for the block device. At this point,
> the counted references to the MMC card block device fall to zero.
>
> But wait! There's still uncounted disk->driverfs_dev reference waiting
> for...
>
> You unmount the partition. This calls block_uevent, which dereferences
> disk->driverfs_dev. You know what happens now.
>
> The levels above genhd can't do the refcounting because they don't know
> when stuff has finished with driverfs_dev. So the only place for sane
> refcounting seems to be genhd.c, as per the patch below.
>
> Comments?
>
> diff -up -x BitKeeper -x ChangeSet -x SCCS -x _xlk -x *.orig -x *.rej -x .git a/block/genhd.c b/block/genhd.c
> --- a/block/genhd.c Sat Feb 18 10:31:37 2006
> +++ b/block/genhd.c Fri Apr 7 15:22:21 2006
> @@ -262,6 +262,7 @@ static int exact_lock(dev_t dev, void *d
> */
> void add_disk(struct gendisk *disk)
> {
> + get_device(disk->driverfs_dev);
> disk->flags |= GENHD_FL_UP;
> blk_register_region(MKDEV(disk->major, disk->first_minor),
> disk->minors, NULL, exact_match, exact_lock, disk);
> @@ -507,6 +508,7 @@ static struct attribute * default_attrs[
> static void disk_release(struct kobject * kobj)
> {
> struct gendisk *disk = to_disk(kobj);
> + put_device(disk->driverfs_dev);
> kfree(disk->random);
> kfree(disk->part);
> free_disk_stats(disk);
>
>
> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of: 2.6 Serial core
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> ----- End forwarded message -----
>
> --
> Russell King
>
On Fri, May 12, 2006 at 10:03:01PM +0100, Russell King wrote:
> On Fri, May 12, 2006 at 01:52:15PM -0700, Greg KH wrote:
> > On Fri, May 12, 2006 at 09:38:50PM +0100, Russell King wrote:
> > > On Fri, May 12, 2006 at 12:09:50PM -0700, Linus Torvalds wrote:
> > > > On Fri, 12 May 2006, James Bottomley wrote:
> > > > > I suggest simply reversing this patch at the moment. If Russell and
> > > > > Jens can tell me what they're trying to do I'll see if there's another
> > > > > way to do it.
> > > >
> > > > Reverted, with a big changelog entry to explain why.
> > >
> > > Great, I'm fucked by the SCSI folk again.
> > >
> > > Can we revert the patch which broke the MMC/SD layer - the one which
> > > added the mount/unmount hotplug events as well then.
> >
> > Why would the mount/unmount hotplug event change break MMC/SD? Do you
> > have a reference to the patch in question?
>
> Please read the commit message in the change in question.
Hm, I did. I can't seem to find the offending patch you refer to, have
a changeset number?
thanks,
greg k-h
On Fri, May 12, 2006 at 10:03:01PM +0100, Russell King wrote:
> The block layer holds on to a reference to a struct device which
> isn't refcounted (until I added it with my patch.) Hence struct
> gendisk structures have a completely independent lifetime and are
> only destroyed when all references to them are removed.
Yes, they are and that's intentional.
Can you explain WTF do you drop that reference so late and not in the
del_gendisk() time?
On Fri, May 12, 2006 at 09:34:16PM +0100, Russell King wrote:
> On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> > Yes. We could just revert that commit, but it seems correct, and I'd
> > really like for somebody to understand _why_ that commit matters at all. I
> > certainly don't see the overlap here..
>
> Reverting the commit breaks MMC/SD in a very real way, and the fix
> is plainly correct and is actually the only possible fix that can be
> applied.
Bullshit. Could you explain what generic code dereferences ->driverfs_dev
after del_gendisk()? If you see such beast, please tell; _that_ is the
real bug.
On Fri, 12 May 2006, Linus Torvalds wrote:
>
> You introduced a commit that fixed one thing, and broke another thing.
And btw, don't take that "you" personally.
This happens.
All the time. And definitely _not_ just to you.
It's why common infrastructure can be such an incredible pain: it may be a
nice common layer, but it does obviously end up affecting a hell of a lot
of different devices and usages. "Private" code is often much better, and
that's what we used to have.
Now, sadly, I think we need that common device layer infrastructure
exactly because otherwise we could never have done any global device
management etc, so in this case that common interface is definitely a
"necessary evil".
And with that necessary evil comes the linkages that it implies.
We'd all be much happier of one piece of code didn't depend on five or six
other pieces of code, and a bug-fix in one place would be guaranteed to
not ever have any other side effects.
We'd all also be much happier if we were all young, healthy, good-looking
and drive Lamborghinis. And didn't have incipient beer-bellies (not that
_I_ would ever have one, of course.. Oh, no. I'm obviously talking about
all you other scruffy people. Me, I'm perfect.)
Sadly, neither of the above schenarios are really very realistic.
So we'd love to have more information. Please?
Linus
On Fri, May 12, 2006 at 02:27:19PM -0700, Linus Torvalds wrote:
> [remove all the inflamatory stuff]
Anyway, you asked for the original oopsen, and here they are. Enjoy.
From: Todd Blumer <[email protected]>
On a PXA27x handheld (iPAQ hx4700), when we eject a mounted SD memory
card, we get a kernel panic (kernel trying to clean up non-existent
device). One hack patch to avoid the panic is:
--- fs/partitions/check.c 10 Apr 2006 22:57:27 -0000 1.15
+++ fs/partitions/check.c 4 May 2006 20:30:15 -0000
@@ -491,6 +491,7 @@
kfree(disk_name);
}
put_device(disk->driverfs_dev);
+ disk->driverfs_dev = 0; /* HACK - what's the right
solution? */
}
kobject_uevent(&disk->kobj, KOBJ_REMOVE);
kobject_del(&disk->kobj);
...
root@ipaq-pxa270:~# Unable to handle kernel NULL pointer dereference at
virtual2pgd = c218c000
[00000002] *pgd=a20ec031, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1]
Modules linked in: nls_iso8859_1 nls_cp437 snd_pcm_oss snd_mixer_oss
snd_hx4700rCPU: 0
PC is at strlen+0xc/0x30
LR is at kobject_get_path+0x2c/0xc0
pc : [<c00ee56c>] lr : [<c00eb724>] Not tainted
sp : c21bfde4 ip : c21bfdf4 fp : c21bfdf0
r10: c21bfe28 r9 : 000007a5 r8 : c21bfe2c
r7 : c3554074 r6 : 000000d0 r5 : 00000001 r4 : c3554074
r3 : 00000002 r2 : c01fa88f r1 : 00000002 r0 : 00000002
Flags: nzCv IRQs on FIQs on Mode SVC_32 Segment user
Control: 397F Table: A218C000 DAC: 00000015
Process umount (pid: 2384, stack limit = 0xc21be198)
Stack: (0xc21bfde4 to 0xc21c0000)
fde0: c21bfe14 c21bfdf4 c00eb724 c00ee56c c34876c0 c355400c
c3542b78
fe00: 0000001a c21bfe2c c21bfe58 c21bfe18 c00e2edc c00eb704 000007a5
c21bfe28
fe20: c01fa888 000000fe 00000012 00000002 c3c9585b c0226e04 c32ff7a8
c3542b60
fe40: c3c9583f c01fab8c c0226db0 c21bfeb0 c21bfe5c c00ec130 c00e2dcc
c3c9585b
fe60: 000007a5 c3c95800 c0226dc4 c3e15b20 00000001 00000001 00000000
00000000
fe80: ffffffff bee45e28 c0331040 c3554200 c3554200 00000000 00000000
c21bff28
fea0: c21bff20 c21bfec0 c21bfeb4 c0084d6c c00ebf1c c21bfed8 c21bfec4
c0084f04
fec0: c0084d48 c3554200 c0225d84 c21bfef0 c21bfedc c0083d90 c0084ef0
c3554200
fee0: c032bd20 c21bff08 c21bfef4 c009b804 c0083d30 c21bff28 c21be000
c21bff1c
ff00: c21bff0c c008baec c009b79c 00000000 c21bff94 c21bff20 c009c0ec
c008bad8
ff20: c21bff20 c21bff20 c20e73a0 c032bd20 c21bff3c c0024db0 c00ee388
00000001
ff40: 00000001 00000000 00000000 ffffffff bee45e28 00000000 00000000
c21bffb0
ff60: 00000000 00091050 c21bff9c bee44db8 000923c8 bee44db8 00000016
c001cf64
ff80: c21be000 00091050 c21bffa4 c21bff98 c009c114 c009beec 00000000
c21bffa8
ffa0: c001cdc0 c009c10c bee44db8 000923c8 bee44db8 bee44dba 000923ca
0000006d
ffc0: bee44db8 000923c8 bee44db8 00000000 00090710 00000000 00091050
00000000
ffe0: 40160e70 bee44d9c 00063fc4 40160e74 60000010 bee44db8 401bd7d8
400dad9c
Backtrace:
[<c00ee560>] (strlen+0x0/0x30) from [<c00eb724>]
(kobject_get_path+0x2c/0xc0)
[<c00eb6f8>] (kobject_get_path+0x0/0xc0) from [<c00e2edc>]
(block_uevent+0x11c/) r8 = C21BFE2C r7 = 0000001A r6 = C3542B78 r5 =
C355400C
r4 = C34876C0
[<c00e2dc0>] (block_uevent+0x0/0x1f4) from [<c00ec130>]
(kobject_uevent+0x220/0)[<c00ebf10>] (kobject_uevent+0x0/0x478) from
[<c0084d6c>] (bdev_uevent+0x30/0x3)[<c0084d3c>] (bdev_uevent+0x0/0x34)
from [<c0084f04>] (kill_block_super+0x20/0x)[<c0084ee4>]
(kill_block_super+0x0/0x3c) from [<c0083d90>] (deactivate_super+0x) r5 =
C0225D84 r4 = C3554200
[<c0083d24>] (deactivate_super+0x0/0x84) from [<c009b804>]
(mntput_no_expire+0x) r5 = C032BD20 r4 = C3554200
[<c009b790>] (mntput_no_expire+0x0/0xc0) from [<c008baec>]
(path_release_on_umo) r5 = C21BE000 r4 = C21BFF28
[<c008bacc>] (path_release_on_umount+0x0/0x24) from [<c009c0ec>]
(sys_umount+0x) r4 = 00000000
[<c009bee0>] (sys_umount+0x0/0x220) from [<c009c114>]
(sys_oldumount+0x14/0x18)
[<c009c100>] (sys_oldumount+0x0/0x18) from [<c001cdc0>]
(ret_fast_syscall+0x0/0)Code: e89da800 e1a0c00d e92dd800 e24cb004 (e5d03000)
and:
From: Mikkel Erup <mikkelerup () yahoo ! com>
BUG: unable to handle kernel NULL pointer dereference
at virtual address 00000002
printing eip:
c021570d
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
speedstep_lib i915 drm mmc_block nvram pcmcia
yenta_socket rsrc_nonstatic sdhci pcmcia_core e1000
mmc_core evdev
CPU: 0
EIP: 0060:[<c021570d>] Not tainted VLI
EFLAGS: 00010202 (2.6.16-git20 #3)
EIP is at kobject_get_path+0x2d/0xe0
eax: 00000000 ebx: 00000001 ecx: ffffffff edx:
00000000
esi: e35e9e74 edi: 00000002 ebp: 000007a5 esp:
dcea1e5c
ds: 007b es: 007b ss: 0068
Process umount (pid: 3407, threadinfo=dcea0000
task=dd334a30)
Stack: <0>c038cad3 dcea1ea0 00000008 dcea1e9c e2765200
e35e9e0c df304158 000007a5
c020bb5d e35e9e74 000000d0 dcea1ea0 de9ad85b
000007a5 dcea1ea4 c038cacb
000000fe 00000002 00000012 c020ba50 c03c8860
0000001a de9ad800 c0215d1e
Call Trace:
<c020bb5d> block_uevent+0x10d/0x210 <c020ba50>
block_uevent+0x0/0x210
<c0215d1e> kobject_uevent+0x1fe/0x520 <c0166860>
bdev_uevent+0x20/0x40
<c0166cc1> kill_block_super+0x21/0x50 <c0166f80>
deactivate_super+0x70/0xa0
<c017fecb> sys_umount+0x4b/0x280 <c010306f>
sysenter_past_esp+0x54/0x75
Code: d2 57 56 53 bb 01 00 00 00 83 ec 10 8b 74 24 24
8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 8b 3e 85 ff
74 1d b9 ff ff ff ff 89 d0 <f2> ae f7 d1 49 8b 76 24 8d
2c 19 8d 5d 01 85 f6 75 e1 85 db 75
Now, can I have your permission not to do anything more tonight? Is
that okay with you, or are you going to rant about that as well?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 10:43:54PM +0100, Al Viro wrote:
> On Fri, May 12, 2006 at 09:34:16PM +0100, Russell King wrote:
> > On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> > > Yes. We could just revert that commit, but it seems correct, and I'd
> > > really like for somebody to understand _why_ that commit matters at all. I
> > > certainly don't see the overlap here..
> >
> > Reverting the commit breaks MMC/SD in a very real way, and the fix
> > is plainly correct and is actually the only possible fix that can be
> > applied.
>
> Bullshit. Could you explain what generic code dereferences ->driverfs_dev
> after del_gendisk()? If you see such beast, please tell; _that_ is the
> real bug.
Al, I think you're going to eat your own bull on this one.
You'll find that in the reply I've just sent to Linus - two oopen
reported by two different people since the mount/umount hotplug
events got re-merged.
The problem case is:
- insert card
- mount filesystem
- remove card
- umount filesystem <bang, oops>
The generic code is block_uevent(), which is called at umount time
_after_ the gendisk has been deleted (which happens when the card has
been removed.)
Basically, you can't umount a destroyed block device without oopsing.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 10:43:54PM +0100, Al Viro wrote:
> On Fri, May 12, 2006 at 09:34:16PM +0100, Russell King wrote:
> > On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> > > Yes. We could just revert that commit, but it seems correct, and I'd
> > > really like for somebody to understand _why_ that commit matters at all. I
> > > certainly don't see the overlap here..
> >
> > Reverting the commit breaks MMC/SD in a very real way, and the fix
> > is plainly correct and is actually the only possible fix that can be
> > applied.
>
> Bullshit. Could you explain what generic code dereferences ->driverfs_dev
> after del_gendisk()? If you see such beast, please tell; _that_ is the
> real bug.
Aha... So block_uevent() appears to be badly broken. Lovely...
OK, could somebody explain WTF is userland supposed to do with event
refering to device that had been gone for a long time?
On Fri, May 12, 2006 at 10:55:20PM +0100, Russell King wrote:
> On Fri, May 12, 2006 at 10:43:54PM +0100, Al Viro wrote:
> > On Fri, May 12, 2006 at 09:34:16PM +0100, Russell King wrote:
> > > On Fri, May 12, 2006 at 10:36:57AM -0700, Linus Torvalds wrote:
> > > > Yes. We could just revert that commit, but it seems correct, and I'd
> > > > really like for somebody to understand _why_ that commit matters at all. I
> > > > certainly don't see the overlap here..
> > >
> > > Reverting the commit breaks MMC/SD in a very real way, and the fix
> > > is plainly correct and is actually the only possible fix that can be
> > > applied.
> >
> > Bullshit. Could you explain what generic code dereferences ->driverfs_dev
> > after del_gendisk()? If you see such beast, please tell; _that_ is the
> > real bug.
>
> Al, I think you're going to eat your own bull on this one.
>
> You'll find that in the reply I've just sent to Linus - two oopen
> reported by two different people since the mount/umount hotplug
> events got re-merged.
>
> The problem case is:
>
> - insert card
> - mount filesystem
> - remove card
> - umount filesystem <bang, oops>
>
> The generic code is block_uevent(), which is called at umount time
> _after_ the gendisk has been deleted (which happens when the card has
> been removed.)
>
> Basically, you can't umount a destroyed block device without oopsing.
So block_uevent() is bogus; great, we've located the real bug. The
real question: what uses these events and what can userland _do_ when
it gets something about a device that had been gone for a month?
Secondary question: who had resurrected that crap? I distinctly remember
killing it off...
On Fri, 12 May 2006, Russell King wrote:
>
> From: Todd Blumer <[email protected]>
> On a PXA27x handheld (iPAQ hx4700), when we eject a mounted SD memory
> card, we get a kernel panic (kernel trying to clean up non-existent
> device). One hack patch to avoid the panic is:
>
> --- fs/partitions/check.c 10 Apr 2006 22:57:27 -0000 1.15
> +++ fs/partitions/check.c 4 May 2006 20:30:15 -0000
> @@ -491,6 +491,7 @@
> kfree(disk_name);
> }
> put_device(disk->driverfs_dev);
> + disk->driverfs_dev = 0; /* HACK - what's the right solution? */
> }
> kobject_uevent(&disk->kobj, KOBJ_REMOVE);
> kobject_del(&disk->kobj);
Btw, on the face it of, I really think that this patch is correct
regardless of any other issues.
We're clearly getting rid of "disk->driverfs_dev" (that's what the
"put_device()" does). So we should clear it out - using
"disk->driverfs_dev" afterwards is clearly not _valid_, because we've
dropped the reference.
Of course, we shouldn't assign zero to it. We should assign NULL.
Alternatively, we could poison it (so that anybody who tries to use it
gets a nice oops ASAP).
Now, regardless of whether that is correct or not, it might not be the
only problem, of course. It sounds like there is something else going on
here too.
Linus
On Fri, 12 May 2006, Al Viro wrote:
>
> Secondary question: who had resurrected that crap? I distinctly remember
> killing it off...
If you did, I don't think it ever got into the kernel.
It was added by Kay Sievers on Nov 3, 2004, according to the old history
(back then it was in drivers/block/genhd.c, and the function was called
"block_hotplug()", but apart from renaming the function and moving the
file, it's recognizably the same.
Of course, you may have killed off an even earlier incarnation..
Linus
On Fri, May 12, 2006 at 03:22:42PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 12 May 2006, Al Viro wrote:
> >
> > Secondary question: who had resurrected that crap? I distinctly remember
> > killing it off...
>
> If you did, I don't think it ever got into the kernel.
>
> It was added by Kay Sievers on Nov 3, 2004, according to the old history
> (back then it was in drivers/block/genhd.c, and the function was called
> "block_hotplug()", but apart from renaming the function and moving the
> file, it's recognizably the same.
>
> Of course, you may have killed off an even earlier incarnation..
Ah, right - there was another uevent mess in fs/super.c. Sorry, got them
confused... and that FPOS _is_ back.
gregkh, may I ask why had bdev_uevent() been resurrected?
On Fri, May 12, 2006 at 03:22:42PM -0700, Linus Torvalds wrote:
> On Fri, 12 May 2006, Al Viro wrote:
> > Secondary question: who had resurrected that crap? I distinctly remember
> > killing it off...
>
> If you did, I don't think it ever got into the kernel.
>
> It was added by Kay Sievers on Nov 3, 2004, according to the old history
> (back then it was in drivers/block/genhd.c, and the function was called
> "block_hotplug()", but apart from renaming the function and moving the
> file, it's recognizably the same.
>
> Of course, you may have killed off an even earlier incarnation..
The changes in question are:
commit fa675765afed59bb89adba3369094ebd428b930b
tree 777a8c1bb48ef7de39073104f974209f4a462b6f
parent b00dc3ad74fdb676552d46ee573b88e927240d0c
author Greg Kroah-Hartman <[email protected]> Wed, 22 Feb 2006 09:39:02 -0800
committer Greg Kroah-Hartman <[email protected]> Wed, 22 Feb 2006 09:39:02 -0800
Revert mount/umount uevent removal
This change reverts the 033b96fd30db52a710d97b06f87d16fc59fee0f1 commit
from Kay Sievers that removed the mount/umount uevents from the kernel.
Some older versions of HAL still depend on these events to detect when a
new device has been mounted. These events are not correctly emitted,
and are broken by design, and so, should not be relied upon by any
future program. Instead, the /proc/mounts file should be polled to
properly detect this kind of event.
A feature-removal-schedule.txt entry has been added, noting when this
interface will be removed from the kernel.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 033b96fd30db52a710d97b06f87d16fc59fee0f1
tree 00fbccf2cf478307e213f298a221e330f3ba12ae
parent 0f76e5acf9dc788e664056dda1e461f0bec93948
author Kay Sievers <[email protected]> 1131685795 +0100
committer Greg Kroah-Hartman <[email protected]> 1136420287 -0800
[PATCH] remove mount/umount uevents from superblock handling
The names of these events have been confusing from the beginning
on, as they have been more like claim/release events. We needed these
events for noticing HAL if storage devices have been mounted.
Thanks to Al, we have the proper solution now and can poll()
/proc/mounts instead to get notfied about mount tree changes.
Signed-off-by: Kay Sievers <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, 12 May 2006, Linus Torvalds wrote:
>
> On Fri, 12 May 2006, Russell King wrote:
> >
> > From: Todd Blumer <[email protected]>
> > On a PXA27x handheld (iPAQ hx4700), when we eject a mounted SD memory
> > card, we get a kernel panic (kernel trying to clean up non-existent
> > device). One hack patch to avoid the panic is:
> >
> > --- fs/partitions/check.c 10 Apr 2006 22:57:27 -0000 1.15
> > +++ fs/partitions/check.c 4 May 2006 20:30:15 -0000
> > @@ -491,6 +491,7 @@
> > kfree(disk_name);
> > }
> > put_device(disk->driverfs_dev);
> > + disk->driverfs_dev = 0; /* HACK - what's the right solution? */
> > }
> > kobject_uevent(&disk->kobj, KOBJ_REMOVE);
> > kobject_del(&disk->kobj);
>
> Btw, on the face it of, I really think that this patch is correct
> regardless of any other issues.
.. and I suspect it also shows what the "other issues" are.
Shouldn't that KOBJ_REMOVE uevent happen _before_ we do all the freeing of
the backing dev object? That KOBJ_REMOVE thing actually seems to want to
report the pathname for the disk it removes. Preferably before the thing
is gone and can't be reported on..
Ie shouldn't the diff be something like this?
Linus
---
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 45ae7dd..7ef1f09 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -533,6 +533,7 @@ void del_gendisk(struct gendisk *disk)
devfs_remove_disk(disk);
+ kobject_uevent(&disk->kobj, KOBJ_REMOVE);
if (disk->holder_dir)
kobject_unregister(disk->holder_dir);
if (disk->slave_dir)
@@ -545,7 +546,7 @@ void del_gendisk(struct gendisk *disk)
kfree(disk_name);
}
put_device(disk->driverfs_dev);
+ disk->driverfs_dev = NULL;
}
- kobject_uevent(&disk->kobj, KOBJ_REMOVE);
kobject_del(&disk->kobj);
}
On Fri, May 12, 2006 at 11:28:16PM +0100, Al Viro wrote:
> Ah, right - there was another uevent mess in fs/super.c. Sorry, got them
> confused... and that FPOS _is_ back.
Actually...
What happens is that turd in fs/super.c (one that should not have been
resurrected) triggers call of block_uevent(). Which uses ->driverfs_dev.
Which is a bug.
So IMO we should simply kill that animal _again_, and see if block_uevent()
is actually need for anything on its own.
On Fri, May 12, 2006 at 02:32:05PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 12 May 2006, Greg KH wrote:
> >
> > No, I don't know, that's why I just asked :)
> >
> > And this bug doesn't have anything to do with why my mmc/sd cards are
> > suddenly not showing up at all anymore in my laptop, right? I need to
> > track that regression from 2.6.17-rc1 down...
>
> Well, that's certainly an interesting regression to look into too..
No, nevermind, I got this to work again. Turns out you need to have a
SD card in the reader when loading the sdhci module for it to work
properly. I'll take it up with the author of the driver as to if this
is "correct" or not.
thanks,
greg k-h
On Fri, May 12, 2006 at 11:48:04PM +0100, Al Viro wrote:
> On Fri, May 12, 2006 at 11:28:16PM +0100, Al Viro wrote:
> > Ah, right - there was another uevent mess in fs/super.c. Sorry, got them
> > confused... and that FPOS _is_ back.
>
> Actually...
>
> What happens is that turd in fs/super.c (one that should not have been
> resurrected) triggers call of block_uevent(). Which uses ->driverfs_dev.
> Which is a bug.
>
> So IMO we should simply kill that animal _again_, and see if block_uevent()
> is actually need for anything on its own.
PS: it _is_ OK to trigger than puppy from add_disk()/del_gendisk() and in
between. I'm not sure if anyone needs it for anything, though. Triggering
it from bdev_uevent() is definitely bogus.
On Fri, 12 May 2006, Al Viro wrote:
>
> PS: it _is_ OK to trigger than puppy from add_disk()/del_gendisk() and in
> between. I'm not sure if anyone needs it for anything, though. Triggering
> it from bdev_uevent() is definitely bogus.
Wouldn't it be perfectly ok if we just made sure to keep the disk device
refcount elevated by the _mount_?
Ie Russell's patch to elevate it around everything didn't work, but
elevating the bdev->bd_disk reference count by a mount, and dropping it on
umount (after doing the umount event) should be ok. No?
Linus
On Fri, May 12, 2006 at 04:04:53PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 12 May 2006, Al Viro wrote:
> >
> > PS: it _is_ OK to trigger than puppy from add_disk()/del_gendisk() and in
> > between. I'm not sure if anyone needs it for anything, though. Triggering
> > it from bdev_uevent() is definitely bogus.
>
> Wouldn't it be perfectly ok if we just made sure to keep the disk device
> refcount elevated by the _mount_?
>
> Ie Russell's patch to elevate it around everything didn't work, but
> elevating the bdev->bd_disk reference count by a mount, and dropping it on
> umount (after doing the umount event) should be ok. No?
->bd_disk _is_ pinned down. However, it's not the problem - we keep
gendisk for as long as anyone uses it. Which might be after the underlying
object is gone; that one is controlled by driver and driver calls
del_gendisk() when it decides that we are through.
The problem is with disk->driverfs_dev, not disk itself. Block layer
has no fscking business touching it after del_gendisk() - if nothing else,
we might have _no_ underlying object at all from the very beginning.
So anything that wants events related to partitions, let alone mounting,
can't expect to see PHYSDEV... crap. Moreover, it can bloody well
get to PHYSDEV... itself *if* it wants to and if it's there. There's
a reason why we have that symlink in sys/block/<device> and userland can
bloody well access it on its own.
On Sat, May 13, 2006 at 12:21:31AM +0100, Al Viro wrote:
> The problem is with disk->driverfs_dev, not disk itself. Block layer
> has no fscking business touching it after del_gendisk() - if nothing else,
> we might have _no_ underlying object at all from the very beginning.
>
> So anything that wants events related to partitions, let alone mounting,
> can't expect to see PHYSDEV... crap. Moreover, it can bloody well
> get to PHYSDEV... itself *if* it wants to and if it's there. There's
> a reason why we have that symlink in sys/block/<device> and userland can
> bloody well access it on its own.
BTW, the best option is to kill bdev_uevent() again. Short of that,
skip PHYSDEV mess if disk doesn't have GENHD_FL_UP.
On Sat, 13 May 2006, Al Viro wrote:
>
> BTW, the best option is to kill bdev_uevent() again. Short of that,
> skip PHYSDEV mess if disk doesn't have GENHD_FL_UP.
I do think the mount/umount events are valid and interesting, so I'd much
rather see the second version.
However, that does beg the question: wouldn't that effectively be what the
patch I posted would do? Notably the "disk->driverfs_dev = NULL" part
after we've dropped it (the "KOBJ_REMOVE" event move is a separate issue,
mixed here into the same patch, but should result in possibly better name
generation for the event).
Basically, onces driverfs_dev has been dropped, we NULL it out, and then
the people who use it automatically get the right result.
Yes? No? "You're a total klutz, Linus, that patch won't actually do
anything, because <xyz>"?
Linus
---
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 45ae7dd..7ef1f09 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -533,6 +533,7 @@ void del_gendisk(struct gendisk *disk)
devfs_remove_disk(disk);
+ kobject_uevent(&disk->kobj, KOBJ_REMOVE);
if (disk->holder_dir)
kobject_unregister(disk->holder_dir);
if (disk->slave_dir)
@@ -545,7 +546,7 @@ void del_gendisk(struct gendisk *disk)
kfree(disk_name);
}
put_device(disk->driverfs_dev);
+ disk->driverfs_dev = NULL;
}
- kobject_uevent(&disk->kobj, KOBJ_REMOVE);
kobject_del(&disk->kobj);
}
On Fri, May 12, 2006 at 03:37:59PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 12 May 2006, Linus Torvalds wrote:
> >
> > On Fri, 12 May 2006, Russell King wrote:
> > >
> > > From: Todd Blumer <[email protected]>
> > > On a PXA27x handheld (iPAQ hx4700), when we eject a mounted SD memory
> > > card, we get a kernel panic (kernel trying to clean up non-existent
> > > device). One hack patch to avoid the panic is:
> > >
> > > --- fs/partitions/check.c 10 Apr 2006 22:57:27 -0000 1.15
> > > +++ fs/partitions/check.c 4 May 2006 20:30:15 -0000
> > > @@ -491,6 +491,7 @@
> > > kfree(disk_name);
> > > }
> > > put_device(disk->driverfs_dev);
> > > + disk->driverfs_dev = 0; /* HACK - what's the right solution? */
> > > }
> > > kobject_uevent(&disk->kobj, KOBJ_REMOVE);
> > > kobject_del(&disk->kobj);
> >
> > Btw, on the face it of, I really think that this patch is correct
> > regardless of any other issues.
>
> .. and I suspect it also shows what the "other issues" are.
>
> Shouldn't that KOBJ_REMOVE uevent happen _before_ we do all the freeing of
> the backing dev object? That KOBJ_REMOVE thing actually seems to want to
> report the pathname for the disk it removes. Preferably before the thing
> is gone and can't be reported on..
>
> Ie shouldn't the diff be something like this?
It looks sane to me. Russell, does it solve your oops too?
thanks,
greg k-h
On Fri, May 12, 2006 at 11:28:16PM +0100, Al Viro wrote:
> On Fri, May 12, 2006 at 03:22:42PM -0700, Linus Torvalds wrote:
> >
> >
> > On Fri, 12 May 2006, Al Viro wrote:
> > >
> > > Secondary question: who had resurrected that crap? I distinctly remember
> > > killing it off...
> >
> > If you did, I don't think it ever got into the kernel.
> >
> > It was added by Kay Sievers on Nov 3, 2004, according to the old history
> > (back then it was in drivers/block/genhd.c, and the function was called
> > "block_hotplug()", but apart from renaming the function and moving the
> > file, it's recognizably the same.
> >
> > Of course, you may have killed off an even earlier incarnation..
>
> Ah, right - there was another uevent mess in fs/super.c. Sorry, got them
> confused... and that FPOS _is_ back.
>
> gregkh, may I ask why had bdev_uevent() been resurrected?
One user, running an old version of HAL, on a distro that had an updated
version available (Gentoo), refused to upgrade it and was upset that
when they plugged a usb-storage device in, it did not show up as an icon
on the desktop.
Andrew pointed out that we broke the "don't break the userspace API"
kernel rule, so I put the patch back, and noted when it would finally be
removed.
I would gladly remove it again if everyone can agree that they will not
get upset at me again when someone running obsolete/broken userspace
programs complains...
thanks,
greg k-h
On Fri, May 12, 2006 at 04:50:32PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 13 May 2006, Al Viro wrote:
> >
> > BTW, the best option is to kill bdev_uevent() again. Short of that,
> > skip PHYSDEV mess if disk doesn't have GENHD_FL_UP.
>
> I do think the mount/umount events are valid and interesting, so I'd much
> rather see the second version.
mount/umount events are BS, but that's a separate story. In case you've
missed the original discussion:
* mount/umount is not tied to block device; if you really want
to track mount tree changes, you need much more; if you want to track
the set of filesystems being active, you _still_ need more.
* "block device is used by fs" is valid, but bdev_uevent() is _NOT_
that; we miss e.g. journals.
* "block device is being claimed exclusively" - even more interesting,
and again, bdev_uevent() isn't it.
* "underlying object of _the_ block device used by a filesystem"
is not well-defined, to put it mildly.
IOW, it's a big mess. Note that if you care about real mount events
("set of mounts has changed") you can do that already, and get _all_
of them - including mount --move, etc. poll(2) on /proc/mounts
will do that for you. And you don't need to be priveleged for that,
while we are at it.
That's what hald wanted bdev_uevent() for. All attempts to find what
other users are really trying to get had not brought anything.
I'm not saying that we don't need something similar, but let's try to make
sure it makes _sense_ and isn't just "oh, my userland code rereads too
often; guess if I would see when kernel reaches this, this and that point
I would be able to reduce the frequency of rereads and wouldn't miss too
much in the cases I've ran into so far".
In the current form bdev_uevent() makes no sense. It pushes a lot of
vaguely related stuff and doesn't cover _any_ sane use fully.
So yeah, I'm seriously pissed off at the bdev_uevent() merge - at hald
crowd for pushing it to gregkh, at gregkh for blindly passing it on
and at myself for missing it back then. Note that as soon as hald folks
_did_ explain what they really wanted, they've got a sane solution (and
are using it now just fine).
> However, that does beg the question: wouldn't that effectively be what the
> patch I posted would do? Notably the "disk->driverfs_dev = NULL" part
> after we've dropped it (the "KOBJ_REMOVE" event move is a separate issue,
> mixed here into the same patch, but should result in possibly better name
> generation for the event).
>
> Basically, onces driverfs_dev has been dropped, we NULL it out, and then
> the people who use it automatically get the right result.
More or less. We don't want to mess with refcounts at all; aside of that,
yes, checking if it's NULL is OK. subsystem mutex should be enough, AFAICS.
On Fri, May 12, 2006 at 04:57:45PM -0700, Greg KH wrote:
> On Fri, May 12, 2006 at 03:37:59PM -0700, Linus Torvalds wrote:
> > On Fri, 12 May 2006, Linus Torvalds wrote:
> > > On Fri, 12 May 2006, Russell King wrote:
> > > >
> > > > From: Todd Blumer <[email protected]>
> > > > On a PXA27x handheld (iPAQ hx4700), when we eject a mounted SD memory
> > > > card, we get a kernel panic (kernel trying to clean up non-existent
> > > > device). One hack patch to avoid the panic is:
> > > >
> > > > --- fs/partitions/check.c 10 Apr 2006 22:57:27 -0000 1.15
> > > > +++ fs/partitions/check.c 4 May 2006 20:30:15 -0000
> > > > @@ -491,6 +491,7 @@
> > > > kfree(disk_name);
> > > > }
> > > > put_device(disk->driverfs_dev);
> > > > + disk->driverfs_dev = 0; /* HACK - what's the right solution? */
> > > > }
> > > > kobject_uevent(&disk->kobj, KOBJ_REMOVE);
> > > > kobject_del(&disk->kobj);
> > >
> > > Btw, on the face it of, I really think that this patch is correct
> > > regardless of any other issues.
> >
> > .. and I suspect it also shows what the "other issues" are.
> >
> > Shouldn't that KOBJ_REMOVE uevent happen _before_ we do all the freeing of
> > the backing dev object? That KOBJ_REMOVE thing actually seems to want to
> > report the pathname for the disk it removes. Preferably before the thing
> > is gone and can't be reported on..
> >
> > Ie shouldn't the diff be something like this?
>
> It looks sane to me. Russell, does it solve your oops too?
It appears to.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, May 12, 2006 at 04:50:32PM -0700, Linus Torvalds wrote:
> On Sat, 13 May 2006, Al Viro wrote:
> >
> > BTW, the best option is to kill bdev_uevent() again. Short of that,
> > skip PHYSDEV mess if disk doesn't have GENHD_FL_UP.
>
> I do think the mount/umount events are valid and interesting, so I'd much
> rather see the second version.
>
> However, that does beg the question: wouldn't that effectively be what the
> patch I posted would do? Notably the "disk->driverfs_dev = NULL" part
> after we've dropped it (the "KOBJ_REMOVE" event move is a separate issue,
> mixed here into the same patch, but should result in possibly better name
> generation for the event).
>
> Basically, onces driverfs_dev has been dropped, we NULL it out, and then
> the people who use it automatically get the right result.
>
> Yes? No? "You're a total klutz, Linus, that patch won't actually do
> anything, because <xyz>"?
Just want to confirm that I can't recreate the SCSI slab error anymore
with your patch (032ebf2620ef99a4fedaa0f77dc2272095ac5863) in the
current -git kernel.
Erik
--
+-- Erik Mouw -- http://www.harddisk-recovery.nl -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands