Hi Alan & Bart !
While fixing my hotswap media-bay IDE controller for 2.6, I found
a locking problem with IDE (again ? :) in ide_unregister_hw. Basically
the problem is that it calls blk_cleanup_queue(), which is unsafe to
call with a lock held (it will call flush_workqueue() at one point).
Other side effect, flush_workqueue() will re-enable IRQs, thus allowing
us to get an IRQ while holding the spinlock -> double lock, but that's
just a side effect of calling flush_workqueue in that context.
So the call to blk_cleanup_queue() shall be moved outside of the
spinlock. I don't know much about the BIO details, is it possible
to first unregister_blkdev, then only call blk_cleanup_queue() ? That
would help making sure we don't get a request sneaking in ?
Ben.
And there's more to it... ide_unregister() doesn't copy hwif->hold from
old to new hwif causing my hotswap bay to lose it's iops on next plug,
it doesn't call unregister_device() for neither hwif->gendev nor
drive[n]->gendev, causing the device model list to be corrupted after
an unregister, ...
Here's a patch that makes it work (media bay hotswap works and doesn't
break the device model device list), however, locking is probably broken
as hell (I just release the spinlock here or there).
I'm not sure what is the best way to fix that locking issue. Regarding
userland access, I beleive the settings semaphore is the way to go.
Regarding the blk queue, I don't know what's the "official" way to
atomically tear down a queue, making sure we properly complete (fail ?)
requests that may have slipped in, and make sure that queue won't be
called again, Jens ? In most cases, once that's done, the rest of
the locking in ide_unregister() is mostly to guard against userland
access (settings) and other ide_register() trying to take the slot
while we are still cleaning up, which can be dealt either with a
"I'm busy housekeeping" flag in hwif, or a semaphore.
What do you think ?
In the meantime, here's the patch that at least makes it work, though
it's probably still racy: (Against 2.6-test2)
(Note to PPC users: there are other bugs in current
drivers/macintosh/mediabay.c in there that need fixes I have here, those
will be pushed separately).
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1008 -> 1.1009
# drivers/ide/ide.c 1.58 -> 1.59
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/08/03 [email protected] 1.1009
# Fixes to ide_unregister() to avoid device list corruption and
# scheduling at interrupt time. Also make sure "hold" flag is
# properly transferred from old to new interface
# --------------------------------------------
#
diff -Nru a/drivers/ide/ide.c b/drivers/ide/ide.c
--- a/drivers/ide/ide.c Sun Aug 3 11:57:56 2003
+++ b/drivers/ide/ide.c Sun Aug 3 11:57:56 2003
@@ -689,8 +689,8 @@
#ifdef CONFIG_PROC_FS
destroy_proc_ide_drives(hwif);
#endif
- hwgroup = hwif->hwgroup;
+ hwgroup = hwif->hwgroup;
/*
* free the irq if we were the only hwif using it
*/
@@ -745,7 +745,11 @@
drive->id = NULL;
}
drive->present = 0;
+ /* Fucked up locking ... */
+ spin_unlock_irq(&ide_lock);
blk_cleanup_queue(&drive->queue);
+ device_unregister(&drive->gendev);
+ spin_lock_irq(&ide_lock);
}
if (hwif->next == hwif) {
BUG_ON(hwgroup->hwif != hwif);
@@ -770,6 +774,11 @@
BUG_ON(hwgroup->hwif == hwif);
}
+ /* More fucked up locking ... */
+ spin_unlock_irq(&ide_lock);
+ device_unregister(&hwif->gendev);
+ spin_lock_irq(&ide_lock);
+
#if !defined(CONFIG_DMA_NONPCI)
if (hwif->dma_base) {
(void) ide_release_dma(hwif);
@@ -794,6 +803,7 @@
put_disk(disk);
}
unregister_blkdev(hwif->major, hwif->name);
+
old_hwif = *hwif;
init_hwif_data(index); /* restore hwif data to pristine status */
hwif->hwgroup = old_hwif.hwgroup;
@@ -812,6 +822,7 @@
hwif->swdma_mask = old_hwif.swdma_mask;
hwif->chipset = old_hwif.chipset;
+ hwif->hold = old_hwif.hold;
#ifdef CONFIG_BLK_DEV_IDEPCI
hwif->pci_dev = old_hwif.pci_dev;
@@ -864,6 +875,13 @@
hwif->ide_dma_retune = old_hwif.ide_dma_retune;
hwif->ide_dma_lostirq = old_hwif.ide_dma_lostirq;
hwif->ide_dma_timeout = old_hwif.ide_dma_timeout;
+ hwif->ide_dma_queued_on = old_hwif.ide_dma_queued_on;
+ hwif->ide_dma_queued_off = old_hwif.ide_dma_queued_off;
+#ifdef CONFIG_BLK_DEV_IDE_TCQ
+ hwif->ide_dma_queued_read = old_hwif.ide_dma_queued_read;
+ hwif->ide_dma_queued_write = old_hwif.ide_dma_queued_write;
+ hwif->ide_dma_queued_start = old_hwif.ide_dma_queued_start;
+#endif
#endif
#if 0
On Sun, Aug 03 2003, Jens Axboe wrote:
> On Sun, Aug 03 2003, Benjamin Herrenschmidt wrote:
> > Hi Alan & Bart !
> >
> > While fixing my hotswap media-bay IDE controller for 2.6, I found
> > a locking problem with IDE (again ? :) in ide_unregister_hw. Basically
> > the problem is that it calls blk_cleanup_queue(), which is unsafe to
> > call with a lock held (it will call flush_workqueue() at one point).
> > Other side effect, flush_workqueue() will re-enable IRQs, thus allowing
> > us to get an IRQ while holding the spinlock -> double lock, but that's
> > just a side effect of calling flush_workqueue in that context.
>
> Irk someone made blk_cleanup_queue() non-atomic. I blame Andrew. And now
> it looks like it's impossible to make it atomic again :/. Not very nice,
> imo it's preferable to keep such unregister functions atomic.
>
> > So the call to blk_cleanup_queue() shall be moved outside of the
> > spinlock. I don't know much about the BIO details, is it possible
> > to first unregister_blkdev, then only call blk_cleanup_queue() ? That
>
> That should work, yes.
>
> > would help making sure we don't get a request sneaking in ?
>
> Hmm not really, there's still a chance that could happen.
and unregister_blkdev() itself isn't even atomic. So I guess IDE does
need fixing anyways.
--
Jens Axboe
On Sun, Aug 03 2003, Benjamin Herrenschmidt wrote:
> Hi Alan & Bart !
>
> While fixing my hotswap media-bay IDE controller for 2.6, I found
> a locking problem with IDE (again ? :) in ide_unregister_hw. Basically
> the problem is that it calls blk_cleanup_queue(), which is unsafe to
> call with a lock held (it will call flush_workqueue() at one point).
> Other side effect, flush_workqueue() will re-enable IRQs, thus allowing
> us to get an IRQ while holding the spinlock -> double lock, but that's
> just a side effect of calling flush_workqueue in that context.
Irk someone made blk_cleanup_queue() non-atomic. I blame Andrew. And now
it looks like it's impossible to make it atomic again :/. Not very nice,
imo it's preferable to keep such unregister functions atomic.
> So the call to blk_cleanup_queue() shall be moved outside of the
> spinlock. I don't know much about the BIO details, is it possible
> to first unregister_blkdev, then only call blk_cleanup_queue() ? That
That should work, yes.
> would help making sure we don't get a request sneaking in ?
Hmm not really, there's still a chance that could happen.
--
Jens Axboe
On Sun, Aug 03 2003, Benjamin Herrenschmidt wrote:
>
> > > would help making sure we don't get a request sneaking in ?
> >
> > Hmm not really, there's still a chance that could happen.
>
> Not too familiar with BIO here, but we would need some kind of
> "dead" flag to cause a reject of any try to insert a new request
> in the queue, don't you think ?
That's exactly right.
> Then, IDE could do something like:
>
> - set dead flag
> - wait for all pending requests to drain (easy: insert a barrier
> in the queue and wait on it, with a hack for the barrier insertion
> to bypass the dead flag... ugh... maybe a blk_terminate_queue()
> doing all that would be helpful ?)
> - unregister blkdev
> - then tear down the queue (leaving the "empty" queue with the dead
> flag set, not just memset(...,0,...), so that any bozo keeping a
> reference to it will be rejected trying to insert request instead
> of trying to tap an uninitalized queue object
>
> What do you think ?
Sounds like just the ticket. It's basically impossible to properly
shutdown a queue without being able to quisce it like you describe. IO
events are unpredictable :)
--
Jens Axboe
> > would help making sure we don't get a request sneaking in ?
>
> Hmm not really, there's still a chance that could happen.
Not too familiar with BIO here, but we would need some kind of
"dead" flag to cause a reject of any try to insert a new request
in the queue, don't you think ?
Then, IDE could do something like:
- set dead flag
- wait for all pending requests to drain (easy: insert a barrier
in the queue and wait on it, with a hack for the barrier insertion
to bypass the dead flag... ugh... maybe a blk_terminate_queue()
doing all that would be helpful ?)
- unregister blkdev
- then tear down the queue (leaving the "empty" queue with the dead
flag set, not just memset(...,0,...), so that any bozo keeping a
reference to it will be rejected trying to insert request instead
of trying to tap an uninitalized queue object
What do you think ?
Ben.
>
>
>Then, IDE could do something like:
>
> - set dead flag
> - wait for all pending requests to drain (easy: insert a barrier
> in the queue and wait on it, with a hack for the barrier insertion
> to bypass the dead flag... ugh... maybe a blk_terminate_queue()
> doing all that would be helpful ?)
> - unregister blkdev
> - then tear down the queue (leaving the "empty" queue with the dead
> flag set, not just memset(...,0,...), so that any bozo keeping a
> reference to it will be rejected trying to insert request instead
> of trying to tap an uninitalized queue object
>
>What do you think ?
>
>
The last step is bad - sooner or later the queue will be kfreed, and if
there are bozos around that still have references, they would access
random memory. It must be guaranteed that all references expired before
the tear down begins. Just leaving a dead flag set is not sufficient.
--
Manfred
> The last step is bad - sooner or later the queue will be kfreed, and if
> there are bozos around that still have references, they would access
> random memory. It must be guaranteed that all references expired before
> the tear down begins. Just leaving a dead flag set is not sufficient.
Jens ? I see no refcounting of the queue, but then we enter the
deep guts of the block layer and all this is pretty much obscure
to me. What can be done here ?
Ben.
Benjamin Herrenschmidt wrote:
>>The last step is bad - sooner or later the queue will be kfreed, and if
>>there are bozos around that still have references, they would access
>>random memory. It must be guaranteed that all references expired before
>>the tear down begins. Just leaving a dead flag set is not sufficient.
>>
>>
>Jens ? I see no refcounting of the queue,. . .
>
struct request_queue's kobj field perhaps?
On Sun, Aug 03 2003, Lou Langholtz wrote:
> Benjamin Herrenschmidt wrote:
>
> >>The last step is bad - sooner or later the queue will be kfreed, and if
> >>there are bozos around that still have references, they would access
> >>random memory. It must be guaranteed that all references expired before
> >>the tear down begins. Just leaving a dead flag set is not sufficient.
> >>
> >>
> >Jens ? I see no refcounting of the queue,. . .
> >
> struct request_queue's kobj field perhaps?
Queue still needs dynamically allocated for this to work, and it isn't.
This is one of Al's projects, but he seems to be busy and/or away. It's
pretty straight forward but pretty massive change, wonder if Linus could
be talked into it...
--
Jens Axboe
On 3 Aug 2003, Benjamin Herrenschmidt wrote:
> And there's more to it... ide_unregister() doesn't copy hwif->hold from
> old to new hwif causing my hotswap bay to lose it's iops on next plug,
> it doesn't call unregister_device() for neither hwif->gendev nor
> drive[n]->gendev, causing the device model list to be corrupted after
> an unregister, ...
What is a goal of calling init_hwif_data() in ide_unregister()?
I guess it is used mainly to clear hwif->io_ports and hwif->irq.
Therefore even if you are using hwif->hold flag io_ports will be set to
default values, so how do you later find your hwif?
Hmmm... what about not copying anything and calling init_hwif_data()
only if !hwif->hold?
--
Bartlomiej
On Tue, 2003-08-05 at 02:28, Bartlomiej Zolnierkiewicz wrote:
> On 3 Aug 2003, Benjamin Herrenschmidt wrote:
>
> > And there's more to it... ide_unregister() doesn't copy hwif->hold from
> > old to new hwif causing my hotswap bay to lose it's iops on next plug,
> > it doesn't call unregister_device() for neither hwif->gendev nor
> > drive[n]->gendev, causing the device model list to be corrupted after
> > an unregister, ...
>
> What is a goal of calling init_hwif_data() in ide_unregister()?
> I guess it is used mainly to clear hwif->io_ports and hwif->irq.
> Therefore even if you are using hwif->hold flag io_ports will be set to
> default values, so how do you later find your hwif?
What is the goal ? good question ;) I'd be happy with removing most
of the junk in init_hwif_data, but we need to go a bit further there
for 2.7, maybe we should discuss that one irc one of these days ;)
We probably want to remove the static array of hwifs and change that
into pointers, hwif themselves beeing fully initialized 'offline' by
the host driver, then handed out to the ide layer...
In the meantime, the current code works because init_hwif_data()
calls ide_init_hwif_ports() which is an arch hook, which will fill
the proper io base, so the hwif can still be found. Since the IOps
themselves are saved/restored in ide_unregister, we end up with
proper IO base and proper IOps still there.
In fact, I suspect the only remaining useful thing done by
init_hwif_date() in there is to clear the drive structures.
> Hmmm... what about not copying anything and calling init_hwif_data()
> only if !hwif->hold?
We may probably still want to clear the drive array and maybe a
the present flag, no ?
Also, look at my patch, we also _NEED_ to add some proper
device_unregister calls to ide_unregister() or this function will
leave dangling entries in the device list, and since those have the
same restrictions as the new blk_cleanup_queue(), we really need to
do that without the lock held.
I'd suggest merging my patch for now, it won't make things much
worse than what they are today regarding racyness of IDE registration
and unregistration, we an look into sanitizing this as a 2.7 goal.
Ben.
On 5 Aug 2003, Benjamin Herrenschmidt wrote:
> On Tue, 2003-08-05 at 02:28, Bartlomiej Zolnierkiewicz wrote:
> > On 3 Aug 2003, Benjamin Herrenschmidt wrote:
> >
> > > And there's more to it... ide_unregister() doesn't copy hwif->hold from
> > > old to new hwif causing my hotswap bay to lose it's iops on next plug,
> > > it doesn't call unregister_device() for neither hwif->gendev nor
> > > drive[n]->gendev, causing the device model list to be corrupted after
> > > an unregister, ...
> >
> > What is a goal of calling init_hwif_data() in ide_unregister()?
> > I guess it is used mainly to clear hwif->io_ports and hwif->irq.
> > Therefore even if you are using hwif->hold flag io_ports will be set to
> > default values, so how do you later find your hwif?
>
> What is the goal ? good question ;) I'd be happy with removing most
> of the junk in init_hwif_data, but we need to go a bit further there
> for 2.7, maybe we should discuss that one irc one of these days ;)
> We probably want to remove the static array of hwifs and change that
> into pointers, hwif themselves beeing fully initialized 'offline' by
> the host driver, then handed out to the ide layer...
Yes, plus adding HBA structure.
> In the meantime, the current code works because init_hwif_data()
> calls ide_init_hwif_ports() which is an arch hook, which will fill
> the proper io base, so the hwif can still be found. Since the IOps
It only works with default/legacy io bases.
> themselves are saved/restored in ide_unregister, we end up with
> proper IO base and proper IOps still there.
> In fact, I suspect the only remaining useful thing done by
> init_hwif_date() in there is to clear the drive structures.
>
> > Hmmm... what about not copying anything and calling init_hwif_data()
> > only if !hwif->hold?
>
> We may probably still want to clear the drive array and maybe a
> the present flag, no ?
Oh yes, this is the main goal if ide_unregister() :-).
> Also, look at my patch, we also _NEED_ to add some proper
> device_unregister calls to ide_unregister() or this function will
> leave dangling entries in the device list, and since those have the
> same restrictions as the new blk_cleanup_queue(), we really need to
> do that without the lock held.
Yes.
> I'd suggest merging my patch for now, it won't make things much
> worse than what they are today regarding racyness of IDE registration
> and unregistration, we an look into sanitizing this as a 2.7 goal.
Okay :\.
--
Bartlomiej
> It only works with default/legacy io bases.
Which is all I care about in this specific situation. The "mediabay"
hotswap controller is driven by ide-pmac, which provides it's own
ide_init_hwif_ports() and ide_default_io_base() arch hooks. (In the
PPC arch, those arch functions can be themselves hooked in by the
actual platform code, on pmac, this goes to ide-pmac, though I also
deal with added PCI cards).
> > Also, look at my patch, we also _NEED_ to add some proper
> > device_unregister calls to ide_unregister() or this function will
> > leave dangling entries in the device list, and since those have the
> > same restrictions as the new blk_cleanup_queue(), we really need to
> > do that without the lock held.
>
> Yes.
That reminds me that without this fix, even ide-cs will break things
when ejecting a CF card for example.
> > I'd suggest merging my patch for now, it won't make things much
> > worse than what they are today regarding racyness of IDE registration
> > and unregistration, we an look into sanitizing this as a 2.7 goal.
>
> Okay :\.
Heh, thanks... Well... who else uses ide_unregister except ide-pmac and
ide-cs anyway ? If it's really that little used, it's quite under control
and we can try to clean it up a bit more then during early 2.6.0...
Right now, I'm waiting to see what we can come up to with Jens Axboe
regarding race-free teardown of the blk queue. Once that's agreed, we
can think about just removing the call to init_hwif_data and just reinit
the drive array ourselves. That would be a good thing anyway as the
current ide_unregister allocation of a full hwif_t on stack is causing
other problems (stack bloat typically, an issue with people trying to
shrink the kernel process stack to 1 page).
Ben.
On 5 Aug 2003, Benjamin Herrenschmidt wrote:
> > It only works with default/legacy io bases.
>
> Which is all I care about in this specific situation. The "mediabay"
> hotswap controller is driven by ide-pmac, which provides it's own
> ide_init_hwif_ports() and ide_default_io_base() arch hooks. (In the
> PPC arch, those arch functions can be themselves hooked in by the
> actual platform code, on pmac, this goes to ide-pmac, though I also
> deal with added PCI cards).
>
> > > Also, look at my patch, we also _NEED_ to add some proper
> > > device_unregister calls to ide_unregister() or this function will
> > > leave dangling entries in the device list, and since those have the
> > > same restrictions as the new blk_cleanup_queue(), we really need to
> > > do that without the lock held.
> >
> > Yes.
>
> That reminds me that without this fix, even ide-cs will break things
> when ejecting a CF card for example.
I am aware of that.
> > > I'd suggest merging my patch for now, it won't make things much
> > > worse than what they are today regarding racyness of IDE registration
> > > and unregistration, we an look into sanitizing this as a 2.7 goal.
> >
> > Okay :\.
>
> Heh, thanks... Well... who else uses ide_unregister except ide-pmac and
> ide-cs anyway ? If it's really that little used, it's quite under control
> and we can try to clean it up a bit more then during early 2.6.0...
HDIO_UNREGISTER_HWIF ioctl. Together with HDIO_SCAN_HWIF ioctl it
provides dirty hotswap functionality. Thats why I mentioned about problem
with ide_unregister() and non default io bases.
> Right now, I'm waiting to see what we can come up to with Jens Axboe
> regarding race-free teardown of the blk queue. Once that's agreed, we
> can think about just removing the call to init_hwif_data and just reinit
> the drive array ourselves. That would be a good thing anyway as the
> current ide_unregister allocation of a full hwif_t on stack is causing
> other problems (stack bloat typically, an issue with people trying to
> shrink the kernel process stack to 1 page).
Decreasing stack usage is an additional motivation to simplify
ide_unregister(). We really need to remove ide_unregister() from J?rn's
top stack (l)users list ;-).
Thanks,
--
Bartlomiej
On Maw, 2003-08-05 at 13:30, Bartlomiej Zolnierkiewicz wrote:
> HDIO_UNREGISTER_HWIF ioctl. Together with HDIO_SCAN_HWIF ioctl it
> provides dirty hotswap functionality. Thats why I mentioned about problem
> with ide_unregister() and non default io bases.
I'm currently testing busstate ioctl based rescanning in 2.4 (long story involving
a laptop). I've made the busstate off code clean up the device - so the hwif becomes
device->present=0 for all devices and busstate on rescans the bus and updates the
device objects.
Its a hack in many ways to keep the base objects present so that the brown and sticky
issues of blk queues in 2.4 never meet the fan of object destruction.