I noticed that currently calling del_gendisk leads to sure deadlock if
attemped from .suspend or .resume functions.
Something like that:
[<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
[<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
[<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
[<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
[<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
[<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
[<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
[<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
[<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
[<ffffffff811391de>] fsync_bdev+0x2e/0x60
[<ffffffff812226be>] invalidate_partition+0x2e/0x50
[<ffffffff8116b92f>] del_gendisk+0x3f/0x140
[<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
[<ffffffff81338977>] mmc_bus_remove+0x17/0x20
[<ffffffff812ce746>] __device_release_driver+0x66/0xc0
[<ffffffff812ce89d>] device_release_driver+0x2d/0x40
[<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
[<ffffffff812cb46f>] device_del+0x12f/0x1a0
[<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
[<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
[<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
[<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
[<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
bdi_queue_work seems to be the problem.
Some device drivers need to remove their cards logically in .suspend,
because the card is removable, and can be changed while system is
suspended.
Best regards,
Maxim Levitsky
On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> I noticed that currently calling del_gendisk leads to sure deadlock if
> attemped from .suspend or .resume functions.
>
> Something like that:
>
> [<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
> [<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
> [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> [<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
> [<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
> [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> [<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
> [<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
> [<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
> [<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
> [<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
> [<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
> [<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
> [<ffffffff811391de>] fsync_bdev+0x2e/0x60
> [<ffffffff812226be>] invalidate_partition+0x2e/0x50
> [<ffffffff8116b92f>] del_gendisk+0x3f/0x140
> [<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
> [<ffffffff81338977>] mmc_bus_remove+0x17/0x20
> [<ffffffff812ce746>] __device_release_driver+0x66/0xc0
> [<ffffffff812ce89d>] device_release_driver+0x2d/0x40
> [<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
> [<ffffffff812cb46f>] device_del+0x12f/0x1a0
> [<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
> [<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
> [<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
> [<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
> [<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
>
> bdi_queue_work seems to be the problem.
>
> Some device drivers need to remove their cards logically in .suspend,
> because the card is removable, and can be changed while system is
> suspended.
>
> Best regards,
> Maxim Levitsky
>
Any update?
Best regards,
Maxim Levitsky
On Monday 15 February 2010, Maxim Levitsky wrote:
> On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > I noticed that currently calling del_gendisk leads to sure deadlock if
> > attemped from .suspend or .resume functions.
Well, it shouldn't be called from there, then.
> > Something like that:
> >
> > [<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
> > [<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
> > [<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
> > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > [<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
> > [<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
> > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > [<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
> > [<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
> > [<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
> > [<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
> > [<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
> > [<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
> > [<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
> > [<ffffffff811391de>] fsync_bdev+0x2e/0x60
> > [<ffffffff812226be>] invalidate_partition+0x2e/0x50
> > [<ffffffff8116b92f>] del_gendisk+0x3f/0x140
> > [<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
> > [<ffffffff81338977>] mmc_bus_remove+0x17/0x20
> > [<ffffffff812ce746>] __device_release_driver+0x66/0xc0
> > [<ffffffff812ce89d>] device_release_driver+0x2d/0x40
> > [<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
> > [<ffffffff812cb46f>] device_del+0x12f/0x1a0
> > [<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
> > [<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
> > [<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
> > [<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
> > [<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
> >
> > bdi_queue_work seems to be the problem.
> >
> > Some device drivers need to remove their cards logically in .suspend,
> > because the card is removable, and can be changed while system is
> > suspended.
I don't know how to resolve this right now.
Rafael
On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
> On Monday 15 February 2010, Maxim Levitsky wrote:
> > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > attemped from .suspend or .resume functions.
>
> Well, it shouldn't be called from there, then.
Even if drivers avoid calling it from within suspend methods, they have
to be able to call it from within resume methods. After all, the
resume method may find that the disk's device has vanished.
> > > Something like that:
> > >
> > > [<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
> > > [<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
> > > [<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
> > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > [<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
> > > [<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
> > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > [<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
> > > [<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
> > > [<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
> > > [<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
> > > [<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
> > > [<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
> > > [<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
> > > [<ffffffff811391de>] fsync_bdev+0x2e/0x60
> > > [<ffffffff812226be>] invalidate_partition+0x2e/0x50
> > > [<ffffffff8116b92f>] del_gendisk+0x3f/0x140
> > > [<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
> > > [<ffffffff81338977>] mmc_bus_remove+0x17/0x20
> > > [<ffffffff812ce746>] __device_release_driver+0x66/0xc0
> > > [<ffffffff812ce89d>] device_release_driver+0x2d/0x40
> > > [<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
> > > [<ffffffff812cb46f>] device_del+0x12f/0x1a0
> > > [<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
> > > [<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
> > > [<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
> > > [<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
> > > [<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
> > >
> > > bdi_queue_work seems to be the problem.
> > >
> > > Some device drivers need to remove their cards logically in .suspend,
> > > because the card is removable, and can be changed while system is
> > > suspended.
>
> I don't know how to resolve this right now.
This is a matter for Jens. Is the bdi writeback task freezable? If it
is, should it be made unfreezable?
Alan Stern
On Tue, 2010-02-16 at 11:27 -0500, Alan Stern wrote:
> On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
>
> > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > attemped from .suspend or .resume functions.
> >
> > Well, it shouldn't be called from there, then.
>
> Even if drivers avoid calling it from within suspend methods, they have
> to be able to call it from within resume methods. After all, the
> resume method may find that the disk's device has vanished.
>
> > > > Something like that:
> > > >
> > > > [<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
> > > > [<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
> > > > [<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
> > > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > > [<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
> > > > [<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
> > > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > > [<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
> > > > [<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
> > > > [<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
> > > > [<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
> > > > [<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
> > > > [<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
> > > > [<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
> > > > [<ffffffff811391de>] fsync_bdev+0x2e/0x60
> > > > [<ffffffff812226be>] invalidate_partition+0x2e/0x50
> > > > [<ffffffff8116b92f>] del_gendisk+0x3f/0x140
> > > > [<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
> > > > [<ffffffff81338977>] mmc_bus_remove+0x17/0x20
> > > > [<ffffffff812ce746>] __device_release_driver+0x66/0xc0
> > > > [<ffffffff812ce89d>] device_release_driver+0x2d/0x40
> > > > [<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
> > > > [<ffffffff812cb46f>] device_del+0x12f/0x1a0
> > > > [<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
> > > > [<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
> > > > [<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
> > > > [<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
> > > > [<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
> > > >
> > > > bdi_queue_work seems to be the problem.
> > > >
> > > > Some device drivers need to remove their cards logically in .suspend,
> > > > because the card is removable, and can be changed while system is
> > > > suspended.
> >
> > I don't know how to resolve this right now.
>
> This is a matter for Jens. Is the bdi writeback task freezable? If it
> is, should it be made unfreezable?
Any update?
Best regards,
Maxim Levitsky
On Tue, Feb 16 2010, Alan Stern wrote:
> On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
>
> > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > attemped from .suspend or .resume functions.
> >
> > Well, it shouldn't be called from there, then.
>
> Even if drivers avoid calling it from within suspend methods, they have
> to be able to call it from within resume methods. After all, the
> resume method may find that the disk's device has vanished.
del_gendisk() needs process context at least, since it'll sleep (not
just for sync/invalidate, but other parts of the destruction as well).
> > > > Something like that:
> > > >
> > > > [<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
> > > > [<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
> > > > [<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
> > > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > > [<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
> > > > [<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
> > > > [<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
> > > > [<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
> > > > [<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
> > > > [<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
> > > > [<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
> > > > [<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
> > > > [<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
> > > > [<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
> > > > [<ffffffff811391de>] fsync_bdev+0x2e/0x60
> > > > [<ffffffff812226be>] invalidate_partition+0x2e/0x50
> > > > [<ffffffff8116b92f>] del_gendisk+0x3f/0x140
> > > > [<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
> > > > [<ffffffff81338977>] mmc_bus_remove+0x17/0x20
> > > > [<ffffffff812ce746>] __device_release_driver+0x66/0xc0
> > > > [<ffffffff812ce89d>] device_release_driver+0x2d/0x40
> > > > [<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
> > > > [<ffffffff812cb46f>] device_del+0x12f/0x1a0
> > > > [<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
> > > > [<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
> > > > [<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
> > > > [<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
> > > > [<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
> > > >
> > > > bdi_queue_work seems to be the problem.
> > > >
> > > > Some device drivers need to remove their cards logically in .suspend,
> > > > because the card is removable, and can be changed while system is
> > > > suspended.
> >
> > I don't know how to resolve this right now.
>
> This is a matter for Jens. Is the bdi writeback task freezable? If it
> is, should it be made unfreezable?
I'm not a big expect on what tasks should be freezable or not. As it
stands, the writeback tasks will attempt to freeze and thaw with the
system. I guess that screws the sync from resume call, since it's not
running and the sync will wait for it to retrieve and finish that work
item.
To the suspend experts - can we safely mark the writeback tasks as
non-freezable?
--
Jens Axboe
On Tue, 23 Feb 2010, Jens Axboe wrote:
> On Tue, Feb 16 2010, Alan Stern wrote:
> > On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
> >
> > > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > > attemped from .suspend or .resume functions.
> > >
> > > Well, it shouldn't be called from there, then.
> >
> > Even if drivers avoid calling it from within suspend methods, they have
> > to be able to call it from within resume methods. After all, the
> > resume method may find that the disk's device has vanished.
>
> del_gendisk() needs process context at least, since it'll sleep (not
> just for sync/invalidate, but other parts of the destruction as well).
That's not a problem; suspend and resume run in process context.
> > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > is, should it be made unfreezable?
>
> I'm not a big expect on what tasks should be freezable or not. As it
> stands, the writeback tasks will attempt to freeze and thaw with the
> system. I guess that screws the sync from resume call, since it's not
> running and the sync will wait for it to retrieve and finish that work
> item.
>
> To the suspend experts - can we safely mark the writeback tasks as
> non-freezable?
The reason for freezing those tasks is to avoid writebacks at random
times during a system sleep transition, when the underlying device may
already be suspended, right?
In principle, a device's writeback task could be unfrozen immediately
after the device is resumed. In practice this might not solve the
problem, since the del_gendisk() call occurs _within_ the device's
resume routine. I suppose del_gendisk() could be made responsible for
unfreezing the writeback task.
The best solution would be to have del_gendisk() avoid waiting for the
writeback task in cases where the underlying device has been removed.
I don't know if that is feasible, however.
Alan Stern
P.S.: Jens, given a pointer to a struct gendisk or to a struct
request_queue, is there a good way to tell whether there any dirty
buffers for that device waiting to be written out? This is for
purposes of runtime power management -- in the initial implementation,
I want to avoid powering-down a block device if it is open or has any
dirty buffers. In other words, only completely idle devices should be
powered down (a good example would be a card reader with no memory card
inserted).
On Tue, Feb 23 2010, Alan Stern wrote:
> > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > is, should it be made unfreezable?
> >
> > I'm not a big expect on what tasks should be freezable or not. As it
> > stands, the writeback tasks will attempt to freeze and thaw with the
> > system. I guess that screws the sync from resume call, since it's not
> > running and the sync will wait for it to retrieve and finish that work
> > item.
> >
> > To the suspend experts - can we safely mark the writeback tasks as
> > non-freezable?
>
> The reason for freezing those tasks is to avoid writebacks at random
> times during a system sleep transition, when the underlying device may
> already be suspended, right?
Right, or at least it would seem pointless to have them running while
the device is suspended. But my point was that if it's easier (and
feasible) to just leave them running, perhaps that was easier.
> In principle, a device's writeback task could be unfrozen immediately
> after the device is resumed. In practice this might not solve the
> problem, since the del_gendisk() call occurs _within_ the device's
> resume routine. I suppose del_gendisk() could be made responsible for
> unfreezing the writeback task.
And that's back to the question of whether or not that is a nice thing to
do. It seems a bit dirty, but otoh where else to do it. Perhaps just
using the kblockd to postpone the del_gendisk() to out-of-resume context
would be the best approach.
> The best solution would be to have del_gendisk() avoid waiting for the
> writeback task in cases where the underlying device has been removed.
> I don't know if that is feasible, however.
kblockd?
> P.S.: Jens, given a pointer to a struct gendisk or to a struct
> request_queue, is there a good way to tell whether there any dirty
> buffers for that device waiting to be written out? This is for
> purposes of runtime power management -- in the initial implementation,
> I want to avoid powering-down a block device if it is open or has any
> dirty buffers. In other words, only completely idle devices should be
> powered down (a good example would be a card reader with no memory card
> inserted).
There's no fool proof way. For most file systems I think you could get
away with checking the q->bdi dirty lists to see if there's anything
pending. But that wont work always, if the fs uses a different backing
dev info than then queue itself.
--
Jens Axboe
On Tue, 23 Feb 2010, Jens Axboe wrote:
> On Tue, Feb 23 2010, Alan Stern wrote:
> > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > is, should it be made unfreezable?
> > >
> > > I'm not a big expect on what tasks should be freezable or not. As it
> > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > system. I guess that screws the sync from resume call, since it's not
> > > running and the sync will wait for it to retrieve and finish that work
> > > item.
> > >
> > > To the suspend experts - can we safely mark the writeback tasks as
> > > non-freezable?
> >
> > The reason for freezing those tasks is to avoid writebacks at random
> > times during a system sleep transition, when the underlying device may
> > already be suspended, right?
>
> Right, or at least it would seem pointless to have them running while
> the device is suspended. But my point was that if it's easier (and
> feasible) to just leave them running, perhaps that was easier.
I don't have a clear picture of how the block layer operates. For
example, what is the reason for this comment in the definition of
struct genhd?
struct device *driverfs_dev; // FIXME: remove
Isn't that crucial for making a disk show up in sysfs? Is the comment
out of date?
A possible approach is to add suspend and resume methods for this
driverfs_dev, and make them be responsible for stopping and restarting
the writeback task instead of relying on the freezer. Then
del_gendisk() could cleanly restart the task when necessary.
> > In principle, a device's writeback task could be unfrozen immediately
> > after the device is resumed. In practice this might not solve the
> > problem, since the del_gendisk() call occurs _within_ the device's
> > resume routine. I suppose del_gendisk() could be made responsible for
> > unfreezing the writeback task.
>
> And that's back to the question of whether or not that is a nice thing to
> do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> using the kblockd to postpone the del_gendisk() to out-of-resume context
> would be the best approach.
That would involve a layering violation, wouldn't it? Either the
driver would have to interface with kblockd directly, or else
del_gendisk() would need to know whether the writeback task was frozen.
On the whole, I think it's best for the block layer to retain full
control over its own tasks and requirements.
Alan Stern
On Tue, 23 Feb 2010, Jens Axboe wrote:
> > P.S.: Jens, given a pointer to a struct gendisk or to a struct
> > request_queue, is there a good way to tell whether there any dirty
> > buffers for that device waiting to be written out? This is for
> > purposes of runtime power management -- in the initial implementation,
> > I want to avoid powering-down a block device if it is open or has any
> > dirty buffers. In other words, only completely idle devices should be
> > powered down (a good example would be a card reader with no memory card
> > inserted).
>
> There's no fool proof way. For most file systems I think you could get
> away with checking the q->bdi dirty lists to see if there's anything
> pending. But that wont work always, if the fs uses a different backing
> dev info than then queue itself.
That's not what I meant. Dirty buffers on a filesystem make no
difference because they always get written out when the filesystem is
unmounted. The device file remains open as long as the filesystem
is mounted, which would prevent the device from being powered down.
I was asking about dirty buffers on a block device that isn't holding a
filesystem -- where the raw device is being used directly for I/O.
Alan Stern
On Tue, Feb 23 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > P.S.: Jens, given a pointer to a struct gendisk or to a struct
> > > request_queue, is there a good way to tell whether there any dirty
> > > buffers for that device waiting to be written out? This is for
> > > purposes of runtime power management -- in the initial implementation,
> > > I want to avoid powering-down a block device if it is open or has any
> > > dirty buffers. In other words, only completely idle devices should be
> > > powered down (a good example would be a card reader with no memory card
> > > inserted).
> >
> > There's no fool proof way. For most file systems I think you could get
> > away with checking the q->bdi dirty lists to see if there's anything
> > pending. But that wont work always, if the fs uses a different backing
> > dev info than then queue itself.
>
> That's not what I meant. Dirty buffers on a filesystem make no
> difference because they always get written out when the filesystem is
> unmounted. The device file remains open as long as the filesystem
> is mounted, which would prevent the device from being powered down.
>
> I was asking about dirty buffers on a block device that isn't holding a
> filesystem -- where the raw device is being used directly for I/O.
OK, so just specifically the page cache of the device. Is that really
enough of an issue to warrant special checking? I mean, what normal
setup would even use buffer raw device access?
But if you wanted, I guess the only way would be to lookup
dirty/writeback pages on the bdev inode mapping. For that you'd need the
bdev, not the gendisk or the queue though.
--
Jens Axboe
On Tue, Feb 23 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > On Tue, Feb 23 2010, Alan Stern wrote:
> > > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > > is, should it be made unfreezable?
> > > >
> > > > I'm not a big expect on what tasks should be freezable or not. As it
> > > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > > system. I guess that screws the sync from resume call, since it's not
> > > > running and the sync will wait for it to retrieve and finish that work
> > > > item.
> > > >
> > > > To the suspend experts - can we safely mark the writeback tasks as
> > > > non-freezable?
> > >
> > > The reason for freezing those tasks is to avoid writebacks at random
> > > times during a system sleep transition, when the underlying device may
> > > already be suspended, right?
> >
> > Right, or at least it would seem pointless to have them running while
> > the device is suspended. But my point was that if it's easier (and
> > feasible) to just leave them running, perhaps that was easier.
>
> I don't have a clear picture of how the block layer operates. For
> example, what is the reason for this comment in the definition of
> struct genhd?
>
> struct device *driverfs_dev; // FIXME: remove
>
> Isn't that crucial for making a disk show up in sysfs? Is the comment
> out of date?
Don't ask me, I'd suggest using git blame for finding out who wrote that
and ping them.
> A possible approach is to add suspend and resume methods for this
> driverfs_dev, and make them be responsible for stopping and restarting
> the writeback task instead of relying on the freezer. Then
> del_gendisk() could cleanly restart the task when necessary.
That sounds over engineered to me.
> > > In principle, a device's writeback task could be unfrozen immediately
> > > after the device is resumed. In practice this might not solve the
> > > problem, since the del_gendisk() call occurs _within_ the device's
> > > resume routine. I suppose del_gendisk() could be made responsible for
> > > unfreezing the writeback task.
> >
> > And that's back to the question of whether or not that is a nice thing to
> > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > would be the best approach.
>
> That would involve a layering violation, wouldn't it? Either the
> driver would have to interface with kblockd directly, or else
> del_gendisk() would need to know whether the writeback task was frozen.
>
> On the whole, I think it's best for the block layer to retain full
> control over its own tasks and requirements.
You would export such functionality - del_gendisk_deferred(), or
something like that. The kblockd suggestion was implementation detail,
not something the driver would concern itself with. It's not exactly
picture perfect, but it could be used from eg resume context where the
device isn't fully live yet.
--
Jens Axboe
On Tue, 23 Feb 2010, Jens Axboe wrote:
> > That's not what I meant. Dirty buffers on a filesystem make no
> > difference because they always get written out when the filesystem is
> > unmounted. The device file remains open as long as the filesystem
> > is mounted, which would prevent the device from being powered down.
> >
> > I was asking about dirty buffers on a block device that isn't holding a
> > filesystem -- where the raw device is being used directly for I/O.
>
> OK, so just specifically the page cache of the device. Is that really
> enough of an issue to warrant special checking? I mean, what normal
> setup would even use buffer raw device access?
Doesn't fdisk use it? There might be other applications too.
> But if you wanted, I guess the only way would be to lookup
> dirty/writeback pages on the bdev inode mapping. For that you'd need the
> bdev, not the gendisk or the queue though.
I can get the bdev from the gendisk by calling bdget_disk() with a
partition number of 0, right? What would the next step be? Would this
check for dirty pages associated with any of the partitions or would it
only look at pages associated with the inode for the entire disk?
Alan Stern
On Tue, 23 Feb 2010, Jens Axboe wrote:
> > > And that's back to the question of whether or not that is a nice thing to
> > > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > > would be the best approach.
> >
> > That would involve a layering violation, wouldn't it? Either the
> > driver would have to interface with kblockd directly, or else
> > del_gendisk() would need to know whether the writeback task was frozen.
> >
> > On the whole, I think it's best for the block layer to retain full
> > control over its own tasks and requirements.
>
> You would export such functionality - del_gendisk_deferred(), or
> something like that. The kblockd suggestion was implementation detail,
> not something the driver would concern itself with. It's not exactly
> picture perfect, but it could be used from eg resume context where the
> device isn't fully live yet.
Hmm. There's still no way for the driver to know whether or not the
writeback task is frozen when it wants to call del_gendisk(). It
would have to defer _all_ such calls. And all hot-pluggable block
drivers would have to do this -- would that be acceptable?
How about plugging the request queue instead of freezing the writeback
task? Would that work? It should be easy enough for a driver to
unplug the queue before unregistering its device from within a resume
method.
Alan Stern
On Wed, Feb 24 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > That's not what I meant. Dirty buffers on a filesystem make no
> > > difference because they always get written out when the filesystem is
> > > unmounted. The device file remains open as long as the filesystem
> > > is mounted, which would prevent the device from being powered down.
> > >
> > > I was asking about dirty buffers on a block device that isn't holding a
> > > filesystem -- where the raw device is being used directly for I/O.
> >
> > OK, so just specifically the page cache of the device. Is that really
> > enough of an issue to warrant special checking? I mean, what normal
> > setup would even use buffer raw device access?
>
> Doesn't fdisk use it? There might be other applications too.
It does, but that sound be a very short lived issue (since the dirty
buffers will get flushed).
> > But if you wanted, I guess the only way would be to lookup
> > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > bdev, not the gendisk or the queue though.
>
> I can get the bdev from the gendisk by calling bdget_disk() with a
> partition number of 0, right? What would the next step be? Would this
> check for dirty pages associated with any of the partitions or would it
> only look at pages associated with the inode for the entire disk?
It would cover the entire bdev.
--
Jens Axboe
On Wed, Feb 24 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > > And that's back to the question of whether or not that is a nice thing to
> > > > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > > > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > > > would be the best approach.
> > >
> > > That would involve a layering violation, wouldn't it? Either the
> > > driver would have to interface with kblockd directly, or else
> > > del_gendisk() would need to know whether the writeback task was frozen.
> > >
> > > On the whole, I think it's best for the block layer to retain full
> > > control over its own tasks and requirements.
> >
> > You would export such functionality - del_gendisk_deferred(), or
> > something like that. The kblockd suggestion was implementation detail,
> > not something the driver would concern itself with. It's not exactly
> > picture perfect, but it could be used from eg resume context where the
> > device isn't fully live yet.
>
> Hmm. There's still no way for the driver to know whether or not the
> writeback task is frozen when it wants to call del_gendisk(). It
> would have to defer _all_ such calls. And all hot-pluggable block
> drivers would have to do this -- would that be acceptable?
I was assuming it knew it was being called from a critical location,
like from resume. I guess the callback just iterates the bus devices and
calls the device remove, so that doesn't quite work without other
changes.
> How about plugging the request queue instead of freezing the writeback
> task? Would that work? It should be easy enough for a driver to
> unplug the queue before unregistering its device from within a resume
> method.
We have specific methods for either freezing of stopping or starting the
queue, perhaps those would be appropriate for suspend/resume actions. It
effectively prevents the queueing function from being called. If there
are dirty pages for the device, then it would not help though, as you
would still get stuck waiting for that IO to complete.
--
Jens Axboe
On Wed, 24 Feb 2010, Jens Axboe wrote:
> > > But if you wanted, I guess the only way would be to lookup
> > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > bdev, not the gendisk or the queue though.
> >
> > I can get the bdev from the gendisk by calling bdget_disk() with a
> > partition number of 0, right? What would the next step be? Would this
> > check for dirty pages associated with any of the partitions or would it
> > only look at pages associated with the inode for the entire disk?
>
> It would cover the entire bdev.
Okay, so once I've got the bdev, how do I look up the dirty/writeback
pages on the inode mapping?
Alan Stern
On Wed, 24 Feb 2010, Jens Axboe wrote:
> > How about plugging the request queue instead of freezing the writeback
> > task? Would that work? It should be easy enough for a driver to
> > unplug the queue before unregistering its device from within a resume
> > method.
>
> We have specific methods for either freezing of stopping or starting the
> queue, perhaps those would be appropriate for suspend/resume actions. It
> effectively prevents the queueing function from being called. If there
> are dirty pages for the device, then it would not help though, as you
> would still get stuck waiting for that IO to complete.
If the resume method would restart the queue before unregistering the
device, pending dirty pages wouldn't cause any problems. They'd get
sent down to the driver and rejected immediately because the device was
dead or done.
The difficulty with this approach is that it requires individual
attention for each block device driver. Either the driver has to
freeze/stop/plug the queue during suspend (and restart it during
resume) or else the device's writeback task has to be frozen.
Can this be encapsulated by a function in the block layer? For
example, drivers could call blk_set_hot_unpluggable(bdev) for devices
that might need to be unregistered during resume. Then they would
become responsible for managing the device's queue.
Alan Stern
On Wed, Feb 24 2010, Alan Stern wrote:
> On Wed, 24 Feb 2010, Jens Axboe wrote:
>
> > > > But if you wanted, I guess the only way would be to lookup
> > > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > > bdev, not the gendisk or the queue though.
> > >
> > > I can get the bdev from the gendisk by calling bdget_disk() with a
> > > partition number of 0, right? What would the next step be? Would this
> > > check for dirty pages associated with any of the partitions or would it
> > > only look at pages associated with the inode for the entire disk?
> >
> > It would cover the entire bdev.
>
> Okay, so once I've got the bdev, how do I look up the dirty/writeback
> pages on the inode mapping?
I _think_ you can get away with not doing a radix lookup for dirty
pages, just looking at the BDI_RECLAIMABLE stat on the bdi. That would
be:
bdi_stat(bdev->bd_inode->i_mapping->backing_dev_info, BDI_RECLAIMABLE);
--
Jens Axboe
On Thu, Feb 25, 2010 at 09:20:35AM +0100, Jens Axboe wrote:
> On Wed, Feb 24 2010, Alan Stern wrote:
> > On Wed, 24 Feb 2010, Jens Axboe wrote:
> >
> > > > > But if you wanted, I guess the only way would be to lookup
> > > > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > > > bdev, not the gendisk or the queue though.
> > > >
> > > > I can get the bdev from the gendisk by calling bdget_disk() with a
> > > > partition number of 0, right? What would the next step be? Would this
> > > > check for dirty pages associated with any of the partitions or would it
> > > > only look at pages associated with the inode for the entire disk?
> > >
> > > It would cover the entire bdev.
> >
> > Okay, so once I've got the bdev, how do I look up the dirty/writeback
> > pages on the inode mapping?
>
> I _think_ you can get away with not doing a radix lookup for dirty
> pages, just looking at the BDI_RECLAIMABLE stat on the bdi. That would
> be:
>
> bdi_stat(bdev->bd_inode->i_mapping->backing_dev_info, BDI_RECLAIMABLE);
mapping_tagged(bdev->bd_inode->i_mapping, PAGECACHE_TAG_DIRTY);
is about as low overhead as it gets as the radix tree propagateѕ
tags back up to the root. i.e. no page lookups needed at all to
determine if it is dirty.
Cheers,
Dave.
--
Dave Chinner
[email protected]