It seems that a process blocked in a write to an xfs filesystem due to
xfs_freeze cannot be frozen by the freezer.
I see this if I suspend my laptop while doing something xfs-filesystem
intensive, like a kernel build. My suspend scripts freeze the XFS
filesystem (as Dave said I should), which presumably blocks some writer,
and then the freezer times out and fails to complete.
Here's part of the process dump the freezer does when it times out:
cc1 D 00000000 0 18138 18137
dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
Call Trace:
[<c0228d97>] xfs_write+0xf4/0x6d9
[<c0226038>] xfs_file_aio_write+0x53/0x5b
[<c0171c15>] do_sync_write+0xae/0xec
[<c0172343>] vfs_write+0xa4/0x120
[<c01728d7>] sys_write+0x3b/0x60
[<c0106fae>] sysenter_past_esp+0x6b/0xa1
=======================
I haven't looked at how to fix this yet. I only just worked out why I
was getting suspend failures.
J
On Thursday, 22 of November 2007, Jeremy Fitzhardinge wrote:
> It seems that a process blocked in a write to an xfs filesystem due to
> xfs_freeze cannot be frozen by the freezer.
The freezer doesn't handle tasks in TASK_UNINTERRUPTIBLE and I don't know how
to make it handle them without at least partially defeating its purpose.
> I see this if I suspend my laptop while doing something xfs-filesystem
> intensive, like a kernel build. My suspend scripts freeze the XFS
> filesystem (as Dave said I should), which presumably blocks some writer,
> and then the freezer times out and fails to complete.
>
> Here's part of the process dump the freezer does when it times out:
>
> cc1 D 00000000 0 18138 18137
> dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
> 009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
> c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
> Call Trace:
> [<c0228d97>] xfs_write+0xf4/0x6d9
> [<c0226038>] xfs_file_aio_write+0x53/0x5b
> [<c0171c15>] do_sync_write+0xae/0xec
> [<c0172343>] vfs_write+0xa4/0x120
> [<c01728d7>] sys_write+0x3b/0x60
> [<c0106fae>] sysenter_past_esp+0x6b/0xa1
> =======================
>
>
> I haven't looked at how to fix this yet. I only just worked out why I
> was getting suspend failures.
Well, you can add freezer_do_not_count()/freezer_count() annotations to
xfs_write() (and whatever else is blocked as a result of the XFS being frozen).
Generally, that would be risky without the freezing of XFS, however, because it
might leak us filesystem data to a storage device after creating a hibernation
image which would result in the filesystem corruption after the resume.
Still, if you only suspend to RAM, that should be safe.
Greetings,
Rafael
Rafael J. Wysocki wrote:
> On Thursday, 22 of November 2007, Jeremy Fitzhardinge wrote:
>
>> It seems that a process blocked in a write to an xfs filesystem due to
>> xfs_freeze cannot be frozen by the freezer.
>>
>
> The freezer doesn't handle tasks in TASK_UNINTERRUPTIBLE and I don't know how
> to make it handle them without at least partially defeating its purpose.
>
Well, I guess the question is whether an xfs-frozen writer really needs
to be UNINTERRUPTIBLE from the freezer's perspective (clearly it does
from usermode's perspective - filesystem writes just don't return EINTR).
>From a quick poke around, it looks to me like freezing is actually
implemented in the VFS layer rather than in XFS itself: is that right?
Could vfs_check_frozen() be changed to something that is freezer-compatible?
>> I see this if I suspend my laptop while doing something xfs-filesystem
>> intensive, like a kernel build. My suspend scripts freeze the XFS
>> filesystem (as Dave said I should), which presumably blocks some writer,
>> and then the freezer times out and fails to complete.
>>
>> Here's part of the process dump the freezer does when it times out:
>>
>> cc1 D 00000000 0 18138 18137
>> dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
>> 009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
>> c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
>> Call Trace:
>> [<c0228d97>] xfs_write+0xf4/0x6d9
>> [<c0226038>] xfs_file_aio_write+0x53/0x5b
>> [<c0171c15>] do_sync_write+0xae/0xec
>> [<c0172343>] vfs_write+0xa4/0x120
>> [<c01728d7>] sys_write+0x3b/0x60
>> [<c0106fae>] sysenter_past_esp+0x6b/0xa1
>> =======================
>>
>>
>> I haven't looked at how to fix this yet. I only just worked out why I
>> was getting suspend failures.
>>
>
> Well, you can add freezer_do_not_count()/freezer_count() annotations to
> xfs_write() (and whatever else is blocked as a result of the XFS being frozen).
>
What would be the implications of that? Would that just prevent
freezing while there's something blocked there?
> Generally, that would be risky without the freezing of XFS, however, because it
> might leak us filesystem data to a storage device after creating a hibernation
> image which would result in the filesystem corruption after the resume.
>
> Still, if you only suspend to RAM, that should be safe.
>
I specifically added it because I was getting data loss due to crashes
during suspend/resume problems. It's been pretty stable lately, but I
may as well remove the xfs_freeze from my suspend scripts if this is the
solution.
I think the broader issue is that there's no reason in principle why
something blocked due to xfs-freezing (or vfs freezing) should prevent
the freezer from completing.
J
On Monday, 26 of November 2007, Jeremy Fitzhardinge wrote:
> Rafael J. Wysocki wrote:
> > On Thursday, 22 of November 2007, Jeremy Fitzhardinge wrote:
> >
> >> It seems that a process blocked in a write to an xfs filesystem due to
> >> xfs_freeze cannot be frozen by the freezer.
> >>
> >
> > The freezer doesn't handle tasks in TASK_UNINTERRUPTIBLE and I don't know how
> > to make it handle them without at least partially defeating its purpose.
> >
>
> Well, I guess the question is whether an xfs-frozen writer really needs
> to be UNINTERRUPTIBLE from the freezer's perspective (clearly it does
> from usermode's perspective - filesystem writes just don't return EINTR).
>
> From a quick poke around, it looks to me like freezing is actually
> implemented in the VFS layer rather than in XFS itself: is that right?
I don't know the details.
> Could vfs_check_frozen() be changed to something that is freezer-compatible?
That seems doable in principle. I'll have a closer look at it.
> >> I see this if I suspend my laptop while doing something xfs-filesystem
> >> intensive, like a kernel build. My suspend scripts freeze the XFS
> >> filesystem (as Dave said I should), which presumably blocks some writer,
> >> and then the freezer times out and fails to complete.
> >>
> >> Here's part of the process dump the freezer does when it times out:
> >>
> >> cc1 D 00000000 0 18138 18137
> >> dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
> >> 009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
> >> c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
> >> Call Trace:
> >> [<c0228d97>] xfs_write+0xf4/0x6d9
> >> [<c0226038>] xfs_file_aio_write+0x53/0x5b
> >> [<c0171c15>] do_sync_write+0xae/0xec
> >> [<c0172343>] vfs_write+0xa4/0x120
> >> [<c01728d7>] sys_write+0x3b/0x60
> >> [<c0106fae>] sysenter_past_esp+0x6b/0xa1
> >> =======================
> >>
> >>
> >> I haven't looked at how to fix this yet. I only just worked out why I
> >> was getting suspend failures.
> >>
> >
> > Well, you can add freezer_do_not_count()/freezer_count() annotations to
> > xfs_write() (and whatever else is blocked as a result of the XFS being frozen).
> >
>
> What would be the implications of that? Would that just prevent
> freezing while there's something blocked there?
The freezer will not wait for this particular task. Still, the task will have
TIF_FREEZE set, so it will freeze as soon as freezer_count() is reached,
unless the thawing of tasks is carried out first.
If used in the right place, it's reasonably safe, but we need to know what
the right place is. [That's how we handle vfork(), BTW.]
> > Generally, that would be risky without the freezing of XFS, however, because it
> > might leak us filesystem data to a storage device after creating a hibernation
> > image which would result in the filesystem corruption after the resume.
> >
> > Still, if you only suspend to RAM, that should be safe.
> >
>
> I specifically added it because I was getting data loss due to crashes
> during suspend/resume problems. It's been pretty stable lately, but I
> may as well remove the xfs_freeze from my suspend scripts if this is the
> solution.
Not exactly. :-)
> I think the broader issue is that there's no reason in principle why
> something blocked due to xfs-freezing (or vfs freezing) should prevent
> the freezer from completing.
Agreed, but the only way to tell the freezer "don't wait for this task", if the
task in question is in TASK_UNINTERRUPTIBLE, is to annotate it.
Greetings,
Rafael
On Sat, Nov 24, 2007 at 12:47:21AM +0100, Rafael J. Wysocki wrote:
> On Thursday, 22 of November 2007, Jeremy Fitzhardinge wrote:
> > It seems that a process blocked in a write to an xfs filesystem due to
> > xfs_freeze cannot be frozen by the freezer.
>
> The freezer doesn't handle tasks in TASK_UNINTERRUPTIBLE and I don't know how
> to make it handle them without at least partially defeating its purpose.
So how do you handle threads that are blocked on I/O or a lock during
the system freeze process, then?
> > I see this if I suspend my laptop while doing something xfs-filesystem
> > intensive, like a kernel build. My suspend scripts freeze the XFS
> > filesystem (as Dave said I should), which presumably blocks some writer,
> > and then the freezer times out and fails to complete.
> >
> > Here's part of the process dump the freezer does when it times out:
> >
> > cc1 D 00000000 0 18138 18137
> > dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
> > 009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
> > c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
> > Call Trace:
> > [<c0228d97>] xfs_write+0xf4/0x6d9
> > [<c0226038>] xfs_file_aio_write+0x53/0x5b
> > [<c0171c15>] do_sync_write+0xae/0xec
> > [<c0172343>] vfs_write+0xa4/0x120
> > [<c01728d7>] sys_write+0x3b/0x60
> > [<c0106fae>] sysenter_past_esp+0x6b/0xa1
> > =======================
> >
> >
> > I haven't looked at how to fix this yet. I only just worked out why I
> > was getting suspend failures.
>
> Well, you can add freezer_do_not_count()/freezer_count() annotations to
> xfs_write() (and whatever else is blocked as a result of the XFS being frozen).
May as well annotate the whole VFS, then, because once the transaction
subsystem is frozen any operation that modifies the filesystem will get
blocked like this.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Monday, 26 of November 2007, David Chinner wrote:
> On Sat, Nov 24, 2007 at 12:47:21AM +0100, Rafael J. Wysocki wrote:
> > On Thursday, 22 of November 2007, Jeremy Fitzhardinge wrote:
> > > It seems that a process blocked in a write to an xfs filesystem due to
> > > xfs_freeze cannot be frozen by the freezer.
> >
> > The freezer doesn't handle tasks in TASK_UNINTERRUPTIBLE and I don't know how
> > to make it handle them without at least partially defeating its purpose.
>
> So how do you handle threads that are blocked on I/O or a lock during
> the system freeze process, then?
We wait until they can continue.
> > > I see this if I suspend my laptop while doing something xfs-filesystem
> > > intensive, like a kernel build. My suspend scripts freeze the XFS
> > > filesystem (as Dave said I should), which presumably blocks some writer,
> > > and then the freezer times out and fails to complete.
> > >
> > > Here's part of the process dump the freezer does when it times out:
> > >
> > > cc1 D 00000000 0 18138 18137
> > > dd5f1e24 00200082 00000002 00000000 ecdeeb00 ecdeec64 c200f280 00000001
> > > 009c09a0 dd5f1e0c dd5f1e0c 0000000f 00000000 00000000 00000000 dd5f1e74
> > > c7beb480 dd5f1e88 dd5f1ea8 c0228d97 e8889540 dd5f1e38 c015b75d dd5f1e44
> > > Call Trace:
> > > [<c0228d97>] xfs_write+0xf4/0x6d9
> > > [<c0226038>] xfs_file_aio_write+0x53/0x5b
> > > [<c0171c15>] do_sync_write+0xae/0xec
> > > [<c0172343>] vfs_write+0xa4/0x120
> > > [<c01728d7>] sys_write+0x3b/0x60
> > > [<c0106fae>] sysenter_past_esp+0x6b/0xa1
> > > =======================
> > >
> > >
> > > I haven't looked at how to fix this yet. I only just worked out why I
> > > was getting suspend failures.
> >
> > Well, you can add freezer_do_not_count()/freezer_count() annotations to
> > xfs_write() (and whatever else is blocked as a result of the XFS being frozen).
>
> May as well annotate the whole VFS, then, because once the transaction
> subsystem is frozen any operation that modifies the filesystem will get
> blocked like this.
Well, I don't know how this mechanism actually works, so I can't comment.
Is there a mutex on which tasks block if the filesystem is frozen?
Greetings,
Rafael
On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:
> On Monday, 26 of November 2007, David Chinner wrote:
> > So how do you handle threads that are blocked on I/O or a lock during
> > the system freeze process, then?
>
> We wait until they can continue.
So if I have a process blocked on an unavilable NFS mount, I can't
suspend?
--
Matthew Garrett | [email protected]
On Tuesday, 27 of November 2007, Matthew Garrett wrote:
> On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:
> > On Monday, 26 of November 2007, David Chinner wrote:
> > > So how do you handle threads that are blocked on I/O or a lock during
> > > the system freeze process, then?
> >
> > We wait until they can continue.
>
> So if I have a process blocked on an unavilable NFS mount, I can't
> suspend?
That's correct, you can't.
[And I know what you're going to say. ;-)]
Greetings,
Rafael
On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote:
> On Tuesday, 27 of November 2007, Matthew Garrett wrote:
>> On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:
>>> On Monday, 26 of November 2007, David Chinner wrote:
>>>> So how do you handle threads that are blocked on I/O or a lock
>>>> during the system freeze process, then?
>>>
>>> We wait until they can continue.
>>
>> So if I have a process blocked on an unavilable NFS mount, I can't
>> suspend?
>
> That's correct, you can't.
>
> [And I know what you're going to say. ;-)]
Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
instead of a zero preempt_count()? Really what we should do is just
iterate over all of the actual physical devices and tell each one
"Block new IO requests preemptably, finish pending DMA, put the
hardware in low-power mode, and prepare for suspend/hibernate". As
long as each driver knows how to do those simple things we can have
an entirely consistent kernel image for both suspend and for
hibernation.
When all tasks are preemptable we can very trivially rely on the
drivers to enforce the "Stop new IO submission" with a dirt-simple
semaphore or waitqueue. The sleep itself will be
TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context.
That way the system suspend time is the sum of the suspend times of
the devices on the system, and the suspend time of any given device
is the sum of its maximum non-preemptible critical section and the
time to flush all of its remaining pending DMA/etc. This is almost
completely independent of the load-level of the machine, and it does
not depend on things like NFS filesystems. The one gotcha is that it
does not flush dirty filesystem pages to disk first, although that
could be fixed with a few VFS and blockdev hooks which hierarchically
flush and "freeze" block devices and filesystems before actually
disabling devices much the way that device-mapper can pause a device
to take a snapshot and end up with a clean journal on the filesystem
afterwards.
Cheers,
Kyle Moffett
On Tuesday, 27 of November 2007, Kyle Moffett wrote:
> On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote:
> > On Tuesday, 27 of November 2007, Matthew Garrett wrote:
> >> On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:
> >>> On Monday, 26 of November 2007, David Chinner wrote:
> >>>> So how do you handle threads that are blocked on I/O or a lock
> >>>> during the system freeze process, then?
> >>>
> >>> We wait until they can continue.
> >>
> >> So if I have a process blocked on an unavilable NFS mount, I can't
> >> suspend?
> >
> > That's correct, you can't.
> >
> > [And I know what you're going to say. ;-)]
>
> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
> instead of a zero preempt_count()? Really what we should do is just
> iterate over all of the actual physical devices and tell each one
> "Block new IO requests preemptably, finish pending DMA, put the
> hardware in low-power mode, and prepare for suspend/hibernate". As
> long as each driver knows how to do those simple things we can have
> an entirely consistent kernel image for both suspend and for
> hibernation.
Well, this is more-or-less how we all imagine that should be done eventually.
The main problem is how to implement it without causing too much breakage.
Also, there are some dirty details that need to be taken into consideration.
> When all tasks are preemptable we can very trivially rely on the
> drivers to enforce the "Stop new IO submission" with a dirt-simple
> semaphore or waitqueue. The sleep itself will be
> TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context.
If there are any drivers that make their devices available via mmap(), that
won't be sufficient.
Probably, we'll need a two iterations over devices to handle all corner cases.
Moreover, for hibernation we need to resume at least some devices in order
to save the image, which shouldn't result in unblocking the waiting tasks.
> That way the system suspend time is the sum of the suspend times of
> the devices on the system, and the suspend time of any given device
> is the sum of its maximum non-preemptible critical section and the
> time to flush all of its remaining pending DMA/etc. This is almost
> completely independent of the load-level of the machine, and it does
> not depend on things like NFS filesystems. The one gotcha is that it
> does not flush dirty filesystem pages to disk first, although that
> could be fixed with a few VFS and blockdev hooks which hierarchically
> flush and "freeze" block devices and filesystems before actually
> disabling devices much the way that device-mapper can pause a device
> to take a snapshot and end up with a clean journal on the filesystem
> afterwards.
Yes, I generally agree.
Greetings,
Rafael
Rafael J. Wysocki wrote:
> Well, this is more-or-less how we all imagine that should be done eventually.
>
> The main problem is how to implement it without causing too much breakage.
> Also, there are some dirty details that need to be taken into consideration.
>
For Xen suspend/resume, I'd like to use the freezer to get all threads
into a known consistent state (where, specifically, they don't have any
outstanding pagetable updates pending). In other words, the freezer as
it currently stands is what I want, modulo some of these issues where it
gets caught up unexpectedly. If threads end up getting frozen anywhere
preempt isn't explicitly disabled, it wouldn't work for me.
J
On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote:
> Rafael J. Wysocki wrote:
>> Well, this is more-or-less how we all imagine that should be done
>> eventually.
>>
>> The main problem is how to implement it without causing too much
>> breakage. Also, there are some dirty details that need to be
>> taken into consideration.
>
> For Xen suspend/resume, I'd like to use the freezer to get all
> threads into a known consistent state (where, specifically, they
> don't have any outstanding pagetable updates pending). In other
> words, the freezer as it currently stands is what I want, modulo
> some of these issues where it gets caught up unexpectedly. If
> threads end up getting frozen anywhere preempt isn't explicitly
> disabled, it wouldn't work for me.
The problem with "one freezer" is that "known consistent state" means
something completely different to every single driver and subsystem.
Xen wants it to mean "No pending page table updates and no more
updates from this point forward". A network driver wants it to mean
"All pending network packets DMAed out or in and the device shut down
with all remaining packets queued. A SATA controller wants it to
mean "All DMA quiesced and no more commands", etc.
The only way to have that work is to put minimal definitions of what
state you care about in the drivers themselves. For Xen this means
that you need to have an appropriately-timed suspend handler which
hooks into Xen code very precisely to create and preserve the "No
pending page table updates" state that you care about. It will be
more work in the short term but it's the only maintainable solution
in the long term IMO.
Cheers,
Kyle Moffett
Kyle Moffett wrote:
> On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote:
>> Rafael J. Wysocki wrote:
>>> Well, this is more-or-less how we all imagine that should be done
>>> eventually.
>>>
>>> The main problem is how to implement it without causing too much
>>> breakage. Also, there are some dirty details that need to be taken
>>> into consideration.
>>
>> For Xen suspend/resume, I'd like to use the freezer to get all
>> threads into a known consistent state (where, specifically, they
>> don't have any outstanding pagetable updates pending). In other
>> words, the freezer as it currently stands is what I want, modulo some
>> of these issues where it gets caught up unexpectedly. If threads end
>> up getting frozen anywhere preempt isn't explicitly disabled, it
>> wouldn't work for me.
>
> The problem with "one freezer" is that "known consistent state" means
> something completely different to every single driver and subsystem.
Not really. The freezer puts tasks into a particular well-understood
state: they're either in usermode, or in the kernel in the
refrigerator. And since the places which call into the refrigerator are
explicit in the source, and not terribly numerous, its easy to audit
exactly what the state is at each call.
> Xen wants it to mean "No pending page table updates and no more
> updates from this point forward". A network driver wants it to mean
> "All pending network packets DMAed out or in and the device shut down
> with all remaining packets queued. A SATA controller wants it to mean
> "All DMA quiesced and no more commands", etc.
Well, those are somewhat different. The existing suspend/resume driver
callbacks are sufficient for a device to be in that state. What I want
for Xen is more global: I just want to make sure tasks are not preempted
in the middle of a state which can't be suspended. The specific details
of the state I want are moderately complex, but short lived. The
problem with other mechanisms - like stop_machine - is that they can
leave threads preempted in one of the states I can't handle, whereas the
the freezer is more deterministic.
> The only way to have that work is to put minimal definitions of what
> state you care about in the drivers themselves. For Xen this means
> that you need to have an appropriately-timed suspend handler which
> hooks into Xen code very precisely to create and preserve the "No
> pending page table updates" state that you care about. It will be
> more work in the short term but it's the only maintainable solution in
> the long term IMO.
No, that doesn't really work. Aside from scattering hooks everywhere
there's pagetable updates, there's no real existing place to hook into.
While I could put those hooks in, they would amount to changing the
kernel-internal pagetable update interface for everyone to deal with a
corner case of a fairly obscure user - I don't think its a good tradeoff.
The freezer is nice because the state it puts each task into is
well-defined, and is well-suited for Xen's use. In fact, I would agree
with you that the use I want to put the freezer to better suits it than
its current use in suspend/resume.
J
Hi!
> >>>>So how do you handle threads that are blocked on I/O or a lock
> >>>>during the system freeze process, then?
> >>>
> >>>We wait until they can continue.
> >>
> >>So if I have a process blocked on an unavilable NFS mount, I can't
> >>suspend?
> >
> >That's correct, you can't.
> >
> >[And I know what you're going to say. ;-)]
>
> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
> instead of a zero preempt_count()? Really what we should do is just
> iterate over all of the actual physical devices and tell each one
> "Block new IO requests preemptably, finish pending DMA, put the
> hardware in low-power mode, and prepare for suspend/hibernate". As
> long as each driver knows how to do those simple things we can have
> an entirely consistent kernel image for both suspend and for
> hibernation.
"each driver" means this is a lot of work. But yes, that is probably
way to go, and patch would be welcome.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi.
Pavel Machek wrote:
> Hi!
>
>>>>>> So how do you handle threads that are blocked on I/O or a lock
>>>>>> during the system freeze process, then?
>>>>> We wait until they can continue.
>>>> So if I have a process blocked on an unavilable NFS mount, I can't
>>>> suspend?
>>> That's correct, you can't.
>>>
>>> [And I know what you're going to say. ;-)]
>> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
>> instead of a zero preempt_count()? Really what we should do is just
>> iterate over all of the actual physical devices and tell each one
>> "Block new IO requests preemptably, finish pending DMA, put the
>> hardware in low-power mode, and prepare for suspend/hibernate". As
>> long as each driver knows how to do those simple things we can have
>> an entirely consistent kernel image for both suspend and for
>> hibernation.
>
> "each driver" means this is a lot of work. But yes, that is probably
> way to go, and patch would be welcome.
Yes, that does work. It's what I've done in my (preliminary) support for
fuse.
Regards,
Nigel
On Wednesday, 2 of January 2008, Nigel Cunningham wrote:
> Hi.
>
> Pavel Machek wrote:
> > Hi!
> >
> >>>>>> So how do you handle threads that are blocked on I/O or a lock
> >>>>>> during the system freeze process, then?
> >>>>> We wait until they can continue.
> >>>> So if I have a process blocked on an unavilable NFS mount, I can't
> >>>> suspend?
> >>> That's correct, you can't.
> >>>
> >>> [And I know what you're going to say. ;-)]
> >> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
> >> instead of a zero preempt_count()? Really what we should do is just
> >> iterate over all of the actual physical devices and tell each one
> >> "Block new IO requests preemptably, finish pending DMA, put the
> >> hardware in low-power mode, and prepare for suspend/hibernate". As
> >> long as each driver knows how to do those simple things we can have
> >> an entirely consistent kernel image for both suspend and for
> >> hibernation.
> >
> > "each driver" means this is a lot of work. But yes, that is probably
> > way to go, and patch would be welcome.
>
> Yes, that does work. It's what I've done in my (preliminary) support for
> fuse.
Hmm, can you please elaborate a bit?
Rafael
Hi.
Rafael J. Wysocki wrote:
> On Wednesday, 2 of January 2008, Nigel Cunningham wrote:
>> Pavel Machek wrote:
>>>>>>>> So how do you handle threads that are blocked on I/O or a lock
>>>>>>>> during the system freeze process, then?
>>>>>>> We wait until they can continue.
>>>>>> So if I have a process blocked on an unavilable NFS mount, I can't
>>>>>> suspend?
>>>>> That's correct, you can't.
>>>>>
>>>>> [And I know what you're going to say. ;-)]
>>>> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
>>>> instead of a zero preempt_count()? Really what we should do is just
>>>> iterate over all of the actual physical devices and tell each one
>>>> "Block new IO requests preemptably, finish pending DMA, put the
>>>> hardware in low-power mode, and prepare for suspend/hibernate". As
>>>> long as each driver knows how to do those simple things we can have
>>>> an entirely consistent kernel image for both suspend and for
>>>> hibernation.
>>> "each driver" means this is a lot of work. But yes, that is probably
>>> way to go, and patch would be welcome.
>> Yes, that does work. It's what I've done in my (preliminary) support for
>> fuse.
>
> Hmm, can you please elaborate a bit?
Sorry. I wasn't very unambiguous, was I? And I'm not sure now whether
you're meaning "How does fuse support relate to freezing block devices?"
or "What's this about fuse support?". Let me therefore seek to answer
both questions:
Higher level, I know (filesystems rather than block devices), but I was
meaning the general concept of blocking new requests and completing
existing ones worked fine for the supposedly impossible fuse support.
Re fuse support, let me start by saying "I know this doesn't handle all
situations, but I think it's a good enough proof-of-concept implementation".
I added some simple hooks to the code for submitting new work to fuse
threads.
#define FUSE_MIGHT_FREEZE(superblock, desc) \
do { \
int printed = 0; \
while(superblock->s_frozen != SB_UNFROZEN) { \
if (!printed) { \
printk("%d frozen in " desc ".\n", current->pid); \
printed = 1; \
} \
try_to_freeze(); \
yield(); \
} \
} while (0)
On top of this, I made a (too simple at the moment) freeze_filesystems
function which iterates through &super_blocks in reverse order, freezing
fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
currently allow for the possibility of someone mounting (say) ext3 on
fuse, but that would just be an extension of what's already done.
The end result is:
int freeze_processes(void)
{
int error;
printk(KERN_INFO "Stopping fuse filesystems.\n");
freeze_filesystems(FS_FREEZER_FUSE);
freezer_state = FREEZER_FILESYSTEMS_FROZEN;
printk(KERN_INFO "Freezing user space processes ... ");
error = try_to_freeze_tasks(FREEZER_USER_SPACE);
if (error)
goto Exit;
printk(KERN_INFO "done.\n");
sys_sync();
printk(KERN_INFO "Stopping normal filesystems.\n");
freeze_filesystems(FS_FREEZER_NORMAL);
freezer_state = FREEZER_USERSPACE_FROZEN;
printk(KERN_INFO "Freezing remaining freezable tasks ... ");
error = try_to_freeze_tasks(FREEZER_KERNEL_THREADS);
if (error)
goto Exit;
printk(KERN_INFO "done.");
freezer_state = FREEZER_FULLY_ON;
Exit:
BUG_ON(in_atomic());
printk("\n");
return error;
}
Sorry if that's more info than you wanted.
Nigel
Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
> On top of this, I made a (too simple at the moment) freeze_filesystems
> function which iterates through &super_blocks in reverse order, freezing
> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
> currently allow for the possibility of someone mounting (say) ext3 on
> fuse, but that would just be an extension of what's already done.
How do you deal with fuse server tasks using other fuse filesystems?
How does freeze_filesystems() look?
Regards
Oliver
Hi.
Oliver Neukum wrote:
> Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
>> On top of this, I made a (too simple at the moment) freeze_filesystems
>> function which iterates through &super_blocks in reverse order, freezing
>> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
>> currently allow for the possibility of someone mounting (say) ext3 on
>> fuse, but that would just be an extension of what's already done.
>
> How do you deal with fuse server tasks using other fuse filesystems?
Since they're frozen in reverse order, the dependant one would be frozen
first.
> How does freeze_filesystems() look?
Removing my ugly debugging statements, it's currently:
/**
* freeze_filesystems - lock all filesystems and force them into a
consistent
* state
*/
void freeze_filesystems(int which)
{
struct super_block *sb;
lockdep_off();
/*
* Freeze in reverse order so filesystems dependant upon others are
* frozen in the right order (eg. loopback on ext3).
*/
list_for_each_entry_reverse(sb, &super_blocks, s_list) {
if (sb->s_type->fs_flags & FS_IS_FUSE &&
sb->s_frozen == SB_UNFROZEN &&
which & FS_FREEZER_FUSE) {
sb->s_frozen = SB_FREEZE_TRANS;
sb->s_flags |= MS_FROZEN;
continue;
}
if (!sb->s_root || !sb->s_bdev ||
(sb->s_frozen == SB_FREEZE_TRANS) ||
(sb->s_flags & MS_RDONLY) ||
(sb->s_flags & MS_FROZEN) ||
!(which & FS_FREEZER_NORMAL))
continue;
freeze_bdev(sb->s_bdev);
sb->s_flags |= MS_FROZEN;
}
lockdep_on();
}
Nigel
Am Donnerstag, 3. Januar 2008 10:52:53 schrieb Nigel Cunningham:
> Hi.
>
> Oliver Neukum wrote:
> > Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
> >> On top of this, I made a (too simple at the moment) freeze_filesystems
> >> function which iterates through &super_blocks in reverse order, freezing
> >> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
> >> currently allow for the possibility of someone mounting (say) ext3 on
> >> fuse, but that would just be an extension of what's already done.
> >
> > How do you deal with fuse server tasks using other fuse filesystems?
>
> Since they're frozen in reverse order, the dependant one would be frozen
> first.
Say I do:
a) mount fuse on /tmp/first
b) mount fuse on /tmp/second
Then the server task for (a) does "ls /tmp/second". So it will be frozen,
right? How do you then freeze (a)? And keep in mind that the server task
may have forked.
Regards
Oliver
Hi.
Oliver Neukum wrote:
> Am Donnerstag, 3. Januar 2008 10:52:53 schrieb Nigel Cunningham:
>> Hi.
>>
>> Oliver Neukum wrote:
>>> Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
>>>> On top of this, I made a (too simple at the moment) freeze_filesystems
>>>> function which iterates through &super_blocks in reverse order, freezing
>>>> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
>>>> currently allow for the possibility of someone mounting (say) ext3 on
>>>> fuse, but that would just be an extension of what's already done.
>>> How do you deal with fuse server tasks using other fuse filesystems?
>> Since they're frozen in reverse order, the dependant one would be frozen
>> first.
>
> Say I do:
>
> a) mount fuse on /tmp/first
> b) mount fuse on /tmp/second
>
> Then the server task for (a) does "ls /tmp/second". So it will be frozen,
> right? How do you then freeze (a)? And keep in mind that the server task
> may have forked.
I guess I should first ask, is this a real life problem or a
hypothetical twisted web? I don't see why you would want to make two
filesystems interdependent - it sounds like the way to create livelock
and deadlocks in normal use, before we even begin to think about
hibernating.
Regards,
Nigel
On Thursday, 3 of January 2008, Nigel Cunningham wrote:
> Hi.
>
> Rafael J. Wysocki wrote:
> > On Wednesday, 2 of January 2008, Nigel Cunningham wrote:
> >> Pavel Machek wrote:
> >>>>>>>> So how do you handle threads that are blocked on I/O or a lock
> >>>>>>>> during the system freeze process, then?
> >>>>>>> We wait until they can continue.
> >>>>>> So if I have a process blocked on an unavilable NFS mount, I can't
> >>>>>> suspend?
> >>>>> That's correct, you can't.
> >>>>>
> >>>>> [And I know what you're going to say. ;-)]
> >>>> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"
> >>>> instead of a zero preempt_count()? Really what we should do is just
> >>>> iterate over all of the actual physical devices and tell each one
> >>>> "Block new IO requests preemptably, finish pending DMA, put the
> >>>> hardware in low-power mode, and prepare for suspend/hibernate". As
> >>>> long as each driver knows how to do those simple things we can have
> >>>> an entirely consistent kernel image for both suspend and for
> >>>> hibernation.
> >>> "each driver" means this is a lot of work. But yes, that is probably
> >>> way to go, and patch would be welcome.
> >> Yes, that does work. It's what I've done in my (preliminary) support for
> >> fuse.
> >
> > Hmm, can you please elaborate a bit?
>
> Sorry. I wasn't very unambiguous, was I? And I'm not sure now whether
> you're meaning "How does fuse support relate to freezing block devices?"
> or "What's this about fuse support?". Let me therefore seek to answer
> both questions:
>
> Higher level, I know (filesystems rather than block devices), but I was
> meaning the general concept of blocking new requests and completing
> existing ones worked fine for the supposedly impossible fuse support.
>
> Re fuse support, let me start by saying "I know this doesn't handle all
> situations, but I think it's a good enough proof-of-concept implementation".
>
> I added some simple hooks to the code for submitting new work to fuse
> threads.
>
> #define FUSE_MIGHT_FREEZE(superblock, desc) \
> do { \
> int printed = 0; \
> while(superblock->s_frozen != SB_UNFROZEN) { \
> if (!printed) { \
> printk("%d frozen in " desc ".\n", current->pid); \
> printed = 1; \
> } \
> try_to_freeze(); \
> yield(); \
> } \
> } while (0)
>
> On top of this, I made a (too simple at the moment) freeze_filesystems
> function which iterates through &super_blocks in reverse order, freezing
> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
> currently allow for the possibility of someone mounting (say) ext3 on
> fuse, but that would just be an extension of what's already done.
>
> The end result is:
>
> int freeze_processes(void)
> {
> int error;
>
> printk(KERN_INFO "Stopping fuse filesystems.\n");
> freeze_filesystems(FS_FREEZER_FUSE);
> freezer_state = FREEZER_FILESYSTEMS_FROZEN;
> printk(KERN_INFO "Freezing user space processes ... ");
> error = try_to_freeze_tasks(FREEZER_USER_SPACE);
> if (error)
> goto Exit;
> printk(KERN_INFO "done.\n");
>
> sys_sync();
> printk(KERN_INFO "Stopping normal filesystems.\n");
> freeze_filesystems(FS_FREEZER_NORMAL);
> freezer_state = FREEZER_USERSPACE_FROZEN;
> printk(KERN_INFO "Freezing remaining freezable tasks ... ");
> error = try_to_freeze_tasks(FREEZER_KERNEL_THREADS);
> if (error)
> goto Exit;
> printk(KERN_INFO "done.");
> freezer_state = FREEZER_FULLY_ON;
> Exit:
> BUG_ON(in_atomic());
> printk("\n");
> return error;
> }
>
> Sorry if that's more info than you wanted.
No, that's fine, thanks.
Greetings,
Rafael
Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:
> Hi.
>
> Oliver Neukum wrote:
> > Am Donnerstag, 3. Januar 2008 10:52:53 schrieb Nigel Cunningham:
> >> Hi.
> >>
> >> Oliver Neukum wrote:
> >>> Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
> >>>> On top of this, I made a (too simple at the moment) freeze_filesystems
> >>>> function which iterates through &super_blocks in reverse order, freezing
> >>>> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
> >>>> currently allow for the possibility of someone mounting (say) ext3 on
> >>>> fuse, but that would just be an extension of what's already done.
> >>> How do you deal with fuse server tasks using other fuse filesystems?
> >> Since they're frozen in reverse order, the dependant one would be frozen
> >> first.
> >
> > Say I do:
> >
> > a) mount fuse on /tmp/first
> > b) mount fuse on /tmp/second
> >
> > Then the server task for (a) does "ls /tmp/second". So it will be frozen,
> > right? How do you then freeze (a)? And keep in mind that the server task
> > may have forked.
>
> I guess I should first ask, is this a real life problem or a
> hypothetical twisted web? I don't see why you would want to make two
> filesystems interdependent - it sounds like the way to create livelock
> and deadlocks in normal use, before we even begin to think about
> hibernating.
Good questions. I personally don't use fuse, but I do care about power
management. The problem I see is that an unprivileged user could make
that dependency, even inadvertedly.
Regards
Oliver
On Jan 04, 2008, at 15:54:06, Oliver Neukum wrote:
> Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:
>> Hi.
>>> a) mount fuse on /tmp/first
>>> b) mount fuse on /tmp/second
>>>
>>> Then the server task for (a) does "ls /tmp/second". So it will be
>>> frozen, right? How do you then freeze (a)? And keep in mind that
>>> the server task may have forked.
>>
>> I guess I should first ask, is this a real life problem or a
>> hypothetical twisted web? I don't see why you would want to make
>> two filesystems interdependent - it sounds like the way to create
>> livelock and deadlocks in normal use, before we even begin to
>> think about hibernating.
>
> Good questions. I personally don't use fuse, but I do care about
> power management. The problem I see is that an unprivileged user
> could make that dependency, even inadvertedly.
I don't think it makes sense for the kernel to try to keep track of
hard data dependencies for FUSE filesystems, or to even *attempt* to
auto-suspend them. You should instead allow a privileged program to
initiate a "freeze-and-flush" operation on a particular FUSE
filesystem and optionally wait for it to finish. Then your userspace
would be configured with the appropriate data dependencies and would
stop FUSE filesystems in the appropriate order.
In addition, the kernel would automatically understand
ext3=>loopback=>fuse, and when asked to freeze the "fuse" part, it
would first freeze the "ext3" and the "loopback" parts using similar
mechanisms as device-mapper currently uses when you do "dmsetup
suspend mydev" followed by "echo 0 $SIZE snapshot /dev/mapper/mydev-
base /dev/mapper/mydev-snap-back p 8 | dmsetup load mydev" (IE: when
you create a snapshot of a given device).
Naturally userspace could deadlock itself (although not the kernel)
by freezing a block device and then attempting to access it, but
since the "freeze" operation is limited to root this is not a big
issue. The way to freeze all filesystems safely would be to clone a
new mount namespace, mlockall(), mount a tmpfs, pivot_root() into the
tmpfs, bind-mount the filesystems you want to freeze directly onto
subdirectories of the tmpfs, and then freeze them in an appropriate
order.
Besides which the worst-case is a pretty straightforward non-critical
failure; you might fail to fully sync a FUSE filesystem because its
daemon is asleep waiting on something (possibly even just sitting in
a "sleep(10000)" call with all signals masked). You simply need to
make sure that all tasks are asleep outside of driver critical
sections so that you can properly suspend your device tree.
Cheers,
Kyle Moffett
On Fri 2008-01-04 21:54:06, Oliver Neukum wrote:
> Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:
> > Hi.
> >
> > Oliver Neukum wrote:
> > > Am Donnerstag, 3. Januar 2008 10:52:53 schrieb Nigel Cunningham:
> > >> Hi.
> > >>
> > >> Oliver Neukum wrote:
> > >>> Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
> > >>>> On top of this, I made a (too simple at the moment) freeze_filesystems
> > >>>> function which iterates through &super_blocks in reverse order, freezing
> > >>>> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
> > >>>> currently allow for the possibility of someone mounting (say) ext3 on
> > >>>> fuse, but that would just be an extension of what's already done.
> > >>> How do you deal with fuse server tasks using other fuse filesystems?
> > >> Since they're frozen in reverse order, the dependant one would be frozen
> > >> first.
> > >
> > > Say I do:
> > >
> > > a) mount fuse on /tmp/first
> > > b) mount fuse on /tmp/second
> > >
> > > Then the server task for (a) does "ls /tmp/second". So it will be frozen,
> > > right? How do you then freeze (a)? And keep in mind that the server task
> > > may have forked.
> >
> > I guess I should first ask, is this a real life problem or a
> > hypothetical twisted web? I don't see why you would want to make two
> > filesystems interdependent - it sounds like the way to create livelock
> > and deadlocks in normal use, before we even begin to think about
> > hibernating.
>
> Good questions. I personally don't use fuse, but I do care about power
> management. The problem I see is that an unprivileged user could make
> that dependency, even inadvertedly.
Other problem is that unprivileged user can do it with evil intent. So
called "denial-of-service" attack.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi.
Pavel Machek wrote:
> On Fri 2008-01-04 21:54:06, Oliver Neukum wrote:
>> Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:
>>> Oliver Neukum wrote:
>>>> Am Donnerstag, 3. Januar 2008 10:52:53 schrieb Nigel Cunningham:
>>>>> Oliver Neukum wrote:
>>>>>> Am Donnerstag 03 Januar 2008 schrieb Nigel Cunningham:
>>>>>>> On top of this, I made a (too simple at the moment) freeze_filesystems
>>>>>>> function which iterates through &super_blocks in reverse order, freezing
>>>>>>> fuse filesystems or ordinary ones. I say 'too simple' because it doesn't
>>>>>>> currently allow for the possibility of someone mounting (say) ext3 on
>>>>>>> fuse, but that would just be an extension of what's already done.
>>>>>> How do you deal with fuse server tasks using other fuse filesystems?
>>>>> Since they're frozen in reverse order, the dependant one would be frozen
>>>>> first.
>>>> Say I do:
>>>>
>>>> a) mount fuse on /tmp/first
>>>> b) mount fuse on /tmp/second
>>>>
>>>> Then the server task for (a) does "ls /tmp/second". So it will be frozen,
>>>> right? How do you then freeze (a)? And keep in mind that the server task
>>>> may have forked.
>>> I guess I should first ask, is this a real life problem or a
>>> hypothetical twisted web? I don't see why you would want to make two
>>> filesystems interdependent - it sounds like the way to create livelock
>>> and deadlocks in normal use, before we even begin to think about
>>> hibernating.
>> Good questions. I personally don't use fuse, but I do care about power
>> management. The problem I see is that an unprivileged user could make
>> that dependency, even inadvertedly.
>
> Other problem is that unprivileged user can do it with evil intent. So
> called "denial-of-service" attack.
Only in this case it would be a denial-of-denial-of-service attack,
since it would stop you hibernating or suspending :).
This is still all hypothetical. If I could have a real life case where
this could actually happen, it would help a lot.
Nigel
Hi!
(replying to *very* old mail).
>>>> We wait until they can continue.
>>>
>>> So if I have a process blocked on an unavilable NFS mount, I can't
>>> suspend?
>>
>> That's correct, you can't.
>>
>> [And I know what you're going to say. ;-)]
>
> Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE" instead
> of a zero preempt_count()? Really what we should do is just iterate over
> all of the actual physical devices and tell each one "Block new IO requests
> preemptably, finish pending DMA, put the hardware in low-power mode, and
> prepare for suspend/hibernate". As long as each driver knows how to do
> those simple things we can have an entirely consistent kernel image for
> both suspend and for hibernation.
Patch would be welcome, actually. It turns out blocking new
IO-requests is not completely trivial.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, 23 Jun 2008, Pavel Machek wrote:
> (replying to *very* old mail).
>
> >>>> We wait until they can continue.
> >>>
> >>> So if I have a process blocked on an unavilable NFS mount, I can't
> >>> suspend?
> >>
> >> That's correct, you can't.
> >>
> >> [And I know what you're going to say. ;-)]
> >
> > Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE" instead
> > of a zero preempt_count()? Really what we should do is just iterate over
> > all of the actual physical devices and tell each one "Block new IO requests
> > preemptably, finish pending DMA, put the hardware in low-power mode, and
> > prepare for suspend/hibernate". As long as each driver knows how to do
> > those simple things we can have an entirely consistent kernel image for
> > both suspend and for hibernation.
>
> Patch would be welcome, actually. It turns out blocking new
> IO-requests is not completely trivial.
Is this the same thing the per-device IO-queue-freeze patches for HDAPS also
need to do? If so, you may want to talk to Elias Oltmanns
<[email protected]> about it. Added to CC.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
Henrique de Moraes Holschuh <[email protected]> wrote:
> On Mon, 23 Jun 2008, Pavel Machek wrote:
>> (replying to *very* old mail).
>
>>
>> >>>> We wait until they can continue.
>> >>>
>> >>> So if I have a process blocked on an unavilable NFS mount, I can't
>> >>> suspend?
>> >>
>> >> That's correct, you can't.
>> >>
>> >> [And I know what you're going to say. ;-)]
>> >
>> > Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE" instead
>> > of a zero preempt_count()? Really what we should do is just iterate over
>> > all of the actual physical devices and tell each one "Block new IO requests
>> > preemptably, finish pending DMA, put the hardware in low-power mode, and
>> > prepare for suspend/hibernate". As long as each driver knows how to do
>> > those simple things we can have an entirely consistent kernel image for
>> > both suspend and for hibernation.
>>
>> Patch would be welcome, actually. It turns out blocking new
>> IO-requests is not completely trivial.
Quite. But I'm not sure I see what this is all about yet. From the IDE
and SCSI subsystems I remember that they block all I/O from higher levels
once the suspend callbacks have been executed. I haven't made an effort
to understand the freezer (or indeed anything related to hibernation)
yet since I don't even use hibernation myself (only s2ram). Do you have
any suggestion where to start reading up on things so I can get an idea
what the issues are and what you would like IDE / SCSI / ... to do?
>
> Is this the same thing the per-device IO-queue-freeze patches for
>HDAPS also
> need to do? If so, you may want to talk to Elias Oltmanns
> <[email protected]> about it. Added to CC.
Thanks for the heads up Henrique. Even though these issues seem to be
related up to a certain degree, there probably are some important
differences. When suspending a system, the emphasis is on leaving the
system in a consistent state (think of journalled file systems), whereas
disk shock protection is mainly concerned with stopping I/O as soon as
possible. As yet, I cannot possibly say to what extend these two
concepts can be reconciled in the sense of sharing some common code.
Regards,
Elias
Hi!
> >> Patch would be welcome, actually. It turns out blocking new
> >> IO-requests is not completely trivial.
>
> Quite. But I'm not sure I see what this is all about yet. From the IDE
> and SCSI subsystems I remember that they block all I/O from higher levels
> once the suspend callbacks have been executed. I haven't made an effort
> to understand the freezer (or indeed anything related to hibernation)
> yet since I don't even use hibernation myself (only s2ram). Do you have
s2ram also uses freezer these days. Difference is s2ram does not
really need it.
> any suggestion where to start reading up on things so I can get an idea
> what the issues are and what you would like IDE / SCSI / ... to do?
I'd like block layer to block any process that tries to do I/O.
> > Is this the same thing the per-device IO-queue-freeze patches for
> >HDAPS also
> > need to do? If so, you may want to talk to Elias Oltmanns
> > <[email protected]> about it. Added to CC.
>
> Thanks for the heads up Henrique. Even though these issues seem to be
> related up to a certain degree, there probably are some important
> differences. When suspending a system, the emphasis is on leaving the
> system in a consistent state (think of journalled file systems), whereas
> disk shock protection is mainly concerned with stopping I/O as soon as
> possible. As yet, I cannot possibly say to what extend these two
> concepts can be reconciled in the sense of sharing some common code.
Actually, I believe requirements are same.
'don't do i/o in dangerous period'.
swsusp will just do sync() before entering dangerous period. That
provides consistent-enough state...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
> > > Is this the same thing the per-device IO-queue-freeze patches for
> > >HDAPS also
> > > need to do? If so, you may want to talk to Elias Oltmanns
> > > <[email protected]> about it. Added to CC.
> >
> > Thanks for the heads up Henrique. Even though these issues seem to be
> > related up to a certain degree, there probably are some important
> > differences. When suspending a system, the emphasis is on leaving the
> > system in a consistent state (think of journalled file systems), whereas
> > disk shock protection is mainly concerned with stopping I/O as soon as
> > possible. As yet, I cannot possibly say to what extend these two
> > concepts can be reconciled in the sense of sharing some common code.
>
> Actually, I believe requirements are same.
>
> 'don't do i/o in dangerous period'.
>
> swsusp will just do sync() before entering dangerous period. That
> provides consistent-enough state...
As I've said many times before - if the requirement is "don't do
I/O" then you have to freeze the filesystem. In no way does 'sync'
prevent filesystems from doing I/O.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Monday, 30 of June 2008, Dave Chinner wrote:
> On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
> > > > Is this the same thing the per-device IO-queue-freeze patches for
> > > >HDAPS also
> > > > need to do? If so, you may want to talk to Elias Oltmanns
> > > > <[email protected]> about it. Added to CC.
> > >
> > > Thanks for the heads up Henrique. Even though these issues seem to be
> > > related up to a certain degree, there probably are some important
> > > differences. When suspending a system, the emphasis is on leaving the
> > > system in a consistent state (think of journalled file systems), whereas
> > > disk shock protection is mainly concerned with stopping I/O as soon as
> > > possible. As yet, I cannot possibly say to what extend these two
> > > concepts can be reconciled in the sense of sharing some common code.
> >
> > Actually, I believe requirements are same.
> >
> > 'don't do i/o in dangerous period'.
> >
> > swsusp will just do sync() before entering dangerous period. That
> > provides consistent-enough state...
>
> As I've said many times before - if the requirement is "don't do
> I/O" then you have to freeze the filesystem. In no way does 'sync'
> prevent filesystems from doing I/O.....
Well, it seems we can handle this on the block layer level, by temporarily
replacing the elevator with something that will selectively prevent fs I/O
from reaching the layers below it.
I talked with Jens about it on a very general level, but it seems doable at
first sight.
Thanks,
Rafael
On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > Actually, I believe requirements are same.
> > >
> > > 'don't do i/o in dangerous period'.
> > >
> > > swsusp will just do sync() before entering dangerous period. That
> > > provides consistent-enough state...
> >
> > As I've said many times before - if the requirement is "don't do
> > I/O" then you have to freeze the filesystem. In no way does 'sync'
> > prevent filesystems from doing I/O.....
>
> Well, it seems we can handle this on the block layer level, by temporarily
> replacing the elevator with something that will selectively prevent fs I/O
> from reaching the layers below it.
>
> I talked with Jens about it on a very general level, but it seems doable at
> first sight.
Why would you hack the blok layer when we already have a perfectly fine
facility to archive what you want? freeze_bdev is there exactly for the
purpose to make the filesystem consistant on disk and then freeze all
I/O.
On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> On Monday, 30 of June 2008, Dave Chinner wrote:
> > On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
> > > > > Is this the same thing the per-device IO-queue-freeze patches for
> > > > >HDAPS also
> > > > > need to do? If so, you may want to talk to Elias Oltmanns
> > > > > <[email protected]> about it. Added to CC.
> > > >
> > > > Thanks for the heads up Henrique. Even though these issues seem to be
> > > > related up to a certain degree, there probably are some important
> > > > differences. When suspending a system, the emphasis is on leaving the
> > > > system in a consistent state (think of journalled file systems), whereas
> > > > disk shock protection is mainly concerned with stopping I/O as soon as
> > > > possible. As yet, I cannot possibly say to what extend these two
> > > > concepts can be reconciled in the sense of sharing some common code.
> > >
> > > Actually, I believe requirements are same.
> > >
> > > 'don't do i/o in dangerous period'.
> > >
> > > swsusp will just do sync() before entering dangerous period. That
> > > provides consistent-enough state...
> >
> > As I've said many times before - if the requirement is "don't do
> > I/O" then you have to freeze the filesystem. In no way does 'sync'
> > prevent filesystems from doing I/O.....
>
> Well, it seems we can handle this on the block layer level, by temporarily
> replacing the elevator with something that will selectively prevent fs I/O
> from reaching the layers below it.
Why? What part of freeze_bdev() doesn't work for you?
Cheers,
Dave.
--
Dave Chinner
[email protected]
Dave Chinner wrote:
> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
>
>> On Monday, 30 of June 2008, Dave Chinner wrote:
>>
>>> On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
>>>
>>>>>> Is this the same thing the per-device IO-queue-freeze patches for
>>>>>> HDAPS also
>>>>>> need to do? If so, you may want to talk to Elias Oltmanns
>>>>>> <[email protected]> about it. Added to CC.
>>>>>>
>>>>> Thanks for the heads up Henrique. Even though these issues seem to be
>>>>> related up to a certain degree, there probably are some important
>>>>> differences. When suspending a system, the emphasis is on leaving the
>>>>> system in a consistent state (think of journalled file systems), whereas
>>>>> disk shock protection is mainly concerned with stopping I/O as soon as
>>>>> possible. As yet, I cannot possibly say to what extend these two
>>>>> concepts can be reconciled in the sense of sharing some common code.
>>>>>
>>>> Actually, I believe requirements are same.
>>>>
>>>> 'don't do i/o in dangerous period'.
>>>>
>>>> swsusp will just do sync() before entering dangerous period. That
>>>> provides consistent-enough state...
>>>>
>>> As I've said many times before - if the requirement is "don't do
>>> I/O" then you have to freeze the filesystem. In no way does 'sync'
>>> prevent filesystems from doing I/O.....
>>>
>> Well, it seems we can handle this on the block layer level, by temporarily
>> replacing the elevator with something that will selectively prevent fs I/O
>> from reaching the layers below it.
>>
>
> Why? What part of freeze_bdev() doesn't work for you?
Well, my original problem - which is still an issue - is that a process
writing to a frozen XFS filesystem is stuck in D state, and therefore
cannot be frozen as part of suspend.
J
On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> Dave Chinner wrote:
>> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
>>> On Monday, 30 of June 2008, Dave Chinner wrote:
>>>> On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
>>>>>>> Is this the same thing the per-device IO-queue-freeze patches for
>>>>>>> HDAPS also
>>>>>>> need to do? If so, you may want to talk to Elias Oltmanns
>>>>>>> <[email protected]> about it. Added to CC.
>>>>>>>
>>>>>> Thanks for the heads up Henrique. Even though these issues seem to be
>>>>>> related up to a certain degree, there probably are some important
>>>>>> differences. When suspending a system, the emphasis is on leaving the
>>>>>> system in a consistent state (think of journalled file systems), whereas
>>>>>> disk shock protection is mainly concerned with stopping I/O as soon as
>>>>>> possible. As yet, I cannot possibly say to what extend these two
>>>>>> concepts can be reconciled in the sense of sharing some common code.
>>>>>>
>>>>> Actually, I believe requirements are same.
>>>>>
>>>>> 'don't do i/o in dangerous period'.
>>>>>
>>>>> swsusp will just do sync() before entering dangerous period. That
>>>>> provides consistent-enough state...
>>>>>
>>>> As I've said many times before - if the requirement is "don't do
>>>> I/O" then you have to freeze the filesystem. In no way does 'sync'
>>>> prevent filesystems from doing I/O.....
>>>>
>>> Well, it seems we can handle this on the block layer level, by temporarily
>>> replacing the elevator with something that will selectively prevent fs I/O
>>> from reaching the layers below it.
>>
>> Why? What part of freeze_bdev() doesn't work for you?
>
> Well, my original problem - which is still an issue - is that a process
> writing to a frozen XFS filesystem is stuck in D state, and therefore
> cannot be frozen as part of suspend.
Silly me - how could I forget the three headed monkey getting in
the way of our happy trip to beer island?
Seriously, though, how is stopping I/O in the elevator is going to
change that? What do you do with a sync I/O (read or write)? The
process is going to have to go to sleep somewhere in D state waiting
for that I/O to complete. If you're going to intercept such
processes somewhere else to do something magic, then why not put
that magic in vfs_check_frozen()?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Monday, 30 of June 2008, Christoph Hellwig wrote:
> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > > Actually, I believe requirements are same.
> > > >
> > > > 'don't do i/o in dangerous period'.
> > > >
> > > > swsusp will just do sync() before entering dangerous period. That
> > > > provides consistent-enough state...
> > >
> > > As I've said many times before - if the requirement is "don't do
> > > I/O" then you have to freeze the filesystem. In no way does 'sync'
> > > prevent filesystems from doing I/O.....
> >
> > Well, it seems we can handle this on the block layer level, by temporarily
> > replacing the elevator with something that will selectively prevent fs I/O
> > from reaching the layers below it.
> >
> > I talked with Jens about it on a very general level, but it seems doable at
> > first sight.
>
> Why would you hack the blok layer when we already have a perfectly fine
> facility to archive what you want? freeze_bdev is there exactly for the
> purpose to make the filesystem consistant on disk and then freeze all
> I/O.
We tried that in the past and it didn't work very well due to some bad
interactions with the md layer that we wanted to stay functional while we
were saving the image.
Also, do all of the supported filesystems implement this feature?
On Monday, 30 of June 2008, Dave Chinner wrote:
> On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > Dave Chinner wrote:
> >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> >>> On Monday, 30 of June 2008, Dave Chinner wrote:
> >>>> On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
> >>>>>>> Is this the same thing the per-device IO-queue-freeze patches for
> >>>>>>> HDAPS also
> >>>>>>> need to do? If so, you may want to talk to Elias Oltmanns
> >>>>>>> <[email protected]> about it. Added to CC.
> >>>>>>>
> >>>>>> Thanks for the heads up Henrique. Even though these issues seem to be
> >>>>>> related up to a certain degree, there probably are some important
> >>>>>> differences. When suspending a system, the emphasis is on leaving the
> >>>>>> system in a consistent state (think of journalled file systems), whereas
> >>>>>> disk shock protection is mainly concerned with stopping I/O as soon as
> >>>>>> possible. As yet, I cannot possibly say to what extend these two
> >>>>>> concepts can be reconciled in the sense of sharing some common code.
> >>>>>>
> >>>>> Actually, I believe requirements are same.
> >>>>>
> >>>>> 'don't do i/o in dangerous period'.
> >>>>>
> >>>>> swsusp will just do sync() before entering dangerous period. That
> >>>>> provides consistent-enough state...
> >>>>>
> >>>> As I've said many times before - if the requirement is "don't do
> >>>> I/O" then you have to freeze the filesystem. In no way does 'sync'
> >>>> prevent filesystems from doing I/O.....
> >>>>
> >>> Well, it seems we can handle this on the block layer level, by temporarily
> >>> replacing the elevator with something that will selectively prevent fs I/O
> >>> from reaching the layers below it.
> >>
> >> Why? What part of freeze_bdev() doesn't work for you?
> >
> > Well, my original problem - which is still an issue - is that a process
> > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > cannot be frozen as part of suspend.
I thought we were talking about the post-freezer situation.
> Silly me - how could I forget the three headed monkey getting in
> the way of our happy trip to beer island?
>
> Seriously, though, how is stopping I/O in the elevator is going to
> change that?
We can do that after creating the image and before we let devices run again.
This way we won't need to worry about the freezer.
> What do you do with a sync I/O (read or write)? The
> process is going to have to go to sleep somewhere in D state waiting
> for that I/O to complete. If you're going to intercept such
> processes somewhere else to do something magic, then why not put
> that magic in vfs_check_frozen()?
This might work too, but it would be nice to do something independent of the
freezer, so that we can drop the freezer when we want and not when we are
forced to.
Thanks,
Rafael
On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> On Monday, 30 of June 2008, Dave Chinner wrote:
> > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > Dave Chinner wrote:
> > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > >>> replacing the elevator with something that will selectively prevent fs I/O
> > >>> from reaching the layers below it.
> > >>
> > >> Why? What part of freeze_bdev() doesn't work for you?
> > >
> > > Well, my original problem - which is still an issue - is that a process
> > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > cannot be frozen as part of suspend.
>
> I thought we were talking about the post-freezer situation.
>
> > Silly me - how could I forget the three headed monkey getting in
> > the way of our happy trip to beer island?
> >
> > Seriously, though, how is stopping I/O in the elevator is going to
> > change that?
>
> We can do that after creating the image and before we let devices run again.
> This way we won't need to worry about the freezer.
You're suggesting that you let processes trying to do I/O continue
until *after* the memory image is taken? How is that going to work?
You've got to quiesce the filesystems totally *before* taking an image
of memory - it's the only way to guarantee that they are the
in-memory state and on disk state are consistent state on resume.
Don't re-invent the wheel - use the API we already have that does
exactly what needs to be done.
> > What do you do with a sync I/O (read or write)? The
> > process is going to have to go to sleep somewhere in D state waiting
> > for that I/O to complete. If you're going to intercept such
> > processes somewhere else to do something magic, then why not put
> > that magic in vfs_check_frozen()?
>
> This might work too, but it would be nice to do something independent of the
> freezer, so that we can drop the freezer when we want and not when we are
> forced to.
vfs_check_frozen() is completely independent of the process freezer.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tuesday, 1 of July 2008, Dave Chinner wrote:
> On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> > On Monday, 30 of June 2008, Dave Chinner wrote:
> > > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > > Dave Chinner wrote:
> > > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > > >>> replacing the elevator with something that will selectively prevent fs I/O
> > > >>> from reaching the layers below it.
> > > >>
> > > >> Why? What part of freeze_bdev() doesn't work for you?
> > > >
> > > > Well, my original problem - which is still an issue - is that a process
> > > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > > cannot be frozen as part of suspend.
> >
> > I thought we were talking about the post-freezer situation.
> >
> > > Silly me - how could I forget the three headed monkey getting in
> > > the way of our happy trip to beer island?
> > >
> > > Seriously, though, how is stopping I/O in the elevator is going to
> > > change that?
> >
> > We can do that after creating the image and before we let devices run again.
> > This way we won't need to worry about the freezer.
>
> You're suggesting that you let processes trying to do I/O continue
> until *after* the memory image is taken?
I'm not going to let the data get to the disk.
> How is that going to work?
> You've got to quiesce the filesystems totally *before* taking an image
> of memory - it's the only way to guarantee that they are the
> in-memory state and on disk state are consistent state on resume.
No, it's not the only way.
We have to ensure that the fs data that did't make it to the disk(s)
before creating the snapshot image will not be written to the disk(s) after
the image has been created. In theory one can think of many ways to achieve
that and the freezing of filesystems is certainly one of those.
> Don't re-invent the wheel - use the API we already have that does
> exactly what needs to be done.
>
> > > What do you do with a sync I/O (read or write)? The
> > > process is going to have to go to sleep somewhere in D state waiting
> > > for that I/O to complete. If you're going to intercept such
> > > processes somewhere else to do something magic, then why not put
> > > that magic in vfs_check_frozen()?
> >
> > This might work too, but it would be nice to do something independent of the
> > freezer, so that we can drop the freezer when we want and not when we are
> > forced to.
>
> vfs_check_frozen() is completely independent of the process freezer.
Well, can you please tell me how exactly that works, then?
Thanks,
Rafael
On Tue, Jul 01, 2008 at 12:38:41AM +0200, Rafael J. Wysocki wrote:
> On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, 30 of June 2008, Dave Chinner wrote:
> > > > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > > > Dave Chinner wrote:
> > > > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > > > >>> replacing the elevator with something that will selectively prevent fs I/O
> > > > >>> from reaching the layers below it.
> > > > >>
> > > > >> Why? What part of freeze_bdev() doesn't work for you?
> > > > >
> > > > > Well, my original problem - which is still an issue - is that a process
> > > > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > > > cannot be frozen as part of suspend.
> > >
> > > I thought we were talking about the post-freezer situation.
> > >
> > > > Silly me - how could I forget the three headed monkey getting in
> > > > the way of our happy trip to beer island?
> > > >
> > > > Seriously, though, how is stopping I/O in the elevator is going to
> > > > change that?
> > >
> > > We can do that after creating the image and before we let devices run again.
> > > This way we won't need to worry about the freezer.
> >
> > You're suggesting that you let processes trying to do I/O continue
> > until *after* the memory image is taken?
>
> I'm not going to let the data get to the disk.
Yes, but you still haven't answered the original question - What are
you going to do with sync I/O that leaves a process in D state
because you've prevented the I/O from being completed?
> > > > What do you do with a sync I/O (read or write)? The
> > > > process is going to have to go to sleep somewhere in D state waiting
> > > > for that I/O to complete. If you're going to intercept such
> > > > processes somewhere else to do something magic, then why not put
> > > > that magic in vfs_check_frozen()?
> > >
> > > This might work too, but it would be nice to do something independent of the
> > > freezer, so that we can drop the freezer when we want and not when we are
> > > forced to.
> >
> > vfs_check_frozen() is completely independent of the process freezer.
>
> Well, can you please tell me how exactly that works, then?
Try looking at the code. When we freeze a filesystem sb->s_frozen
changes state depending on the level of freeze currently obtained
by the filesystem. And:
#define vfs_check_frozen(sb, level) \
wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
Pretty bloody simple, really.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun 2008-06-29 23:37:31, Jeremy Fitzhardinge wrote:
> Dave Chinner wrote:
>> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
>>
>>> On Monday, 30 of June 2008, Dave Chinner wrote:
>>>
>>>> On Thu, Jun 26, 2008 at 05:09:10PM +0200, Pavel Machek wrote:
>>>>
>>>>>>> Is this the same thing the per-device IO-queue-freeze patches for
>>>>>>> HDAPS also
>>>>>>> need to do? If so, you may want to talk to Elias Oltmanns
>>>>>>> <[email protected]> about it. Added to CC.
>>>>>>>
>>>>>> Thanks for the heads up Henrique. Even though these issues seem to be
>>>>>> related up to a certain degree, there probably are some important
>>>>>> differences. When suspending a system, the emphasis is on leaving the
>>>>>> system in a consistent state (think of journalled file systems), whereas
>>>>>> disk shock protection is mainly concerned with stopping I/O as soon as
>>>>>> possible. As yet, I cannot possibly say to what extend these two
>>>>>> concepts can be reconciled in the sense of sharing some common code.
>>>>>>
>>>>> Actually, I believe requirements are same.
>>>>>
>>>>> 'don't do i/o in dangerous period'.
>>>>>
>>>>> swsusp will just do sync() before entering dangerous period. That
>>>>> provides consistent-enough state...
>>>>>
>>>> As I've said many times before - if the requirement is "don't do
>>>> I/O" then you have to freeze the filesystem. In no way does 'sync'
>>>> prevent filesystems from doing I/O.....
>>>>
>>> Well, it seems we can handle this on the block layer level, by temporarily
>>> replacing the elevator with something that will selectively prevent fs I/O
>>> from reaching the layers below it.
>>>
>>
>> Why? What part of freeze_bdev() doesn't work for you?
>
> Well, my original problem - which is still an issue - is that a process
> writing to a frozen XFS filesystem is stuck in D state, and therefore
> cannot be frozen as part of suspend.
Well, if it is in D state but does not hold any important locks, you
can just add "try_to_freeze()" in the place where it is sleeping,
right?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tuesday, 1 of July 2008, Dave Chinner wrote:
> On Tue, Jul 01, 2008 at 12:38:41AM +0200, Rafael J. Wysocki wrote:
> > On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > > On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> > > > On Monday, 30 of June 2008, Dave Chinner wrote:
> > > > > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > > > > Dave Chinner wrote:
> > > > > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > > > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > > > > >>> replacing the elevator with something that will selectively prevent fs I/O
> > > > > >>> from reaching the layers below it.
> > > > > >>
> > > > > >> Why? What part of freeze_bdev() doesn't work for you?
> > > > > >
> > > > > > Well, my original problem - which is still an issue - is that a process
> > > > > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > > > > cannot be frozen as part of suspend.
> > > >
> > > > I thought we were talking about the post-freezer situation.
> > > >
> > > > > Silly me - how could I forget the three headed monkey getting in
> > > > > the way of our happy trip to beer island?
> > > > >
> > > > > Seriously, though, how is stopping I/O in the elevator is going to
> > > > > change that?
> > > >
> > > > We can do that after creating the image and before we let devices run again.
> > > > This way we won't need to worry about the freezer.
> > >
> > > You're suggesting that you let processes trying to do I/O continue
> > > until *after* the memory image is taken?
> >
> > I'm not going to let the data get to the disk.
>
> Yes, but you still haven't answered the original question - What are
> you going to do with sync I/O that leaves a process in D state
> because you've prevented the I/O from being completed?
I don't want to intercept those processes, just allow them to block on that I/O.
> > > > > What do you do with a sync I/O (read or write)? The
> > > > > process is going to have to go to sleep somewhere in D state waiting
> > > > > for that I/O to complete. If you're going to intercept such
> > > > > processes somewhere else to do something magic, then why not put
> > > > > that magic in vfs_check_frozen()?
> > > >
> > > > This might work too, but it would be nice to do something independent of the
> > > > freezer, so that we can drop the freezer when we want and not when we are
> > > > forced to.
> > >
> > > vfs_check_frozen() is completely independent of the process freezer.
> >
> > Well, can you please tell me how exactly that works, then?
>
> Try looking at the code. When we freeze a filesystem sb->s_frozen
> changes state depending on the level of freeze currently obtained
> by the filesystem. And:
>
> #define vfs_check_frozen(sb, level) \
> wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
>
> Pretty bloody simple, really.
OK
Do all of the filesystems implement the freezing?
Rafael
On Tue, Jul 01, 2008 at 04:35:43PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > On Tue, Jul 01, 2008 at 12:38:41AM +0200, Rafael J. Wysocki wrote:
> > > On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > > > On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> > > > > On Monday, 30 of June 2008, Dave Chinner wrote:
> > > > > > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > > > > > Dave Chinner wrote:
> > > > > > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > > > > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > > > > > >>> replacing the elevator with something that will selectively prevent fs I/O
> > > > > > >>> from reaching the layers below it.
> > > > > > >>
> > > > > > >> Why? What part of freeze_bdev() doesn't work for you?
> > > > > > >
> > > > > > > Well, my original problem - which is still an issue - is that a process
> > > > > > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > > > > > cannot be frozen as part of suspend.
> > > > >
> > > > > I thought we were talking about the post-freezer situation.
> > > > >
> > > > > > Silly me - how could I forget the three headed monkey getting in
> > > > > > the way of our happy trip to beer island?
> > > > > >
> > > > > > Seriously, though, how is stopping I/O in the elevator is going to
> > > > > > change that?
> > > > >
> > > > > We can do that after creating the image and before we let devices run again.
> > > > > This way we won't need to worry about the freezer.
> > > >
> > > > You're suggesting that you let processes trying to do I/O continue
> > > > until *after* the memory image is taken?
> > >
> > > I'm not going to let the data get to the disk.
> >
> > Yes, but you still haven't answered the original question - What are
> > you going to do with sync I/O that leaves a process in D state
> > because you've prevented the I/O from being completed?
>
> I don't want to intercept those processes, just allow them to block on that I/O.
So you're going to allow them to go to D state somewhere. Ok, so
what's the problem with blocking them in vfs_check_frozen(), then?
> Do all of the filesystems implement the freezing?
Most of the major ones - those that implement ->write_super_lockfs()
should work just fine.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tuesday, 1 of July 2008, Dave Chinner wrote:
> On Tue, Jul 01, 2008 at 04:35:43PM +0200, Rafael J. Wysocki wrote:
> > On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > > On Tue, Jul 01, 2008 at 12:38:41AM +0200, Rafael J. Wysocki wrote:
> > > > On Tuesday, 1 of July 2008, Dave Chinner wrote:
> > > > > On Mon, Jun 30, 2008 at 11:00:43PM +0200, Rafael J. Wysocki wrote:
> > > > > > On Monday, 30 of June 2008, Dave Chinner wrote:
> > > > > > > On Sun, Jun 29, 2008 at 11:37:31PM -0700, Jeremy Fitzhardinge wrote:
> > > > > > > > Dave Chinner wrote:
> > > > > > > >> On Mon, Jun 30, 2008 at 01:22:47AM +0200, Rafael J. Wysocki wrote:
> > > > > > > >>> Well, it seems we can handle this on the block layer level, by temporarily
> > > > > > > >>> replacing the elevator with something that will selectively prevent fs I/O
> > > > > > > >>> from reaching the layers below it.
> > > > > > > >>
> > > > > > > >> Why? What part of freeze_bdev() doesn't work for you?
> > > > > > > >
> > > > > > > > Well, my original problem - which is still an issue - is that a process
> > > > > > > > writing to a frozen XFS filesystem is stuck in D state, and therefore
> > > > > > > > cannot be frozen as part of suspend.
> > > > > >
> > > > > > I thought we were talking about the post-freezer situation.
> > > > > >
> > > > > > > Silly me - how could I forget the three headed monkey getting in
> > > > > > > the way of our happy trip to beer island?
> > > > > > >
> > > > > > > Seriously, though, how is stopping I/O in the elevator is going to
> > > > > > > change that?
> > > > > >
> > > > > > We can do that after creating the image and before we let devices run again.
> > > > > > This way we won't need to worry about the freezer.
> > > > >
> > > > > You're suggesting that you let processes trying to do I/O continue
> > > > > until *after* the memory image is taken?
> > > >
> > > > I'm not going to let the data get to the disk.
> > >
> > > Yes, but you still haven't answered the original question - What are
> > > you going to do with sync I/O that leaves a process in D state
> > > because you've prevented the I/O from being completed?
> >
> > I don't want to intercept those processes, just allow them to block on that I/O.
>
> So you're going to allow them to go to D state somewhere. Ok, so
> what's the problem with blocking them in vfs_check_frozen(), then?
>
> > Do all of the filesystems implement the freezing?
>
> Most of the major ones - those that implement ->write_super_lockfs()
> should work just fine.
Okay, so we can do that.
I'm surely not against freezing of the filesystems before hibernation at least.
In fact we tried that in the past, but there were some locking problems I was
unable to resolve at that time.
Unfortunately I'm not very familiar with the VFS and filesystems code, so some
experts' help would be very much appreciated.
Thanks,
Rafael
Rafael J. Wysocki wrote:
>>> I talked with Jens about it on a very general level, but it seems doable at
>>> first sight.
>> Why would you hack the blok layer when we already have a perfectly fine
>> facility to archive what you want? freeze_bdev is there exactly for the
>> purpose to make the filesystem consistant on disk and then freeze all
>> I/O.
>
> We tried that in the past and it didn't work very well due to some bad
> interactions with the md layer that we wanted to stay functional while we
> were saving the image.
Hm, details or a link?
> Also, do all of the supported filesystems implement this feature?
ext3, ext4, gfs2, jfs, reiserfs, xfs, all provide a write_super_lockfs
op, which is what freeze_bdev uses. I think that the rest is generic,
for simpler filesystems.
-Eric