LinuxLists.cc - [PATCH 3/3] Add timeout feature

[permalink] [raw]

Subject: Re: [dm-devel] Re: [PATCH 3/3] Add timeout feature

On Thu, Jul 03, 2008 at 01:47:10PM +0100, Alasdair G Kergon wrote:
> On Thu, Jul 03, 2008 at 09:11:05PM +0900, Takashi Sato wrote:
> > If the freezer accesses the frozen filesystem and causes a deadlock,
> > the above ideas can't solve it
>
> But you could also say that if the 'freezer' process accesses the frozen
> filesystem and deadlocks then that's just a bug and that userspace code
> should be fixed and there's no need to introduce the complexity of a
> timeout parameter.

Seconded - that was also my primary objection to the timeout code.

> The point I'm trying to make here is:
> Under what real-world circumstances might multiple concurrent freezing
> attempts occur, and which of A, B or C (or other variations) would be
> the most appropriate way of handling such situations?
>
> A common example is people running xfs_freeze followed by an lvm command
> which also attempts to freeze the filesystem.

Yes, I've seen that reported a number of times.

> I can see a case for B or C, but personally I prefer A:
>
> > > 1 succeeds, freezes
> > > 2 succeeds, remains frozen
> > > 3 succeeds, remains frozen
> > > 4 succeeds, thaws

Agreed, though I'd modify the definition of that case to be "remain
frozen until the last thaw occurs". That has the advantage that
it's relatively simple to implement with just a counter...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-04 12:08:09

by Takashi Sato

[permalink] [raw]

Subject: Re: [dm-devel] Re: [PATCH 3/3] Add timeout feature

Hi Alasdair, Eric and Dave,

> On Thu, Jul 03, 2008 at 01:47:10PM +0100, Alasdair G Kergon wrote:
>> On Thu, Jul 03, 2008 at 09:11:05PM +0900, Takashi Sato wrote:
>> > If the freezer accesses the frozen filesystem and causes a deadlock,
>> > the above ideas can't solve it
>>
>> But you could also say that if the 'freezer' process accesses the frozen
>> filesystem and deadlocks then that's just a bug and that userspace code
>> should be fixed and there's no need to introduce the complexity of a
>> timeout parameter.
>
> Seconded - that was also my primary objection to the timeout code.

I will consider removing the timeout.

>> The point I'm trying to make here is:
>> Under what real-world circumstances might multiple concurrent freezing
>> attempts occur, and which of A, B or C (or other variations) would be
>> the most appropriate way of handling such situations?
>>
>> A common example is people running xfs_freeze followed by an lvm command
>> which also attempts to freeze the filesystem.
>
> Yes, I've seen that reported a number of times.
>
>> I can see a case for B or C, but personally I prefer A:
>>
>> > > 1 succeeds, freezes
>> > > 2 succeeds, remains frozen
>> > > 3 succeeds, remains frozen
>> > > 4 succeeds, thaws
>
> Agreed, though I'd modify the definition of that case to be "remain
> frozen until the last thaw occurs". That has the advantage that
> it's relatively simple to implement with just a counter...

I agree this idea.
But I have one concern. When device-mapper's freeze follows FIFREEZE,
can device-mapper freeze only device-mapper's part correctly?
And when device-mapper's thaw follows FITHAW,
can device-mapper thaw only device-mapper's part?

Cheers, Takashi

2008-07-07 11:07:31

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Hi!

> I still disagree with this whole patch. There is not reason to let
> the freeze request timeout - an auto-unfreezing will only confuse the
> hell out of the caller. The only reason where the current XFS freeze
> call can hang and this would be theoretically useful is when the

What happens when someone dirties so much data that vm swaps out
whatever process that frozen the filesystem?

I though that was why the timeout was there...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-08 23:10:32

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Mon, Jul 07, 2008 at 01:07:31PM +0200, Pavel Machek wrote:
> Hi!
>
> > I still disagree with this whole patch. There is not reason to let
> > the freeze request timeout - an auto-unfreezing will only confuse the
> > hell out of the caller. The only reason where the current XFS freeze
> > call can hang and this would be theoretically useful is when the
>
> What happens when someone dirties so much data that vm swaps out
> whatever process that frozen the filesystem?

a) you can't dirty a frozen filesystem - by definition a frozen
filesystem is a *clean filesystem* and *cannot be dirtied*.
b) Swap doesn't write through the filesystem
c) you can still read from a frozen filesystem to page your
executableѕ in.
d) if dirtying another unfrozen filesystem swaps out your
application so it can't run, then there's a major VM bug.
Regardless, until the app completes it is relying on the
filesystem being frozen, so it better remain frozen....

> I though that was why the timeout was there...

Not that I know of.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-08 23:19:38

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed 2008-07-09 09:10:27, Dave Chinner wrote:
> On Mon, Jul 07, 2008 at 01:07:31PM +0200, Pavel Machek wrote:
> > Hi!
> >
> > > I still disagree with this whole patch. There is not reason to let
> > > the freeze request timeout - an auto-unfreezing will only confuse the
> > > hell out of the caller. The only reason where the current XFS freeze
> > > call can hang and this would be theoretically useful is when the
> >
> > What happens when someone dirties so much data that vm swaps out
> > whatever process that frozen the filesystem?
>
> a) you can't dirty a frozen filesystem - by definition a frozen
> filesystem is a *clean filesystem* and *cannot be dirtied*.

Can you stop me?

mmap("/some/huge_file", MAP_SHARED);

then write to memory mapping?

> b) Swap doesn't write through the filesystem
> c) you can still read from a frozen filesystem to page your
> executable?? in.

atime modification should mean dirty data, right? And dirty data mean
memory pressure, right?

> d) if dirtying another unfrozen filesystem swaps out your
~~~~~~~
> application so it can't run, then there's a major VM bug.
> Regardless, until the app completes it is relying on the
> filesystem being frozen, so it better remain frozen....

Agreed. With emphasis on "another".

> > I though that was why the timeout was there...
>
> Not that I know of.

Ok, lets see how you deal with mmap.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-09 00:52:54

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 01:20:31AM +0200, Pavel Machek wrote:
> On Wed 2008-07-09 09:10:27, Dave Chinner wrote:
> > On Mon, Jul 07, 2008 at 01:07:31PM +0200, Pavel Machek wrote:
> > > Hi!
> > >
> > > > I still disagree with this whole patch. There is not reason to let
> > > > the freeze request timeout - an auto-unfreezing will only confuse the
> > > > hell out of the caller. The only reason where the current XFS freeze
> > > > call can hang and this would be theoretically useful is when the
> > >
> > > What happens when someone dirties so much data that vm swaps out
> > > whatever process that frozen the filesystem?
> >
> > a) you can't dirty a frozen filesystem - by definition a frozen
> > filesystem is a *clean filesystem* and *cannot be dirtied*.
>
> Can you stop me?
>
> mmap("/some/huge_file", MAP_SHARED);
>
> then write to memory mapping?

Sure - we can put a hook in ->page_mkwrite() to prevent it. We
don't right now because nobody in the real world really cares if one
half of a concurrent user data change is in the old snapshot or the
new one......

> > b) Swap doesn't write through the filesystem
> > c) you can still read from a frozen filesystem to page your
> > executable?? in.
>
> atime modification should mean dirty data, right?

Metadata, not data. If that's really a problem (and it never has
been for XFS because we always allow in memory changes to atime)
then touch_atime could be easily changed to avoid this...

> And dirty data mean
> memory pressure, right?

If you walk enough inodes while the filesystem is frozen, it
theoretically could happen. Typically a filesystem is only for a
few seconds at a time so in the real world this has never, ever been
a problem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 01:12:12

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 10:52:54AM +1000, Dave Chinner wrote:
> If you walk enough inodes while the filesystem is frozen, it
> theoretically could happen. Typically a filesystem is only for a
> few seconds at a time so in the real world this has never, ever been
> a problem.

I had argued for the timeout (and so it's mostly my fault that
Takashi-San included it as a feature) mainly because I was (and still
amm) deeply paranoid about the competence of the application
programers who might use this feature. I could see them screwing up
leaving it locked forever --- perhaps when their program core dumps or
when the user types ^C and they forgot to install a signal handler,
leaving the filesystem frozen forever.

In the meantime, user applications that try to create files on that
filesystem, or write to files already opened when the filesystem are
frozen will accumulate dirty pages in the page cache, until the system
finally falls over.

Think about some of the evil perpetrated by hal and the userspace
suspend-resume scripts (and how much complexity with random XML
fragments getting parsed by various dbus plugins), and tell me with a
straight face that you would trust these modern-day desktop
application writers with this interface. Because they *will* find
some interesting way to (ab)use it.....

Also, I didn't think the extra code complexity to implements timeouts
was *that* bad --- it seemed fairly small for the functionality.

Best regards,

- Ted

2008-07-09 05:51:30

by Brad Boyer

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, Jul 08, 2008 at 09:09:22PM -0400, Theodore Tso wrote:
> I had argued for the timeout (and so it's mostly my fault that
> Takashi-San included it as a feature) mainly because I was (and still
> amm) deeply paranoid about the competence of the application
> programers who might use this feature. I could see them screwing up
> leaving it locked forever --- perhaps when their program core dumps or
> when the user types ^C and they forgot to install a signal handler,
> leaving the filesystem frozen forever.
>
> In the meantime, user applications that try to create files on that
> filesystem, or write to files already opened when the filesystem are
> frozen will accumulate dirty pages in the page cache, until the system
> finally falls over.
>
> Think about some of the evil perpetrated by hal and the userspace
> suspend-resume scripts (and how much complexity with random XML
> fragments getting parsed by various dbus plugins), and tell me with a
> straight face that you would trust these modern-day desktop
> application writers with this interface. Because they *will* find
> some interesting way to (ab)use it.....
>
> Also, I didn't think the extra code complexity to implements timeouts
> was *that* bad --- it seemed fairly small for the functionality.

Just as an extra point of reference, VxFS supports a freeze/thaw by
ioctl very similar to this including a timeout in seconds. This
means someone else thought it was a useful feature.

http://sfdoccentral.symantec.com/sf/5.0/linux/manpages/vxfs/vxfsio_7.html

Brad Boyer
[email protected]

2008-07-09 06:13:55

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, 8 Jul 2008, Theodore Tso wrote:
> On Wed, Jul 09, 2008 at 10:52:54AM +1000, Dave Chinner wrote:
> > If you walk enough inodes while the filesystem is frozen, it
> > theoretically could happen. Typically a filesystem is only for a
> > few seconds at a time so in the real world this has never, ever been
> > a problem.
>
> I had argued for the timeout (and so it's mostly my fault that
> Takashi-San included it as a feature) mainly because I was (and still
> amm) deeply paranoid about the competence of the application
> programers who might use this feature. I could see them screwing up
> leaving it locked forever --- perhaps when their program core dumps or
> when the user types ^C and they forgot to install a signal handler,
> leaving the filesystem frozen forever.

A much better way to deal with that would be to limit the freeze to
the lifetime of the process somehow. E.g. the freeze ioctl sets a bit
in struct file (I'm sure we have some available) and release on the
file checks this bit and thaws the filesystem.

This would mean that freeze and thaw will have to be done on the same
file descriptor, but this isn't unreasonable to expect, is it?

Miklos

2008-07-09 06:16:39

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 08:13:21AM +0200, Miklos Szeredi wrote:
> This would mean that freeze and thaw will have to be done on the same
> file descriptor, but this isn't unreasonable to expect, is it?

It is certainly not the current use case, where you run one command
to freeze the filesystem and another one to unfreeze it.

2008-07-09 06:22:56

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 9 Jul 2008, Christoph Hellwig wrote:
> On Wed, Jul 09, 2008 at 08:13:21AM +0200, Miklos Szeredi wrote:
> > This would mean that freeze and thaw will have to be done on the same
> > file descriptor, but this isn't unreasonable to expect, is it?
>
> It is certainly not the current use case, where you run one command
> to freeze the filesystem and another one to unfreeze it.

So instead of

freeze_fs mountpoint
backup-command
unfreeze_fs mountpoint

the user would have do to

run_freezed mountpoint backup-command

I find the second one nicer, regardless of any reliability issues.

Miklos

2008-07-09 06:42:16

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 09 Jul 2008 08:22:56 +0200
Miklos Szeredi <[email protected]> wrote:

> On Wed, 9 Jul 2008, Christoph Hellwig wrote:
> > On Wed, Jul 09, 2008 at 08:13:21AM +0200, Miklos Szeredi wrote:
> > > This would mean that freeze and thaw will have to be done on the
> > > same file descriptor, but this isn't unreasonable to expect, is
> > > it?
> >
> > It is certainly not the current use case, where you run one command
> > to freeze the filesystem and another one to unfreeze it.
>
> So instead of
>
> freeze_fs mountpoint
> backup-command
> unfreeze_fs mountpoint
>
> the user would have do to
>
> run_freezed mountpoint backup-command
>
> I find the second one nicer, regardless of any reliability issues.

nah he needs to do

make_snapshot ; backup-command ; unref_snapshot.

freezing isn't the right solution for the backup problem ;)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 06:48:42

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, 8 Jul 2008, Arjan van de Ven wrote:
> nah he needs to do
>
> make_snapshot ; backup-command ; unref_snapshot.
>
> freezing isn't the right solution for the backup problem ;)

Confused, what's freezing _is_ for then? Patch description says:

Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests. So, we cannot take a backup which keeps
the filesystem's consistency with the storage device's features
(snapshot and replication) while it is mounted.
In many case, a commercial filesystem (e.g. VxFS) has
the freeze feature and it would be used to get the consistent backup.
If Linux's standard filesytem ext3 has the freeze feature, we can do it
without a commercial filesystem.

Miklos

2008-07-09 06:55:02

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 09 Jul 2008 08:48:42 +0200
Miklos Szeredi <[email protected]> wrote:

> On Tue, 8 Jul 2008, Arjan van de Ven wrote:
> > nah he needs to do
> >
> > make_snapshot ; backup-command ; unref_snapshot.
> >
> > freezing isn't the right solution for the backup problem ;)
>
> Confused, what's freezing _is_ for then? Patch description says:
>
> Currently, ext3 in mainline Linux doesn't have the freeze feature
> which suspends write requests. So, we cannot take a backup which
> keeps the filesystem's consistency with the storage device's features
> (snapshot and replication) while it is mounted.

I tihnk the idea there is

freeze . do the snapshot op . unfreeze . make backup of snapshot

one can argue about the need of doing the first 3 steps via a userland
loop; it sure sounds like one needs to be really careful to not do any
writes to the fs from the app that does snapshots (and that includes
doing any syscalls in the kernel that allocate memory.. just because
that already could cause unrelated data to be written from inside the
app. Not fun.)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 06:59:59

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, Jul 08, 2008 at 11:41:20PM -0700, Arjan van de Ven wrote:
> On Wed, 09 Jul 2008 08:22:56 +0200
> Miklos Szeredi <[email protected]> wrote:
> > > On Wed, Jul 09, 2008 at 08:13:21AM +0200, Miklos Szeredi wrote:
> > > > This would mean that freeze and thaw will have to be done on the
> > > > same file descriptor, but this isn't unreasonable to expect, is
> > > > it?
> > >
> > > It is certainly not the current use case, where you run one command
> > > to freeze the filesystem and another one to unfreeze it.
> >
> > So instead of
> >
> > freeze_fs mountpoint
> > backup-command
> > unfreeze_fs mountpoint
> >
> > the user would have do to
> >
> > run_freezed mountpoint backup-command
> >
> > I find the second one nicer, regardless of any reliability issues.
>
> nah he needs to do
>
> make_snapshot ; backup-command ; unref_snapshot.
>
> freezing isn't the right solution for the backup problem ;)

You're forgetting that to take a snapshot you generally need to
freeze the filesystem. ;) i.e:

freeze; make_snapshot; unfreeze
backup-command
unref_snapshot

Yes, dm_snapshot does the freeze/snapshot/unfreeze for you, but the
point is there is a freeze in the example you gave.

The argument against Miklos' version is that there may be multiple
commands to execute while the fs is frozen.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 07:08:32

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, 8 Jul 2008, Arjan van de Ven wrote:
> I tihnk the idea there is
>
> freeze . do the snapshot op . unfreeze . make backup of snapshot

Ah, so then my proposal would become

run_frozen mountpoint do-snapshot
do-backup
release-snapshot

and if they are afraid of deadlocks they can just implement the
timeout in userspace:

run_frozen -t timeout mountpoint do-snapshot

'run_frozen' can be a trivial 30 line app, that can be guaranteed not
to deadlock.

> one can argue about the need of doing the first 3 steps via a userland
> loop; it sure sounds like one needs to be really careful to not do any
> writes to the fs from the app that does snapshots (and that includes
> doing any syscalls in the kernel that allocate memory.. just because
> that already could cause unrelated data to be written from inside the
> app. Not fun.)

Userland always has to be careful when messing with raw devices. That
alone is not a reason to put the snapshotting facility in kernel IMO.

Miklos

2008-07-09 07:13:51

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Tue, Jul 08, 2008 at 11:55:02PM -0700, Arjan van de Ven wrote:
> On Wed, 09 Jul 2008 08:48:42 +0200
> Miklos Szeredi <[email protected]> wrote:
>
> > On Tue, 8 Jul 2008, Arjan van de Ven wrote:
> > > nah he needs to do
> > >
> > > make_snapshot ; backup-command ; unref_snapshot.
> > >
> > > freezing isn't the right solution for the backup problem ;)
> >
> > Confused, what's freezing _is_ for then? Patch description says:
> >
> > Currently, ext3 in mainline Linux doesn't have the freeze feature
> > which suspends write requests. So, we cannot take a backup which
> > keeps the filesystem's consistency with the storage device's features
> > (snapshot and replication) while it is mounted.
>
> I tihnk the idea there is
>
> freeze . do the snapshot op . unfreeze . make backup of snapshot
>
> one can argue about the need of doing the first 3 steps via a userland
> loop; it sure sounds like one needs to be really careful to not do any
> writes to the fs from the app that does snapshots (and that includes
> doing any syscalls in the kernel that allocate memory.. just because
> that already could cause unrelated data to be written from inside the
> app. Not fun.)

Bloody hell! Doesn't *anyone* understand that a frozen filesystem is
*clean*? That the process of freezing it ensures all dirty data and
metadata is written out before the freeze completes? And that once
frozen, it can't be dirtied until unfrozen?

That's 3 (or is it 4 - maybe 5 now that I think about it) different
ppl in 24 hours that have made this same broken argument about
being unable to write back dirty data on a frozen filesystem......

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 07:14:18

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 9 Jul 2008, Dave Chinner wrote:
> The argument against Miklos' version is that there may be multiple
> commands to execute while the fs is frozen.

Which is what a shell is for ;)

Miklos

2008-07-09 07:33:58

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 09:13:34AM +0200, Miklos Szeredi wrote:
> On Wed, 9 Jul 2008, Dave Chinner wrote:
> > The argument against Miklos' version is that there may be multiple
> > commands to execute while the fs is frozen.
>
> Which is what a shell is for ;)

Yeah, weĺl, with your method I ca't tell a user to:

# xfs_freeze -f /mntpt
# xfs_db -r -c 'sb 0' -c 'p rootino' /dev/foo
128
# xfs_db -r -c 'ino 128' -c 'p core.format' /dev/foo
core.format = 2 (extents)
# xfs_db .....
.....
# xfs_freeze -u /mntpt

i.e. using the freeze to force all metadata to disk and
prevent it from changing while doing interactive debugging
of some problem.

Yes, a one-shot freeze/unfreeze mechanism works for some
use cases. The point is that it does not work for them all.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 08:11:29

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 9 Jul 2008, Dave Chinner wrote:
> On Wed, Jul 09, 2008 at 09:13:34AM +0200, Miklos Szeredi wrote:
> > On Wed, 9 Jul 2008, Dave Chinner wrote:
> > > The argument against Miklos' version is that there may be multiple
> > > commands to execute while the fs is frozen.
> >
> > Which is what a shell is for ;)
>
> Yeah, weĺl, with your method I ca't tell a user to:
>
> # xfs_freeze -f /mntpt
> # xfs_db -r -c 'sb 0' -c 'p rootino' /dev/foo
> 128
> # xfs_db -r -c 'ino 128' -c 'p core.format' /dev/foo
> core.format = 2 (extents)
> # xfs_db .....
> .....
> # xfs_freeze -u /mntpt
>
> i.e. using the freeze to force all metadata to disk and
> prevent it from changing while doing interactive debugging
> of some problem.

# run_freeze /mntpt /bin/bash
# ...
# ^D

It's the same, no?

Miklos

2008-07-09 11:10:10

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

>
> Bloody hell! Doesn't *anyone* understand that a frozen filesystem is
> *clean*? That the process of freezing it ensures all dirty data and
> metadata is written out before the freeze completes? And that once
> frozen, it can't be dirtied until unfrozen?

What do you mean by "it can't be diritied until unfrozen". What
happens if I have a kernel compilation happening on a filesystem which
I am trying to freeze? Does

(a) the freeze fail (because the checks equivalent to what happens
when you remount a filesystem read-only happen)?

(b) The process gets a kill -9 when it tries to write a file on the
frozen filesystem?

(c) The process gets a kill -STOP when it tries to write
to a file on the frozen filesystem?

(d) The process won't fail, but just continue to run, filling the page
cache with dirty pages that can't be written out because the
filesystem is frozen?

If the answer is (b) or (c), and if you don't have a timeout, and the
backup process which has frozen the filesystem tries to write to the
filesystem, hilarity will ensue....

> That's 3 (or is it 4 - maybe 5 now that I think about it) different
> ppl in 24 hours that have made this same broken argument about
> being unable to write back dirty data on a frozen filesystem......

It's not a question of writing back dirty data, it's the fact that you
*can't*, leading to the page cache filling up wirth dirty data,
leading eventually to the OOM killer running --- and since the last
time I tried suggesting that if the process holding the file
descriptor freezing the filesystem, that idea got shot down (I see
it's been suggested again), if that happens, there is going to be no
other recovery path other than the Big Red Button.

- Ted

2008-07-09 11:15:37

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 10:11:01AM +0200, Miklos Szeredi wrote:
> On Wed, 9 Jul 2008, Dave Chinner wrote:
> > On Wed, Jul 09, 2008 at 09:13:34AM +0200, Miklos Szeredi wrote:
> > > On Wed, 9 Jul 2008, Dave Chinner wrote:
> > > > The argument against Miklos' version is that there may be multiple
> > > > commands to execute while the fs is frozen.
> > >
> > > Which is what a shell is for ;)
> >
> > Yeah, weĺl, with your method I ca't tell a user to:
> >
> > # xfs_freeze -f /mntpt
> > # xfs_db -r -c 'sb 0' -c 'p rootino' /dev/foo
> > 128
> > # xfs_db -r -c 'ino 128' -c 'p core.format' /dev/foo
> > core.format = 2 (extents)
> > # xfs_db .....
> > .....
> > # xfs_freeze -u /mntpt
> >
> > i.e. using the freeze to force all metadata to disk and
> > prevent it from changing while doing interactive debugging
> > of some problem.
>
> # run_freeze /mntpt /bin/bash
> # ...
> # ^D
>
> It's the same, no?

For that case, yeah. But it's a horrible hack - if that's the
best we can come up with for this freeze/unfreeze then we've
already lost.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 11:50:04

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 07:09:00AM -0400, Theodore Tso wrote:
> >
> > Bloody hell! Doesn't *anyone* understand that a frozen filesystem is
> > *clean*? That the process of freezing it ensures all dirty data and
> > metadata is written out before the freeze completes? And that once
> > frozen, it can't be dirtied until unfrozen?
>
> What do you mean by "it can't be diritied until unfrozen". What
> happens if I have a kernel compilation happening on a filesystem which
> I am trying to freeze? Does

> (a) the freeze fail (because the checks equivalent to what happens
> when you remount a filesystem read-only happen)?
>
> (b) The process gets a kill -9 when it tries to write a file on the
> frozen filesystem?
>
> (c) The process gets a kill -STOP when it tries to write
> to a file on the frozen filesystem?
>
> (d) The process won't fail, but just continue to run, filling the page
> cache with dirty pages that can't be written out because the
> filesystem is frozen?

(e) none of the above. The kernel compilation will appear to pause
until the filesystem is unfrozen. No other visible effect should
occur. It will get blocked in a write or filesystem transaction
because the fs is frozen.

Look at vfs_check_frozen() - any call to that will block if the
filesystem is frozen or being frozen. The generic hook is in
__generic_file_aio_write_nolock() and various other filesystems have
calls in their specific write paths (fuse, ntfs, ocfs2, xfs, xip) to
do this.

For all other modifications, filesystem specific methods of
blocking transactions are used. XFS uses vfs_check_frozen() in
xfs_trans_alloc(), ext3 (and probably ocfs2) do it via
their ->write_super_lockfs method calling journal_lock_updates(),
ext4 via jbd2_lock_updates() and so on....

When the filesystem is unfrozen the journal is unlocked and
anything sleeping on the vfs_check_frozen() waitqueue is
woken.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-09 12:24:59

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, Jul 09, 2008 at 09:49:58PM +1000, Dave Chinner wrote:
> (e) none of the above. The kernel compilation will appear to pause
> until the filesystem is unfrozen. No other visible effect should
> occur. It will get blocked in a write or filesystem transaction
> because the fs is frozen.

So if the process which froze the filesystem accidentally tries
writing to a log file (or database file containing the backup
information, or whatever) that happens to be on the filesystem that is
frozen, that process will get blocked and you end up in a deadlock;
did I get that right?

- Ted

2008-07-09 12:59:20

by Olaf Frączyk

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 2008-07-09 at 08:24 -0400, Theodore Tso wrote:
> On Wed, Jul 09, 2008 at 09:49:58PM +1000, Dave Chinner wrote:
> > (e) none of the above. The kernel compilation will appear to pause
> > until the filesystem is unfrozen. No other visible effect should
> > occur. It will get blocked in a write or filesystem transaction
> > because the fs is frozen.
>
> So if the process which froze the filesystem accidentally tries
> writing to a log file (or database file containing the backup
> information, or whatever) that happens to be on the filesystem that is
> frozen, that process will get blocked and you end up in a deadlock;
> did I get that right?
Where do you see the deadlock?
The process doesn't have a lock on filesystem or something. You can
always unfreeze from another process.

Regards,

Olaf
--
Olaf Frączyk <[email protected]>

2008-07-09 13:53:36

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 9 Jul 2008 17:13:46 +1000
Dave Chinner <[email protected]> wrote:

> > one can argue about the need of doing the first 3 steps via a
> > userland loop; it sure sounds like one needs to be really careful
> > to not do any writes to the fs from the app that does snapshots
> > (and that includes doing any syscalls in the kernel that allocate
> > memory.. just because that already could cause unrelated data to be
> > written from inside the app. Not fun.)
>
> Bloody hell! Doesn't *anyone* understand that a frozen filesystem is
> *clean*? That the process of freezing it ensures all dirty data and
> metadata is written out before the freeze completes? And that once
> frozen, it can't be dirtied until unfrozen?
>
> That's 3 (or is it 4 - maybe 5 now that I think about it) different
> ppl in 24 hours that have made this same broken argument about
> being unable to write back dirty data on a frozen filesystem......

unless you also freeze the system as a whole (in a 'refrigerator
suspend' way).. the "clean" part is just about a nanosecond long. After
that stuff gets dirty again (you're doing that sendfile to receive more
packets from the FTP upload etc etc).
Sure you can pause those. But there's a real risk that you end up
pausing the app that you want to unfreeze the fs (via the memory
allocation->direct reclaim path). And no, mlock doesn't help.
Especially with delayed allocation, where data writes will cause
metadata activity, this is not just theory.

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 13:56:04

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 9 Jul 2008 21:49:58 +1000
Dave Chinner <[email protected]> wrote:

>
> (e) none of the above. The kernel compilation will appear to pause
> until the filesystem is unfrozen. No other visible effect should
> occur. It will get blocked in a write or filesystem transaction
> because the fs is frozen.
>
> Look at vfs_check_frozen() - any call to that will block if the
> filesystem is frozen or being frozen. The generic hook is in
> __generic_file_aio_write_nolock() and various other filesystems have
> calls in their specific write paths (fuse, ntfs, ocfs2, xfs, xip) to
> do this.

yeah and mmap doesn't happen
>
> For all other modifications, filesystem specific methods of
> blocking transactions are used. XFS uses vfs_check_frozen() in
> xfs_trans_alloc(), ext3 (and probably ocfs2) do it via
> their ->write_super_lockfs method calling journal_lock_updates(),
> ext4 via jbd2_lock_updates() and so on....

and what if it's the process that you need to unfreeze the fs later?
Good luck.

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 13:58:11

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Jumping into the battle...

Advfs implemented freezefs and thawfs in 2001 so here is
the design rational from a commercial unix view.

Note - We already had built-in snapshots for local disk
consistent backups so some choices might be different on Linux.

NEED - provide way for SAN and hardware raid storage to do
its snapshot/copy function while the system was in-use and
get an image that could mount cleanly. Without freeze, at
a minimum we usually needed filesystem metadata recovery
to run, worst case is completely unusable snapshits :)

freezefs() is single-level:

ENOTSUPPOTED - by any other fs
EOK - done
EINPROGRESS
EALREADY

As implemented, freezefs only ensures the metadata is
consistent so the filesystem copy can mount anywhere.

This means ONLY SOME metadata (or no metadata) is flushed and
then all metadata updates are stopped. User/kernel writes
to already allocated file pages WILL go to a frozen disk.

It also means writers that need storage allocation (not
delaloc or existing) and things that semantically must
force on-disk updates will hang during the freeze.

These semantics meet the need and has the advantage of the
best perfomance. The design specification for freezefs
provided flags on the api to add more consistency options
later if they were desired:
- flush all dirty metadata
- flush all existing dirty file data
- prevent new dirty file data to disk
but they would all add to the "kill the system" problem.

freezefs has the timeout argument and the default timeout
is a system config parameter:
> 0 specifies the timeout value
= 0 uses the default timeout
< 0 disable timeout

A program could call the freezefs/thawfs api, but the
only current use is the separate commands

# freezefs
# [do your hardware raid stuff]
# thawfs

This is either operator driven or script/cron driven
because hardware raid providers (especially our own)
are really unfriendly and not helpful.

NUMBER ONE RULE - freeze must not hang/crash the system
because that defeats the customer reason for wanting it.

WHY A TIMEOUT - need a way for operator to abort the
freeze because with a frozen filesystem they may not
even be able to do a login to thaw it!

Users get pissed when the system is hung for a long
time and our experience with SAN devices is that their
response to commands is often unreasonably long.

In addition to the user controllable timeout mechanism,
we internally implement AUTO-THAW in the filesystem
whenever necessary to prevent a kernel hang/crash.

If an AUTO-THAW occurs, we post to the log and an
event manager so the user knows the snapshot is bad.

jim

2008-07-09 13:57:58

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed, 09 Jul 2008 14:59:20 +0200
Olaf Frączyk <[email protected]> wrote:

> On Wed, 2008-07-09 at 08:24 -0400, Theodore Tso wrote:
> > On Wed, Jul 09, 2008 at 09:49:58PM +1000, Dave Chinner wrote:
> > > (e) none of the above. The kernel compilation will appear to
> > > pause until the filesystem is unfrozen. No other visible effect
> > > should occur. It will get blocked in a write or filesystem
> > > transaction because the fs is frozen.
> >
> > So if the process which froze the filesystem accidentally tries
> > writing to a log file (or database file containing the backup
> > information, or whatever) that happens to be on the filesystem that
> > is frozen, that process will get blocked and you end up in a
> > deadlock; did I get that right?
> Where do you see the deadlock?
> The process doesn't have a lock on filesystem or something. You can
> always unfreeze from another process.
>

if it's one of your main filesystems... good luck starting a shell
without writing a single thing to disk... FAIL.

2008-07-09 14:13:29

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Oops..

> WHY A TIMEOUT - need a way for operator to abort the
> freeze

instead of "abort" I should have said "limit" because
it is really proactive control (so they will not get
called in the middle of the night by pissed users).

I forgot to make it clear that TIMEOUT is the same as AUTO-THAW
in logging errors so the adnin knows they have a bad snapshot and
can do it again.

I also forgot to say that our customers all say they can deal
with retrying a snapshot, but not with unknown bad snapshots
and most definitely not with killing their 24/7 operations!

jim

2008-07-09 20:55:55

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Wed 2008-07-09 09:08:07, Miklos Szeredi wrote:
> On Tue, 8 Jul 2008, Arjan van de Ven wrote:
> > I tihnk the idea there is
> >
> > freeze . do the snapshot op . unfreeze . make backup of snapshot
>
> Ah, so then my proposal would become
>
> run_frozen mountpoint do-snapshot
> do-backup
> release-snapshot
>
> and if they are afraid of deadlocks they can just implement the
> timeout in userspace:
>
> run_frozen -t timeout mountpoint do-snapshot
>
> 'run_frozen' can be a trivial 30 line app, that can be guaranteed not
> to deadlock.

Userland apps can be swapped out and need kernel memory allocations
during syscalls.

I bet even sleep(30) uses kmalloc internally.

So yes, even trivial applications can deadlock.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-09 20:56:15

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Hi!

> > > > > I still disagree with this whole patch. There is not reason to let
> > > > > the freeze request timeout - an auto-unfreezing will only confuse the
> > > > > hell out of the caller. The only reason where the current XFS freeze
> > > > > call can hang and this would be theoretically useful is when the
> > > >
> > > > What happens when someone dirties so much data that vm swaps out
> > > > whatever process that frozen the filesystem?
> > >
> > > a) you can't dirty a frozen filesystem - by definition a frozen
> > > filesystem is a *clean filesystem* and *cannot be dirtied*.
> >
> > Can you stop me?
> >
> > mmap("/some/huge_file", MAP_SHARED);
> >
> > then write to memory mapping?
>
> Sure - we can put a hook in ->page_mkwrite() to prevent it. We
> don't right now because nobody in the real world really cares if one
> half of a concurrent user data change is in the old snapshot or the
> new one......
>
> > > b) Swap doesn't write through the filesystem
> > > c) you can still read from a frozen filesystem to page your
> > > executable?? in.
> >
> > atime modification should mean dirty data, right?
>
> Metadata, not data. If that's really a problem (and it never has
> been for XFS because we always allow in memory changes to atime)
> then touch_atime could be easily changed to avoid this...
>
> > And dirty data mean
> > memory pressure, right?
>
> If you walk enough inodes while the filesystem is frozen, it
> theoretically could happen. Typically a filesystem is only for a
> few seconds at a time so in the real world this has never, ever been
> a problem.

So we have freezing interface that does not really freeze, and
that can break the system when filesystem is frozen for too long...
:-(.

Maybe you could use process freezer -- cgroup people are adding
userspace interface to that -- to solve those... but that would mean
stopping everyone but thread doing freezing...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-13 12:06:02

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Hi!

> NEED - provide way for SAN and hardware raid storage to do
> its snapshot/copy function while the system was in-use and
> get an image that could mount cleanly. Without freeze, at
> a minimum we usually needed filesystem metadata recovery
> to run, worst case is completely unusable snapshits :)
>
> freezefs() is single-level:
>
> ENOTSUPPOTED - by any other fs
> EOK - done
> EINPROGRESS
> EALREADY
>
> As implemented, freezefs only ensures the metadata is
> consistent so the filesystem copy can mount anywhere.
>
> This means ONLY SOME metadata (or no metadata) is flushed and
> then all metadata updates are stopped. User/kernel writes
> to already allocated file pages WILL go to a frozen disk.

That's the difference here. They do write file data, and thus avoid
mmap()-writes problem.

...and they _still_ provide auto-thaw.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-13 17:15:43

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Pavel Machek wrote:

>>This means ONLY SOME metadata (or no metadata) is flushed and
>>then all metadata updates are stopped. User/kernel writes
>>to already allocated file pages WILL go to a frozen disk.
>
> That's the difference here. They do write file data, and thus avoid
> mmap()-writes problem.
>
> ...and they _still_ provide auto-thaw.
> Pavel

One of the hardest things to make people understand is that
stopping file data writes in the filesystem during a freeze
is not just dangerous, it is also __worthless__ unless you
have a complete "user environment freeze" mechanism.

In a real 24/7 environment, the DB and application stack
may be poorly glued together stuff from multiple vendors.

And unless each independent component has a freeze and they
can all be coordinated, the data in the pipeline is never
stable enough to say "if you stop all writes to disk and
take a snapshot, this is the same as an orderly shutdown,
backup, restore, and startup".

If you need to stop applications before a freeze, there
is no reason to implement "stop writing file data to disk".

The only real way to make it work (and what the smart apps
do) is to have application "checkpoint" commands so they
can roll-back to a stable point from the snapshot while
allowing new user activity to proceed.

People who don't have checkpoints or some other way to
make their environment stable with a transitioning snapshot
must stop all user activity before snapshotting and have
maintenance windows defined to do that.

jim

2008-07-14 06:35:37

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

On Sun 2008-07-13 13:15:43, jim owens wrote:
> Pavel Machek wrote:
>
>>> This means ONLY SOME metadata (or no metadata) is flushed and
>>> then all metadata updates are stopped. User/kernel writes
>>> to already allocated file pages WILL go to a frozen disk.
>>
>> That's the difference here. They do write file data, and thus avoid
>> mmap()-writes problem.
>>
>> ...and they _still_ provide auto-thaw.
>> Pavel
>
> One of the hardest things to make people understand is that
> stopping file data writes in the filesystem during a freeze
> is not just dangerous, it is also __worthless__ unless you
> have a complete "user environment freeze" mechanism.

Eh?

> And unless each independent component has a freeze and they
> can all be coordinated, the data in the pipeline is never
> stable enough to say "if you stop all writes to disk and
> take a snapshot, this is the same as an orderly shutdown,
> backup, restore, and startup".

If you also freeze data writes, at least snapshot is not worse than
_unexpected_ shutdown.

And apps should already be ready for unexpected shutdowns (using fsync
etc).

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-14 13:14:59

by Takashi Sato

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Hi,

jim owens wrote:
> In addition to the user controllable timeout mechanism,
> we internally implement AUTO-THAW in the filesystem
> whenever necessary to prevent a kernel hang/crash.
>
> If an AUTO-THAW occurs, we post to the log and an
> event manager so the user knows the snapshot is bad.

I am interested in AUTO-THAW.
What is the difference between the timeout and AUTO-THAW?
When the kernel detects a deadlock, does it occur to solve it?

Cheers, Takashi

2008-07-14 13:19:05

[permalink] [raw]

Subject: Re: [PATCH 3/3] Add timeout feature

Pavel Machek wrote:

> If you also freeze data writes, at least snapshot is not worse than
> _unexpected_ shutdown.

True. But the key point is also that stopping file writes is
__no better__ than an unexpected shutdown for applications.

Once kernel people accept that it is identical to an unknown
shutdown state, they realize that as you said...

> And apps should already be ready for unexpected shutdowns (using fsync
> etc).

The people writing the code outside the kernel are the ones
responsible for the logic of handling "we don't know what
application writes made it to disk".

Remember that immediately after fsync(), the next write
can make the file inconsistent. For example, look at this
simple sales processing database type sequence:

write(sale_last_name_file, "Mackek"
write(sale_first_name_file, "Pavel")
fsync(sale_last_name_file)
fsync(sale_first_name_file)
...
write(sale_last_name_file, "Owens")
write(sale_first_name_file, "Jim")
fsync(sale_last_name_file)

FREEZEFS [defined to NOT allow file data writes]

So if we restart from that snapshot, the sales order
thinks the customer is "Pavel Owens"... unless there
is some mechanism such as seqence numbers that tie
the files together. And if they have such a mechanism
then it doesn't matter if we allow even more writes
such as:

write(sale_first_name_file, "Ted")

There just is no way inside the kernel to know a
freeze was triggered at a stable application point
such that data before the freeze must be on disk
and data after the freez must not go to disk.

This issue is not unique to freeze, it is also
present with filesystems that have snapshots and
cow/versioning built in.

jim

2008-07-14 14:04:43