2009-01-19 00:52:13

by Theodore Ts'o

[permalink] [raw]
Subject: The meaning of data=ordered as it relates to delayed allocation


An Ubuntu user recently complained about a large number of recently
updated files which were zero-length after an crash. I started looking
more closely at that, and it's because we have an interesting
interpretation of data=ordered. It applies for blocks which are already
allocated, but not for blocks which haven't been allocated yet. This
can be surprising for users; and indeed, for many workloads where you
aren't using berk_db some other database, all of the files written will
be newly created files (or files which are getting rewritten after
opening with O_TRUNC), so there won't be any difference between
data=writeback and data=ordered.

So I wonder if we should either:

(a) make data=ordered force block allocation and writeback --- which
should just be a matter of disabling the
redirty_page_for_writepage() code path in ext4_da_writepage()

(b) add a new mount option, call it data=delalloc-ordered which is (a)

(c) change the default mount option to be data=writeback

(d) Do (b) and make it the default

(e) Keep things the way they are

Thoughts, comments? My personal favorite is (b). This allows users
who want something that works functionally much more like ext3 to get
that, while giving us the current speed advantages of a more aggressive
delayed allocation.

- Ted


2009-01-19 04:43:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: The meaning of data=ordered as it relates to delayed allocation

On Sun, Jan 18, 2009 at 07:52:10PM -0500, Theodore Ts'o wrote:
>
> An Ubuntu user recently complained about a large number of recently
> updated files which were zero-length after an crash. I started looking
> more closely at that, and it's because we have an interesting
> interpretation of data=ordered. It applies for blocks which are already
> allocated, but not for blocks which haven't been allocated yet. This
> can be surprising for users; and indeed, for many workloads where you
> aren't using berk_db some other database, all of the files written will
> be newly created files (or files which are getting rewritten after
> opening with O_TRUNC), so there won't be any difference between
> data=writeback and data=ordered.


That meaning of data=ordered is to ensure that we don't update inode
i_size without writing the data blocks within i_size. So even with
delayed allocation if we have i_size update ( this happen when we
allocate blocks ) we would write the data blocks first.

With that interpretation having a zero block file on crash is fine. But
we should not find the files corrupted.(ie, files with wrong contents).

>
> So I wonder if we should either:
>
> (a) make data=ordered force block allocation and writeback --- which
> should just be a matter of disabling the
> redirty_page_for_writepage() code path in ext4_da_writepage()


We can't do that because we cannot do block allocation there. So we need
to redirty the page that have unmapped buffer_heads.

>
> (b) add a new mount option, call it data=delalloc-ordered which is (a)
>
> (c) change the default mount option to be data=writeback


This won't guarantee that i_size/metadata get updated ONLY after data blocks
are written.

>
> (d) Do (b) and make it the default
>
> (e) Keep things the way they are
>
> Thoughts, comments? My personal favorite is (b). This allows users
> who want something that works functionally much more like ext3 to get
> that, while giving us the current speed advantages of a more aggressive
> delayed allocation.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-19 12:45:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: The meaning of data=ordered as it relates to delayed allocation

On Mon, Jan 19, 2009 at 10:13:45AM +0530, Aneesh Kumar K.V wrote:
> > So I wonder if we should either:
> >
> > (a) make data=ordered force block allocation and writeback --- which
> > should just be a matter of disabling the
> > redirty_page_for_writepage() code path in ext4_da_writepage()
>
> We can't do that because we cannot do block allocation there. So we need
> to redirty the page that have unmapped buffer_heads.

What is preventing us from doing block allocation from
ext4_da_writepage()?

- Ted

2009-01-19 14:46:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: The meaning of data=ordered as it relates to delayed allocation

On Mon, Jan 19, 2009 at 07:45:13AM -0500, Theodore Tso wrote:
> On Mon, Jan 19, 2009 at 10:13:45AM +0530, Aneesh Kumar K.V wrote:
> > > So I wonder if we should either:
> > >
> > > (a) make data=ordered force block allocation and writeback --- which
> > > should just be a matter of disabling the
> > > redirty_page_for_writepage() code path in ext4_da_writepage()
> >
> > We can't do that because we cannot do block allocation there. So we need
> > to redirty the page that have unmapped buffer_heads.
>
> What is preventing us from doing block allocation from
> ext4_da_writepage()?
>

The callback is called with page lock held and we can't start a journal
with page lock held.

-aneesh

2009-01-19 19:11:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: The meaning of data=ordered as it relates to delayed allocation

On Jan 18, 2009 19:52 -0500, Theodore Ts'o wrote:
> An Ubuntu user recently complained about a large number of recently
> updated files which were zero-length after an crash. I started looking
> more closely at that, and it's because we have an interesting
> interpretation of data=ordered. It applies for blocks which are already
> allocated, but not for blocks which haven't been allocated yet. This
> can be surprising for users; and indeed, for many workloads where you
> aren't using berk_db some other database, all of the files written will
> be newly created files (or files which are getting rewritten after
> opening with O_TRUNC), so there won't be any difference between
> data=writeback and data=ordered.
>
> So I wonder if we should either:
>
> (a) make data=ordered force block allocation and writeback --- which
> should just be a matter of disabling the
> redirty_page_for_writepage() code path in ext4_da_writepage()

That would re-introduce the "Firefox" problem where fsync of one file forces
all other files being written to flush their data blocks to disk.

> (b) add a new mount option, call it data=delalloc-ordered which is (a)

I'd prefer a better name, like "flushall-ordered" or similar, because to
me "delalloc-ordered" would imply the current behaviour.

> (c) change the default mount option to be data=writeback

That can expose garbage data to the user, which the current behaviour
does not do.

> (d) Do (b) and make it the default
>
> (e) Keep things the way they are
>
> Thoughts, comments? My personal favorite is (b). This allows users
> who want something that works functionally much more like ext3 to get
> that, while giving us the current speed advantages of a more aggressive
> delayed allocation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-01-26 13:24:47

by Jan Kara

[permalink] [raw]
Subject: Re: The meaning of data=ordered as it relates to delayed allocation

> On Mon, Jan 19, 2009 at 07:45:13AM -0500, Theodore Tso wrote:
> > On Mon, Jan 19, 2009 at 10:13:45AM +0530, Aneesh Kumar K.V wrote:
> > > > So I wonder if we should either:
> > > >
> > > > (a) make data=ordered force block allocation and writeback --- which
> > > > should just be a matter of disabling the
> > > > redirty_page_for_writepage() code path in ext4_da_writepage()
> > >
> > > We can't do that because we cannot do block allocation there. So we need
> > > to redirty the page that have unmapped buffer_heads.
> >
> > What is preventing us from doing block allocation from
> > ext4_da_writepage()?
> >
> The callback is called with page lock held and we can't start a journal
> with page lock held.
There is actually even more fundamental problem with this.
ext4_da_writepage() is called from JBD2 commit code to commit ordered
mode data buffers. But when you're committing a transaction you might
not have enough space in the journal to do the allocation. OTOH I think
in it would be a good thing to allocate blocks on transaction commit
(in ordered mode) exactly because we'd have better consistency
guarantees. As Andreas points out, the downside would be that the fsync
problem we have with ext3 starts manifesting itself again.
The question is how to technically implement allocation on commit time
- if we had transaction credits reserved it would not be a big deal, but
the problem is, we usually highly overestimate number of credits needed
for an allocation and these errors accumulate. So the result would be
that we'd have to commit transactions much earlier then we do now.
So technically simpler might be to just ask pdflush to flush dirty
data on ext4 filesystem more often. We could actually queue a writeout
after every transaction commit. For users the result should be roughly
the same as writeout on transaction commit. Although it might take a bit
of time to allocate and writeout those 100M of dirty memory that can in
theory accumulate on average desktop so there might be noticable
differences. But it's going to be less noticable than that 30s kupdate
timeout which is default now.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs